AGU Notes
12/04
Check out CUASHSI HydroDesktop
Chris Mattmann, senior Computer Scientist at NASA/JPL and officer of the Apache Software Foundation
Talked about the “Square kilometer array”, an array of instruments planned to record 700 TB/sec of data coming off the wire that has to be recorded
Discussed “Rapid Science Algorithm Development” and ASF projects which aided it.
Apache Object Oriented Data Technology (OODT), and a file crawler which runs from it where metadata is a “first-class citizen”. Uses Apache Tika auto-file identification.
Apache Airavata for building science gateways, data processing, management
Apache HBASE for column oriented databases
Regional Climate Modeling DB
Earth Systems Grid Federation:
- OpenID sign-on, X509 certificates for scripts
- Search UI uses apache tika text processing, free text, facets, geospatial and temporal queries
- Links to OPeNDAP or Live Access Server (LAS) visualizations
- Discovery over REST
- Future work: "Open Climate GIS"
Metadata-based system for...
Generating too much data record everything Used machine learning algorithm to generate candidate events Web-based interface to have reviewers collaboratively review candidate events (at their leisure) which was a small subset of the data Reviewer can individually tage events and those tags are used as feedback into the machine learning algorithm Has been running since June 2012
NASA ECHO
Good story about “consuming” Open Source vs. contributing/participating Open Source… Ruby had a concurrency bug, and was fixed in a couple hours of being reported. However no one at NASA knew about it, and they ended up duplicating the effort of fixing it in their own hack-ish way,
On the continuum of open source collaboration: source drop e.g. tarball: low overhead, low value everyone works on public repo: high overhead, much higher value
github “organizations” public and private code… can constrain users for certain roles
voxel model for time-space a good way of visualizing change in shoreline over time
2012/12/06
David Blodgett from the USGS
App stack UI | Data Broker | Subsetter and Querying | Enterprise Storage
Web services for MODIS data
NSIDC
“Portal proliferation problem” 1 app stack for one UI Service-Oriented-Architecture encourages re-use Adopt > Extend > Roll your own RESTful > Roll your own ESIP OpenSearch, ESIP Collection Cast OAI-PHM? Using service standards increases complexity and you have to make some compromises on your development CORS Cross Origin Resource Sharing is part of the HTML5 standard Ian Truslove http://goo.gl/xkxgd
Lessons from environmental systems towards big data
Total visionary talk out of left field, but very imaginative and interesting
Lessons from nature =
- Be mobile; keep moving
- Keep it simple
- Think ecosystem
- Allow for evolution
- Steer towards cooperation
Many more salient points.
2012/12/06
Service data enterprise
3V’s “volume, velocity, variety” (constitute “big” data) “Big” data can fall on a continuum where one end of the scale is small, heterogeneous, value-added data and the other end is large volume, automatically collected data streams. Key constraints on the large volume end are “how to find?” and “how to process?”. Mentioned that it would take 4 days to download 1TB to his desktop (is that right? That’d be ~3 MBps). Emphasized the “false dichotomy between paying for science and paying for data management”. Compare the costs of doing data management vs. the costs of not doing data management.
NOAA data management archeitect
Principles of all public data:
- discoverable
- accessible
- documented
- preserved
Different datasets, however, have different data lifecycles Suggests that all NOAA grantees will have to address how they will facilitate data sharing People have to have “both the right and the authority” to perform data management data citation… metadata tagging
NASA Worldview for GIBS
GeoSeas "Common Data Index"
“use common vocabulary” Tools and Services high-res seismic data viewing DTM viewer litho-log tool view 83000 “datasets” http://geo-seas.eu/
"Improved access to federated data"
Climate Change Research Institute data portal on-the-fly delivery if cheap, vs. ordering CWIC - “They provide a programmer, we provide support” for implementing glue
metadata based approaches?
RDF + vocabulary mapping (more effort, more value) 3x store “enterprise” XML-based schema.org = “piecemeal” which is less effort but limited
OPeNDAP subsetter
Utility of subsets
- high indicies ~ coordinates
- low - interest is skewed (relative to source array) e.g.
- polygon
- time, not space
- incompatible structures are likely
- selection requires data values
Solution?
- subset creation from skewed or irregular grids
- binning
OPULS - OPeNDAP-Unidata Linked Servers (shared software) Subset creation from irregular grids * demo running
- subsets defined by polygonal regions
- each subset _is_ itself an irregular grid (not regridding)
Subset by values
Subsetter capabilities @ NCAR
Crawl remote THREDDS catalogues 6.5T of data accessed as 46GB
2012/12/07
Move from compute-move-analyze paradigm Decision to dev OSScienceS Point: Open development, it takes time to make software sharable Counterpoint: You can ask others to do that which you are not good at or unable Point: Competition; if you give it away, others can write proposals to extend it Counterpoint: Well, if they do, it’s going to help your progress as well Point: “My code is ‘sophisticated’” Counterpoint: Research shows that it will do you no damage to release it
Pros:
- Improves style - think about how bad journals would be if there were no editors!
- Consistent with the scientific process
- Advertisement of you
- Enhances organization
- Encourages long-term re-use (even by yourself!)
Time-series data server (tsds.org, autoplot.org) builds on OPeNDAP and THREDDS 4GB data @ 2 sec resolution start from top-down to make data request by starting with pixels/time-scale etc. Makes it so that you can make the most efficient request possible —–
blog comments powered by Disqus