AGU Notes

12/04

Chris Mattmann, senior Computer Scientist at NASA/JPL and officer of the Apache Software Foundation

Talked about the “Square kilometer array”, an array of instruments planned to record 700 TB/sec of data coming off the wire that has to be recorded

Discussed “Rapid Science Algorithm Development” and ASF projects which aided it.

Apache Object Oriented Data Technology (OODT), and a file crawler which runs from it where metadata is a “first-class citizen”. Uses Apache Tika auto-file identification.

Apache Airavata for building science gateways, data processing, management

Apache HBASE for column oriented databases

Regional Climate Modeling DB

Earth Systems Grid Federation:

OpenID sign-on, X509 certificates for scripts
Search UI uses apache tika text processing, free text, facets, geospatial and temporal queries
Links to OPeNDAP or Live Access Server (LAS) visualizations
Discovery over REST
Future work: "Open Climate GIS"

Metadata-based system for...

Generating too much data record everything Used machine learning algorithm to generate candidate events Web-based interface to have reviewers collaboratively review candidate events (at their leisure) which was a small subset of the data Reviewer can individually tage events and those tags are used as feedback into the machine learning algorithm Has been running since June 2012

NASA ECHO

Good story about “consuming” Open Source vs. contributing/participating Open Source… Ruby had a concurrency bug, and was fixed in a couple hours of being reported. However no one at NASA knew about it, and they ended up duplicating the effort of fixing it in their own hack-ish way,

On the continuum of open source collaboration: source drop e.g. tarball: low overhead, low value everyone works on public repo: high overhead, much higher value

github “organizations” public and private code… can constrain users for certain roles

voxel model for time-space a good way of visualizing change in shoreline over time

2012/12/06

David Blodgett from the USGS

App stack UI | Data Broker | Subsetter and Querying | Enterprise Storage

Web services for MODIS data

http://daac.ornl.gov/

NSIDC

“Portal proliferation problem” 1 app stack for one UI Service-Oriented-Architecture encourages re-use Adopt > Extend > Roll your own RESTful > Roll your own ESIP OpenSearch, ESIP Collection Cast OAI-PHM? Using service standards increases complexity and you have to make some compromises on your development CORS Cross Origin Resource Sharing is part of the HTML5 standard Ian Truslove http://goo.gl/xkxgd

Lessons from environmental systems towards big data

Total visionary talk out of left field, but very imaginative and interesting

Lessons from nature =

Be mobile; keep moving
Keep it simple
Think ecosystem
Allow for evolution
Steer towards cooperation

Many more salient points.

2012/12/06

Service data enterprise

3V’s “volume, velocity, variety” (constitute “big” data) “Big” data can fall on a continuum where one end of the scale is small, heterogeneous, value-added data and the other end is large volume, automatically collected data streams. Key constraints on the large volume end are “how to find?” and “how to process?”. Mentioned that it would take 4 days to download 1TB to his desktop (is that right? That’d be ~3 MBps). Emphasized the “false dichotomy between paying for science and paying for data management”. Compare the costs of doing data management vs. the costs of not doing data management.

NOAA data management archeitect

Principles of all public data:

discoverable
accessible
documented
preserved

Different datasets, however, have different data lifecycles Suggests that all NOAA grantees will have to address how they will facilitate data sharing People have to have “both the right and the authority” to perform data management data citation… metadata tagging

NASA Worldview for GIBS

GeoSeas "Common Data Index"

“use common vocabulary” Tools and Services high-res seismic data viewing DTM viewer litho-log tool view 83000 “datasets” http://geo-seas.eu/

"Improved access to federated data"

Climate Change Research Institute data portal on-the-fly delivery if cheap, vs. ordering CWIC - “They provide a programmer, we provide support” for implementing glue

metadata based approaches?

RDF + vocabulary mapping (more effort, more value) 3x store “enterprise” XML-based schema.org = “piecemeal” which is less effort but limited

OPeNDAP subsetter

Utility of subsets

high indicies ~ coordinates
low - interest is skewed (relative to source array) e.g.

polygon
time, not space
incompatible structures are likely
selection requires data values

Solution?

subset creation from skewed or irregular grids
binning

OPULS - OPeNDAP-Unidata Linked Servers (shared software) Subset creation from irregular grids * demo running

subsets defined by polygonal regions
each subset _is_ itself an irregular grid (not regridding)

Subset by values

Subsetter capabilities @ NCAR

Crawl remote THREDDS catalogues 6.5T of data accessed as 46GB

2012/12/07

Move from compute-move-analyze paradigm Decision to dev OSScienceS Point: Open development, it takes time to make software sharable Counterpoint: You can ask others to do that which you are not good at or unable Point: Competition; if you give it away, others can write proposals to extend it Counterpoint: Well, if they do, it’s going to help your progress as well Point: “My code is ‘sophisticated’” Counterpoint: Research shows that it will do you no damage to release it

Pros:

Improves style - think about how bad journals would be if there were no editors!
Consistent with the scientific process
Advertisement of you
Enhances organization
Encourages long-term re-use (even by yourself!)

Time-series data server (tsds.org, autoplot.org) builds on OPeNDAP and THREDDS 4GB data @ 2 sec resolution start from top-down to make data request by starting with pixels/time-scale etc. Makes it so that you can make the most efficient request possible —–

← Previous Archive Next →

blog comments powered by Disqus

Published

10 December 2012

12/04

Regional Climate Modeling DB

Metadata-based system for...

NASA ECHO

2012/12/06

Web services for MODIS data

NSIDC

Lessons from environmental systems towards big data

2012/12/06

Service data enterprise

NOAA data management archeitect

NASA Worldview for GIBS

GeoSeas "Common Data Index"

"Improved access to federated data"

metadata based approaches?

OPeNDAP subsetter

Subsetter capabilities @ NCAR

2012/12/07

Published

Category

Tags