Integrating NCAR’s data infrastructure with the OSDF
Project Website
Project Overview
Data intensive research, including data analytics, machine learning,
and data assimilation continues to drive innovation and discovery
across the geosciences. An obstacle to scientific discovery is that
critical research datasets are distributed and stored across many
disparate locations, making it challenging for researchers to easily
access data outside of their home environment and investigate cross
disciplinary relationships such as those explored at NCAR and NEON.
The Open Science Data Federation (OSDF, is working to overcome this
challenge by providing a unified view of datasets stored across
autonomous facilities, integrated with the high-throughput
computational resources of the Open Science Pool (OSPool, We propose to incorporate NCAR’s curated research data collections with the OSDF by
acquiring, operating, and maintaining OSDF Origin and OSDF Cache nodes
and by providing research, consulting, community engagement and
training services to:
1) Broaden community access to NCAR’s model
generated (climate projections and historical reanalysis) and
observing facility produced datasets on NSF national
cyberinfrastructure resources,
2) Explore, develop and publish example
workflows that leverage OSDF/OSPool resources to support investigation
of reference research use cases and identify future needs in the
OSDF/OSPool infrastructure and
3) Engage and train researchers on how
their research workflows can leverage the capabilities of the OSDF,
including how they can develop and run workflows on OSPool resources
and share their personal datasets to the OSDF for reuse by others.
Example research use cases
We have developed the following example research use cases and documented it on Github.
Example workflows
- Access CESM2 LENS data from the AWS opendata origin and the NCAR data origin and
- a) Benchmark data access speeds for subsets of various sizes.
- b) Bias-correct surface temperature using ERA5 reanalysis.
- c) Compute surface ocean heat content.
- d) Compute and plot Global Mean Surface Temperature Anomaly (GMSTA).
- Access NOAA's SONAR data from an AWS origin to plot echograms
- Benchmark data access speeds from the NCAR data origin using the DART reanalysis dataset and make diagnostic plots
- Benchmark data access speeds from the NCAR's data origin, when the data is accessed from the OSPool's access point AP40
- Run temperature bias-correction workflow on
- a) Texas Advanced Computing Center's Stampede3 cluster
- b) NCAR's Casper cluster
- Access NA-CORDEX data from NCAR's Research Data Archive and make some diagnostic plots.
Machine Learning Workflows
- Use logistic regression to predict Nino3.4 indices in advance. The training data are Sea Surface Temperature (SST) values and observed nino indices hosted on NCAR's RDA.
These use cases demonstrate the ingestion of data both from the NCAR
origin and the AWS OpenData origin:
Datasets from the NCAR are now accessible via OSDF:
--All NCAR datasets are accessible from NCAR's OSDF origin under:<dnnnnnn> where <dnnnnnn>
maps into a unique dataset identifier.
Future Plans
We plan to develop additional example research use cases moving
forward and will host a hackathon in Summer of 2025 as part of the
pythia cookoff series to generate community developed example use cases
that leverage OSDF resources.