Getting Started with NMDC Data

What data will be available during the NMDC Pilot?

The NMDC team is working with a range of microbiome research groups, funded by the US Department of Energy, the US Department of Agriculture, and the National Science Foundation, to explore best practices for data curation, data management, data discovery and access, and data citations. During the pilot, data sets from these partnerships will be made available to the community for search and download. Currently, the NMDC data portal has:

Natural Organic Matter
17.0 TB
17.0 TB

How does the NMDC portal integrate data from different sources?

The development and adoption of standards is key to being able to integrate and search across data derived from different studies and stored in distributed data resources. The NMDC team is contributing to existing community-driven standards, creating a framework for using these standards, and mapping existing systems to these standards.

Are NMDC data products reproducible and comparable?

The NMDC team has integrated existing bioinformatics tools into standardized workflows for processing raw multi-omics data (e.g., metagenome, metatranscriptome, metaproteome, and metabolome data) to produce reproducible and reusable annotated data products. These workflows are modular, containerized, and available for the community to use.

What types of data does the NMDC data portal provide?

The NMDC data portal hosts environmental multi-omics microbiome data. During the pilot phase of the project, the NMDC data portal is mainly hosting data generated at DOE Office of Science User Facilities, the Joint Genome Institute and the Environmental Molecular Sciences Laboratory. The NMDC team is also exploring how to provide efficient search and access to publicly available sequence data available through INSDC repositories, specifically data sets that meet the NMDC sample metadata standards. The NMDC Pilot is not currently accepting data sets from other sources.

Multi-omics data that are currently available in the NMDC data portal include:

  • Metagenomes. Provide genomic information for microbes recovered directly from an environmental sample and is widely used for taxonomic and functional microbiome characterization.
  • Metatranscriptomes. Provide gene expression information for microbes within their natural environments and key information on the active functions within a microbiome at the time of sampling.
  • Metaproteomes. Provide information on the proteins that are present in a microbiome and serves as a critical link between sequence data and inferring phenotypes.
  • Metabolomes. Provide measurements of small molecules (metabolites) to understand the biochemical processes occurring in biological systems.

NMDC is a distributed data infrastructure

The NMDC leverages distributed data management and compute, yet hosts a centralized metadata store to link services and existing resources. This model enables the NMDC to build upon existing infrastructure resources hosted at the JGI and EMSL to support data management, including JGI’s Integrated Microbial Genomes & Microbiomes (IMG/M) and EMSL’s myEMSL.

NMDC Infrastructure Vision

The elements in purple highlight the various access points for the NMDC data. A user can download through the NMDC Data Portal or through the API, and access reads-based analyses through JGI’s Integrated Microbial Genomes and Microbiomes site. In the future the data and workflows will be available through the KBase and EDGE platforms. Data are contributed at the NMDC nodes. The NMDC Pilot has started by building the DOE node and in the next year we will work with NSF’s NEON project and CyVerse to make those data available through NMDC.

Instrumentation data and analysis products are available for download through the NMDC Data Portal or programmatically using the NMDC Data API, but are stored in distributed access points at the JGI and EMSL.

NMDC Data Request Handling

The FICUS data are housed at NERSC and EMSL. When a data download request is made, the data are retrieved from the appropriate location and downloaded to the users’s system.

Important Considerations

  • The NMDC data ecosystem does not broker access agreements on behalf of other institutions. Accessing data will require adherence to the data access policy of the repositories where the data is housed.
  • Data in the NMDC data portal is made available under the CC BY 4.0 license
  • Some data that are discoverable (e.g., metadata) in the NMDC data portal may be subject to usage and access restrictions
