Data Integration
Data integration efforts across the DOE Biological and Environmental Research (BER) ecosystem
The Department of Energy (DOE) Biological and Environmental Research (BER) program supports research to “achieve a predictive understanding of complex biological, Earth, and environmental systems with the aim of advancing the nation’s energy and infrastructure security.” Towards this mission, BER supports three DOE Office of Science user facilities and numerous large-scale programs that provide access to scientific instruments, resources, and expertise to the global research community. The recent workshop report “A Unified Data Infrastructure for Biological and Environmental Research: Report from the BER Advisory Committee” provides a summary of existing capabilities for data management and recommendations for future data infrastructure.
To improve interoperability across existing resources and advance community standards, the NMDC serves as a data integration “hub” for standardized microbiome data to support how researchers create, use, and reuse data. Please read on for more details on how we work across BER user facilities and data resources.
JGI
The Joint Genome Institute (JGI) is a user facility located at the Lawrence Berkeley National Laboratory that provides advanced genomic capabilities, large-scale data, and professional expertise to support the global research community in studies of complex biological and environmental systems. The JGI generates sequence data from plants, microbes, and complex communities.
How does the NMDC work with the JGI?
The NMDC works closely with JGI staff and users. The NMDC Submission Portal is used for incoming JGI projects where researchers have expressed interest in collaborating with the NMDC and sharing data across JGI and NMDC systems. The NMDC pulls in microbiome data previously generated at the JGI to also make it findable and searchable in the NMDC Data Portal. We also have a longstanding and highly productive collaboration across JGI’s data management systems, the Genomes OnLine Database (GOLD) and the Integrated Microbial Genomes & Microbiomes (IMG/M) platform. Last year, we developed an automated process to fetch study and sample metadata from the GOLD API and regularly share metadata updates across systems. We have also worked to more seamlessly share processed metagenome data with the IMG/M team for interoperability across the NMDC Data Portal and IMG/M. This year, the IMG/M team made use of the NMDC API to pull data into the IMG/M platform and built out the NMDC Metagenome Study List. Together, these efforts aim to harmonize and complement existing JGI data and comparative analysis services to support microbiome research.
EMSL
The Environmental Molecular Sciences Laboratory (EMSL) is a user facility located at Pacific Northwest National Laboratory that provides researchers with multi-omics, advanced microscopy, and other state-of-the-art techniques to contribute to DOE BER scientific goals.
How does the NMDC work with EMSL?
As with the JGI, users can submit their sample metadata to EMSL through the NMDC Submission Portal. Specialized tabs for JGI and EMSL-specific metadata are available in the Submission Portal to streamline tracking the sample and processing metadata throughout the project lifecycle. The NMDC ingests microbiome multi-omics data from EMSL to make it accessible in the NMDC Data Portal. Through our collaboration with EMSL, we are advancing the community data standards for natural organic matter data generated by high resolution mass spectrometry. Specifically, the NMDC has incorporated metadata standards for sample processing, data generation, and data processing developed by EMSL for the Molecular Observation Network (MONet) into the NMDC schema. As MONet data become available, EMSL and the NMDC will share these metadata via APIs and the NMDC Data Portal will point to the data locations in EMSL’s archive. Additionally, in collaboration with EMSL’s Computing and Data Operations group, we have established a backup instance of the NMDC Data Portal and Submission Portal on EMSL’s Kubernetes resources and initiated an allocation on Tahoma to provide compute hours for processing multi-omics data. We are able to ensure reliable access to microbiome data and resilience for data processing demands by leveraging these BER infrastructure resources.
ESS-DIVE
The Environmental System Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) is a primary data repository for environmental data generated through Department of Energy funding.
How does the NMDC work with ESS-DIVE?
The NMDC team has worked with the ESS-DIVE team to develop methods for linking samples and metadata across data infrastructures. Together, we detailed these collaborations in a blog post this past year and have further engaged in these activities with the broader environmental research community through the ESS-DIVE Open Data Workshop. Specifically, our coordinated efforts have used persistent identifiers to add links and references on relevant ESS-DIVE and NMDC landing pages to connect data across these systems. For example, original source sample International Generic Sample Numbers (IGSNs) from the System for Earth Sample Registration (SESAR) used in ESS-DIVE connect to NMDC’s sample identifiers. Additionally, ESS-DIVE Dataset landing pages and NMDC Study landing pages reference JGI award DOIs, associated journal publication DOIs, and data DOIs to provide cross linkages across the systems (e.g., Genome Resolved Open Watershed study, East River Watershed study).
Further, team members at ESS-DIVE are collaborating with the NMDC team to test the sample metadata validation tool used by the Submission Portal (DataHarmonizer) with the goal of incorporating the ESS-DIVE sample identifier and metadata reporting format using the NMDC sample Submission Portal tooling. This new effort will make it easier for researchers to submit Environmental System Science samples that have been assigned IGSNs, and to harmonize the sample metadata across NMDC, JGI, and EMSL to support sharing and interoperability.
KBase
The DOE Systems Biology Knowledgebase (KBase) is a community-driven research platform for systems biology centered around providing bioinformatics workflows and tools to researchers through the use of free DOE-provided computing resources.
How does the NMDC work with KBase?
Available multi-omics microbiome data in the NMDC can support advanced analyses and modeling of microbiomes. Towards this effort, we initiated collaboration with the KBase team to support data sharing of NMDC’s multi-omics data within their narrative infrastructure. These initial efforts were prototyped in 2021 for metagenome data from six studies (https://narrative.kbase.us/#org/nmdc), yet presented challenges with maintaining data harmonization across systems. To overcome these challenges, the KBase team has initiated the development of a “Data Transfer Service” (DTS) with the JGI to prototype data sharing and preserve high-level credit and metadata information for attribution and provenance. We have recently started working with the KBase team to prototype the use of this solution with NMDC data. This will allow data in the NMDC to be directly accessed through the KBase platform and narrative infrastructure.
Further, the NMDC team is also engaged in proactive discussions with the KBase team for data modeling efforts and are exploring mechanisms to integrate the NMDC schema into a revised KBase architecture. We are also developing interactive notebooks that will serve as examples for how NMDC data and metadata can be used in the KBase platform for data analysis and modeling of microbiomes.
Read more about the DOE BER ecosystem and how the NMDC acts as a data integration hub in our recent DOE Performance Metric report.