The NMDC Standardized Workflows
Integrating bioinformatics software in interoperable and reusable workflows
While improvements to bioinformatics workflows are important for advancing science, the number of resulting microbiome datasets processed with different tools presents challenges for reusability and making cross-study comparisons. To address these challenges, the NMDC aims to integrate existing open-source bioinformatics tools into standardized workflows for processing raw multi-omics data (e.g., metagenome, metatranscriptome, metaproteome, and metabolome data) to produce interoperable and reusable annotated data products.
During the pilot phase of the NMDC, the workflows will include bioinformatics tools developed by the Joint Genome Institute (JGI), Environmental Molecular Sciences Laboratory (EMSL), and Los Alamos National Laboratory (LANL). This software will be made publicly available as standalone, containerized workflows, offering a unique opportunity for any institute/individual to obtain, install, and run the workflows in their own environments. The NMDC standardized workflows activities will be conducted in close coordination with developers at the JGI and EMSL User Facilities to ensure that the NMDC leverages their capabilities and integrates updates.
Establish key data pipelines for NMDC-compliant metagenomic, metatranscriptomic, metaproteomic, and metabolomic data products.
The NMDC Standardized Workflows Tasks
The NMDC team will implement the activities outlined here with the aim of providing interoperable and reusable bioinformatics workflows.
Establish data submission
Establish data ingestion and user submission process for non-NMDC data sources by integrating data from external sources, including from NCBI’s SRA.
Augment workflows to support broader sequencing platforms and data formats, de novo assembly of metatranscriptome data, and statistical metrics for quantifying gene transcription levels.
Automate the microbiome data reanalysis process by developing algorithms that scan novel taxonomies to trigger rerunning workflows on previously processed data.
not yet started
The NMDC workflows guiding principles
- Workflows will be based on open-source* tools
- Workflows will be modular and containerized (Docker) for reproducibility (identical results) and portability (ease of deployment in different environments)
- Workflows will be documented (readthedocs) to describe all functions, dependencies and requirements, and a description of computational requirements
- Workflows will be able to evolve as new tools, data types, and references emerge
To clearly capture all steps of analyses, the NMDC adopted the Workflow Description Language (WDL) as the workflow specification standard. Compared to the Common Workflow Language (CWL), WDL is better supported by the Cromwell workflow engine, which was developed by the Broad Institute and has already been adopted by the JGI and has a growing community. Other guidelines include using open source free software whenever possible, providing sufficient documentation, intensive testing and benchmarking, and using proprietary software only when open-source, freely available alternatives do not exist or lack strong buy-in from the community. A standard documentation template using Sphinx and reStructuredText has been developed and can be extended to document the source code and generate documentation in various formats, including online HTML site and PDF.
All workflows are in GitHub/DockerHub:
For native installation, all of the workflows are now available in GitHub and have corresponding Docker images with the third-party tools. Links to download requisite databases and test data sets are also provided.
Completed workflows available on NMDC EDGE:
NMDC EDGE is a platform for the NMDC workflows to make them available to users through a user-friendly GUI. This platform is currently hosted at San Diego Supercomputing Center. We are currently looking for beta-testers. If you are interested in testing these workflows, please contact us at email@example.com.
- All of the NMDC Metagenomic workflows have been included. These individual include the ReadsQC, Read-based Taxonomy Classification, Assembly, Annotation, and MAG (Metagenome Assembled Genomes) Generation.
- A full Metagenomics pipeline is also incorporated which allows the user to run some or all of the individual metagenomic workflows with a single data input.
- A Natural Organic Matter workflow is now available for beta testing.
Workflows nearing completion within NMDC EDGE:
- The Metatranscriptomic workflow will be completed soon.
Completed metagenomics workflows documentation on NMDC EDGE, and documentation for native installations of all workflows
- Documentation for installing all NMDC workflows natively on an institutional system
- Documentation for running metagenomic workflows on NMDC EDGE.
- Metatranscriptomic, metaproteomic, metabolomic workflow availability on NMDC EDGE
- Augmenting process and workflow ontology – constructed initial data schema with Aim1 to accommodate workflows and data products