The NMDC Standardized Workflows
Integrating bioinformatics software in interoperable and reusable workflows
While improvements to bioinformatics workflows are important for advancing science, the number of resulting microbiome datasets processed with different tools presents challenges for reusability and making cross-study comparisons. To address these challenges, the NMDC aims to integrate existing open-source bioinformatics tools into standardized workflows for processing raw multi-omics data (e.g., metagenome, metatranscriptome, metaproteome, and metabolome data) to produce interoperable and reusable annotated data products.
During the pilot phase of the NMDC, the workflows will include bioinformatics tools developed by the Joint Genome Institute (JGI), Environmental Molecular Sciences Laboratory (EMSL), and Los Alamos National Laboratory (LANL). This software will be made publicly available as standalone, containerized workflows, offering a unique opportunity for any institute/individual to obtain, install, and run the workflows in their own environments. The NMDC standardized workflows activities will be conducted in close coordination with developers at the JGI and EMSL User Facilities to ensure that the NMDC leverages their capabilities and integrates updates.
Establish key data pipelines for NMDC-compliant metagenomic, metatranscriptomic, metaproteomic, and metabolomic data products.
The NMDC Standardized Workflows Tasks
The NMDC team will implement the activities outlined here with the aim of providing interoperable and reusable bioinformatics workflows.
Establish data submission
Establish data ingestion and user submission process for non-NMDC data sources by integrating data from external sources, including from NCBI’s SRA.
Augment workflows to support broader sequencing platforms and data formats, de novo assembly of metatranscriptome data, and statistical metrics for quantifying gene transcription levels.
Automate the microbiome data reanalysis process by developing algorithms that scan novel taxonomies to trigger rerunning workflows on previously processed data.
not yet started
The NMDC workflows guiding principles
- Workflows will be based on open-source* tools
- Workflows will be modular and containerized (Docker) for reproducibility (identical results) and portability (ease of deployment in different environments)
- Workflows will be documented (readthedocs) to describe all functions, dependencies and requirements, and a description of computational requirements
- Workflows will be able to evolve as new tools, data types, and references emerge
To clearly capture all steps of analyses, the NMDC adopted the Workflow Description Language (WDL) as the workflow specification standard. Compared to the Common Workflow Language (CWL), WDL is better supported by the Cromwell workflow engine, which was developed by the Broad Institute and has already been adopted by the JGI and has a growing community. Other guidelines include using open source free software whenever possible, providing sufficient documentation, intensive testing and benchmarking, and using proprietary software only when open-source, freely available alternatives do not exist or lack strong buy-in from the community. A standard documentation template using Sphinx and reStructuredText has been developed and can be extended to document the source code and generate documentation in various formats, including online HTML site and PDF.
All workflows are in GitHub/DockerHub:
For native installation, all of the workflows are now available in GitHub and have corresponding Docker images with the third-party tools. Links to download requisite databases and test data sets are also provided.
Completed metagenomic workflows available on NMDC EDGE (coming soon!):
- The workflows for shotgun metagenome assembly and annotation was adopted from existing production workflows from the JGI and the IMG/M platform. The assembly workflow had previously been converted to WDL; the annotation workflow has also been converted. In addition, the JGI QA/QC workflow was added to the suite as an upstream step before assembly. Each of the QA/QC, Assembly and Annotation workflows are distinct workflows with their own WDL files and can be found in the NMDC GitHub repository. The corresponding Docker images with all of the dependencies are available at DockerHub. The NMDC’s primary efforts for these workflows have been in documentation, removing dependencies on the JGI infrastructure, adapting them to be more portable, and bringing them into alignment with the principles described above.
- The workflow for shotgun metagenomic read-based analysis was based on an existing workflow from EDGE Bioinformatics. The NMDC’s efforts have focused on separating it out from EDGE, converting it to WDL, creating the corresponding Docker images for the third-party tools, documenting the workflow, and benchmarking its performance.
Workflows nearing completion within NMDC EDGE:
- The workflow for contig binning and metagenome-assembled genome (MAG) generation, based on the JGI MAG workflow, has been converted to WDL and added to the NMDC EDGE platform. It has not yet been benchmarked in this environment.
- The generation of a metatranscriptome de novo assembly follows what is currently best practice at the JGI, which involves using MegaHIT. This assembly will undergo annotation using the workflow described above. The evaluation of expression will be conducted using an adaptation of a workflow from EDGE that is still in development. This includes read-mapping to the assembly (either de novo metatranscriptome or a corresponding metagenome), followed by statistical analysis of distribution and abundance of hits. This has been converted to WDL and the corresponding Docker image are being added to NMDC EDGE.
Completed initial operational deployments of Cromwell at the National Energy Research Scientific Computing Center (NERSC) and Los Alamos National Laboratory (LANL)
- Workflow execution services have been deployed at NERSC and LANL with NERSC’s Spin container platform as the key production system and LANL big memory servers (> 1TB memory in a single node) used for de novo metagenome assembly to complement the limited number of big memory nodes at NERSC. Cromwell was chosen as the workflow manager at both sites based on its strong support from the Broad Institute, growing community, and feature set.
Completed metagenomics workflows documentation on NMDC EDGE, and documentation for native installations of all workflows
- Documentation for installing all NMDC workflows natively on an institutional system
- Documentation for running metagenomic workflows on NMDC EDGE – coming soon!
- Metatranscriptomic, metaproteomic, metabolomic workflow availability on NMDC EDGE
- Augmenting process and workflow ontology – constructed initial data schema with Aim1 to accommodate workflows and data products