The NMDC Standardized Workflows

Integrating bioinformatics software in interoperable and reusable workflows

Motivation

While improvements to bioinformatics workflows are important for advancing science, the number of resulting microbiome datasets processed with different tools presents challenges for reusability and making cross-study comparisons. To address these challenges, the NMDC aims to integrate existing open-source bioinformatics tools into standardized workflows for processing raw multi-omics data (e.g., metagenome, metatranscriptome, metaproteome, and metabolome data) to produce interoperable and reusable annotated data products.

During the pilot phase of the NMDC, the workflows will include bioinformatics tools developed by the Joint Genome Institute (JGI) and Environmental Molecular Sciences Laboratory (EMSL). These User Facility software will be made publicly available as standalone, containerized workflows, offering a unique opportunity for any institute/individual to obtain, install, and run the workflows in their own environments. The NMDC standardized workflows activities will be conducted in close coordination with developers at the JGI and EMSL User Facilities to ensure that the NMDC leverages their capabilities and integrates updates.

      Establish Pipelines

      Establish key data pipelines for NMDC-Compliant Data Products

      Deploy workflow engines

      Deploy workflow management engines

      The NMDC Standardized Workflows Tasks

      The NMDC team will implement the activities outlined here with the aim of providing interoperable and reusable bioinformatics workflows.

      Generate Pilot Data

      Integrate and generate data products from pilot data providers

      Establish data submission

      Establish data ingestion and user submission process for non-NMDC data sources

      Improve Workflows

      Iterative Workflow Improvement and Novel Workflow Design

      Automate Updates

      Automating microbiome analysis updates

      Approach

      (make sure to cover all of these topics somewhere below – this is placeholder text for now)

      • Establish best practices and governance for baseline analysis products
      • Benchmark all workflows for reproducibility, accuracy, and scalability
      • For portability, use a workflow description format with software and dependencies in containers
      • Establish initial standardized open source workflows:
        • Data QC
        • Assembly and Annotation
        • Read-based taxonomy classification
        • Binning and MAG generation
        • Metatranscriptomic, metaproteomic, metabolomic

      The NMDC workflows guiding principles

      • Workflows will be based on open-source* tools
      • Workflows will be modular and containerized (Docker) for reproducibility (identical results) and portability (ease of deployment in different environments)
      • Workflows will be documented (readthedocs) to describe all functions, dependencies and requirements, and a description of computational requirements
      • Workflows will be able to evolve as new tools, data types, and references emerge

      To clearly capture all steps of analyses, the NMDC adopted the Workflow Description Language (WDL) as the workflow specification standard. Compared to the Common Workflow Language (CWL), WDL is better supported by the Cromwell workflow engine, which was developed by the Broad Institute and has already been adopted by the JGI and has a growing community. Other guidelines include using open source free software whenever possible, providing sufficient documentation, intensive testing and benchmarking, and using proprietary software only when open-source, freely available alternatives do not exist or lack strong buy-in from the community. A standard documentation template using Sphinx and reStructuredText has been developed and can be extended to document the source code and generate documentation in various formats, including online HTML site and PDF. 

      [Can we provide more information about decisions that were made? (i.e. which tools were selected and why, other details about infrastructure such as EDGE, etc.)]

      Accomplishments

      Workflows

      • Completed Minimum Viable Product for:
        • Read QC
        • Assembly
        • Annotation
        • Read-based taxonomy
        • Metaproteomics
        • Metabolomics¬†
      • Completed Proof of concept for:
        • Metatranscriptomics
        • Contig binning and metagenome-assembled genome (MAG) Workflows

      Workflow engine

      • Completed initial operational deployments of Cromwell at the National Energy Research Scientific Computing Center (NERSC) and Los Alamos National Laboratory (LANL).

      Ongoing activities

      • Workflow documentation
      • Workflow benchmarking
      • Processing of FICUS Data sets
      • Augmenting process and workflow ontology – Constructed initial data schema with Aim1 to accommodate workflows and data products
      Thank you for your interest
      Please be sure to check your inbox for the latest news, updates, and information.