Data Management Best Practices

Documenting how data is generated, organized, stored, and retrieved is crucial to making data findable, accessible, interoperable, and reusable (FAIR).

What role does data management play in data sharing?

According to the 8 Steps of the Data Life Cycle, data management is an iterative step that persists from the beginning to the end of the project. This documentation of how data is generated, organized, stored, and retrieved is crucial to making data findable, accessible, interoperable, and reusable (FAIR). Data and data management plans need to be machine readable to fully abide by FAIR principles. By making data machine readable, it increases the scope and scale of which data can be shared and analyzed. FAIR guiding principles are rooted in the idea that “good data management is not a goal in itself, but rather is the key conduit leading to knowledge discovery and innovation, and to subsequent data and knowledge integration and reuse by the community after the data publication process” (doi.org/10.1038/sdata.2016.18). By having a well defined and well executed data management, the barriers of sharing data are lowered and allow for the research community to work together.

Why is sharing data important in microbiome research?

Over the past couple decades, microbiome data have grown exponentially. However, the sheer amount of data available presents a significant bottleneck for analysis and interpretation (doi.org/10.1038/s41564-017-0077-3). Standardizing data through community-driven practices (FAIR, CARE, and TRUST) to enable cross-study search and access provides avenues for scientific collaboration and discovery although the microbiome filed is only just beginning to find ways to coordinate across these broader open science practices. Standardization of data is key to enabling reusability, creating massive potential for new discoveries and the advancement of microbiome sciences. By sharing FAIR data, the microbiome research community is one step closer to decoding the molecular underpinnings of fundamental biological processes, and ultimately, driving transformational discoveries (doi.org/10.1038/s41579-020-0377-0).

One method of sharing your data is to submit a “Microbiome Data Report” making it available for reuse with detailed descriptions of how your data was produced increasing it’s reusability. Journals currently accepting Microbiome Data Reports include Nature Scientific Data and ASM’s Microbiology Resource Announcements. Consider publishing a Microbiome Data Report especially if your sample set is particularly unique and/or relevant for your field.

For more information on the importance of data sharing see Nature Special Edition on Data Sharing.

What is a Data Management Plan?

A data management plan (DMP) is an integral part of grant applications. DMPs are required by every federal funder but the guidelines vary depending on the agency. Here you can find information for Federal Funding Agencies that work with microbiome data.

Your DMP communicates how you and your team will collect, categorize, store, and share any data produced during the duration of a grant, and how that data will be preserved and made accessible after the completion of the project. While the DMP is important for your grant proposal, it is also important for laying the groundwork for producing high quality, accessible, and reusable data. A DMP should be a living document that sets expectations for your project team before and during the project. To maximize the impact of your DMP it should be public, machine readable, and openly licensed (doi.org/10.1371/journal.pcbi.1004525).

While a new concept, making your DMP machine readable, increases the likelihood that your data will gain recognition and credit because your data can be located, reused and cited easily (doi.org/10.1371/journal.pcbi.1006750). To learn more about the merits of a machine readable DMPs read Ten principles for machine-actionable data management plans.

What is the NMDC’s role in data management?

Alongside the NMDC data portal and workflows, the NMDC team is collaborating with the research community to improve training resources in data literacy and data management practices. The microbiome research community has voiced a need for more support for training in the collection, analysis, and sharing of data generated through their experiments, in addition to more traditional training for technical and laboratory skills.

To support best practices in microbiome data management, the NMDC team has made available two key resources:

In partnership with the University of California Curation Center of the California Digital Library, the NMDC team has created a microbiome-specific DMPTool Template. DMPTool is an open-source application that assists researchers in the creation of data management plans compliant with federal funding requirements. The NMDC DMPTool template is funding-organization agnostic and was developed to support microbiome data management best practices with specifications unique to microbiome standards and data processing. Once you create an account, this link will take you to the NMDC team’s Microbiome Omics Research DMP Template which provides step-by-step prompts for your DMP.
The NMDC team provides Data Management Plan Consultancies where we can help you draft an effective DMP and provide you with tools and resources for completing your DMP in accordance with funder requirements and community best practices. The NMDC team has the expertise to offer guidance on the creation of DMPs for the Department of Energy Office of Science proposals. Below you can find DMP requirements and resources for all federal funders that work with microbiome research. Email us to schedule a consultation.

What to include in your DMP?

While your DMP should include information about all data collected through the duration of your research, these guidelines and best practices focus on large omic data generated from microbiome samples. If you are unfamiliar with microbiome data management and metadata standards, we recommend you begin with an Introduction to Metadata and Ontologies: Everything You Always Wanted to Know About Metadata and Ontologies (But Were Afraid to Ask) (doi.org/10.25979/1607365) and The NMDC Metadata Standards Documentation. These resources will introduce you to multi-omics metadata standards that leverage existing community-driven standards.

All of these sections are laid out in the NMDC DMPTool template. This is a living document. The NMDC team welcomes community feedback on these resources. Email us with comments, suggestions, or questions.

Sample and Data Types and Sources: This section outlines what kinds of data will be produced throughout the project.

Describe the data set including basic identification information, average size, volume of estimated number of data files produced
What types of data are being generated?
How are they being generated (tools and instruments)?
What analysis stages will the data go through?
Will you be using existing data for any of your findings?

Data Standards and Formats: This section defines all variables of interest and communicates that you are aware of and will abide by community best practices whenever possible.

How is your data being processed?
What are the recognized community standards for your data and which will you follow?
How will your data adhere to FAIR Data Principles?
Who will ensure that the data standards and formats are maintained?
How will you define and categorize variables of interest that are not part of standard fields? List all variables of interest.

Microbiome Community Standards & Repositories*
Data Type	Metagenomics	Metatranscriptomics	Metaproteomics	Metabolomics
Standards	GSC MIxS	GSC MIxS	Proteomics Standards Initiative	Metabolomics Standards Initiative
Repositories	SRA ENA DDBJ	Gene Expression Omnibus Array Express	PRIDE Protein Data Bank ProteomeXchange	Metabolomics Workbench Metabolights

*Not an exhaustive list, however, these repositories are widely accepted as standard in the omics field. While there are other repositories available, an important aspect of a quality DMP is the long-term preservation of data. By depositing your data in a primary data archive, you reduce the risk of the repository not being maintained and therefore losing your storage. Additional repositories can be found through the Nature recommended scientific repositories.

Roles and Responsibilities: This section shows how your data management plan will be executed and ensures that your team’s data management responsibilities are clearly defined.

Who is responsible for data storage and access, quality control, documentation, and preservation during the project?
Who is responsible for coordinating the various data once collection is complete?
What is your estimated budget for data management activities?

Data Dissemination & Archiving: This section describes what the final data products will be and how you will protect data, if applicable.

What are the anticipated data products? Include secondary products (publications, presentations etc.)
When will your data be released?
Who is the target audience for your data set?
Is the security of your data important for privacy reasons? If so, how do you intend to protect your data?

Policies for data sharing, public access, and re-use: This section communicates that you understand your funders data sharing policies and that you have a plan to ensure public availability.

How will you comply with your funders data policies? Most federal funders require the public availability of all data produced.
What are your data attribution standards for other researchers who may use your data?
Are there any inherent restrictions on the sharing of the data?

Data and Sample Preservation: This section communicates the sustainability plan for your data, showing your funder that the data products will last after the completion of the project.

Who is responsible for maintaining the data and metadata over time?
How much of your budget (if any) will be dedicated to data maintenance and preservation?
How much storage do you anticipate needing for your final data products?
How will you ensure adherence to plan your DMP?