Jul. 15, 2024 | Blog Post

“Metadata – A love letter to your future self”: Community-driven metadata standards to promote open and accessible science

Leah Johnson and Winston Anthony

To learn more about how we can support microbiome researchers, we often ask our community members to reflect on their experiences. We recently asked, “Have you reused data for your research and, if so, what are some of your sticking points in the data reuse process?” Missing metadata, or the data about your data, is one of the most common answers we hear. Inaccurate or incomplete metadata means that a researcher cannot be sure how data was generated, from sample collection to sample processing. This makes it extremely difficult to reuse datasets and hinders the research community’s ability to expand upon past work. Across domains, researchers are pushing to make data Findable, Accessible, Interoperable, and Reusable (FAIR), particularly as the amount of publicly available sequencing and multi-omics data rapidly expands [1]. A key part of FAIR-aligned data management best practices includes implementing metadata standards to ensure data has enough relevant context to be properly reused. However, it may not always be clear which standards to use and how to adhere to them [2].

In 2005, the Genomic Standards Consortium (GSC) was formed by Dawn Field (1969–2020), a visionary in molecular biology and genomics, to create community-driven standards to make it easier for researchers to find, access, and compare genomic data [3]. Following many influential projects to develop genomic standards over the years, the “Minimum Information About (x) any Sequence (MIxS)” has emerged as the ‘go-to’ standard for sequence data. Within the MIxS framework, modular components enable researchers to combine information about how a sequence was generated and the environmental context in which a sample was collected. The National Microbiome Data Collaborative (NMDC) has worked towards creating standardized metadata ontologies for environmental microbiome researchers by integrating and curating community-driven standards. The curated NMDC metadata templates combine terms from the GSC’s MIxS, the Joint Genome Institute’s (JGI) Genomes OnLine Database (GOLD), and the Environment Ontology (EnvO) (Figure 1) [3,4,5]. From the beginning of the NMDC, we have worked closely with the GSC to create robust descriptors for environmental multi-omics microbiome data. Most recently, we co-authored a book chapter in Comparative Genomics: Methods and Protocols with the GSC team titled “A practical approach to using the Genomic Standards Consortium MIxS reporting standard for comparative genomics and metagenomics” [6]. We aimed to create a practical resource for researchers to understand the standard and easily use it in their own work.

Figure 1. MIxS extensions available through the GSC. The extensions on the left are currently supported by the NMDC.

Between our partnership with the GSC and the NMDC Champions program, we provide an avenue for the research community to provide feedback on these terms to help maintain the MIxS standard. Members of the research community are also able to contribute to the MIxS standard to support their specific research fields. For example, NMDC Champion Natalia Galud Erazo at the Scripps Institute of Oceanography is working to create a set of guidelines and standards for the Mangrove Microbiome Initiative. Her goal is to help researchers within her community access and reuse data, as well as facilitate collaborations and data sharing for comparative studies across sites worldwide. While she is still in the early stages of contributing to these guidelines, Erazo commented that being a part of the NMDC community has made this process easier, “I have received a lot of help, guidance, and advice, and that has made things easy to navigate especially when you start getting into the weeds of packages, standards, and ontologies which can be a little bit overwhelming and confusing at times!”

To get a ‘behind the scenes’ perspective on community standards, we connected with several GSC members to learn more about what it takes to develop and maintain standards, and what motivates them.

Mark Miller is a software developer at Lawrence Berkeley National Laboratory working to develop the NMDC metadata schema. Peter Woollard is a Data Standards Biocurator working with repositories such as the European Nucleotide Archive. They are active members of the GSC’s Technical Working Group. Lynn Schriml is an Associate Professor at the University of Maryland, School of Medicine and currently serves as the President of the GSC Board.

What tips do you have for researchers looking to adopt or contribute to metadata standards?

Lynn: For researchers looking to adopt standards, search for and identify relevant standards in your field. Standardize as much of the collected data as possible. If you need additional terms from a standard, contact the developers. To contribute, connect with the standards developers – it is an open, collaborative community.

Mark: Learn about the existing standards themselves, like MIxS, but also learn about the principles and technologies that underlie the standards. GSC has done a great job of collecting community input about the terminology that researchers want for describing their own research. But it’s earlier on in the process of ensuring that data described with MIxS terms can be consistently retrieved. Remember that you are a data user as well as a data generator!

Peter: Research which archive you want your data submitted to and then look for standards geared towards your field and that archive. If there are not suitable standards, approach GSC and offer to help work with them to create or to improve standards. Also think about the data as being a very valuable resource, not just a side product of material generated towards a scientific paper. If you organize your data well and with good metadata and data standards from the start it will help you get the most out of the data and for the data to be made available with rich metadata in an archive for posterity.

Why do you believe standards are important?

Mark: Standards are a protocol for sharing data with others, but also a love letter to your future self. Adhering to standards is also likely to be good preparation for artificial intelligence/machine learning analyses. Honestly, I think all of us know why standards are important, but we can be tempted to do the minimum because we have so many other challenges and opportunities. The important part is planning to work within a standards system from the beginning.

Peter: I believe that scientific understanding advances best with high quality and diverse data. By using appropriate metadata and data standards data can be more easily be found, accessed, interoperable and reproducible (FAIR).

Lynn: Standards are important to coordinate and share data to enable the reuse of highly valuable sequencing data, for example, in order to enable analysis exploring drivers of microbial diversity.

What do you enjoy most about developing or maintaining these standards?

Peter: It is helping in a small way to improve the handling of the world’s biological scientific knowledge. The Earth ecosystem and its peoples will continue to face challenges, like pandemics and climate change. The more FAIR data is, the more useful it will be in helping our scientific and medical people solve problems and discover more. [Outside] GSC, I am currently deeply involved with a project seeking good practice and finding gaps in aquatic eDNA data. Having been involved with GSC and ENA is helping me spot and understand the existence of gaps, and also hearing the frustrations of researchers active in the eDNA community. So I have enjoyed sharing knowledge of GSC and learning from them, on why certain standards gaps are more important than others and ultimately will need to feed this back to improve the standards, which iteratively improves future data.

Lynn: I most enjoy making genomic data discoverable, working with research communities to facilitate their data collection and analysis, enriching our capacity to further knowledge advancement.

Mark: I enjoy working with people who want to create tools that can be used to solve problems. In my case, I really want to see MIxS become more and more robust, transparent and utilitarian, especially because I manage a data schema that depends on MIxS.

Figure 2. NMDC team member and 2023 Dawn Field Award recipient Montana Smith presents on the NMDC metadata standards and NMDC Submission Portal at the GSC 2023 Annual Meeting.

MIxS in Action

Researchers across the globe are implementing the MIxS standard to make their research accessible to the broader community. Winston Anthony, a 2024 NMDC Ambassador, shares his experience with how metadata standards have improved his research experience:

As a computational microbiologist, I am involved in multiple projects tasked with assembling bacterial genomes from sequence. It is remarkable to see how the standards produced by the Genome Standards Consortium have changed the way I collect, analyze, and report data. Unifying and standardizing shared descriptors and environmental packages for metadata has increased the usage of sequence data I upload to the Sequence Read Archive (SRA) by others. This makes me feel more connected to the global research community, and increases the reach of my home institution and work.

Currently I am working on a large-scale data reuse project where we are looking at the taxonomy and functional potential of metagenome-assembled genomes (MAGs) from already published sequence data. Within the JGI IMG/M Data Portal, we used the GOLD ecosystem subtype category for “Soil”, and the standardized quality cut-offs for medium to high quality MAGs [assembly_qual] defined by the Minimum Information About (X) any Sequence (MIxS) as standardized terms for restricting our data search. In this way we were able to quickly and programmatically collect thousands of MAGs for analysis, each with a plethora of associated metadata to aid our analysis. This task would have been impossible 8 years ago before the Minimum Information about any Single Amplified Genome (MISAG) and Metagenome-Assembled Genome (MIMAG) standards were created. By requiring the inclusion of metadata like latitude and longitude coordinates of sampling locations and collection time/date, we now have incredibly rich, longitudinal datasets at the continental and even global scale for which we can start to mine for new microbiological insight.

Figure 3. NMDC team members at the GSC 2023 Annual Meeting in Thailand (left to right: Julia Kelliher, Mark Miller, Montana Smith, Yuri Corilo).

The community comes together to discuss standards each year at the Genomic Standards Consortium Annual Meeting. This year, the conference will be held in Tuscon, Arizona from August 5 – 9, 2024. Meeting organizer and NMDC Champion Bonnie Hurwitz encourages everyone to attend. “This year’s meeting features an amazing lineup of seasoned and emerging researchers in the genomic standards community. I am particularly excited to hear more about ways researchers are promoting genomic reproducibility, this year’s theme!” When asked about what sessions or topics she is most excited to discuss with the standards community, she replied, “This year’s special session on virome standards is especially relevant due to the new NIH Common Fund Human Virome Program, which aims to study the human virome across diverse cohorts, develop new tools, and reveal host-virome interactions. Coordinating data standards, integration and access is vital to these efforts.” NMDC team members will be in attendance to learn more about and discuss these new areas with the community – connect with us at our workshop to talk about microbiome standards!

LA-UR-24-27004

References

M. D. Wilkinson, M. Dumontier, Ij. J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, P. E. Bourne, J. Bouwman, A. J. Brookes, T. Clark, M. Crosas, I. Dillo, O. Dumon, S. Edmunds, C. T. Evelo, R. Finkers, A. Gonzalez-Beltran, A. J. G. Gray, P. Groth, C. Goble, J. S. Grethe, J. Heringa, P. A. C. ’t Hoen, R. Hooft, T. Kuhn, R. Kok, J. Kok, S. J. Lusher, M. E. Martone, A. Mons, A. L. Packer, B. Persson, P. Rocca-Serra, M. Roos, R. van Schaik, S.-A. Sansone, E. Schultes, T. Sengstag, T. Slater, G. Strawn, M. A. Swertz, M. Thompson, J. van der Lei, E. van Mulligen, J. Velterop, A. Waagmeester, P. Wittenburg, K. Wolstencroft, J. Zhao, B. Mons, The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016).
P. Vangay, J. Burgin, A. Johnston, K. L. Beck, D. C. Berrios, K. Blumberg, S. Canon, P. Chain, J.-M. Chandonia, D. Christianson, S. V. Costes, J. Damerow, W. D. Duncan, J. P. Dundore-Arias, K. Fagnan, J. M. Galazka, S. M. Gibbons, D. Hays, J. Hervey, B. Hu, B. L. Hurwitz, P. Jaiswal, M. P. Joachimiak, L. Kinkel, J. Ladau, S. L. Martin, L. A. McCue, K. Miller, N. Mouncey, C. Mungall, E. Pafilis, T. B. K. Reddy, L. Richardson, S. Roux, L. M. Schriml, J. P. Shaffer, J. C. Sundaramurthi, L. R. Thompson, R. E. Timme, J. Zheng, E. M. Wood-Charlson, E. A. Eloe-Fadrosh, Microbiome Metadata Standards: Report of the National Microbiome Data Collaborative’s Workshop and Follow-On Activities. mSystems 6, e01194-20 (2021).
D. Field, L. Amaral-Zettler, G. Cochrane, J. R. Cole, P. Dawyndt, G. M. Garrity, J. Gilbert, F. O. Glöckner, L. Hirschman, I. Karsch-Mizrachi, H.-P. Klenk, R. Knight, R. Kottmann, N. Kyrpides, F. Meyer, I. San Gil, S.-A. Sansone, L. M. Schriml, P. Sterk, T. Tatusova, D. W. Ussery, O. White, J. Wooley, The Genomic Standards Consortium. PLoS Biol 9, e1001088 (2011).
T. B. K. Reddy, A. D. Thomas, D. Stamatis, J. Bertsch, M. Isbandi, J. Jansson, J. Mallajosyula, I. Pagani, E. A. Lobos, N. C. Kyrpides, The Genomes OnLine Database (GOLD) v.5: a metadata management system based on a four level (meta)genome project classification. Nucleic Acids Res 43, D1099-1106 (2015).
P. L. Buttigieg, N. Morrison, B. Smith, C. J. Mungall, S. E. Lewis, the ENVO Consortium, The environment ontology: contextualising biological and biomedical entities. Journal of Biomedical Semantics 4, 43 (2013).
E. A. Eloe-Fadrosh, C. J. Mungall, M. A. Miller, M. Smith, S. S. Patil, J. M. Kelliher, L. Y. D. Johnson, F. E. Rodriguez, P. S. G. Chain, B. Hu, M. B. Thornton, L. A. McCue, A. C. McHardy, N. L. Harris, T. B. K. Reddy, S. Mukherjee, C. I. Hunter, R. Walls, L. M. Schriml, “A Practical Approach to Using the Genomic Standards Consortium MIxS Reporting Standard for Comparative Genomics and Metagenomics” in Comparative Genomics: Methods and Protocols, J. C. Setubal, P. F. Stadler, J. Stoye, Eds. (Springer US, New York, NY, 2024; https://doi.org/10.1007/978-1-0716-3838-5_20), pp. 587–609.

Media Contact

Leah Johnson

leahjohnson@lanl.gov, https://doi.org/10.5281/zenodo.12745149

Join our vision

Want more info? Or to be an NMDC Champion? Subscribe to be the first to know about the latest news and developments.