Recommendations for open-source data and metadata standards

In conclusion of this report, we would like to propose a set of recommendations that distill the lessons learned from an examination of data and metadata standards through the lense of open-source software development practices. We divide this section into two parts: one aimed at the science and technology communities that develop and maintain open-source standards, and the other aimed at policy-making and funding agencies, who have an interest in fostering more efficient, more robust, and more transparent open-source standards.

Science and technology communities:

Establish standards governance based on OSS best practices

While best-practice governance principles are also relatively new in OSS communities, there is already a substantial set of prior art in this domain, on which the developers and maintainers of open-source data and metadata standards can rely. For example, it is now clear that governance principles and rules can mitigate some of the risks and challenges mentioned in (sec-challenges?), especially for communities beyond a certain size that need to converge toward a new standard or rely on an existing standard. Developers and maintainers should review existing governance practices such as those provided by The Open Source Way, (https://www.theopensourceway.org/).

Foster meta-standards development

One of the main conclusions that arise from our survey of the landscape of existing standards is that there is significant knowledge that exists across fields and domains and that informs the development of standards within each field, but that could be surfaced to the level where it may be adopted more widely in different domains and be more broadly useful. One approach to this is a comparative approach: in this approach, a readiness and/or maturity model can be developed that assesses the challenges and opportunities that a specific standard faces at its current phase of development. Developing such a maturity model, while it goes beyond the scope of the current report, could lead to the eventual development of a meta-standard or a standard-of-standards. This would facilitate a succinct description of cross-cutting best-practices that can be used as a basis for the analysis or assessment of an existing standard, or as guidelines to develop new standards. For instance, specific barriers to adopting a data standard that take into account the size of the community and its specific technological capabilities should be considered.

More generally, meta-standards could include formalization for versioning of standards and interactions with specific related software. This includes amplifying formalization/guidelines on how to create standards (for example, metadata schema specifications using LinkML, https://linkml.io). However, aspects of communication with potential user audiences (e.g., researchers in particular domains) should be taken into account as well. For example, in the quality of onboarding documentation and tools for ingestion or conversion into standards-compliant datasets.

An ontology for the standards-development process – for example top-down vs bottom-up, minimum number of datasets, target community size and technical expertise typical of this community, and so forth – could help guide the standards-development process towards more effective adoption and use. A set of meta-standards and high-level descriptions of the standards-development process – some of which is laid out in this report – could help standard developers avoid known pitfalls, such as the dreaded proliferation of standards, or complexity-impeded adoption. Surveying and documenting the success and failures of current standards for a specific dataset / domain can help disseminate knowledge about the standardization process. Resources such as Fairsharing ( https://fairsharing.org/) or the Digital Curation Center (https://www.dcc.ac.uk/guidance/standards) can help guide this process.

Develop standards in tandem with standards-associated software

Development of standards should be coupled and tightly linked with development of associated software. This produces a virtuous cycle where the use-cases and technical issues that arise in software development informs the development of the standard and vice versa. One of the lessons learned across a variety of different standards is the importance of automated validation of the standard. Automated validation is broadly seen as a requirement for the adoption of a standard and a factor in managing change of the standard over time. To advance this virtuous cycle, we recommend to make data standards machine readable, and make software creation an integral part of establishing a standard’s schema. Additionally, standards evolution should maintain software compatibility, and ability to translate and migrate between standards.

Policy-making and funding entities:

Fund the development of open-source standards

While some funding agencies already support standards development as part of the development of informatics infrastructures, data standards development should be seen as integral to science innovation and earmarked for funding in research grants, not only in specialized contexts. Funding models should encourage the development and adoption of standards, and fund associated community efforts and tools for this. The OSS model is seen as a particularly promising avenue for an investment of resources, because it builds on previously-developed procedures and technical infrastructure and because it provides avenues for the democratization of development processes and for community input along the way. At the same time, there are significant challenges associated with incentives to engage, ranging from the dilution of credit to individual contributors, and ranging through the burnout of maintainers and developers. The clarity offered by procedures for enhancement proposals and semantic versioning schemes adopted in standards development offers avenues for a range of stakeholders to propose well-defined contributions to large and field-wide standards efforts (e.g., (Pestilli et al. 2021)), and potentially helps alleviate some of these concerns by providing avenues for individual contributions to surface, as well as clarity of process, which can alleviate the risks of maintainer burnout.

Invest in data stewards

Advancing the development and adoption of open-source standards requires the dissemination of knowledge to researchers in a variety of fields, but this dissemination itself may not be enough without the fostering of specialized expertise. Therefore, it is important to recognize the distinct role that data stewards play in contemporary research. As policy demands for openness become increasingly high, it is crucial to truly support experts whose role will be to develop, maintain, and facilitate the adoption and use of open-source standards. This support needs to manifest in all stages of the career of these individuals: it will be necessary to set up programs for training for data stewards, and to invest in the career paths of individuals that receive such training so that this crucial role is encouraged. Initial proposals for the curriculum and scope of the role have already been proposed (e.g., in (Mons 2018)), but we identify here also a need to connect these individuals directly to the practices that exemplify open-source standards. Thus, it will be important for these individuals to be conversant in the methodology of OSS. This does not mean that they need to become software engineers – though for some of them there may be some overlap with the role of research software engineers (Connolly et al. 2023) – but rather that they need to become familiar with those parts of the OSS development life-cycle that are specifically useful for the development of open-source standards. For example, tools for version control, tools for versioning, and tools for creation and validation of compliant data and metadata. Stakeholder organizations should invest in training grants to establish curriculum for data and metadata standards education.

Ultimately, efficient use of data stewards and their knowledge will have to be applied. It is evident that not every project and every lab that produces data requires a full-time data steward. Instead, data stewardship could be centralized within organizations such as libraries, data science, or software engineering cores of larger research organizations. This would be akin to recent models for research software engineering that are becoming common in many research organization (Van Tuyl 2023). Efficiency considerations also suggest that the development of data standards would not have its intended purpose unless funds are also allocated to the implementation of the standard in practice. Mandating standards without appropriate funding for their implementation by data producers and data users could risk hampering science and could leading to researchers doing the bare minimum to make their data “open”.

Review open-source standards pathways

Invest in programs that examine retrospective pathways for establishing data standards. Encourage publication of lifecycles for successful data standards. These lifecycles should include the process, creators, affiliations, grants, and adoption journeys of open-source standards. To encourage sustainable development of open-source standards, and to build on prior experience, the documentation and dissemination of lifecycles should be seen as an integral step of the work of standards creators and granting agencies. In the meanwhile, it would be good to also retroactively document the lifecycle of existing standards that are seen as success stories, and to foster the awareness of these standards. In addition, fostering research projects on the principles that underlie successful open-source standards development will help formulate new standards and iterate on existing ones. In accordance, data management plans should promote the sharing of not only data, but also metadata and descriptions of how to use it.

Manage Cross Sector alliances

Encourage cross-sector and cross-domain alliances that can impact successful standards creation. Invest in robust program management of these alliances to align pace and create incentives (for instance via Open Source Program Offices at Universities or other research organizations). Similar to program officers at funding agencies, standards evolution need sustained PM efforts. Multi-party partnerships should include strategic initiatives for standard establishment such as the Pistoia Alliance (https://www.pistoiaalliance.org/).

Connolly, Andrew, Joseph Hellerstein, Naomi Alterman, David Beck, Rob Fatland, Ed Lazowska, Vani Mandava, and Sarah Stone. 2023. “Software Engineering Practices in Academia: Promoting the 3Rs—Readability, Resilience, and Reuse.” Harvard Data Science Review 5 (2).

Mons, Barend. 2018. Data Stewardship for Open Science: Implementing FAIR Principles. 1st ed. Vol. 1. Milton: CRC Press. https://doi.org/10.1201/9781315380711.

Pestilli, Franco, Russ Poldrack, Ariel Rokem, Theodore Satterthwaite, Franklin Feingold, Eugene Duff, Cyril Pernet, Robert Smith, Oscar Esteban, and Matt Cieslak. 2021. “A Community-Driven Development of the Brain Imaging Data Standard (BIDS) to Describe Macroscopic Brain Connections.” OSF.

Van Tuyl, Steve, ed. 2023. “Hiring, Managing, and Retaining Data Scientists and Research Software Engineers in Academia: A Career Guidebook from ADSA and US-RSE.” https://doi.org/https://doi.org/10.5281/zenodo.8329337.