Use cases

To understand how OSS development practices affect the development of data and metadata standards, it is informative to demonstrate this cross-fertilization through a few use cases. As we will see in these examples, some fields, such as astronomy, high-energy physics and earth sciences have a relatively long history of shared data resources from organizations such as SDSS, CERN, and NASA, while other fields have only relatively recently become aware of the value of data sharing and its impact. These disparate histories inform how standards have evolved and how OSS practices have pervaded their development. It also demonstrates field-specific limitations on the adoption of OSS tools and practices that exemplify some of the challenges, which we will explore subsequently.

Astronomy

An early prominent example of a community-driven standard is the FITS (Flexible Image Transport System) file format standard, which was developed in the late 1970s and early 1980s (Wells and Greisen 1979), and has been adopted worldwide for astronomy data preservation and exchange. Essentially every software platform used in astronomy reads and writes the FITS format. It was developed by observatories in the 1980s to store image data in the visible and x-ray spectrum. It has been endorsed by the International Astronomical Union (IAU), as well as funding agencies. Though the format has evolved over time, “once FITS, always FITS”. That is, the format cannot be evolved to introduce changes that break backward compatibility. Among the features that make FITS so durable is that it was designed originally to have a very restricted metadata schema. That is, FITS records were designed to be the lowest common denominator of word lengths in computer systems at the time. However, while FITS is compact, its ability to encode a coordinate frame for pixels, means that data from different observational instruments can be stored in this format and relationships between data from different instruments can be defined, rendering manual and error-prone procedures for conforming images obsolete. Nevertheless, the stability has also raised some issues as the field continues to adapt to new measurement methods and the demands of ever-increasing data volumes and complex data analysis use-case, such as interchange with other data and the use of complex data bases to store and share data (Scroggins and Boscoe 2020). Another prominent example of the use of open-source processes to develop standards in Astronomy is in the tools and protocols developed by the International Virtual Observatory Alliance (IVOA) and its national implementations, e.g., in the US Virtual Astronomical Observatory(Hanisch et al. 2015). The virtual observatories facilitate discovery and access across observatories around the world and underpin data discovery in astronomy. The IVOA took inspiration from the World-Wide Web Consortium (W3C) and adopted its process for the development of its standards (i.e., Working drafts \(\rightarrow\) Proposed Recommendations \(\rightarrow\) Recommendations), with individual standards developed by inter-institutional and international working groups. One of the outcomes of the coordination effort is the development of an ecosystem of software tools both developed within the observatory teams and within the user community that interoperate with the standards that were adopted by the observatories.

High-energy physics (HEP)

Because data collection is centralized, standards to collect and store HEP data have been established and the adoption of these standards in data analysis has high penetration (Basaglia et al. 2023). A top-down approach is taken so that within every large collaboration, standards are enforced, and this adoption is centrally managed. Access to raw data is essentially impossible because of its large volume, and making it publicly available would be technically very difficult. Therefore, analysis tools are tuned specifically to the standards of the released data. Incentives to use the standards are provided by funders that require data management plans that specify how the data is shared (i.e., in a standards-compliant manner).

Earth sciences

The need for geospatial data exchange between different systems began to be recognized in the 1970s and 1980s, but proprietary formats still dominated. Coordinated standardization efforts brought the Open Geospatial Consortium (OGC) establishment in the 1990s, a critical step towards open standards for geospatial data. The 1990s have also seen the development of key standards such as the Network Common Data Form (NetCDF) developed by the University Corporation for Atmospheric Research (UCAR), and the Hierarchical Data Format (HDF), a set of file formats (HDF4, HDF5) that are widely used, particularly in climate research. The GeoTIFF format, which originated at NASA in the late 1990s, is extensively used to share image data. The following two decades, the 2000s-2020s, brought an expansion of open standards and integration with web technologies developed by OGC, as well as other standards such as the Keyhole Markup Language (KML) for displaying geographic data in Earth browsers. Formats suitable for cloud computing also emerged, such as the Cloud Optimized GeoTIFF (COG), followed by Zarr and Apache Parquet for array and tabular data, respectively. In 2006, the Open Source Geospatial Foundation (OSGeo, https://www.osgeo.org) was established, demonstrating the community’s commitment to the development of open-source geospatial technologies. While some standards have been developed in the industry (e.g., Keyhole Markup Language (KML) by Keyhole Inc., which Google later acquired), they later became international standards of the OGC, which now encompasses more than 450 commercial, governmental, nonprofit, and research organizations working together on the development and implementation of open standards (https://www.ogc.org).

Neuroscience

In contrast to the previously-mentioned fields, Neuroscience has traditionally been a “cottage industry”, where individual labs have generated experimental data designed to answer specific experimental questions. While this model still exists, the field has also seen the emergence of new modes of data production that focus on generating large shared datasets designed to answer many different questions, more akin to the data generated in large astronomy data collection efforts (Koch and Clay Reid 2012). This change has been brought on through a combination of technical advances in data acquisition techniques, which now generate large and very high-dimensional/information-rich datasets, cultural changes, which have ushered in new norms of transparency and reproducibility, and funding initiatives that have encouraged this kind of data collection. However, because these changes are recent relative to the other cases mentioned above, standards for data and metadata in neuroscience have been prone to adopt many elements of modern OSS development. Two salient examples in neuroscience are the Neurodata Without Borders file format for neurophysiology data (Rübel et al. 2022) and the Brain Imaging Data Structure (BIDS) standard for neuroimaging data (Gorgolewski et al. 2016). BIDS in particular owes some of its success to the adoption of OSS development mechanisms (Poldrack et al. 2024). For example, small changes to the standard are managed through the GitHub pull request mechanism; larger changes are managed through a BIDS Enhancement Proposal (BEP) process that is directly inspired by the Python programming language community’s Python Enhancement Proposal procedure, which is used to introduce new ideas into the language. Though the BEP mechanism takes a slightly different technical approach, it tries to emulate the open-ended and community-driven aspects of Python development to accept contributions from a wide range of stakeholders and tap a broad base of expertise.

Community science

Another interesting use case for open-source standards is community/citizen science. An early example of this approach is OpenStreetMap (https://www.openstreetmap.org), which allows users to contribute to the project development with code and data and freely use the maps and other related geospatial datasets. But this example is not unique. Overall, this approach has grown in the last 20 years and has been adopted in many different fields. It has many benefits for both the research field that harnesses the energy of non-scientist members of the community to engage with scientific data, as well as to the community members themselves who can draw both knowledge and pride in their participation in the scientific endeavor. It is also recognized that unique broader benefits are accrued from this mode of scientific research, through the inclusion of perspectives and data that would not otherwise be included. To make data accessible to community scientists, and to make the data collected by community scientists accessible to professional scientists, it needs to be provided in a manner that can be created and accessed without specialized instruments or specialized knowledge. Here, standards are needed to facilitate interactions between an in-group of expert researchers who generate and curate data and a broader set of out-group enthusiasts who would like to make meaningful contributions to the science. This creates a particularly stringent constraint on transparency and simplicity of standards. Creating these standards in a manner that addresses these unique constraints can benefit from OSS tools, with the caveat that some of these tools require additional expertise. For example, if the standard is developed using git/GitHub for versioning, this would require learning the complex and obscure technical aspects of these system that are far from easy to adopt, even for many professional scientists.

Basaglia, T, M Bellis, J Blomer, J Boyd, C Bozzi, D Britzger, S Campana, et al. 2023. “Data Preservation in High Energy Physics.” The European Physical Journal C 83 (9): 795.

Gorgolewski, Krzysztof J, Tibor Auer, Vince D Calhoun, R Cameron Craddock, Samir Das, Eugene P Duff, Guillaume Flandin, et al. 2016. “The Brain Imaging Data Structure, a Format for Organizing and Describing Outputs of Neuroimaging Experiments.” Sci Data 3 (June): 160044. https://www.nature.com/articles/sdata201644.

Hanisch, R J, G B Berriman, T J W Lazio, S Emery Bunn, J Evans, T A McGlynn, and R Plante. 2015. “The Virtual Astronomical Observatory: Re-Engineering Access to Astronomical Data.” Astron. Comput. 11 (June): 190–209.

Koch, Christof, and R Clay Reid. 2012. “Observatories of the Mind.” http://dx.doi.org/10.1038/483397a.

Poldrack, Russell A, Christopher J Markiewicz, Stefan Appelhoff, Yoni K Ashar, Tibor Auer, Sylvain Baillet, Shashank Bansal, et al. 2024. “The Past, Present, and Future of the Brain Imaging Data Structure (BIDS).” ArXiv, January.

Rübel, Oliver, Andrew Tritt, Ryan Ly, Benjamin K Dichter, Satrajit Ghosh, Lawrence Niu, Pamela Baker, et al. 2022. “The Neurodata Without Borders Ecosystem for Neurophysiological Data Science.” Elife 11 (October).

Scroggins, Michael, and Bernadette M Boscoe. 2020. “Once FITS, Always FITS? Astronomical Infrastructure in Transition.” IEEE Ann. Hist. Comput. 42 (2): 42–54.

Wells, Donald Carson, and Eric W Greisen. 1979. “FITS-a Flexible Image Transport System.” In Image Processing in Astronomy, 445.