Opportunities and risks for open-source standards

While open-source standards benefit from the technical and social aspects of OSS, these tools and practices are associated with risks that need to be mitigated.

Flexibility vs. Stability

One of the defining characteristics of OSS is its dynamism and its rapid evolution. Because OSS can be used by anyone and, in most cases, contributions can be made by anyone, innovations flow into OSS in a bottom-up fashion from users/developers. Pathways to contribution by members of the community are often well-defined: both from the technical perspective (e.g., through a pull request on GitHub, or other similar mechanisms), as well as from the social perspective (e.g., whether contributors need to accept certain licensing conditions through a contributor licensing agreement) and the socio-technical perspective (e.g., how many people need to review a contribution, what are the timelines for a contribution to be reviewed and accepted, what are the release cycles of the software that make the contribution available to a broader community of users, etc.). Similarly, open-source standards may also find themselves addressing use cases and solutions that were not originally envisioned through bottom-up contributions of members of a research community to which the standard pertains. However, while this dynamism provides an avenue for flexibility it also presents a source of tension. This is because data and metadata standards apply to already existing datasets, and changes may affect the compliance of these existing datasets. These existing datasets may have a lifespan of decades, making continued compatibility crucial. Similarly, analysis technology stacks that are developed based on an existing version of a standard have to adapt to the introduction of new ideas and changes into a standard. Dynamic changes of this sort therefore risk causing a loss of faith in the standard by a user community, and migration away from the standard. Similarly, if a standard evolves too rapidly, users may choose to stick to an outdated version of a standard for a long time, creating strains on the community of developers and maintainers of a standard who will need to accommodate long deprecation cycles. On the other hand, in cases in which some forms of dynamic change is prohibited – as in the case of the FITS file format, which prohibits changes that break backwards-compatibility – there is also a cost associated with the stability (Scroggins and Boscoe 2020): limiting adoption and combinations of new types of measurements, new analysis methods or new modes of data storage and data sharing.

Mismatches between standards developers and user communities

Open-source standards often entail an inherent gap between the core developers of the standard and the users of the standard. The former may be possess higher ability to engage with the technical details undergirding standards and their development, while the latter still have a high level of interest as members of the broader research field to which the standard pertains. This gap, in and of itself, creates friction on the path to broad adoption and best utilization of the standards. In extreme cases, the interests of researchers and standards developers may even seem at odds, as developers implement sophisticated mechanisms to automate the creation and validation of the standard or advocate for more technically advanced mechanisms for evolving the standard. These advanced capabilities offer more robust development practices and consistency in cases where the standards are complex and elaborate. They can also ease the maintenance burden of the standard. On the other hand, they may end up leaving potential experimental researchers and data providers sidelined in the development of the standard, and limiting their ability to provide feedback about the practical implications of changes to the standards. One example of this (already mentioned above in (sec-use-cases?)) is the use of git/GitHub for versioning of standards documents. This sets a high bar for participation in standards development for researchers in fields of research in which git/GitHub have not yet had significant adoption as tools of day-to-day computational practice. At the same time, it provides clarity and robustness for standards developers communities that are well-versed in these tools.

Another layer of potential mismatches arises when a more complex set of stakeholders needs to be considered. For example, the Group on Earth Observations (GEO) is a network that aims to coordinate decision making around satellite missions and to standardize the data that results from these missions. Because this group involves a range of different stakeholders, including individuals who more closely understand potential legal issues and researchers who are better equipped to evaluate technical and domain questions, communication is slower and hindered. As the group aims to move forward by consensus, these communication difficulties can slow down progress. This is just an example, which exemplifies the many cases in which OSS process which strives for consensus can slow progress.

Cross-domain gaps

There is much to be gained from the development of standards that apply in multiple different domains. For example, many research fields use images as data and array-based computing that is applicable to images in various research fields is at the core of many scientific computing codes. This means that practitioners within any given field should be motivated to draw on shared data standards and shared software implementations of operations that are common across fields. On the other hand, it is very hard to justify the investment in these common resources. On the one hand, data standardization investment is even more justified if the standard is generalizable beyond any specific science domain. On the other hand, while the use cases are domain sciences based, data standardization is seen as a data infrastructure and not a science investment, reducing the immediate incentives for researchers to engage with such efforts. This is exacerbated by science research funding schemes that eschew generalized cross-domain solutions, and that seek to generate tangible impact only with a specific domain.

Data instrumentation issues

Where there is commercial interest in the development of data analysis tools (e.g., in biomedical applications or applications were research funding can be directed towards commercial solutions) there is an incentive to create data formats and data analysis platforms that are proprietary. This may drive innovative applications of scientific measurements, but also creates sub-fields where scientific observations are generated by proprietary instrumentation, due to these commercialization or other profit-driven incentives. FTIR Spectroscopy is one such example, wherein use of Bruker instrumentation necessitates downstream analysis of the resulting measurements using proprietary binary formats necessary for the OPUS Software. Another example is the proliferation of proprietary file formats in electrophysiological measurements of brain signals Hermes and Cimbalnek (2023). And yet another one is proprietary application programming interfaces (APIs) used in electronic health records Adler-Milstein and Pfeifer (2017). In most cases, there is a lack of regulatory oversight to adhere to available standards or evolve common tools, limiting integration across different measurements. In cases where a significant amount of data is already stored in proprietary formats, or where access is limited by proprietary APIs significant data transformations may be required to get data to a state that is amenable to open-source standards. In these sub-fields there may also be a lack of incentive to set aside investment or resources to invest in establishing open-source data standards, leaving these sub-fields relatively siloed.

Harnessing new computing paradigms and technologies

Open-source standards development faces the challenges of adapting to new computing paradigms and technologies. Cloud computing provides a particularly stark set of opportunities and challenges. On the one hand, cloud computing offers practical solutions for many challenges of contemporary data-driven research. For example, the scalability of cloud resources addresses some of the challenges of the scale of data that is produced by instruments in many fields. The cloud also makes data access relatively straightforward, because of the ability to determine data access permissions in a granular fashion. On the other hand, cloud computing requires reinstrumenting many data formats. This is because cloud data access patterns are fundamentally different from the ones that are used in local posix-style file-systems. Suspicion of cloud computing comes in two different flavors: the first by researchers and administrators who may be wary of costs associated with cloud computing, and especially with the difficulty of predicting these costs. This can particularly affect scenarios where long-term preservation is required. Projects such as NSF’s Cloud Bank seek to mitigate some of these concerns, by providing an additional layer of transparency into cloud costs (Norman et al. 2021). The other type of objection relates to the fact that cloud computing services, by their very nature, are closed ecosystems that resist portability and interoperability. Some aspects of the services are always going to remain hidden and privy only to the cloud computing service provider. In this respect, cloud computing runs afoul of some of the appealing aspects of OSS. That said, the development of “cloud native” standards can provide significant benefits in terms of the research that can be conducted. For example, NOAA plans to use cloud computing for integration across the multiple disparate datasets that it collects to build knowledge graphs that can be queried by researchers to answer questions that can only be answered through this integration. Putting all the data “in one place” should help with that. Adaptation to the cloud in terms of data standards has driven development of new file formats. A salient example is the ZARR format (Miles et al. 2024), which supports random access into array-based datasets stored in cloud object storage, facilitating scalable and parallelized computing on these data. Indeed, data standards such as NWB (neuroscience) and OME (microscopy) now use ZARR as a backend for cloud-based storage. In other cases, file formats that were once not straightforward to use in the cloud, such as HDF5 and TIFF have been adapted to cloud use (e.g., through the cloud-optimized geoTIFF format).

Unclear pathways for standards success and sustainability

The development of open-source standards faces similar sustainability challenges to those faced by open-source software that is developed for research. Standards typically develop organically through sustained and persistent efforts from dedicated groups of data practitioners. These include scientists and the broader ecosystem of data curators and users. However, there is no playbook on the structure and components of a data standard, or the pathway that moves the implementation of a specific data architecture (e.g., a particular file format) to become a data standard. As a result, data standardization lacks formal avenues for success and recognition, for example through dedicated research grants (and see (sec-cross-sector?)). This hampers the long-term trajectory that is needed to inculcate a standard into the day-to-day practice of researchers.

Adler-Milstein, Julia, and Eric Pfeifer. 2017. “Information Blocking: Is It Occurring and What Policy Strategies Can Address It?” Milbank Q. 95 (1): 117–35.

Barker, Wesley, Natalya Maisel, Catherine E Strawley, Grace K Israelit, Julia Adler-Milstein, and Benjamin Rosner. 2024. “A National Survey of Digital Health Company Experiences with Electronic Health Record Application Programming Interfaces.” J. Am. Med. Inform. Assoc. 31 (4): 866–74.

Gillon, Colleen J, Cody Baker, Ryan Ly, Edoardo Balzani, Bingni W Brunton, Manuel Schottdorf, Satrajit Ghosh, and Nima Dehghani. 2024. “ODIN: Open Data in Neurophysiology: Advancements, Solutions & Challenges.” arXiv [q-Bio.NC], July.

Hermes, Dora, and Jan Cimbalnek. 2023. “How Can Intracranial EEG Data Be Published in a Standardized Format?” In Studies in Neuroscience, Psychology and Behavioral Economics, 595–604. Cham: Springer International Publishing.

Miles, Alistair, jakirkham, M Bussonnier, Josh Moore, Dimitri Papadopoulos Orfanos, Davis Bennett, David Stansby, et al. 2024. “Zarr-Developers/Zarr-Python: V3.0.0-Alpha.” Zenodo. https://doi.org/10.5281/zenodo.11592827.

Norman, Michael, Vince Kellen, Shava Smallen, Brian DeMeulle, Shawn Strande, Ed Lazowska, Naomi Alterman, et al. 2021. “CloudBank: Managed Services to Simplify Cloud Access for Computer Science Research and Education.” In Practice and Experience in Advanced Research Computing. PEARC ’21. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3437359.3465586.

Scroggins, Michael, and Bernadette M Boscoe. 2020. “Once FITS, Always FITS? Astronomical Infrastructure in Transition.” IEEE Ann. Hist. Comput. 42 (2): 42–54.