Fair Play for Data: Researchers Develop Practical FAIR Principles for Data Sets

sarah Jonas

9 months ago

FAIR data principles aim to automate data management, paving the way to link data sources, AI, extreme scale, and edge computing with modern scientific data infrastructure to automate and accelerate discovery.

Image courtesy of Argonne Leadership Computing Facility Visualization and Data Analytics Group

The Science

The Higgs boson is a fundamental particle responsible for the generation of mass in all other elementary particles. Since it was discovered at the CERN Large Hadron Collider in 2012, researchers have developed strategies to understand how the Higgs boson interacts with the other elementary particles. Scientists are also seeking cues in this experimental data that could indicate physics beyond our current understanding of nature. This science depends on the ability to extract new insights from massive experimental data sets. To help, researchers have defined practical FAIR (findable, accessible, interoperable, reusable) principles for data. FAIR will help humans and computers use large data sets. It will also enable modern computers to process these data sets. This work is critical for developing artificial intelligence (AI) tools that can identify novel patterns and features in experimental data.

The Impact

This work provides a guide that enables researchers to create and evaluate whether data sets adhere to FAIR principles. This will allow both humans and machines to use (and re-use) data sets, bypassing the need for time-consuming manual pre-processing. It also helps researchers prepare FAIR data sets for use in modern computing environments. If this vision is realized, scientific facilities will be able to seamlessly transfer experimental data to modern computing environments such as high performance computers. There, researchers can use the data to produce novel AI algorithms that provide trustworthy predictions and extract new knowledge.

Summary

In this project, researchers developed a domain-agnostic, step-by-step assessment guide to evaluate if a given data set meets FAIR principles. They showcased its application with an open simulated data set produced by the CMS Collaboration at the CERN Large Hadron Collider. The researchers also developed and shared tools to visualize and explore this data set. This work has the overarching goal of providing a blueprint for the integration of data sets, AI tools, and smart cyberinfrastructure. This approach will lead to the creation of a rigorous AI framework for interdisciplinary discovery and innovation.

Within the next decade, as the scientific community adopts FAIR AI models and data, researchers will be able to gradually bridge existing gaps between theory and experimental science. AI models that are currently trained with large-scale simulations and approximate mathematical models will be gradually refined to learn and describe nature, identifying principles and patterns that go beyond existing theories. In time, AI will be capable of synthesizing knowledge from disparate disciplines to provide a holistic understanding of natural phenomena, unifying mathematics, physics, and scientific computing for the advancement of science.

Contact

Eliu Huerta
Argonne National Laboratory
elihu@anl.gov

Funding

This work was supported by the Department of Energy (DOE) Office of Science, Advanced Scientific Computing Research FAIR Data program FAIR Framework for Physics-Inspired Artificial Intelligence in High Energy Physics project. It used resources of the Argonne Leadership Computing Facility, a DOE Office of Science user facility. One of the researchers was also supported by a Halicioğlu Data Science Fellowship.

Publications

Chen, Y., et al., A FAIR and AI-ready Higgs boson decay dataset. Scientific Data 9, 31 (2022). [DOI: 10.1038/s41597-021-01109-0]