Introduction

Machine learning (ML) models produced by researchers are considered to be research output just like more traditional research outputs such as journal articles, conference papers, book chapters, etc. This means that, when creating and sharing machine learning models, researchers need to fulfil funder and institutional requirements for research outputs.

For example, researchers need to ensure that all their work falls under the ethical approval in projects where such approval is applicable (see guidelines from e.g., Swedish Research Council, ERC). In addition, their research output needs to meet open access publication requirements (e.g., Swedish Research Council, ERC), open data requirements (e.g., Swedish Research Council, ERC), open analysis workflows and code requirements (e.g., National guidelines for promoting open science in Sweden).

To meet these requirements, European and Swedish funders and universities currently recommend adhering to FAIR principles (e.g., Swedish Research Council: Making research data accessible and FAIR criteria).

FAIR ML models

FAIR (Findable, Accessible, Interoperable, Reusable) is a set of principles originally written for research data (see Wilkinson et al 2016) but since expanded to other research output (see Baker et al 2022, Patel et al 2023). There is no one specific way to 'make something FAIR'; instead, research output can adhere to FAIR principles to different extent and in different ways.

The long-term goal of SciLifeLab Serve is allow Swedish researchers to meet funder requirements in terms of FAIR principles when sharing models without any extra work; in other words, everything should be done for you automatically when you share your model on SciLifeLab Serve. In the meantime, there are some things that researchers can do themselves. On this page, we give recommendations on basic steps researchers can take to adhere to FAIR principles to a reasonable extent when sharing machine learning models.

Meeting FAIR requirements in applications with ML models

Currently, researchers can share their machine learning models through SciLifeLab Serve by turning them into independent applications. We have guidelines for how to do it here. We also have a separate page describing how applications (including machine learning applications) can meet FAIR requirements. All models shared on SciLifeLab Serve should aim to fulfil the requirements described there as a starting point. Below, we provide additional recommendations that are specific to ML models.

Additional suggestions specific to ML models

When it comes to specifically machine learning models, researchers should in addition put extra effort into the descriptions of their models so that they useful information for use and reuse of the model. Good descriptions (metadata) are one of the pillars of FAIR. Pick recommendations below that are relevant for you, create a description file with the text following the recommendations, and share this file in the same place where you share the model artifact files (for example, in a GitHub repository).

There is no single standard for describing ML models because it is still a relatively new field and because coming up with such a community standard is complicated due to differences in ML modeling approaches. However, there are already recommendations that gained traction. For example, you can follow the Model cards format suggested by researchers at Google to describe your model. Another influential recommendation is Model cards format by HuggingFace. A model card is simply a file with structured text that describes the specific aspects of the model.

DOME (Data, Optimization, Model and Evaluation) is a set of community recommendations for reporting supervised machine learning–based analyses applied to biological studies (Walsh et al 2021). DOME recommendations are written specifically with the goal to improve machine learning assessment and reproducibility. These recommendations were developed primarily for the case of supervised learning in biological applications in the absence of direct experimental validation, as this is the most common type of ML approach used in biology. Since their publication, DOME recommendations have been increasingly adopted by the community, including some journals requiring descriptions according to DOME. There is also a DOME registry website where researchers can add their models.

Open ML models

As mentioned in the Introduction, funders and institutions require open sharing of research output. Research projects using machine learning models make use of and create many artifacts, and all of these components need to be taken into account when considering the funder and institutional requirements. We at SciLifeLab Serve endorse the so-called Model Openness Framework (MOF, White et al 2024) developed by researchers at Linux Foundation and elsewhere.

The Model Openness Framework identifies 17 components that can be shared by researchers developing machine learning models as well as appropriate licenses that these need to be shared with. There is also Model Openness Tool where researchers can add their own models or get an overview of how other models are classified in MDF.

While the Model Openness Framework is designed for deep learning artifacts and does not transfer directly to every form of learning in AI, we think this is a great starting point for any ML researchers wishing to share their models. We recommend the researchers strive to share as many components as possible from the list. The long-term goal of SciLifeLab Serve is to make sharing all these components as easy as possible.

Model Openness Framework Components and Licenses

Source: Table 2 of White et al 2024

Component Domain Content Type Accepted Open License
Datasets Data Data Preferred: CDLA-Permissive-2.0, CC-BY-4.0. Acceptable: Any including unlicensed
Data Preprocessing Code Data Code Acceptable: OSI-approved, e.g., The MIT License
Model Architecture Model Code Acceptable: OSI-approved, e.g., The MIT License
Model Parameters Model Data Preferred: CDLA-Permissive-2.0. Acceptable: OSI-approved, e.g., The MIT License, Permissive Open Data Licenses
Model Metadata Model Data Preferred: CDLA-Permissive-2.0. Acceptable: CC-BY-4.0, Permissive Open Data Licenses
Training Code Model Code Acceptable: OSI-approved, e.g., The MIT License
Inference Code Model Code Acceptable: OSI-approved, e.g., The MIT License
Evaluation Code Model Code Acceptable: OSI-approved, e.g., The MIT License
Evaluation Data Model Data Preferred: CDLA-Permissive-2.0. Acceptable: CC-BY-4.0, Permissive Open Data Licenses
Evaluation Results Model Documentation Preferred: CC-BY-4.0. Acceptable: Permissive Open Content Licenses
Supporting Libraries & Tools Model Code Acceptable: OSI-approved, e.g., The MIT License
Model Card Model Documentation Preferred: CC-BY-4.0. Acceptable: Permissive Open Content Licenses
Data Card Data Documentation Preferred: CC-BY-4.0. Acceptable: Permissive Open Content Licenses
Technical Report Model & Data Documentation Preferred: CC-BY-4.0. Acceptable: Permissive Open Content Licenses
Research Paper Model & Data Documentation Preferred: CC-BY-4.0. Acceptable: Permissive Open Content Licenses
Sample Model Outputs Model Data or Code Unlicensed

Other sources of information

There are many other good sources of information about FAIR ML models and open ML models that you can use if you are interested to dive into this. Here are some recommendations:

Ten simple rules for good model-sharing practices

FAIR, AI Readiness, and Reproducibility Network: Resources

The SciLifeLab Serve user guide is powered by django-wiki, an open source application under the GPLv3 license. Let knowledge be the cure.