Use Versioning for Data, Model, Configurations and Training Scripts
22 / 46 •
Training •
This practice was ranked as basic.
Click to read more. • This practice helps to increase
the traceability of ML components.
Click to read more.
Click to read more. • This practice helps to increase
the traceability of ML components.
Click to read more.
Intent
Improve reproducibility, traceability and compliance.
Motivation
In order to reproduce previous machine learning experiments, one needs more than just the executable code. Versioning the training and testing data, the final model, and all configuration files is complementary to versioning the executable code.
Applicability
Versioning should be used in any machine learning application or experiment.
Description
Versioning in machine learning involves more components than in traditional software: among the executable code we have to store the training and testing data sets, the configuration files and the final model artifacts.
Storing all information allows previous experiments to be reproduced and re-assessed. Moreover, it helps auditing, compliance and backward traceability and compatibility.
However, many of these artifacts have distinct and large sizes, which makes versioning difficult. In most cases, data and model artifacts will be versioned in different systems than code and configuration files.
In order to avoid versioning issues, make sure to:
- include a link to the data version in the code / configuration artifacts together with an unique id and a time stamp,
- add feature documentation for all data and link it to the code artifacts,
- add tests for data processing and merging,
- include scripts for running or deploying the experiment, e.g. bash scripts, infrastructure scripts, etc.
Adoption
Related
Read more
- 10 Best Practices for Deep Learning
- ModelOps: Cloud-based lifecycle management for reliable and trusted AI
- Machine learning: Moving from experiments to production
- Managing Machine Learning Projects
- Principled Machine Learning: Practices and Tools for Efficient Collaboration
- Versioning for end-to-end machine learning pipelines
22 / 46 •
Training •
This practice was ranked as basic.
Click to read more. • This practice helps to increase
the traceability of ML components.
Click to read more.
Click to read more. • This practice helps to increase
the traceability of ML components.
Click to read more.