Write Reusable Scripts for Data Cleaning and Merging
4 / 46 •
Data •
This practice was ranked as basic.
Click to read more. • This practice helps to increase
the traceability of ML components.
Click to read more.
Click to read more. • This practice helps to increase
the traceability of ML components.
Click to read more.
Intent
Avoid untidy data wrangling scripts, reuse code and increase reproducibility.
Motivation
Data cleaning and merging are exploratory processes and tend to lack structure. Many times these processes involve manual steps, or poorly structured code which can not be reused later. Needless to mention such code can not be integrated in a processing pipeline.
Applicability
Reusable data cleaning scripts should be written for any ML application that does not use raw or standard data sets.
Description
Most of the time, training machine learning models is preceded by an exploratory phase, in which non-structured code is written, or manual steps are performed in order to get the data in the right format, merge several data sources, etc. Especially when using notebooks, there is a tendency to write ad-hoc data processing scripts, which depend on variables already stored in memory when running previous cells.
Before moving to the training phase, it is important to convert this code into reusable scripts and move it into methods which can be called and tested individually. This will enable code reuse and ease integration into processing pipelines.
Adoption
Related
Read more
- Best Practices in Machine Learning Infrastructure
- Data management challenges in production machine learning
- ML Ops: Machine Learning as an engineered disciplined
4 / 46 •
Data •
This practice was ranked as basic.
Click to read more. • This practice helps to increase
the traceability of ML components.
Click to read more.
Click to read more. • This practice helps to increase
the traceability of ML components.
Click to read more.