Write Reusable Scripts for Data Cleaning and Merging

June, 2025 • Alex Serban, Koen van der Blom, Joost Visser

4 / 46 • Data •

This practice was ranked as basic.
Click to read more. •

This practice helps to increase
the traceability of ML components.
Click to read more.

Intent

Avoid untidy data wrangling scripts, reuse code and increase reproducibility.

Motivation

Data cleaning and merging are exploratory processes and tend to lack structure. Many times these processes involve manual steps, or poorly structured code which can not be reused later. Needless to mention such code can not be integrated in a processing pipeline.

Applicability

Reusable data cleaning scripts should be written for any ML application that does not use raw or standard data sets.

Description

Most of the time, training machine learning models is preceded by an exploratory phase, in which non-structured code is written, or manual steps are performed in order to get the data in the right format, merge several data sources, etc. Especially when using notebooks, there is a tendency to write ad-hoc data processing scripts, which depend on variables already stored in memory when running previous cells.

Before moving to the training phase, it is important to convert this code into reusable scripts and move it into methods which can be called and tested individually. This will enable code reuse and ease integration into processing pipelines.

Adoption