Write Reusable Scripts for Data Cleaning and Merging

4 / 45 Data This practice was ranked as basic. Click to read more.


Avoid untidy data wrangling scripts, reuse code and increase reproducibility.


Data cleaning and merging are exploratory processes and tend to lack structure. Many times these processes involve manual steps, or poorly structured code which can not be reused later. Needless to mention such code can not be integrated in a processing pipeline.


Reusable data cleaning scripts should be written for any ML application that does not use raw or standard data sets.


Most of the time, training machine learning models is preceded by an exploratory phase, in which non-structured code is written, or manual steps are performed in order to get the data in the right format, merge several data sources, etc. Especially when using notebooks, there is a tendency to write ad-hoc data processing scripts, which depend on variables already stored in memory when running previous cells.

Before moving to the training phase, it is important to convert this code into reusable scripts and move it into methods which can be called and tested individually. This will enable code reuse and ease integration into processing pipelines.



Read more

4 / 45 Data This practice was ranked as basic. Click to read more.