Use Sanity Checks for All External Data Sources
Intent
Avoid invalid or incomplete data being processed.
Motivation
Data is at the heart of any machine learning model. Therefore, avoiding data errors is crucial for model quality.
Applicability
Data quality control should be applied to any machine learning application.
Description
Whenever external data sources are used, or data is collected that may be incomplete or ill formatted, it is important to verify the data quality. Invalid or incomplete data may cause outages in production or lead to inaccurate models.
Start by checking simple data attributes, such as:
- data types,
- missing values,
- data min. or max. values,
- histograms of continuous values,
and gradually include more complex data statistics, such as the ones recommended here.
Missing data can also be substituted using data imputation; such as imputation by zero, mean, median, random values, etc.
Also, make sure the data verification scripts are reusable and can be later integrated in any processing pipeline.
Adoption
Related
- Check that Input Data is Complete, Balanced and Well Distributed
- Write Reusable Scripts for Data Cleaning and Merging
Read more
- Data management challenges in production machine learning
- ML Ops: Machine Learning as an engineered disciplined