Whenever external data sources are used, or data is collected that may be incomplete or ill formatted, it is important to verify the data quality. Invalid or incomplete data may cause outages in production or lead to inaccurate models.
Start by checking simple data attributes, such as:
- data types,
- missing values,
- data min. or max. values,
- histograms of continuous values,
and gradually include more complex data statistics, such as the ones recommended here.
Missing data can also be substituted using data imputation; such as imputation by zero, mean, median, random values, etc.
Also, make sure the data verification scripts are reusable and can be later integrated in any processing pipeline.
- Check that Input Data is Complete, Balanced and Well Distributed
- Write Reusable Scripts for Data Cleaning and Merging
- Data management challenges in production machine learning
- ML Ops: Machine Learning as an engineered disciplined