Check that Input Data is Complete, Balanced and Well Distributed
Intent
Motivation
Applicability
Description
Besides performing sanity checks on the input data, it is recommended to constantly check for data evolution. In a continuously evolving environment, the data distribution will evolve over time. For example, your user distribution per geographical regions may change with time, and lead to future biases towards over-representative regions.
Continuously check that:
- features are still present in enough examples,
- features have the right number of values (cardinality) (e.g. there can not be more than one age/age derived feature),
- hidden dependencies between data attributes are not present,
- the input data distribution did not shift: e.g. a group is under- or over-represented.
Many machine learning algorithms use the “independent and identically distributed” assumption, which states that training and test samples are independent (i.e. changing one sample does not influence the others) and are sampled from the same distribution. In case your algorithms use this assumption, make sure to include checks between training, testing and production data to ensure no distribution drifts are present.
Building a strong data validation pipeline should also include:
- dashboards or visual elements to continuously monitor data quality, and
- alerts that inform team members when unusual events occur.
If your model performs close to real-time, or online learning, a strong alert system can help to detect errors early and correct them.
Adoption
Related
- Use Sanity Checks for All External Data Sources
- Perform Checks to Detect Skew between Models
- Test for Social Bias in Training Data
Read more
- Continuous Training for Production ML in the TensorFlow Extended (TFX) Platform
- Data management challenges in production machine learning
- Hidden Technical Debt in Machine Learning Systems
- Managing Machine Learning Projects