Ensure Data Labelling is Performed in a Strictly Controlled Process
Intent
Motivation
Applicability
Description
In supervised learning, labels are crucial for the proper functioning of any algorithm. However, labelling large quantities of data is not trivial. Incorrect labels introduce noise and may lead to sub-optimal results. Firstly, data labelling raises challenges because the volume of data is typically large. Secondly, choosing labels is a subjective activity and may introduce bias or noise.
Imposing a strictly controlled process for data labelling guarantees that your algorithm is served with the best data, and helps to avoid issues arising from model debugging and error tracing.
A mature data labelling process includes peer-reviewing all labels by a second team member.
Lower or sub-optimal label quality can impact the whole machine learning pipeline. In case this problem can not be addressed (and a machine learning solution is still desired), make sure you document and communicate this issue within the team.
Adoption
Read more
- A Survey on Data Collection for Machine Learning A Big Data - AI Integration Perspective_2019
- The curse of big data labeling and three ways to solve it
- The ultimate guide to data labeling for ML
- How to organize data labelling for ML