In supervised learning, labels are crucial for the proper functioning of any algorithm. However, labelling large quantities of data is not trivial. Incorrect labels introduce noise and may lead to sub-optimal results. Firstly, data labelling raises challenges because the volume of data is typically large. Secondly, choosing labels is a subjective activity and may introduce bias or noise.
Imposing a strictly controlled process for data labelling guarantees that your algorithm is served with the best data, and helps to avoid issues arising from model debugging and error tracing.
A mature data labelling process includes peer-reviewing all labels by a second team member.
Lower or sub-optimal label quality can impact the whole machine learning pipeline. In case this problem can not be addressed (and a machine learning solution is still desired), make sure you document and communicate this issue within the team.
- A Survey on Data Collection for Machine Learning A Big Data - AI Integration Perspective_2019
- The curse of big data labeling and three ways to solve it
- The ultimate guide to data labeling for ML
- How to organize data labelling for ML