Ensure Data Labelling is Performed in a Strictly Controlled Process

June, 2025 • Alex Serban, Koen van der Blom, Joost Visser

5 / 46 • Data •

This practice was ranked as basic.
Click to read more.

Intent

Avoid invalid or incomplete labels.

Motivation

Controlling the data labelling process ensures label quality -- an important quality driver for supervised learning algorithms.

Applicability

Data label control should be applied to any machine learning application that uses labels, i.e. in supervised learning or flavors of supervised learning such as semi-supervised learning.

Description

In supervised learning, labels are crucial for the proper functioning of any algorithm. However, labelling large quantities of data is not trivial. Incorrect labels introduce noise and may lead to sub-optimal results. Firstly, data labelling raises challenges because the volume of data is typically large. Secondly, choosing labels is a subjective activity and may introduce bias or noise.

Imposing a strictly controlled process for data labelling guarantees that your algorithm is served with the best data, and helps to avoid issues arising from model debugging and error tracing.

A mature data labelling process includes peer-reviewing all labels by a second team member.

Lower or sub-optimal label quality can impact the whole machine learning pipeline. In case this problem can not be addressed (and a machine learning solution is still desired), make sure you document and communicate this issue within the team.

Adoption