Make Data Sets Available on Shared Infrastructure (private or public)

June, 2025 • Alex Serban, Koen van der Blom, Joost Visser

8 / 46 • Data •

This practice was ranked as basic.
Click to read more. •

This practice helps to increase
the traceability of ML components.
Click to read more.

Intent

Avoid data duplication, data bottlenecks, or unnecessary transfer of large data sets.

Motivation

The amount of data processed by machine learning models is higher than usual software systems, raising concerns related to duplication, transfer, storage, and traceability. Making the data sets available on shared infrastructure helps mitigate these issues.

Applicability

Data availability on shared infrastructure should be applied to any machine learning application.

Description

Good data management and storage is important for several reasons:

access control,
virtualisation,
versioning,
maintainability and freshness,
to avoid duplication,
to avoid unnecessary transfers, and save time.

Many applications deal with large data volumes. Transferring (or copying) large data volumes is not trivial, and may introduce delays in the processing pipelines. Needless to say duplication becomes an issue with large volumes of data.

Making data sets available on shared infrastructure (e.g. S3 Buckets or mountable disks) helps mitigate these issues. Moreover, it facilitates the adoption of access control policies, and provides traceability i.e. by keeping a data access log.

Adopting standard naming conventions for the data sets – e.g. to reflect the version – is also considered a best practice.

Adoption