Good data management and storage is important for several reasons:
- access control,
- maintainability and freshness,
- to avoid duplication,
- to avoid unnecessary transfers, and save time.
Many applications deal with large data volumes. Transferring (or copying) large data volumes is not trivial, and may introduce delays in the processing pipelines. Needless to say duplication becomes an issue with large volumes of data.
Making data sets available on shared infrastructure (e.g. S3 Buckets or mountable disks) helps mitigate these issues. Moreover, it facilitates the adoption of access control policies, and provides traceability i.e. by keeping a data access log.
Adopting standard naming conventions for the data sets – e.g. to reflect the version – is also considered a best practice.
- Managing Machine Learning Projects
- Software Engineering for Machine Learning: A Case Study
- Principled Machine Learning: Practices and Tools for Efficient Collaboration
- Software development best practices in a deep learning environment