SE-ML | Machine Learning Engineering Practices in Recent Years: Trends and Challenges

Following the recent recognition of our article on engineering best practices for ML by the Journal of Systems and Software (JSS) as one of the two papers of the year 2024, we conducted a follow-up study that analyzes the latest trends in ML engineering best practices. Since our initial collection of practices was finalized around 2021, we identified and reviewed 108 new articles published from 2022 to 2025 in leading software engineering journals and on arXiv.

This article is the first in a series designed to reveal the latest developments in ML engineering best practices, pinpoint new challenges, and discuss topics that have received less attention. Our current analysis identifies five major research directions in the current literature and three areas that are underrepresented.

We discuss these findings, give details on article selection and analysis, and present a complete list of references categorized by the identified topics.

Data Collection

To gather relevant articles, we initially parsed the titles and abstracts of all published articles from 2022 to 2025 in prominent venues in the field, such as ICSE, ESEM, TSE, and others. These lists were obtained either directly from the publication sites or through dblp. For example, dblp conveniently compiles all TSE publications from a single year onto one page, which simplified our parsing process.

Next, we parsed all articles from arXiv’s computer science software engineering (cs.SE) category between 2022 and 2025 using the official API.

The titles and abstracts were first filtered automatically to check for the presence of keywords such as “ML”, “machine learning”, “artificial intelligence”, and others. The remaining titles were manually reviewed, resulting in the selection of 108 relevant articles, which are included in the references section.

Additionally, we attempted to use the phi-4 model to filter relevant articles by providing the title and abstract in a prompt. The prompt asked whether the article pertains to (a) software engineering for machine learning (SE4ML), (b) machine learning for software engineering (ML4SE), or (c) solely software engineering without any machine learning components. Unfortunately, the results were unsatisfactory as the model often failed to distinguish between SE4ML and ML4SE. More advanced prompting techniques, such as providing explicit examples, might have yielded better results. However, manually reviewing the titles ensured the accuracy of our selection.

Recent Trends in ML Engineering Practices

To identify trends from the selected articles, we began by reviewing and categorizing them based on the primary engineering topics they addressed. Subsequently, for each topic, we identified subtopics to provide a more detailed analysis. This resulted in five core topics, discussed below.

1. Engineering Practices for ML and Responsible ML

The development of engineering practices for ML is steadily evolving, with increasing emphasis on Responsible ML. Among general ML practices, requirements engineering emerges as the most extensively represented area, with studies addressing elicitation techniques for general ML systems [90, 93], frameworks tailored to Responsible ML [1, 83], and broader mapping efforts that explore the landscape of ML-specific requirements engineering [34, 78]. Closely following are contributions in architecture design, where researchers propose novel architectures [21] and reusable architectural patterns for AI and ML systems [29, 82, 106].

Other areas, such as AutoML practices [63, 70] and data engineering [13, 36], remain significantly underrepresented despite their foundational role in successful ML development. This gap is addressed further in the challenges section, where the need for more robust and systematic data practices is discussed. Additionally, a notable portion of the literature focuses on practices within diverse development contexts [3, 14, 18, 25, 73, 80, 81, 86, 87], highlighting how real-world constraints shape ML engineering.

The most substantial body of work centers on Responsible ML – covering key concerns such as fairness, bias mitigation, transparency, security, and ethical governance. Within this domain, several comprehensive catalogues of best practices and design patterns have been proposed to guide practitioners [10, 12, 17, 35, 69, 95, 96]. Specific themes such as Green ML/AI [31, 43, 51] and AI governance [2, 23, 33, 38, 95] are also gaining traction, reflecting a broader shift toward aligning innovation with long-term social, environmental, and ethical responsibilities.

2. MLOps, Model Management, and Deployment

The operationalization of ML models emerged as another area of focus, with MLOps practices gaining considerable traction in both research and industry. A substantial body of work investigates the challenges, perceptions, and adoption of MLOps, highlighting gaps in automation, tooling, collaboration, and quality assurance across ML pipelines [32, 37, 40, 48, 50, 52, 76, 79, 89, 91, 94, 104, 108].

A second group of studies contributes to the development of MLOps practices and pipeline design, proposing architectural frameworks and best practices for automating training, deployment, and explainability within production settings [45, 60, 72, 103]. Complementing this, research on model management and maintenance addresses versioning, reuse, and governance – key to sustainable ML operations in dynamic environments [16, 24, 26].

Tooling support is another growing area, with emerging solutions aimed at improving collaborative model development and version control, such as MGit and Git-Theta [68, 71].

3. Off-the-Shelf Models and Agentic Architectures

Agentic architectures are emerging as a dynamic paradigm in ML system design, where autonomous agents interact, collaborate, and adapt within complex, often real-time environments. This approach is increasingly powered by foundation models and supported through multi-agent coordination frameworks. The studies identified investigate how to design [44, 54, 56], develop [20], and deploy [27] such systems responsibly [56].

Complementing it, a growing body of work addresses design patterns and software architectures for foundation model-based systems, offering taxonomies and reference frameworks that help develop these systems [74, 77].

4. ML Software Quality Assessment

A growing body of research explores ML-specific quality assessment, including testing techniques which highlighting unique challenges such as non-determinism, generalization, and evaluation cost trade-offs [19, 22, 41, 61, 62, 65, 97]. Complementing this, more specialized techniques like test case prioritization have been explored to improve continuous integration workflows [98], while studies on testing in practice offer insights from real-world deployments [100].

Parallel efforts in Responsible ML testing aim to adopt fairness, explainability, and ethical testing in ML pipelines – proposing tools, metrics, and frameworks for fairness auditing, risk assessment, and certification [6, 9, 46, 47, 66, 67, 84, 85, 107, 42].

Maintaining ML systems over time demands robust maintenance practices, with recent studies analyzing the evolution of ML maintenance practices in open-source ecosystems [55], the management of technical debt [58], and the identification of code smells and architectural tactics for maintenance [92, 101, 102, 105]. These efforts illustrate a maturing field, moving beyond performance-centric evaluation to more integrated engineering strategies that account for long-term risk, responsible development, and maintenance.

5. Using ML to Improve Engineering Practices

Last, an emerging direction explores how ML and LLMs can be used to improve engineering practices. Recent work investigate the role of LLMs in software quality assurance [4], in developing trustworthy software [28] or for engineering practice recommendations [64].

Open Challenges and Underrepresented Topics

Considering the trends discussed, we explore three topics that remain underdeveloped and warrant further attention. It is important to note that the perspectives presented here are subjective and open to debate.

1. Data Engineering

Regardless of the ML methodology employed, the literature consistently underscores the critical role of high-quality data in achieving optimal results and maintaining model performance over time. Therefore, it is essential to focus on developing advanced data pipelines that enable efficient model debugging, maintenance, and evolution.

These pipelines should prioritize automation, where monitoring systems can automatically flag outliers detected in deployed models and send them to annotation teams for review. Similarly, new data should undergo automatic quality and relevance checks before being integrated into annotation pipelines. Once new annotations are accepted, they should automatically trigger new training pipelines, ensuring that models are continually updated with the most accurate and relevant data.

While some practices in the data category indicate progress in these directions, more complex capabilities are essential for maintaining model relevance in dynamic environments. However, they are rarely implemented or studied on a large scale, pointing to a significant area for further research and development.

2. Higher Focus on Scaling and Model Efficiency Engineering

Recently, there has been a surge in publications exploring advanced scaling and optimization techniques for training and runing inference on ML models (e.g., [109, 110, 111]). However, these techniques are seldom adopted or discussed within the software engineering community, and they are not often translated into practical applications. Methods such as model compression, quantization, distillation, and more sophisticated optimization strategies like developing custom kernels require further investigation, particularly as larger models frequently operate inefficiently in real-world settings.

While leading companies are already leveraging these techniques – exemplified by organizations like DeepSeek, OpenAI, and Mistral (which has one of the fastest inference speeds) – there is an urgent need for the broader software engineering community to examine, abstract, and convert these methods into practical guidelines. This will help ensure that the advantages of these advanced optimization techniques are more widely accessible and can be effectively utilized across diverse applications and industries.

3. Incentivization and Collection of User Feedback from Deployed Models

Collecting and incentivizing users to provide feedback for deployed models is essential for understanding model performance, identifying biases and areas for improvement, and improving user satisfaction. However, the most effective methods for collecting this feedback and the contexts in which they should be applied remain unclear. This presents a significant opportunity for the software engineering community to develop and refine best practices.

Consider, for example, suggesting improved translations to DeepL or providing more detailed feedback for a generated phrase in Mistral. These platforms have begun to integrate user feedback mechanisms, but they are still clumsy or incomplete. Here, the software engineering community can play a more important role in creating standardized approaches for collecting and utilizing user feedback, ensuring that deployed models continuously evolve and improve based on real-world usage and user insights. Given the current state of ML, fostering collective participation and collaboration is essential to mitigate some of the biases and inefficiencies present in current systems.

Conclusions

In this first article, we explore trends in the evolution of machine learning engineering practices, building upon the prior work that resulted in the catalog of practices available on this website. By analyzing 108 articles published between 2022 and 2025, we identified 5 key trends and 3 underrepresented areas in the field of ML engineering. Our initial findings highlight significant progress especially in the area of Responsible ML and persistent challenges.

In the upcoming articles, we will explore each identified trend in greater detail, examining novel practices and research directions while potentially uncovering new challenges and underrepresented topics.

References in ordered descending by date

How to Elicit Explainability Requirements? A Comparison of Interviews, Focus Groups, and Surveys
Toward Effective AI Governance: A Review of Principles
Software Engineering for Self-Adaptive Robotics: A Research Agenda
Advancing Software Quality: A Standards-Focused Review of LLM-Based Assurance Techniques
Future of Code with Generative AI: Transparency and Safety in the Era of AI Generated Software
What Makes a Fairness Tool Project Sustainable in Open Source?
Can We Recycle Our Old Models? An Empirical Evaluation of Model Selection Mechanisms for AIOps Solutions
Understanding Large Language Model Supply Chain: Structure, Domain, and Vulnerabilities
Whence Is A Model Fair? Fixing Fairness Bugs via Propensity Score Matching
Explainability for Embedding AI: Aspirations and Actuality
Who Speaks for Ethics? How Demographics Shape Ethical Advocacy in Software Development
Extending Behavioral Software Engineering: Decision-Making and Collaboration in Human-AI Teams for Responsible Software Engineering
Data Requirement Goal Modeling for Machine Learning Systems
Towards practicable Machine Learning development using AI Engineering Blueprints
Systematic Literature Review of Automation and Artificial Intelligence in Usability Issue Detection
Model Lake: a New Alternative for Machine Learning Models Management and Governance
Contextual Fairness-Aware Practices in ML: A Cost-Effective Empirical Evaluation
A Systematic Survey on Debugging Techniques for Machine Learning Systems
Challenges in Testing Large Language Model Based Software: A Faceted Taxonomy
Agentic AI Software Engineers: Programming with Trust
Hierarchical Fallback Architecture for High Risk Online Machine Learning Inference
A quantitative framework for evaluating architectural patterns in ML systems
Securing the AI Frontier: Urgent Ethical and Regulatory Imperatives for AI-Driven Cybersecurity
Variability-Aware Machine Learning Model Selection: Feature Modeling, Instantiation, and Experimental Case Study
The Current Challenges of Software Engineering in the Era of Large Language Models
An Efficient Model Maintenance Approach for MLOps
Microservice-based edge platform for AI services
Engineering Trustworthy Software: A Mission for LLMs
Architectural Patterns for Designing Quantum Artificial Intelligence Systems
Overview of Current Challenges in Multi-Architecture Software Engineering and a Vision for the Future
Do Developers Adopt Green Architectural Tactics for ML-Enabled Systems? A Mining Software Repository Study
Machine Learning Operations: A Mapping Study
Responsible AI in Open Ecosystems: Reconciling Innovation with Risk Assessment and Disclosure
How Mature is Requirements Engineering for AI-based Systems? A Systematic Mapping Study on Practices, Challenges, and Future Research Directions
A Catalog of Fairness-Aware Practices in Machine Learning Engineering
Data Quality Antipatterns for Software Analytics
A Large-Scale Study of Model Integration in ML-Enabled Software Systems
Balancing Innovation and Ethics in AI-Driven Software Development
Project Archetypes: A Blessing and a Curse for AI Development
Initial Insights on MLOps: Perception and Adoption by Practitioners
ModelVerification.jl: a Comprehensive Toolbox for Formally Verifying Deep Neural Networks
Deriva-ML: A Continuous FAIRness Approach to Reproducible Machine Learning Models
Innovating for Tomorrow: The Convergence of SE and Green AI
Agent Design Pattern Catalogue: A Collection of Architectural Patterns for Foundation Model based Agents
Automating the Training and Deployment of Models in MLOps by Integrating Systems with Machine Learning
Software Fairness Debt
How fair are we? From conceptualization to automated assessment of fairness definitions
Professional Insights into Benefits and Limitations of Implementing MLOps Principles
Empirical Analysis on CI/CD Pipeline Evolution in Machine Learning Projects
An Empirical Study of Challenges in Machine Learning Asset Management
Green AI: A Preliminary Empirical Study on Energy Consumption in DL Models Across Different Runtime Infrastructures
Towards MLOps: A DevOps Tools Recommender System for Machine Learning System
A Systematic Literature Review on Explainability for Machine/Deep Learning-based Software Engineering Research
Design Patterns for Machine Learning Based Systems with Human-in-the-Loop
Analyzing the Evolution and Maintenance of ML Models on Hugging Face
Towards Responsible Generative AI: A Reference Architecture for Designing Foundation Model based Agents
Continuous Management of Machine Learning-Based Application Behavior
An Empirical Study of Self-Admitted Technical Debt in Machine Learning Software
Secure Software Development: Issues and Challenges
Towards an MLOps Architecture for XAI in Industrial Applications
Identifying Concerns When Specifying Machine Learning-Enabled Systems: A Perspective-Based Approach
Software Testing of Generative AI Systems: Challenges and Opportunities
A General Recipe for Automated Machine Learning in Practice
On Using Information Retrieval to Recommend Machine Learning Good Practices for Software Engineers
The Costly Dilemma: Generalization, Evaluation and Cost-Optimal Deployment of Large Language Models
A New Perspective on Evaluation Methods for Explainable Artificial Intelligence (XAI)
Revisiting the Performance-Explainability Trade-Off in Explainable Artificial Intelligence (XAI)
MGit: A Model Versioning and Management System
Designing Explainable Predictive Machine Learning Artifacts: Methodology and Practical Demonstration
Fix Fairness, Don’t Ruin Accuracy: Performance Aware Fairness Repair using AutoML
Git-Theta: A Git Extension for Collaborative Development of Machine Learning Models
Responsible Design Patterns for Machine Learning Pipelines
QB4AIRA: A Question Bank for AI Risk Assessment
A Taxonomy of Foundation Model based Systems through the Lens of Software Architecture
Towards machine learning guided by best practices
Scaling ML Products At Startups: A Practitioner’s Guide
A Reference Architecture for Designing Foundation Model based Systems
Machine Learning with Requirements: a Manifesto
Toward End-to-End MLOps Tools Map: A Preliminary Study based on a Multivocal Literature Review
Analysis of Software Engineering Practices in General Software and Machine Learning Startups
A Case Study on AI Engineering Practices: Developing an Autonomous Stock Trading System
Design Patterns for AI-based Systems: A Multivocal Literature Review and Pattern Repository
Requirements Engineering Framework for Human-centered Artificial Intelligence Software Systems
Framework for Certification of AI-Based Systems
Towards Concrete and Connected AI Risk Assessment (C$^2$AIRA): A Systematic Mapping Study
What are the Machine Learning best practices reported by practitioners on Stack Exchange?
Towards Modular Machine Learning Solution Development: Benefits and Trade-offs
An architectural technical debt index based on machine learning and architectural smells
Studying the Characteristics of AIOps Projects on GitHub
Requirements Engineering for Artificial Intelligence Systems: A Systematic Mapping Study
Quality Assurance in MLOps Setting: An Industrial Perspective
Capabilities for Better ML Engineering
Requirements Engineering for Machine Learning: A Review and Reflection
Operationalizing Machine Learning: An Interview Study
Responsible AI Pattern Catalogue: A Collection of Best Practices for AI Governance and Engineering
A Methodology and Software Architecture to Support Explainability-by-Design
Software Testing for Machine Learning
Comparative Study of Machine Learning Test Case Prioritization for Continuous Integration Testing
Aspirations and Practice of Model Documentation: Moving the Needle with Nudging and Traceability
Exploring ML testing in practice – Lessons learned from an interactive rapid review with Axis Communications
Achieving Guidance in Applied Machine Learning through Software Engineering Techniques
Code Smells for Machine Learning Applications
MLOps best practices, challenges and maturity models
How Do Model Export Formats Impact the Development of ML-Enabled Systems? A Case Study on Model Integration
Architectural tactics to achieve quality attributes of machine-learning-enabled systems: a systematic literature review
Design Patterns for AI-based Systems: A Multivocal Literature Review and Pattern Repository
Fairness-aware machine learning engineering: how far are we?
We Have No Idea How Models will Behave in Production until Production: How Engineers Operationalize Machine Learning

Other references

The Ultra-Scale Playbook: Training LLMs on GPU Clusters
DeepSeek-V3 Technical Report
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures

References ordered by groups

Development of practices for Responsible ML/AI

How to Elicit Explainability Requirements? A Comparison of Interviews, Focus Groups, and Surveys
Toward Effective AI Governance: A Review of Principles
Software Engineering for Self-Adaptive Robotics: A Research Agenda
Explainability for Embedding AI: Aspirations and Actuality
Extending Behavioral Software Engineering: Decision-Making and Collaboration in Human-AI Teams for Responsible Software Engineering
Data Requirement Goal Modeling for Machine Learning Systems
Towards practicable Machine Learning development using AI Engineering Blueprints
Contextual Fairness-Aware Practices in ML: A Cost-Effective Empirical Evaluation
A Systematic Survey on Debugging Techniques for Machine Learning Systems
Hierarchical Fallback Architecture for High Risk Online Machine Learning Inference
Securing the AI Frontier: Urgent Ethical and Regulatory Imperatives for AI-Driven Cybersecurity
The Current Challenges of Software Engineering in the Era of Large Language Models
Architectural Patterns for Designing Quantum Artificial Intelligence Systems
Do Developers Adopt Green Architectural Tactics for ML-Enabled Systems? A Mining Software Repository Study
Responsible AI in Open Ecosystems: Reconciling Innovation with Risk Assessment and Disclosure
How Mature is Requirements Engineering for AI-based Systems? A Systematic Mapping Study on Practices, Challenges, and Future Research Directions
A Catalog of Fairness-Aware Practices in Machine Learning Engineering
Data Quality Antipatterns for Software Analytics
Balancing Innovation and Ethics in AI-Driven Software Development
Project Archetypes: A Blessing and a Curse for AI Development
Innovating for Tomorrow: The Convergence of SE and Green AI
Green AI: A Preliminary Empirical Study on Energy Consumption in DL Models Across Different Runtime Infrastructures
A General Recipe for Automated Machine Learning in Practice
Designing Explainable Predictive Machine Learning Artifacts: Methodology and Practical Demonstration
Fix Fairness, Don’t Ruin Accuracy: Performance Aware Fairness Repair using AutoML
QB4AIRA: A Question Bank for AI Risk Assessment
Towards machine learning guided by best practices
Machine Learning with Requirements: a Manifesto
Analysis of Software Engineering Practices in General Software and Machine Learning Startups
A Case Study on AI Engineering Practices: Developing an Autonomous Stock Trading System
Design Patterns for AI-based Systems: A Multivocal Literature Review and Pattern Repository
Requirements Engineering Framework for Human-centered Artificial Intelligence Software Systems
What are the Machine Learning best practices reported by practitioners on Stack Exchange?
Towards Modular Machine Learning Solution Development: Benefits and Trade-offs
Requirements Engineering for Artificial Intelligence Systems: A Systematic Mapping Study
Requirements Engineering for Machine Learning: A Review and Reflection
Responsible AI Pattern Catalogue: A Collection of Best Practices for AI Governance and Engineering
A Methodology and Software Architecture to Support Explainability-by-Design
Aspirations and Practice of Model Documentation: Moving the Needle with Nudging and Traceability
Design Patterns for AI-based Systems: A Multivocal Literature Review and Pattern Repository

MLOps, model management, and deployment

Can We Recycle Our Old Models? An Empirical Evaluation of Model Selection Mechanisms for AIOps Solutions
Model Lake: a New Alternative for Machine Learning Models Management and Governance
Variability-Aware Machine Learning Model Selection: Feature Modeling, Instantiation, and Experimental Case Study
An Efficient Model Maintenance Approach for MLOps
Machine Learning Operations: A Mapping Study
A Large-Scale Study of Model Integration in ML-Enabled Software Systems
Initial Insights on MLOps: Perception and Adoption by Practitioners
Automating the Training and Deployment of Models in MLOps by Integrating Systems with Machine Learning
Professional Insights into Benefits and Limitations of Implementing MLOps Principles
An Empirical Study of Challenges in Machine Learning Asset Management
Towards MLOps: A DevOps Tools Recommender System for Machine Learning System
Towards an MLOps Architecture for XAI in Industrial Applications
MGit: A Model Versioning and Management System
Git-Theta: A Git Extension for Collaborative Development of Machine Learning Models
Responsible Design Patterns for Machine Learning Pipelines
Scaling ML Products At Startups: A Practitioner’s Guide
Toward End-to-End MLOps Tools Map: A Preliminary Study based on a Multivocal Literature Review
Studying the Characteristics of AIOps Projects on GitHub
Quality Assurance in MLOps Setting: An Industrial Perspective
Operationalizing Machine Learning: An Interview Study
MLOps best practices, challenges and maturity models
How Do Model Export Formats Impact the Development of ML-Enabled Systems? A Case Study on Model Integration
We Have No Idea How Models will Behave in Production until Production: How Engineers Operationalize Machine Learning

Off-the-Shelf Models and Agentic Architectures

Agentic AI Software Engineers: Programming with Trust
Microservice-based edge platform for AI services
Agent Design Pattern Catalogue: A Collection of Architectural Patterns for Foundation Model based Agents
Design Patterns for Machine Learning Based Systems with Human-in-the-Loop
Towards Responsible Generative AI: A Reference Architecture for Designing Foundation Model based Agents
A Taxonomy of Foundation Model based Systems through the Lens of Software Architecture
A Reference Architecture for Designing Foundation Model based Systems

ML software quality assessment

What Makes a Fairness Tool Project Sustainable in Open Source?
Understanding Large Language Model Supply Chain: Structure, Domain, and Vulnerabilities
Whence Is A Model Fair? Fixing Fairness Bugs via Propensity Score Matching
Challenges in Testing Large Language Model Based Software: A Faceted Taxonomy
A quantitative framework for evaluating architectural patterns in ML systems
ModelVerification.jl: a Comprehensive Toolbox for Formally Verifying Deep Neural Networks
Software Fairness Debt
How fair are we? From conceptualization to automated assessment of fairness definitions
Analyzing the Evolution and Maintenance of ML Models on Hugging Face
An Empirical Study of Self-Admitted Technical Debt in Machine Learning Software
Identifying Concerns When Specifying Machine Learning-Enabled Systems: A Perspective-Based Approach
Software Testing of Generative AI Systems: Challenges and Opportunities
The Costly Dilemma: Generalization, Evaluation and Cost-Optimal Deployment of Large Language Models
A New Perspective on Evaluation Methods for Explainable Artificial Intelligence (XAI)
Revisiting the Performance-Explainability Trade-Off in Explainable Artificial Intelligence (XAI)
Framework for Certification of AI-Based Systems
Towards Concrete and Connected AI Risk Assessment (C$^2$AIRA): A Systematic Mapping Study
Capabilities for Better ML Engineering
Software Testing for Machine Learning
Comparative Study of Machine Learning Test Case Prioritization for Continuous Integration Testing
Exploring ML testing in practice – Lessons learned from an interactive rapid review with Axis Communications
Achieving Guidance in Applied Machine Learning through Software Engineering Techniques
Code Smells for Machine Learning Applications
Architectural tactics to achieve quality attributes of machine-learning-enabled systems: a systematic literature review
Fairness-aware machine learning engineering: how far are we?
Deriva-ML: A Continuous FAIRness Approach to Reproducible Machine Learning Models

Using ML to improve engineering practices

Advancing Software Quality: A Standards-Focused Review of LLM-Based Assurance Techniques
Future of Code with Generative AI: Transparency and Safety in the Era of AI Generated Software
Engineering Trustworthy Software: A Mission for LLMs
On Using Information Retrieval to Recommend Machine Learning Good Practices for Software Engineers