SE-ML | Quality Attributes for ML-Enabled Systems: A Research Synthesis

Part of our ongoing exploration into the latest trends in ML engineering best practices we identified five key areas that define the current landscape. This article discusses the fourth and last area: Quality attributes for machine learning systems.

Our synthesis of 24 academic papers reveals how the community approaches ML system quality: fairness, safety, and explainability now rank above accuracy in engineering discussions. From these articles, three cross-cutting themes emerge: reproducibility as a first-class concern, the inadequacy of single aggregate metrics, and a preference for explicit over implicit decisions. Furthermore, seven interconnected research areas emerge: code quality (cataloguing 22 ML-specific anti-patterns), testing strategies across traditional ML and LLMs, technical debt (ML projects carry twice the debt density of conventional software), fairness as both a conceptual framework and operational challenge, architectural tactics with quantifiable trade-offs, explainability evaluation methods, and certification frameworks for safety-critical deployment.

Cross-Cutting Themes Across the Literature

Before examining individual topics, we highlight three recurring themes across the studies:

Reproducibility as a main concern. Studies on code quality [1], technical debt [6], and certification [13] independently emphasize reproducibility. Zhang et al. [1] identify “randomness uncontrolled” and “hyperparameters not explicitly set” as code smells. Bhatia et al. [6] find that reproducibility-related debt persists longer than other types. Andrade et al. [13] require deterministic inference as a precondition for certification. This convergence suggests reproducibility should be treated as a quality attribute alongside accuracy and performance.

The Inadequacy of Single Metrics. Multiple papers critique reliance on aggregate quality metrics. Dobslaw et al. [4] distinguish atomic from aggregated oracles in LLM testing. Indykov et al. [9] find that fairness and safety rank above accuracy in quality attribute frequency. Yang et al. [3] demonstrate that capability-based evaluation predicts generalization better than overall accuracy. The consistent conclusion is single-number evaluations obscure critical failure modes.

Explicit Over Implicit. Whether discussing hyperparameters [1], fairness definitions [7, 8], architectural trade-offs [9], or dataset provenance [13], the literature consistently favors explicit documentation and deliberate choice over implicit defaults.

Code Quality: Cataloguing ML-Specific Anti-Patterns

Zhang et al. [1] synthesized academic literature, grey literature, and practitioner forums to identify a set code smells specific to ML applications. The identified smells cluster into three categories by pipeline stage:

Data Processing Smells include unnecessary iteration, chain indexing in dataframe operations, and implicit data type handling. These smells impact efficiency and error-proneness.

Training Smells include hyperparameters not explicitly set, uncontrolled randomness, memory not freed between training runs, and data leakage through improper cross-validation. These smells impact reproducibility, resource efficiency, and model validity.

Deployment Smells include gradient accumulation errors, tensor shape mismatches, and API deprecation issues. These impact reliability and maintainability.

Testing: Convergent Strategies Across ML Paradigms

The testing literature reveals both LLM-specific approaches and more generalizable principles.

Traditional ML Testing. Riccio et al. [2] systematically mapped testing approaches, identifying metamorphic testing as the dominant strategy for addressing the oracle problem. Rather than specifying expected outputs, metamorphic testing specifies relationships: rotating an image should not change its classification; adding neutral words should not flip sentiment.

Furthermore, mutation testing has also been proposed, with ML-specific adaptation. Traditional mutation operators designed for code do not transfer effectively to neural networks. Domain-specific operators such as label perturbation, weight manipulation, data augmentation failures, prove more effective in pracctice.

Industry Practice. Song et al. [14] conducted an interactive review bridging academic testing literature with industrial practice. Their findings reveal a focus on testing data quality and preprocessing stages, areas under-represented in academic work. Marijan [15] further examines test case prioritization for continuous integration in ML systems, finding that ML-specific prioritization techniques outperform traditional approaches when training data evolves frequently.

LLM and GenAI Testing. Dobslaw et al. [4] propose a taxonomy for LLM testing challenges, distinguishing atomic oracles (pass/fail on individual cases) from aggregated oracles (statistical properties across many cases). Testing whether a chatbot “responds politely” requires aggregated assessment, not individual test cases.

Aleti [16] extends this to generative AI systems more broadly, identifying the oracle problem as particularly acute: generated outputs have no single “correct” answer. Their analysis suggests that reference-based evaluation (comparing to gold standards) must be complemented by reference-free approaches evaluating coherence, relevance, and safety independently.

Convergent Principles. Despite targeting different ML paradigms, the literature converges on several principles:

Capability-based organization: Yang et al. [3] advocate organizing tests around abstract capabilities (negation handling, numerical reasoning) rather than individual examples.
Adequacy metrics beyond coverage: Asgari et al. [5] propose a test suite including metrics that measure diversity, failure region coverage, and scenario completeness. These approaches move ML testing beyond neuron coverage toward more meaningful adequacy criteria.
Data-centric testing: Multiple papers emphasize that testing ML systems requires testing data quality, not just model behavior [1, 14, 15].
Configuration as test parameter: Model configuration (hyperparameters, temperature, prompt format) must be part of test specification, not assumed constant.

Technical Debt: Quantification for ML

Bhatia et al. [6] provide the first large-scale empirical comparison of self-admitted technical debt (SATD) between ML and non-ML projects. They find that ML projects carry twice the debt density (1.87% vs. 0.92%), quantifying what practitioners have long suspected.

Novel Debt Types. Beyond confirming higher debt levels, the study identifies two debt types absent from prior SATD taxonomies:

Configuration debt: Arising from the complex configurations wiring together data pipelines, model architectures, and training procedures
Inadequate tests debt: Arising from the difficulty of testing ML systems and temptation to skip coverage

These types account for a combined 15% of ML-specific SATD instances, suggesting they warrant dedicated detection and management strategies.

Connection to Code Smells. The debt study complements the code smell catalog [1]. Where Zhang et al. identify what patterns to avoid, Bhatia et al. quantify where such patterns accumulate and how long they persist. Together, they suggest a detection-prioritization pipeline: identify smells through static analysis, prioritize remediation in high-churn areas with debt history.

Santos et al. [7] propose software fairness debt as a category distinct from technical and social debt. Their contribution is conceptual: arguing that the societal impact of biased systems demands its own debt metaphor with distinct causes, effects, and remediation strategies.

Root Cause Taxonomy. Their systematic review identifies eight root causes of fairness deficiency, with training data bias and design bias receiving the most attention in the literature. Model bias follows closely, while cognitive bias, historical bias, requirements bias, and testing bias receive moderate attention. Societal bias remains the least explored. Primary mitigations range from data augmentation and reweighting for training data issues, to inclusive design reviews, fairness-aware algorithms, and demographic test coverage. The relative under-exploration of societal and requirements bias suggests research opportunities in upstream bias sources.

Operationalizing Fairness. d’Aloisio et al. [8] complement the conceptual framework with operational tooling. Their approach provides a domain-specific language for specifying fairness definitions and automating assessment.

The Sustainability Crisis. Mim et al. [17] investigate what makes fairness tool projects sustainable in open source. Their finding that 53% of open-source fairness projects become inactive within their first three years highlights a sobering obstacle: while the research community has developed sophisticated fairness concepts and tools, maintaining them in practice proves difficult. The study identifies factors predicting sustainability such institutional backing, clear governance, and integration with widely-used ML frameworks.

Propensity-Based Fairness Testing. Peng et al. [18] propose FairMatch, applying propensity score matching to identify and fix fairness bugs. Unlike traditional approaches that retrain models, FairMatch identifies similar individuals across demographic groups and tests whether they receive similar predictions. This approach enables fairness testing without requiring ground truth labels for protected attributes.

Connection to Architecture. Indykov et al. [9] find that fairness ranks as the most frequently mentioned quality attribute in ML architecture literature. This architectural attention validates treating fairness as a system-level concern, not merely a model-level metric.

Architectural Tactics: Quality Trade-offs Made Explicit

Indykov et al. [9] conducted the most comprehensive systematic review in our corpus, analyzing 206 papers to identify quality attributes, architectural tactics, and their trade-offs.

Quality Attribute Rankings. The frequency analysis reveals community priorities that challenge traditional assumptions about ML system quality. Fairness and safety emerge as the most frequently discussed quality attributes, followed closely by security and explainability. Privacy ranks fifth, while reliability appears next. Notably, accuracy – traditionally considered the primary measure of ML success – ranks only seventh, followed by robustness, maintainability, performance, and transparency.

That fairness and safety rank above accuracy reflects a shift from “does it work?” to “does it work responsibly?” This aligns with the fairness debt literature [7] and suggests architectural decisions should explicitly address responsible AI concerns.

Toward Quantitative Evaluation. Emanuilov and Dimov [25] move beyond qualitative trade-off discussion to quantitative measurement. Their framework for evaluating architectural patterns in ML systems provides metrics for scalability (e.g., Scalability Coefficient improvements from 1.62 to 6.97 in monolith-to-microservice transitions) and performance (e.g., 31% Performance Coefficient improvements via weight pruning). This quantification enables evidence-based architectural decisions rather than intuition-driven choices.

Requirements and Specification. Villamizar et al. [26] address a gap upstream of architecture: how to specify ML-enabled systems in the first place. They propose an approach that provides perspective-based guidance for identifying concerns across stakeholder viewpoints – from data scientists focused on model performance to operators concerned with deployment and monitoring. This requirements engineering approach ensures architectural decisions address all relevant quality attributes.

Explainability: Nuancing the Performance Trade-off

The literature on explainability challenges simplistic “interpretable vs. accurate” dichotomies, with multiple articles examining when the trade-off is real versus when it reflects insufficient engineering effort.

Revisiting the Trade-off. Crook et al. [19] systematically analyze claims about the performance-explainability trade-off. They find that for many application domains, the trade-off is overstated: explainable models exist within the set of near-optimal solutions. The challenge is to efficiently search for them.

When Complex Models Are Necessary. Two conditions justify black-box models:

Hidden patterns: Some domains contain patterns too complex, non-linear, or distributed for human engineering. The success of deep learning in vision and language provides evidence such patterns exist.
Unengineerable features: Some concepts are intuitive to humans but resist formalization. “Natural sounding speech” is recognizable but difficult to express as computable rules.

Evaluating Explanations. Speith and Langer [20] provide a classification of XAI evaluation methods, distinguishing between:

Functionally-grounded evaluation: Testing whether explanations reflect actual model behavior (e.g., if weights are randomized, explanations should change)
Human-grounded evaluation: Testing whether explanations help users understand and trust appropriately
Application-grounded evaluation: Testing whether explanations improve downstream task performance

Their analysis reveals that most XAI research focuses on functionally-grounded evaluation, leaving human-grounded and application-grounded assessment under-explored. This gap matters because an explanation can be faithful to model behavior yet unhelpful – or even misleading – to users.

Certification and Verification: Toward Production Readiness

Multiple papers address what evidence is needed to certify ML systems for safety-critical deployment, drawing from both formal methods and domain-specific standards.

Aviation-Derived Frameworks. Andrade et al. [13] propose a certification framework adapted from aviation standards for ML components. Their framework requires identifying the five ingredients determining model behavior: training code, initial model weights, training data, inference code, and input data at inference time.

Gariel et al. [23] extend this with a framework specifically for certifying AI-based systems in aerospace contexts. Their approach emphasizes operational domain characterization – formally specifying what inputs the system will encounter in deployment – and continuous monitoring to detect out-of-distribution inputs at runtime.

Formal Verification Tooling. Wei et al. [24] introduce a comprehensive toolbox for formally verifying deep neural networks. The tool supports verification of properties including robustness (small input perturbations don’t change outputs), safety (certain outputs are never produced), and reachability (characterizing possible outputs for input regions). While formal verification remains computationally expensive for large networks, the authors demonstrate tractable verification for safety-critical components.

Runtime Verification. Beyond static certification, frameworks address runtime concerns: guardrails for unsafe outputs, out-of-distribution detection, and logging for post-hoc analysis. This connects to the architectural literature’s emphasis on monitoring and safety mechanisms [9].

Synthesis and Conclusions

Across 24 papers spanning code quality, testing, debt, fairness, architecture, explainability, and certification, several convergent themes emerge:

Consensus Areas:

Reproducibility requires explicit control of randomness, hyperparameters, data versioning, and environment configuration [1, 6, 13, 22]
Single aggregate metrics obscure failure modes; capability-based and demographic-stratified evaluation provides better signal [3, 5, 18]
Fairness is a system-level quality attribute requiring dedicated tooling, testing, and long-term maintenance commitment [7, 8, 17, 18]
Architectural decisions involve explicit trade-offs between competing qualities; quantitative frameworks enable evidence-based choices [9, 25]
Data-centric testing and requirements engineering are essential complements to model-centric evaluation [14, 15, 26]

Active Research:

Aggregated oracles for LLM testing remain underserved by current tooling [4, 16]
Configuration debt and test debt in ML systems need detection approaches [6]
Fairness-performance trade-offs in compressed models warrant further study [9, 18]
Supply chain security for LLM systems is newly recognized as critical [12]
Formal verification is becoming tractable for safety-critical ML components [24]

Gaps and Opportunities:

Integration across research communities (testing and fairness, architecture and debt)
Longitudinal studies of ML system evolution in production [21]
Human-grounded and application-grounded XAI evaluation [20]
Standardized certification frameworks across domains [13, 23]

The literature collectively suggests that ML engineering is maturing from ad-hoc practices toward systematic discipline. The convergence of independent research groups on similar principles – reproducibility, explicit trade-offs, capability-based evaluation, fairness-as-architecture – indicates genuine knowledge accumulation.

For practitioners, the literature offers actionable guidance: catalog and avoid ML-specific code smells [1]; organize tests around capabilities and data quality [3, 14]; treat fairness as an architectural concern with long-term maintenance implications [7, 17]; make trade-offs explicit in design decisions [9, 25]; and invest in reproducibility infrastructure early [22].

References

Zhang et al. “Code Smells for Machine Learning Applications” CAIN 2022
Riccio et al. “Testing Machine Learning based Systems: A Systematic Mapping” 2020
Yang et al. “Capabilities for Better ML Engineering” CEUR Workshop 2022
Dobslaw et al. “Challenges in Testing Large Language Model Based Software: A Faceted Taxonomy” 2025
Asgari et al. “Test suite Instance Space Adequacy for AI Systems” 2023
Bhatia et al. “An Empirical Study of Self-Admitted Technical Debt in Machine Learning Software” TOSEM 2025
Santos et al. “Software Fairness Debt” SE 2030, 2024
d’Aloisio et al. “MODNESS: From Conceptualization to Automated Assessment of Fairness Definitions” 2024
Indykov et al. “Architectural Tactics for ML-Enabled Systems: A Systematic Literature Review” JSS 2025
Reimann & Kniesel-Wünsche. “Achieving Guidance in Applied Machine Learning through Software Engineering Techniques” 2022
Aryan et al. “The Costly Dilemma: Generalization, Evaluation and Cost-Optimal Deployment of Large Language Models” 2023
Hu et al. “Understanding Large Language Model Supply Chain: Structure, Domain, and Vulnerabilities” 2025
Andrade et al. “A Framework for Certifying Object Detection DNNs” 2023
Song et al. “Exploring ML Testing in Practice – Lessons Learned from an Interactive Rapid Review with Axis Communications” 2022
Marijan. “Comparative Study of Machine Learning Test Case Prioritization for Continuous Integration Testing” 2022
Aleti. “Software Testing of Generative AI Systems: Challenges and Opportunities” 2023
Mim et al. “What Makes a Fairness Tool Project Sustainable in Open Source?” 2025
Peng et al. “Whence Is A Model Fair? Fixing Fairness Bugs via Propensity Score Matching” 2025
Crook et al. “Revisiting the Performance-Explainability Trade-Off in Explainable Artificial Intelligence (XAI)” 2023
Speith & Langer. “A New Perspective on Evaluation Methods for Explainable Artificial Intelligence (XAI)” 2023
Castaño et al. “Analyzing the Evolution and Maintenance of ML Models on Hugging Face” 2023
Li et al. “Deriva-ML: A Continuous FAIRness Approach to Reproducible Machine Learning Models” 2024
Gariel et al. “Framework for Certification of AI-Based Systems” 2023
Wei et al. “ModelVerification.jl: A Comprehensive Toolbox for Formally Verifying Deep Neural Networks” 2024
Emanuilov & Dimov. “A Quantitative Framework for Evaluating Architectural Patterns in ML Systems” 2025
Villamizar et al. “Identifying Concerns When Specifying Machine Learning-Enabled Systems: A Perspective-Based Approach (PerSpecML)” 2023