Track and Govern AI Inference Costs
Intent
Motivation
Applicability
Description
As machine learning systems move into production, inference cost can quickly dominate the total system cost, particularly for large language models accessed via APIs or GPU-intensive deep learning workloads.
Instrument and Attribute Costs
Measure and log the compute cost of every inference request, broken down by model, endpoint, and use case. For API-based models, record token usage per call and aggregate by feature or user segment. For self-hosted models, track GPU/CPU utilisation, memory, and infrastructure spend per deployment. Cost instrumentation should be part of the same observability stack as latency and quality metrics.
Set Budgets and Alerts
Define cost budgets per model and per team, and configure alerts when spending approaches thresholds. Treat unexpected cost spikes as incidents: investigate whether they are caused by traffic growth, model changes, or inefficient prompting patterns.
Make Cost a Design-Time Trade-Off
Before deploying a model, explicitly evaluate the cost-quality trade-off:
- benchmark lighter models or quantised variants against the target quality bar; if the cheaper model is sufficient, use it,
- evaluate batching, caching, and request coalescing strategies at design time rather than as afterthoughts,
- document the cost assumptions for the selected model in the deployment decision record.
Govern API Usage
For third-party LLM APIs, apply rate limits and quotas at the application level to protect against runaway loops or misbehaving clients. Review and renegotiate API contracts as usage scales. Avoid hard-coding specific external model versions without a plan for cost reassessment when pricing changes.
Related
- Use the Most Efficient Models and Optimise for Inference
- Continuously Monitor the Behaviour of Deployed Models
- Build an ML Observability Infrastructure
- Establish Organizational Structures for Responsible AI Governance