Recently I was reviewing the monthly Azure bill for a proof-of-concept environment running a handful of GPT-4o deployments across support, summarisation, and internal knowledge base workloads. The number staring back at me was not small. In my opinion, the single biggest operational challenge with Azure OpenAI in production is not quality, latency, or even safety; it is cost. And for teams running multiple models across multiple use cases, it compounds quickly.
Thankfully, Microsoft shipped three features in 2025 that fundamentally change the cost optimisation playbook: Model Router for intelligent routing across model tiers, Provisioned Spillover for overflow management, and Stored Completions for capturing production traffic and distilling it into cheaper models. In this post, I am going to walk through each one, then show how they combine into a tiered architecture that keeps your Azure bill in check without sacrificing quality.
Let’s dive in!
The Cost Problem (In Real Numbers)
Before we get into the features, let me frame why this matters with some rough numbers. At the time of writing, GPT-4.1 on Global Standard runs at roughly US$2.00 per 1M input tokens and US$8.00 per 1M output tokens. GPT-4.1-nano, on the other hand, sits at US$0.10 per 1M input and US$0.40 per 1M output. That is a 20x difference on the input side alone.
The dirty secret of most production AI workloads is that a significant chunk of requests (simple lookups, straightforward summaries, FAQ-style questions) do not need your most expensive model. They need a model that is “good enough.” The challenge has always been: how do you route intelligently without building a custom orchestration layer?
That is exactly what Model Router solves.
Model Router: One Deployment, Many Models
Model Router is a trained language model that analyses your prompts in real time and routes each request to the most suitable underlying model. You deploy it like any other model in Foundry, point your application at the single deployment, and Model Router handles the rest. No custom routing code, no prompt classifiers, no if-else chains.
Under the hood, Model Router evaluates each prompt based on complexity, reasoning requirements, and task type, then selects from a pool of underlying models. With the 2025-05-19 preview version, it started with GPT-4.1, GPT-4.1-mini, GPT-4.1-nano, and o4-mini. The August 2025 update added GPT-5 series models, and the November 2025 GA version (2025-11-18) expanded to 18+ models including Anthropic Claude, DeepSeek, Grok, and Llama models.
Three Routing Modes
The real power is in the routing modes. When you create a custom deployment, you can choose:
| Mode | Behaviour | Best For |
|---|---|---|
| Balanced (default) | Considers models within 1-2% quality of the best option, picks the cheapest | General purpose workloads |
| Cost | Widens the quality band to 5-6%, picks the cheapest | High-volume, budget-sensitive workloads |
| Quality | Always picks the highest quality model for the prompt | Complex reasoning, compliance-critical outputs |
A community benchmark from the Microsoft Tech Community blog tested 10 prompts across the three modes and saw savings of 4.5% in Balanced, 4.7% in Cost, and 14.2% in Quality mode (that last one surprised me too; turns out selective premium routing is more efficient than blanket premium). In production with hundreds of thousands of requests and a mixed prompt profile, the savings compound significantly.
You can also define a custom model subset, restricting which models are eligible for routing. This is useful for compliance (only route to models hosted in your data zone) or cost control (exclude the expensive reasoning models entirely for a given deployment).
Here is a quick example deploying Model Router via the Azure CLI:
az cognitiveservices account deployment create \
--name my-foundry-resource \
--resource-group rg-blog-07-cost-optimisation \
--deployment-name model-router-cost \
--model-name model-router \
--model-version 2025-11-18 \
--model-format OpenAI \
--sku-capacity 150 \
--sku-name GlobalStandard
Once deployed, your application code does not change at all. Just point to the model-router-cost deployment name instead of gpt-4.1 or whichever model you were using before. Model Router handles the selection per-request.
Note: Model Router is currently available in East US 2 and Sweden Central for Global Standard and Data Zone Standard deployments. If you are running workloads out of Australia East (like me), you will need to use Global Standard, which routes to the nearest data zone anyway.
Provisioned Spillover: The Safety Valve for PTU Deployments
If you have invested in Provisioned Throughput Units (PTU), you know the dilemma: size your PTU for peak traffic and you waste money during quiet periods, size for average traffic and you drop requests during spikes. Provisioned Spillover (GA since August 2025) eliminates this binary choice.
The concept is pretty straight forward. When your PTU deployment hits capacity (returns a 429), Spillover automatically redirects the overflow request to a designated standard (pay-as-you-go) deployment in the same resource. Your application gets a successful response instead of an error, and you only pay standard token rates for the spillover traffic.
You can enable it at the deployment level (all requests get spillover protection) or per-request using the x-ms-spillover-deployment header for more granular control:
# Per-request spillover using the header
curl $AZURE_OPENAI_ENDPOINT/openai/deployments/my-ptu-deployment/chat/completions?api-version=2024-10-21 \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $AZURE_OPENAI_AUTH_TOKEN" \
-H "x-ms-spillover-deployment: my-standard-deployment" \
-d '{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarise this quarter earnings report."}
]
}'
When a request spills over, the response includes the x-ms-spillover-from-deployment header so you know it happened. You can monitor the split between PTU and standard traffic using Azure Monitor metrics with the ModelDeploymentName and IsSpillover splits.
The cost implication is significant. Instead of sizing your PTU for the 99th percentile spike, you can size for your 80th or 90th percentile baseline and let spillover catch the rest. Given that PTU reservations can save you up to 85% over hourly rates, right-sizing your PTU allocation and relying on spillover for bursts is a materially cheaper strategy than over-provisioning.
Be warned: Spillover requests may incur slightly higher latency because the service prioritises sending requests to the PTU deployment first. For latency-critical workloads, monitor your p99 carefully.
Stored Completions and Distillation: The Feedback Loop
This is where cost optimisation gets genuinely clever. Stored Completions (shipped December 2024, API available February 2025) lets you capture production conversation histories as training datasets with a single parameter change.
Just add Store = true to your chat completions options:
using OpenAI;
using OpenAI.Chat;
using System.ClientModel.Primitives;
using Azure.Identity;
#pragma warning disable OPENAI001
BearerTokenPolicy tokenPolicy = new(
new DefaultAzureCredential(),
"https://ai.azure.com/.default");
ChatClient client = new OpenAIClient(
authenticationPolicy: tokenPolicy,
options: new OpenAIClientOptions()
{
Endpoint = new Uri("https://my-foundry-resource.openai.azure.com/openai/v1")
}).GetChatClient("gpt-4.1");
ChatCompletionOptions options = new()
{
Store = true,
Metadata =
{
["use_case"] = "support-summarisation",
["environment"] = "production"
}
};
ChatCompletion completion = await client.CompleteChatAsync([
new SystemChatMessage("Summarise the customer support ticket concisely."),
new UserChatMessage(ticketText)
], options);
Once you have accumulated a few hundred high-quality completions (10 is the minimum, but more is better), you can distill them directly into a smaller model. The idea is straightforward: your expensive GPT-4.1 deployment has been producing excellent summaries in production. You capture those outputs, then use them to fine-tune GPT-4.1-nano to produce the same quality for your specific task at a fraction of the cost.
The workflow in the Foundry portal is: Stored Completions > Filter by metadata/quality > Distill > Select target model (e.g., gpt-4.1-nano) > Fine-tune. No manual data engineering, no CSV exports, no JSONL wrangling.
This creates a powerful feedback loop: deploy expensive model > capture outputs > distill into cheap model > deploy cheap model behind Model Router > repeat. Each cycle makes your architecture cheaper while maintaining quality on the tasks that matter.
Reinforcement Fine-Tuning: Teaching Reasoning Models New Tricks
For teams using reasoning models like o4-mini, Reinforcement Fine-Tuning (RFT) offers another angle. Instead of supervised fine-tuning with labelled data, RFT uses a reward-based process with custom graders to improve the model’s reasoning on your specific domain.
You define a grader (string-check, text-similarity, model-based, or even custom Python code), provide training data with ground truth, and the service trains the model to maximise the grader’s score. It is GA for o4-mini (2025-04-16) and in private preview for gpt-5.
The cost angle here is that a well-tuned o4-mini can replace a more expensive model for domain-specific reasoning tasks. The training cost is US$100/hour, with a built-in $5,000 auto-pause guardrail so you will not accidentally burn through your budget. Once trained, the per-token inference cost is the same as the base o4-mini model, which is substantially cheaper than o3 or gpt-5 for tasks where the fine-tuned model performs comparably.
Putting It All Together: The Tiered Architecture
Here is how all four pieces fit together into a cost-optimised production architecture:
- Baseline layer (PTU): Size your Provisioned Throughput for your steady-state traffic. Use the capacity calculator to estimate PTU requirements based on your average requests per minute and token sizes.
- Spillover layer (Standard): Configure Provisioned Spillover to a standard deployment in the same resource. This catches traffic bursts without over-provisioning PTU.
- Routing layer (Model Router): Deploy Model Router in Cost or Balanced mode as your application’s primary endpoint. It automatically routes simple requests to cheaper models and complex requests to premium models.
- Feedback loop (Stored Completions + Distillation): Enable
store=Trueon your premium model deployments. Periodically distill high-quality outputs into smaller models. Redeploy the distilled models, and optionally include them in your Model Router’s custom subset.
The result: your baseline traffic runs on discounted PTU capacity. Spikes overflow gracefully to pay-as-you-go. Routine requests get routed to the cheapest model that can handle them. And over time, your distilled models get better at your specific tasks, allowing Model Router to route even more traffic to cheaper tiers.
What to Watch Out For
A few gotchas I have hit or seen others run into:
- Model Router region limits. It is only available in East US 2 and Sweden Central. If data residency is a concern, use Data Zone Standard deployments and a custom model subset to stay within your compliance boundary.
- Spillover latency. The PTU deployment is always tried first, even when it is at capacity. This adds a round-trip before the spillover kicks in. For latency-sensitive workloads, monitor your p99 and consider whether a slightly over-provisioned PTU is worth it.
- Stored Completions storage cap. You can store a maximum of 10 GB of completions data per resource. If you are running high-volume workloads, use metadata filtering to keep only the high-quality examples you actually need for distillation.
- RFT training costs. The $5,000 auto-pause is helpful, but at $100/hour, a long training run can still add up. Start with small datasets and low
compute_multipliervalues, then scale up once you have confirmed the grader is working correctly.
Wrapping Up
Hopefully this post has given you a practical playbook for tackling Azure OpenAI costs at scale. The combination of Model Router (smart routing), Provisioned Spillover (burst management), Stored Completions (data capture), and Reinforcement Fine-Tuning (model customisation) gives you a layered strategy that gets cheaper over time without sacrificing quality. The days of “pick one model and hope for the best” are behind us.
As always, feel free to reach out with any questions or comments!
Until next time, stay cloudy!