The “It Works on My Machine” Problem, But for AI
Recently I was putting together a Foundry agent demo that passed every manual test I threw at it. It answered questions accurately, called the right tools, stayed on topic. I was feeling pretty pleased with myself. Then, within 48 hours of sharing it with the team, someone managed to coax it into recommending a competitor’s product and leaking an internal API endpoint in a code sample. Classic.
The thing is, I should have known better. We wouldn’t dream of deploying a web application without automated tests and a CI/CD pipeline, yet somehow the industry has been hand-waving agent quality assurance with “yeah, I asked it a few questions and it seemed fine.” Thankfully, Microsoft has shipped a stack of tooling in 2025 that makes proper agent evaluation not just possible, but pretty straight forward.
In this post, I’m going to walk through three layers of agent safety that, in my opinion, every team should be running before they let an agent anywhere near production:
- Evaluate with the Foundry Evaluation SDK’s agentic evaluators
- Red-team with the AI Red Teaming Agent (powered by PyRIT)
- Monitor with Defender for Foundry at runtime
Let’s dive in!
Layer 1: Agentic Evaluators in the Foundry SDK
The Azure AI Evaluation SDK now includes evaluators built specifically for agentic workflows. These aren’t your standard “is this response coherent?” checks (though those exist too). These evaluators understand the multi-step, tool-calling nature of agents.
Here are the ones I use most:
- IntentResolutionEvaluator: Did the agent correctly identify what the user was actually asking? This catches those frustrating cases where the agent confidently answers the wrong question.
- ToolCallAccuracyEvaluator: Did the agent call the right tools with the right parameters? This one is brilliant for agents with multiple function tools. It supports File Search, Azure AI Search, Bing Grounding, Code Interpreter, OpenAPI, and custom function tools.
- TaskAdherenceEvaluator: Did the agent stay within scope? If your agent is meant to book flights but starts offering financial advice, this evaluator catches it.
- CodeVulnerabilityEvaluator: Does the generated code contain security vulnerabilities? Covers Python, Java, C++, C#, Go, JavaScript, and SQL. If your agent writes code for users, this is non-negotiable.
- GroundednessEvaluator: Are the agent’s responses actually grounded in the tool outputs it received, or is it hallucinating?
Getting started is straightforward. Install the SDK and point it at your agent’s conversation data:
Note: The Azure AI Evaluation SDK is currently Python-only. There is no .NET equivalent at the time of writing. If your agent code is in C#, you can still run evaluations as a separate Python step in your CI/CD pipeline.
pip install azure-ai-evaluation
If you’re using Foundry Agent Service, the AIAgentConverter handles all the data wrangling for you. Here’s how to evaluate a single agent run:
import json
import os
from azure.ai.evaluation import (
AIAgentConverter,
IntentResolutionEvaluator,
TaskAdherenceEvaluator,
ToolCallAccuracyEvaluator,
CodeVulnerabilityEvaluator,
ContentSafetyEvaluator,
)
from azure.identity import DefaultAzureCredential
# Point at your Foundry project
project_endpoint = os.environ["AZURE_AI_PROJECT"]
project_client = AIProjectClient(
endpoint=project_endpoint,
credential=DefaultAzureCredential(),
)
# Convert agent thread data into evaluation format
converter = AIAgentConverter(project_client)
converted_data = converter.convert(thread_id=thread.id, run_id=run.id)
# Configure evaluators with your judge model
model_config = {
"azure_deployment": os.environ["AZURE_DEPLOYMENT_NAME"],
"api_key": os.environ["AZURE_API_KEY"],
"azure_endpoint": os.environ["AZURE_ENDPOINT"],
"api_version": os.environ["AZURE_API_VERSION"],
}
# Run the evaluators
evaluators = {
"intent": IntentResolutionEvaluator(model_config=model_config),
"task": TaskAdherenceEvaluator(model_config=model_config),
"tools": ToolCallAccuracyEvaluator(model_config=model_config),
}
for name, evaluator in evaluators.items():
result = evaluator(**converted_data)
print(f"{name}: {json.dumps(result, indent=2)}")
Each evaluator returns a score on a 1 to 5 Likert scale, a pass/fail result against a configurable threshold, and (this is the good bit) a reason explaining why it scored the way it did. That reason field is gold for debugging.
Note: For complex evaluation tasks that need refined reasoning, consider using a reasoning model like o3-mini as the judge. You can enable this by passing is_reasoning_model=True when initialising the evaluator. The docs cover the full model support matrix.
For batch evaluation across multiple agent runs (which is what you want for CI/CD), use the evaluate() API:
from azure.ai.evaluation import evaluate
# Prepare evaluation data from multiple threads
converter.prepare_evaluation_data(
thread_ids=thread_ids,
filename="evaluation_data.jsonl"
)
# Run batch evaluation
response = evaluate(
data="evaluation_data.jsonl",
evaluation_name="pre-deployment-check",
evaluators=evaluators,
azure_ai_project=os.environ["AZURE_AI_PROJECT"],
)
print(f"Average scores: {response['metrics']}")
print(f"View results: {response.get('studio_url')}")
The studio_url in the response takes you straight to the Foundry portal where you can compare runs, drill into individual failures, and track regression over time. It’s genuinely useful.
Layer 2: Automated Red Teaming with PyRIT
Evaluation tells you how your agent performs on expected inputs. Red teaming tells you how it performs when someone is actively trying to break it. These are very different things, and you need both.
The AI Red Teaming Agent integrates Microsoft’s open-source PyRIT (Python Risk Identification Tool) framework directly into Foundry. It automatically probes your agent with adversarial inputs, evaluates whether the attacks succeeded, and produces a scorecard with Attack Success Rate (ASR) metrics.
The risk categories it covers include violence, hate/unfairness, sexual content, self-harm, protected materials, code vulnerabilities, and ungrounded attributes. For agents specifically, it also tests for prohibited actions, sensitive data leakage, and task adherence under adversarial pressure.
Here’s a basic scan against a model endpoint:
from azure.ai.evaluation.red_team import RedTeam, RiskCategory, AttackStrategy
from azure.identity import DefaultAzureCredential
# Install with: pip install "azure-ai-evaluation[redteam]"
red_team_agent = RedTeam(
azure_ai_project=os.environ["AZURE_AI_PROJECT"],
credential=DefaultAzureCredential(),
risk_categories=[
RiskCategory.Violence,
RiskCategory.HateUnfairness,
RiskCategory.SelfHarm,
RiskCategory.Sexual,
],
num_objectives=10, # 10 attack prompts per category
)
# Scan your model or application
red_team_result = await red_team_agent.scan(
target=azure_openai_config,
scan_name="Pre-deployment safety scan",
attack_strategies=[
AttackStrategy.EASY, # Base64, Flip, Morse encoding
AttackStrategy.MODERATE, # Tense conversion
AttackStrategy.DIFFICULT, # Composed multi-step attacks
],
output_path="red_team_results.json",
)
What I love about this is the layered attack complexity. Easy attacks are simple encoding tricks (Base64, character flipping, Morse code). Moderate attacks use another LLM to rephrase the adversarial prompt. Difficult attacks compose multiple strategies together. You can also compose your own custom strategies:
# Compose a custom multi-step attack: Base64 encode, then apply ROT13
custom_attack = AttackStrategy.Compose([
AttackStrategy.Base64,
AttackStrategy.ROT13,
])
The output is a JSON scorecard breaking down ASR by risk category and attack complexity, which you can feed directly into a CI/CD gate. If your overall ASR exceeds your threshold, the pipeline fails. Simple as that.
Be warned: The AI Red Teaming Agent is currently in public preview and only works in East US 2, Sweden Central, France Central, and Switzerland West regions. Also, PyRIT requires Python 3.10 or above, so check your CI runner images.
For the truly adventurous, you can also bring your own custom attack seed prompts tailored to your specific use case. The Microsoft AI Red Teaming Playground Labs on GitHub are a great starting point for learning how to think like an adversary.
Layer 3: Runtime Monitoring with Defender for Foundry
Evaluation and red teaming happen before deployment. But what about after? This is where Microsoft Defender for Cloud’s AI threat protection comes in.
Defender now provides runtime threat detection for Foundry agents, covering threats aligned with OWASP guidance for LLM and agentic AI systems:
- Tool misuse: agents coerced into abusing APIs or backend systems
- Privilege compromise: permission misconfigurations or role exploitation
- Resource overload: attacks exhausting compute or service capacity
- Intent breaking: adversaries redirecting agent objectives
- Identity spoofing: false identity execution of actions
- Human manipulation: attackers exploiting trust in agent responses
Enabling it is a single click on your Azure subscription, and existing Foundry agents start detecting threats within minutes. The best part? Threat protection for Foundry Agent Service is currently free of charge and doesn’t consume tokens. You genuinely have no excuse not to turn it on.
Detections surface in the Defender for Cloud portal and integrate with Defender XDR and Microsoft Sentinel, so your SOC team can correlate AI-specific threats with broader security signals.
Note: Defender’s AI threat protection for Foundry agents is currently in public preview (as of February 2026). It also includes security posture recommendations that identify misconfigurations, excessive permissions, and insecure instructions in your agents.
Putting It All Together: The CI/CD Pattern
Here’s the pattern I recommend to anyone building agents on Foundry:
- Develop: Build your agent, write your evaluation test set (or generate one with the SDK)
- Evaluate: Run agentic evaluators on every PR. Gate merges on passing scores for intent resolution, tool call accuracy, and task adherence
- Red-team: Run the AI Red Teaming Agent on the candidate build. Gate deployment on ASR thresholds
- Deploy: Push to production with confidence
- Monitor: Defender for Foundry watches for runtime threats. Alerts feed into your incident response workflow
This mirrors what we already do for application security (SAST, DAST, runtime WAF), just adapted for the unique risks of agentic AI. The Cloud Adoption Framework’s guidance on building agents recommends exactly this “shift left” approach.
The evaluation SDK costs nothing beyond the underlying Azure OpenAI model usage for the judge. Safety evaluations run at $0.02 per 1K input tokens. Red teaming bills based on safety evaluation consumption. And Defender is currently free for Foundry agents. For what you get, the cost is trivial.
Wrapping Up
If you take one thing from this post, let it be this: agent evaluation is not optional. The tools exist, they’re accessible, and they integrate into the workflows you already know. Evaluate your agents like you test your code. Red-team them like you pen-test your APIs. Monitor them like you monitor your infrastructure.
Until next time, stay cloudy!