Computer Use Comes to Azure: Browser Automation, Desktop Agents, and the RPA Replacement

Recently I was exploring a third-party logistics portal for a demo, and let me tell you, it was built in a different era. No API. No webhooks. No export function. Just a web form from 2009 that required twelve clicks and three page loads to extract a single shipment status. The team using it had been copying and pasting data manually for years, and their “automation” was a Power Automate Desktop flow that broke every time the vendor tweaked a CSS class.

Then I discovered that Azure now lets an AI model look at a screenshot, figure out where to click, and actually do it. Not in some theoretical research paper, but as a production-ready tool you can wire into your agents today.

Over the course of 2025, Microsoft shipped three distinct layers of computer use capability, each targeting a different level of complexity. This post will walk through all three, cover the (very real) security considerations, and give you a practical decision framework for when to reach for computer use versus sticking with APIs or traditional RPA. Let’s dive in!

The Three Layers of Computer Use

Before we get into the weeds, here’s the high-level picture. Microsoft has built a stack, not a single tool:

  1. The computer-use-preview model (March 2025): A vision model accessed via the Responses API that interprets screenshots and proposes UI actions. You bring your own execution environment.
  2. Browser Automation tool (August 2025): A managed tool in Foundry Agent Service that runs Playwright in isolated, cloud-hosted sessions. The agent handles web tasks end-to-end; you don’t manage a browser.
  3. Computer Use tool (September 2025): A full desktop interaction tool in Foundry Agent Service that works with any Windows GUI, not just browsers.

And as a bonus, Copilot Studio brought computer use to no-code builders, which is frankly terrifying and exciting in equal measure.

Let me walk through each one.

Layer 1: The computer-use-preview Model

First cab off the rank is the computer-use-preview model, which landed alongside the Responses API back in March 2025. This is the foundation that everything else builds on.

The concept is pretty straight forward: you send the model a screenshot, it analyses the pixels, and it tells you what action to take next (click here, type this, scroll down). Your code then executes that action, captures a new screenshot, and sends it back. Rinse and repeat until the task is done.

Here’s the basic flow in C#:

using OpenAI;
using OpenAI.Responses;
using System.ClientModel.Primitives;
using Azure.Identity;

#pragma warning disable OPENAI001

BearerTokenPolicy tokenPolicy = new(
    new DefaultAzureCredential(),
    "https://ai.azure.com/.default");

OpenAIResponseClient client = new(
    model: "computer-use-preview",
    authenticationPolicy: tokenPolicy,
    options: new OpenAIClientOptions()
    {
        Endpoint = new Uri("https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1")
    });

ResponseCreationOptions options = new();
options.Tools.Add(ResponseTool.CreateComputerTool());

OpenAIResponse response = await client.CreateResponseAsync([
    ResponseItem.CreateUserMessageItem([
        ResponseContentPart.CreateInputTextPart(
            "Navigate to bing.com and search for 'Azure AI Foundry'.")
    ])
], options);

// The response contains a computer_call with actions to execute
foreach (ResponseItem item in response.Value.OutputItems)
{
    if (item is ComputerCallResponseItem computerCall)
    {
        Console.WriteLine($"Actions to perform: {computerCall.Action}");
    }
}

The model returns actions like click (with x/y coordinates), type (with text), scroll, keypress, and screenshot. Your application code is responsible for actually executing these actions, whether that’s through Playwright, a desktop automation library, or whatever framework you prefer.

Note: The computer-use-preview model requires limited access approval. It’s currently available in East US 2, Sweden Central, and South India. Not Australia East yet, unfortunately, so factor in the latency if you’re down under like me.

The important thing to understand is that this model operates on raw pixels. It doesn’t parse HTML or read the DOM. It literally looks at the screenshot the way a human would and figures out where to click. This makes it incredibly flexible (it works on any UI) but also means it needs a decent resolution screenshot to work accurately. Microsoft recommends 1440×900 or 1600×900 for optimal click accuracy.

Layer 2: Browser Automation (The Managed Option)

If Layer 1 is “here’s a vision model, go build your own loop,” then the Browser Automation tool is “let us handle the browser for you.” It shipped in August 2025 as part of Foundry Agent Service, and it takes a fundamentally different approach.

Rather than working with raw screenshots, Browser Automation parses the actual page structure (the DOM and accessibility tree) and uses that to reason about web elements. It runs inside Microsoft Playwright Workspaces, isolated, cloud-hosted browser sessions in your Azure subscription.

Here’s what the code looks like:

using Azure.Identity;
using Azure.AI.Projects;

#pragma warning disable OPENAI001

AIProjectClient projectClient = new(
    endpoint: new Uri("https://your-account.services.ai.azure.com/api/projects/your-project"),
    credential: new DefaultAzureCredential());

var browserTool = new BrowserAutomationPreviewToolDefinition(
    new BrowserAutomationToolParameters(
        new BrowserAutomationToolConnectionParameters(
            "your-browser-automation-connection-id")));

var agentDefinition = new PromptAgentDefinition("gpt-4.1-mini")
{
    Instructions = "You are a web research assistant. Use the browser to find information.",
    Tools = { browserTool }
};

AgentVersion agent = await projectClient.Agents.CreateAgentVersionAsync(
    "WebResearchAgent", options: new(agentDefinition));

var openAIClient = projectClient.GetOpenAIClient();
var responseClient = openAIClient.GetOpenAIResponseClient("WebResearchAgent");

ResponseCreationOptions options = new()
{
    ToolChoice = ResponseToolChoice.CreateRequiredChoice()
};

OpenAIResponse response = await responseClient.CreateResponseAsync(
    "Go to finance.yahoo.com and find today's MSFT stock price.",
    options);

Console.WriteLine(response.Value.GetOutputText());

A few things I love about this approach. First, it works with any GPT model, not just the computer-use-preview model. That’s a big deal for cost and availability. Second, you don’t manage the screenshot loop yourself; the tool handles the full cycle of navigate, parse, act, and repeat internally. Third, each session is sandboxed in its own Playwright workspace, so there’s genuine isolation between tasks.

Be warned: Setting this up requires a few moving parts. You need a Playwright Workspace resource, an access token, and a project connection configured in the Foundry portal. The setup guide walks through it, but budget 20 minutes for the initial configuration.

Layer 3: Computer Use Tool (Full Desktop)

The Computer Use tool, which arrived in September 2025, extends beyond the browser to any desktop application. Think legacy Windows apps, thick clients, ERP systems with desktop UIs, anything with a graphical interface.

Like the raw model from Layer 1, it works by interpreting screenshots and proposing keyboard and mouse actions. But it’s packaged as a proper Agent Service tool with SDK support across Python, .NET, TypeScript, and Java.

The key difference from Browser Automation is the trade-off between flexibility and complexity:

Feature Browser Automation Computer Use Tool
Model support Any GPT model computer-use-preview only
Screen understanding DOM/HTML parsing Raw pixel screenshots
Interfaces Browser only Any desktop or browser UI
Session management Managed Playwright Workspaces You provide the environment
Setup complexity Medium (Playwright connection) Higher (sandboxed VM recommended)

Choose Browser Automation when your task is web-only and you want the simplest setup. Choose Computer Use when you need to interact with desktop applications or when the web app is so JavaScript-heavy that DOM parsing struggles with it.

Copilot Studio: Computer Use for Everyone

And then there’s the Copilot Studio angle. Microsoft brought computer use to no-code builders, meaning your business analysts and citizen developers can build agents that click through UIs using natural language instructions. No code required.

What makes the Copilot Studio implementation interesting is the enterprise wrappers they’ve built around it: built-in credentials management, session replay in audit logs, and Cloud PC pooling for running automations at scale. It also supports multiple foundation models, including both OpenAI’s computer-use-preview and Anthropic’s Claude Sonnet 4.5 for tasks that need nuanced UI reasoning.

In my opinion, this is where the real RPA disruption happens. The traditional RPA pitch was “automate repetitive UI tasks without changing your systems.” Computer use agents do the same thing, but they adapt when the UI changes instead of breaking. That 2009 logistics portal I mentioned at the start? A computer use agent would just figure out the new layout and keep going. A traditional RPA bot would crash and page someone at 3am.

Security: The Bit You Can’t Skip

I need to be direct here: computer use carries real security risks, and Microsoft is refreshingly upfront about it. The official documentation includes warnings in bold red boxes, and for good reason.

The core risks are:

  • Prompt injection via screenshots: A malicious website could display text that tricks the model into performing unintended actions. The API includes safety checks for this (malicious instruction detection, irrelevant domain detection, sensitive domain detection), but they’re not bulletproof.
  • Credential exposure: If the agent can see the screen, it can see passwords, tokens, and sensitive data. Never run computer use on machines with access to credentials or sensitive systems.
  • Unintended actions: The model might misinterpret a UI element and click the wrong button. In a banking application, that’s not a “whoops, try again” situation.

Microsoft’s recommended safeguards:

  1. Always use sandboxed environments: Low-privilege VMs or Playwright Workspaces with no access to sensitive data or production systems.
  2. Human-in-the-loop for sensitive actions: When safety checks fire (pending_safety_checks in the API response), require explicit human acknowledgement before proceeding.
  3. Audit everything: Use Foundry tracing to log every action the agent takes. Copilot Studio adds session replay for visual audit trails.
  4. Principle of least privilege: Give the agent’s environment only the access it needs for the specific task.

When to Use What: A Decision Framework

This is the question I get asked most often. Here’s how I think about it:

Use an API integration when:

  • The target system has a well-documented API
  • You need reliability, speed, and structured data
  • The integration is long-lived and worth the development investment

Use traditional RPA (Power Automate Desktop, UiPath, etc.) when:

  • The UI is stable and rarely changes
  • The workflow is rule-based with no decision-making required
  • You need robust error handling and retry logic built in
  • Your organisation already has RPA infrastructure and expertise

Use Browser Automation when:

  • The target is a web application with no API
  • The UI changes occasionally (DOM parsing is more resilient than pixel-based)
  • You want managed infrastructure and session isolation
  • Web-only scope is sufficient

Use Computer Use when:

  • You need to interact with desktop applications
  • The target UI is highly dynamic or JavaScript-heavy
  • You need visual verification of what the agent is doing
  • No other integration method is available

Use Copilot Studio Computer Use when:

  • The automation needs to be built and maintained by non-developers
  • Enterprise governance (audit trails, credentials management) is a hard requirement
  • You want Cloud PC pooling for running automations at scale

The honest truth is that API integration should always be your first choice when it’s available. It’s faster, cheaper, more reliable, and easier to secure. Computer use is the “last resort” tool for when there genuinely is no API, but it’s an incredibly powerful last resort.

Wrapping Up

2025 has been the year that AI stopped being limited to text boxes and API calls. The combination of the computer-use-preview model, Browser Automation, and the Computer Use tool gives Azure a complete stack for UI automation, from managed browser sessions all the way to full desktop control.

Is it a complete replacement for RPA today? Not yet. But the trajectory is clear: agents that can adapt to UI changes, understand context, and make decisions will eventually outpace rigid, rule-based bots. For that 2009 logistics portal, computer use is already the better answer.

As always, feel free to reach out with any questions or comments!

Until next time, stay cloudy!

Leave a Comment