RPABOTS.WORLD

Bedrock Agents Classic Sunset: Migration Guide to AgentCore (2026)

Satish Prasad — Sun, 26 Jul 2026 16:27:57 +0000

On June 30, 2026, AWS quietly renamed Amazon Bedrock Agents to “Amazon Bedrock Agents Classic” and placed it in maintenance mode. Starting July 30, 2026, no new customers can onboard. The model catalog is frozen on that date. No new features will ship. The service that launched in November 2023 as AWS’s flagship AI agent product is being retired after just two years and eight months — part of a broader consolidation that also puts Amazon Kendra and Amazon Q Business into maintenance mode on the same timeline.

If you’re running production agents on Bedrock Agents today, nothing breaks on July 30. Your existing agents, APIs, and Infrastructure-as-Code templates continue to work. But the window is closing: you’re now on a platform that won’t receive new models, new features, or new integrations. The replacement, Amazon Bedrock AgentCore, is a fundamentally different architecture — not an upgrade, but a separate product. This guide walks you through exactly what changed, which migration path fits your workload, and how to execute the move before you’re stuck on a frozen platform.

What AWS Actually Retired — and Why

Bedrock Agents Classic isn’t the only casualty. AWS announced maintenance mode for three first-generation AI services simultaneously:

Retired Service	Launched	Maintenance Mode	New Customer Cutoff	Successor
Amazon Bedrock Agents (now “Classic”)	November 2023	June 30, 2026	July 30, 2026	Amazon Bedrock AgentCore
Amazon Kendra	2020	June 30, 2026	July 30, 2026	Amazon Bedrock Knowledge Bases
Amazon Q Business	April 2024	TBD	July 31, 2026	Amazon Quick Suite (QuickSight)

The pattern is clear: AWS is collapsing point solutions into a smaller set of composable, agent-native platforms. As Forbes reported on July 24, 2026, this retirement signals that AWS’s first-generation AI services were too tightly coupled and too opinionated for the way enterprises actually build agents in production.

The strategic logic: Bedrock Agents Classic was a fully managed, opinionated service. You configured instructions, a model, action groups, knowledge bases, and guardrails — AWS owned the orchestration loop. That worked for bounded, single-purpose agents. But production teams building multi-agent systems, custom orchestration, or framework-specific pipelines (LangGraph, CrewAI, Strands) hit the walls fast. AgentCore gives them the infrastructure without the opinions.

Bedrock Agents Classic vs. AgentCore: Architecture Comparison

These are not two versions of the same product. They’re architecturally different, and understanding the differences is the prerequisite for a clean migration.

Dimension	Bedrock Agents Classic	AgentCore
Architecture Model	Fully managed, opinionated service	Framework-agnostic composable infrastructure
Orchestration	AWS-owned agent loop (you configure, AWS orchestrates)	You own the loop (or use managed harness)
Framework Support	AWS-native only	Strands, LangGraph, LangChain, CrewAI, AutoGen, OpenAI Agents SDK, Claude Agent SDK, custom code
Model Support	Bedrock catalog (frozen July 30, 2026)	Any model provider: Bedrock, Anthropic, OpenAI, Google Gemini
Runtime	Managed, abstracted	Serverless microVMs with filesystem and shell access ($0.0895/vCPU-hour)
Components	Monolithic: agents, action groups, KBs, guardrails bundled	12 independent services: Runtime, Gateway, Memory, Policy, Identity, Observability, Evaluations, Browser, Code Interpreter, Payments, Search, plus model costs
Tool Protocol	Action groups (Lambda/API schema)	MCP, A2A, API Gateway, plus legacy action group support
Multi-Agent	Limited (single-agent focus)	Native multi-agent orchestration via Strands swarm/graph primitives
Session Duration	Minutes (synchronous invocations)	Up to 8 hours, asynchronous tool execution
Pricing	Per-invocation	Consumption-based per component (Runtime, Gateway, Memory, etc.)
Payments	Not available	Built-in (Stripe + Coinbase, preview)
Browser Automation	Not available	AgentCore Browser with OS-level interaction

The mental model shift: Bedrock Agents Classic was a product — a turnkey service you configured. AgentCore is infrastructure — a set of primitives you compose. If you’ve used Bedrock Agents Classic like a managed database (hand it a schema, it runs queries), AgentCore is more like running your own database engine on managed compute (you control the query planner, AWS manages the servers).

What Happens to Your Existing Agents on July 30

The immediate impact is carefully scoped. Here’s exactly what changes and what doesn’t:

What Continues Working

All existing APIs remain available to current customers: UpdateAgent, GetAgent, ListAgents, DeleteAgent, PrepareAgent, InvokeAgent, action group APIs, knowledge base APIs, and alias APIs. Your existing Infrastructure-as-Code templates (CloudFormation, Terraform, CDK) that create or manage Bedrock Agents continue to work for allowlisted accounts. Bug fixes and security patches will still be applied.

What Stops

CreateAgent and InvokeInlineAgent are restricted for accounts without prior usage — you can’t create new agents on Agents Classic from a fresh account after July 30. The model catalog is frozen: any new model released to Bedrock after July 30, 2026 will only be available through AgentCore. No new features will be developed. Feature requests are no longer considered.

The Real Risk

The danger isn’t an immediate outage — it’s platform drift. Every month you stay on Agents Classic, you fall further behind: no access to new models (including future Claude, GPT, and Gemini releases on Bedrock), no new tool integrations, no performance improvements, and growing distance from AWS’s primary documentation and support focus. As the Forbes analysis noted, this is a pattern AWS has executed before: maintenance mode is a slow sunset, not a hard stop.

The Two Migration Paths

AgentCore offers two distinct migration strategies. Your choice depends on how much control you need over the agent orchestration loop.

Path 1: AgentCore Managed Harness (Config-Based)

The managed harness is the closest analog to Bedrock Agents Classic. You define an agent by specifying a model, system prompt, and tools. AgentCore handles the full agent loop: reasoning, tool selection, action execution, and response streaming. Each session runs in its own isolated microVM with filesystem and shell access.

Best for: Teams migrating simple, single-purpose agents that don’t require custom orchestration logic. If your Bedrock Agents Classic setup was standard — model + instructions + action groups + knowledge bases — the harness path minimizes code changes.

What you get: Managed compute, memory, identity controls, observability, and tool integration — all configurable through three API calls. Supports any model provider (Bedrock, Anthropic, OpenAI, Gemini).

Limitation: You’re still delegating the orchestration loop. If you need custom routing, multi-agent coordination, or framework-specific patterns, you’ll hit the same walls you hit on Agents Classic — just in a newer product.

Path 2: Code-Defined Agents on AgentCore Runtime

Deploy your own agent code — built with Strands, LangGraph, CrewAI, OpenAI Agents SDK, Claude Agent SDK, or fully custom logic — on AgentCore’s managed infrastructure. You own the orchestration loop; AWS provides the Runtime (serverless microVMs), Gateway (MCP/A2A tool connections), Memory, Policy enforcement, Identity, and Observability.

Best for: Teams already using or planning to adopt an open-source agent framework. If you’re building multi-agent systems, need custom tool-calling patterns, or want framework portability, this is the path.

What you get: Full control over agent logic with production-grade infrastructure underneath. Sessions can run up to 8 hours with asynchronous tool execution. Native MCP and A2A protocol support means your agents can interoperate with tools and other agents across the ecosystem.

The Strands connection: Strands Agents is AWS’s open-source SDK (Python and TypeScript) for building agents. Version 1.0 shipped May 21, 2026 with multi-agent orchestration, A2A protocol support, and a remote session manager. It’s model-agnostic but has native AgentCore integration. If you don’t already have a framework preference, Strands is the path of least resistance for AWS-native teams.

Component-by-Component Migration Map

Here’s how each Bedrock Agents Classic concept maps to AgentCore:

Bedrock Agents Classic	AgentCore Equivalent	Migration Notes
Agent Instructions (system prompt)	System prompt parameter in Harness config, or prompt in your agent code	Direct port — copy your prompt text unchanged
Model Selection	Model parameter (any provider supported)	Can now use non-Bedrock models (OpenAI, Gemini, Anthropic direct)
Action Groups (Lambda functions)	AgentCore Gateway + MCP tools	Wrap Lambda functions as MCP servers, or use Gateway API connections. Existing Lambda code can be reused
Knowledge Bases	Bedrock Knowledge Bases (unchanged) + AgentCore Search	Knowledge Bases are NOT retired — they’re a separate service that AgentCore agents consume directly
Guardrails	Bedrock Guardrails (unchanged) + AgentCore Policy	Guardrails remain available. AgentCore Policy adds agent-specific action controls on top
Agent Aliases	AgentCore Gateway endpoints	Reconfigure client-side invocation URLs
Session State	AgentCore Memory	Managed memory service with persistence and personalization
Agent Invocation (InvokeAgent API)	AgentCore Runtime API	New API surface — requires client code changes
CloudWatch Metrics	AgentCore Observability (CloudWatch-based)	New metric namespace (AWS/Bedrock-AgentCore), more granular traces

Critical note on Knowledge Bases: Your existing Amazon Bedrock Knowledge Bases and Guardrails are not affected by the Agents Classic retirement. They’re separate services that continue to evolve independently. AgentCore agents consume them the same way Classic agents did. This is the one piece of the migration that doesn’t require rearchitecting.

Step-by-Step Migration Checklist

Use this checklist whether you’re taking the harness path or the code-defined path. Steps marked with [H] are harness-specific; [C] are code-defined-specific.

Phase 1: Inventory and Assessment (Week 1)

1. Audit your existing agents. Run ListAgents across all regions. For each agent, document: the model ID, system prompt, action groups (with Lambda ARNs), knowledge base associations, guardrail configurations, and alias mappings. Export these configurations — they’re your migration blueprint.

2. Classify each agent. Categorize by complexity:

Simple (single model + prompt + 1-2 action groups): Harness path candidate
Moderate (multiple action groups + knowledge bases + guardrails): Harness path, but verify all features are supported
Complex (custom prompt overrides, multi-turn orchestration, chained agents): Code-defined path with Strands or your framework of choice

3. Check your account’s AgentCore access. Verify you can create resources in AgentCore. Run a simple test deployment to confirm IAM permissions and service quotas.

Phase 2: Build the AgentCore Equivalent (Weeks 2-3)

4. [H] Use the AgentCore CLI import. The AgentCore CLI can import existing Bedrock Agents Classic configurations as a starting point. Run the import, then review and adjust the generated harness configuration. This handles model, prompt, and basic tool mappings automatically.

5. [C] Set up your agent framework. If taking the code-defined path, initialize your Strands/LangGraph/CrewAI project. Configure AgentCore Runtime as your deployment target. Map each Classic action group to an MCP tool definition or Gateway endpoint.

6. Migrate action groups to MCP tools. For each Lambda-based action group:

Keep the Lambda function — the business logic doesn’t change
Create an MCP server wrapper or register the Lambda as a Gateway tool
Test tool invocation independently before connecting to the agent

7. Connect Knowledge Bases. Your existing Bedrock Knowledge Bases work directly with AgentCore agents. Update the agent configuration to reference them via the AgentCore API surface rather than the Classic agent association.

8. Configure Policy and Guardrails. Port your existing Guardrails configuration (it’s the same service). Then add AgentCore Policy rules for any agent-specific action controls you need — spending limits, tool access restrictions, or approval workflows.

Phase 3: Test and Validate (Week 3-4)

9. Run parallel invocations. Send identical prompts to both your Classic agent and the new AgentCore agent. Compare responses for accuracy, tool-calling behavior, and latency. Document any behavioral differences.

10. Load test the AgentCore agent. AgentCore’s default quota supports up to 5,000 active concurrent sessions in US East (N. Virginia) and US West (Oregon), and 2,500 in other regions, with 200 agent interactions per second. Verify your workload fits within these limits or request increases.

11. Validate observability. Confirm that traces, metrics, and logs appear correctly in the new AWS/Bedrock-AgentCore CloudWatch namespace. Update any dashboards, alarms, or monitoring integrations that reference the old metric namespace.

Phase 4: Cut Over (Week 4-5)

12. Update client integrations. Switch your application code from the Bedrock Agents InvokeAgent API to the AgentCore Runtime API. This is a breaking API change — the request/response format differs. Update SDKs and test end-to-end.

13. Update IaC templates. Replace Bedrock Agents CloudFormation/Terraform resources with AgentCore equivalents. The resource types and property schemas are different.

14. Decommission Classic agents. Once the AgentCore agents are handling production traffic, delete the Classic agents to avoid confusion and stop any residual costs. Keep your audit documentation for compliance records.

AgentCore’s New Capabilities You Didn’t Have Before

Migration isn’t just about preserving existing functionality. AgentCore introduces capabilities that Bedrock Agents Classic never offered, and several of them can materially improve your agent deployments.

AgentCore Browser

Agents can now automate browser workflows with OS-level interaction capabilities beyond Chrome DevTools Protocol. This opens up web scraping, form filling, and UI testing workflows that previously required separate browser automation infrastructure. As of April 2026, the Browser tool supports both CDP-based and OS-level actions within the same session.

AgentCore Payments (Preview)

Built in partnership with Coinbase and Stripe, AgentCore Payments is the first managed payment capability purpose-built for autonomous agents. Agents can autonomously access and pay for APIs, MCP servers, web content, and other agents. Stablecoin support enables sub-cent microtransactions with configurable spending guardrails. This is genuinely new territory — agent governance frameworks will need to evolve to account for agents that hold and spend budgets.

AgentCore Evaluations and Optimization

The evaluations service lets you run batch evaluations and A/B tests against production agent workloads. The recommendations capability analyzes production traces to generate optimized system prompts and tool descriptions automatically. This closed-loop optimization didn’t exist in Classic — you had to build it yourself.

Framework Portability

AgentCore is framework-agnostic. You can deploy agents built with LangGraph, CrewAI, Microsoft Agent Framework, Strands, or fully custom code. This eliminates vendor lock-in at the orchestration layer — if you decide to switch from LangGraph to Strands (or vice versa), the infrastructure doesn’t change.

Long-Running Sessions

Classic agents were limited to synchronous invocations lasting minutes. AgentCore supports sessions up to 8 hours with asynchronous tool execution. For complex research tasks, multi-step data processing, or workflows that require human-in-the-loop approvals, this is a significant upgrade.

Pricing: What Changes

Bedrock Agents Classic used per-invocation pricing. AgentCore uses consumption-based pricing across 12 independent components. Here are the key rates:

AgentCore Component	Pricing	Notes
Runtime	$0.0895/vCPU-hour + $0.00945/GB-hour	Based on active CPU use and peak memory
Gateway	Per MCP operation (ListTools, CallTool, Ping)	Plus per-search-query and per-tool-indexed for semantic search
Memory	Two tiers (short-term session, long-term persistent)	Consumption-based storage and retrieval
Policy	Per policy evaluation	Scales with agent action volume
Identity	Per authentication/authorization event	Supports Okta, Entra, Cognito
Observability	CloudWatch pricing	Not a separate AgentCore rate — uses standard CW costs
Browser	Per session-minute	For web automation workloads
Code Interpreter	Per execution	Sandboxed code execution within agent sessions
Evaluations	Per evaluation run	Batch evaluations and A/B tests

The net cost impact depends heavily on your workload profile. High-throughput, short-lived agents may see costs increase due to the per-component billing model. Long-running, complex agents that previously required multiple invocations and external orchestration infrastructure may see costs decrease. Run the AgentCore pricing calculator with your actual invocation volumes before committing to a migration timeline.

The Broader Context: Why This Matters Beyond AWS

AWS’s move isn’t happening in isolation. Every major cloud vendor is converging on the same pattern: decomposed, framework-agnostic agent infrastructure that separates orchestration from platform services.

Google launched Vertex AI Agent Builder with ADK (Agent Development Kit) support, recently releasing Gemini 3.6 Flash with built-in Computer Use for agent automation.
Microsoft now requires a dedicated Agent 365 license (effective July 1, 2026) for AI agent security capabilities in Copilot Studio and Microsoft Foundry — separating agent governance from general cloud security licensing.
Salesforce just landed a $1.6 billion VA Agentforce deal, with Agentforce ARR reaching $1.2 billion (205% YoY growth). Enterprise agentic platforms are becoming their own product category.
Open-source frameworks like LangGraph and Strands are becoming the agent-building layer, with cloud platforms providing the deployment and governance layer underneath.

The strategic implication: the “agent platform” is splitting into two tiers. The orchestration tier (how agents think, plan, and act) is moving to open-source frameworks. The infrastructure tier (how agents deploy, scale, pay, authenticate, and get monitored) is where cloud vendors compete. AgentCore is AWS placing its bet on the infrastructure tier. If you’re building agents on AWS, understanding this split — and designing your architecture accordingly — is more important than the migration mechanics.

Common Migration Mistakes to Avoid

Based on patterns from early adopters and AWS’s own guidance, here are the pitfalls that catch teams during this transition:

1. Treating it as a version upgrade. AgentCore is not Bedrock Agents v2. It’s a different product with different APIs, different resource types, and a different architectural model. Teams that approach migration as a configuration update will stall when they discover the API surfaces don’t align. Plan for a rebuild, not a reconfiguration.

2. Migrating everything at once. Start with your simplest agent. Get one agent running on AgentCore, validate the full lifecycle (deploy, invoke, monitor, update), then migrate the rest. The first agent teaches you the operational differences; the rest are pattern repetition.

3. Ignoring the pricing model shift. Per-invocation pricing and per-component consumption pricing optimize differently. An agent that was cheap on Classic (few invocations, each doing a lot internally) might be expensive on AgentCore (each tool call, memory write, and policy check is metered separately). Model your costs before migrating production workloads.

4. Not updating observability. Your existing CloudWatch dashboards, alarms, and runbooks reference Bedrock Agents metrics and log groups. AgentCore publishes to a different namespace (AWS/Bedrock-AgentCore) with different metric names. A migration that skips the monitoring update leaves your ops team blind during the most critical period.

5. Over-engineering the framework choice. If your agents are simple and you’re happy with AWS-native tooling, use the managed harness. Don’t adopt LangGraph or Strands just because AgentCore supports them. Framework migration and platform migration at the same time doubles the risk. Pick one battle per quarter.

What to Do This Week

With the July 30 cutoff five days away, here’s a prioritized action plan:

Immediate (before July 30):

Run ListAgents across all regions and export every agent configuration. This is your insurance policy.
Verify your AWS account has AgentCore access and test a basic deployment.
If you were planning to create new agents on Classic, do it now — after July 30, CreateAgent is blocked for accounts without prior usage.

Next 30 days:

Classify your agents by complexity and assign migration paths (harness vs. code-defined).
Migrate your simplest agent as a proof of concept.
Run the AgentCore pricing calculator with your actual production volumes.

Next 90 days:

Complete migration of all production agents.
Update IaC templates, monitoring, and runbooks.
Evaluate AgentCore-only features (Browser, Payments, Evaluations) for new use cases.

FAQs

Will my existing Bedrock Agents stop working on July 30, 2026?

No. Existing agents continue to function normally. All management and invocation APIs remain available to current customers. What stops is the ability to create new agents from accounts that haven’t used the service before, and the model catalog freezes — no new models will be added after that date.

Can I use LangGraph or CrewAI agents on AgentCore?

Yes. AgentCore is framework-agnostic. You can deploy agents built with LangGraph, CrewAI, Strands, OpenAI Agents SDK, Claude Agent SDK, AutoGen, or fully custom code on AgentCore’s managed Runtime. The infrastructure services (Memory, Gateway, Policy, Identity, Observability) work identically regardless of framework.

Are Bedrock Knowledge Bases and Guardrails also being retired?

No. Knowledge Bases and Guardrails are separate services that continue to evolve independently. AgentCore agents consume them directly. This is the one part of the migration that doesn’t require rearchitecting.

What’s the difference between the AgentCore managed harness and code-defined agents?

The managed harness is config-based: you declare model, tools, and instructions, and AgentCore handles the orchestration loop. Code-defined agents give you full control: you write the agent logic in your framework of choice and deploy it on AgentCore’s Runtime. The harness is faster to set up; code-defined gives you more flexibility. Both use the same underlying infrastructure services.

How does AgentCore pricing compare to Bedrock Agents Classic?

Classic used per-invocation pricing. AgentCore uses consumption-based pricing across 12 components (Runtime, Gateway, Memory, Policy, etc.). The net impact depends on your workload: simple agents with few tool calls may cost more due to per-component metering; complex agents that previously required external orchestration infrastructure may cost less. Use the AgentCore pricing calculator to model your specific workload before migrating.

Key Takeaways

July 30, 2026 is the hard cutoff for new customer access to Bedrock Agents Classic. Existing agents keep running, but the model catalog freezes and no new features ship.
AgentCore is not an upgrade — it’s a separate product with a different architecture. Plan for a rebuild, not a reconfiguration.
Two migration paths exist: the managed harness (config-based, closest to Classic) and code-defined agents (framework-agnostic, full control). Pick based on orchestration complexity.
Knowledge Bases and Guardrails are safe — they’re separate services that AgentCore agents consume directly.
New capabilities unlock on migration: 8-hour sessions, browser automation, autonomous payments, framework portability, and closed-loop optimization via evaluations.
The broader industry is converging on the same pattern: open-source orchestration + cloud-native infrastructure. This migration is a chance to align your architecture with that direction.
Start this week: export all agent configurations before July 30, verify AgentCore access, and migrate your simplest agent first as a proof of concept.

References

AWS, Amazon Bedrock Agents Classic Maintenance Mode, AWS Documentation, 2026.
Janakiram MSV, AWS Kills The AI Services It Launched Just Two Years Ago, Forbes, July 24, 2026.
Faisal Haque, AWS Just Retired Its Flagship AI Agent Product. It Launched in 2023, AWS in Plain English, July 2026.
AWS, Amazon Bedrock AgentCore, AWS Product Page, 2026.
AWS, Amazon Bedrock AgentCore Increases Default Runtime Quota Limits, AWS What’s New, July 2026.
AWS, Release Notes for Amazon Bedrock AgentCore, AWS Documentation, 2026.
AWS, Technical Deep Dive: AgentCore Payments and Innovation in Agentic Commerce, AWS ML Blog, 2026.
Cipher Projects, Amazon Bedrock Agents vs AgentCore + Strands (2026): Managed Agent or Production Runtime?, 2026.
Cloud Burn, Amazon Bedrock AgentCore Pricing: 12 Components Breakdown, 2026.
AWS, Amazon Bedrock AgentCore Browser Adds OS-Level Interaction Capabilities, April 2026.

What to read next on rpabotsworld.com:

Gemini 3.6 Flash for Agent Builders: Complete Guide

Satish Prasad — Wed, 22 Jul 2026 17:14:13 +0000

On July 21, 2026, Google released three new Gemini models in a single drop — and buried the most interesting story under the least interesting headline. While most coverage focused on the incremental version bump from 3.5 to 3.6 Flash, the real signal is what these models reveal about where Google thinks the agentic AI market is heading: cheaper per-task costs for long-running agents, built-in GUI automation via Computer Use, and a government-restricted cybersecurity model that outperforms Google’s own flagship on vulnerability detection. If you’re building AI agents in production, this release reshapes your cost model, your tool-calling architecture, and possibly your security posture. Here’s everything that matters.

What Google Actually Released on July 21, 2026

Google shipped three models simultaneously, each targeting a different slice of the agentic workload spectrum:

Model	Role	Input Cost (per 1M tokens)	Output Cost (per 1M tokens)	Speed	Access
Gemini 3.6 Flash	General-purpose workhorse	$1.50	$7.50	Standard Flash tier	Public — Gemini API, AI Studio, Gemini Enterprise
Gemini 3.5 Flash-Lite	High-throughput, cost-optimized	$0.30	$2.50	350 output tokens/sec	Public — Gemini API, AI Studio
Gemini 3.5 Flash Cyber	Vulnerability detection & remediation	TBD (pilot pricing)	TBD (pilot pricing)	Flash-tier	Restricted — governments & trusted partners via CodeMender

The naming is worth parsing. Gemini 3.6 Flash is a generation bump — a new model, not a point release. Flash-Lite and Flash Cyber are variants of the existing 3.5 Flash base, fine-tuned for specific workload profiles. This three-model strategy signals that Google is moving away from the “one model does everything” approach toward a tiered architecture where different models serve different stages of an agentic pipeline.

Gemini 3.6 Flash: The Workhorse Gets Leaner

The headline number is 17% — that’s the reduction in output token usage compared to 3.5 Flash, confirmed by the Artificial Analysis Index. In isolation, 17% sounds incremental. In the context of agentic workloads, where a single task can involve dozens of tool calls and reasoning steps, it’s the difference between an agent run costing $0.12 and $0.10 — multiplied across thousands of daily executions.

What the Benchmarks Actually Show

Google published benchmark results across coding, knowledge work, and computer use. Here’s what moved:

Benchmark	Gemini 3.5 Flash	Gemini 3.6 Flash	Change
DeepSWE (coding)	37%	49%	+32% relative improvement
SWE-Bench Pro	55.1%	58.7%	+6.5% relative
OSWorld-Verified (Computer Use)	78.4%	83.0%	+5.9% relative

The DeepSWE jump from 37% to 49% is the standout. DeepSWE tests multi-file code changes across real-world repositories — exactly the kind of task that agentic coding tools need to perform reliably. A 32% relative improvement in that capability means 3.6 Flash can handle significantly more complex code modifications in a single pass, reducing the need for retry loops that inflate both latency and cost.

Built-In Computer Use: No More Separate Integrations

The Computer Use improvement from 78.4% to 83.0% on OSWorld-Verified matters less for the percentage and more for how it’s delivered. With 3.6 Flash, Computer Use — the ability to operate a graphical user interface by reading screen contents, moving a cursor, clicking, and typing — is now a built-in client-side tool available directly through the Gemini API and Gemini Enterprise. Previously, integrating GUI automation required separate tooling or custom wrappers. Now it ships as a native capability alongside function calling, code execution, and search grounding.

For agent builders, this means you can construct a single agent that reasons, calls APIs, executes code, and drives a web browser or desktop application, all through the same model and API surface. The practical impact is clearest in enterprise automation scenarios where some systems have APIs and others only have web interfaces — a common reality in large organizations running legacy ERP or industry-specific software.

Context and Knowledge

Both 3.6 Flash and 3.5 Flash-Lite retain the one-million-token input context window and 64,000-token maximum output limit from 3.5 Flash. The knowledge cutoff advances from January 2025 to March 2026 — a meaningful update for agents that need to reason about recent events, current product documentation, or evolving regulatory frameworks without requiring RAG for every query.

Pricing: The Agentic Cost Equation

Output pricing dropped from $9.00 to $7.50 per million tokens, while input held at $1.50. That 17% output reduction compounds with the price cut: an agent that previously spent $9.00 on output tokens for a task now spends approximately $6.23 — a 31% effective cost reduction for the same work. For teams running thousands of agent tasks daily, this reprices the entire economics of agentic deployment.

Gemini 3.5 Flash-Lite: The High-Throughput Tier

Flash-Lite occupies a new category in Google’s model lineup: the ultra-cheap, ultra-fast tier designed for workloads where volume matters more than peak capability. At $0.30 per million input tokens and $2.50 per million output tokens, it’s among the cheapest production-grade models available from any major provider.

The speed specification tells the story: 350 output tokens per second, per Artificial Analysis. This makes Flash-Lite the natural choice for several agentic patterns that are expensive or slow with larger models:

Agent-based search and retrieval: When an orchestrator agent dispatches dozens of sub-queries to search, filter, and rank results, each sub-query can run on Flash-Lite at a fraction of the cost of the orchestrator’s model.
Document processing pipelines: Extracting structured data from hundreds or thousands of documents — invoices, contracts, medical records — where each extraction is a straightforward but repetitive task.
Classification and routing: Determining which specialist agent should handle an incoming request, or categorizing data before it enters a more expensive processing stage.
Guardrail evaluation: Running safety, compliance, or quality checks on agent outputs before they reach the end user — a pattern that doubles the model calls per task but can now run cheaply on a dedicated lightweight model.

The strategic implication is that Google is encouraging a model-per-stage architecture for agentic systems: Flash-Lite for preprocessing, routing, and validation; 3.6 Flash for reasoning, tool calling, and code generation; and larger models (when 3.5 Pro eventually ships) for the hardest problems. This mirrors what sophisticated agent builders were already doing manually — Google is now pricing the models to make this architecture the obvious default.

Gemini 3.5 Flash Cyber: A Government-Grade Vulnerability Hunter

The most technically interesting model in the release is the one most people can’t use yet. Gemini 3.5 Flash Cyber is a fine-tuned variant of 3.5 Flash, purpose-built for finding, validating, and patching software vulnerabilities. It’s available exclusively to governments and trusted partners through CodeMender, Google’s security-focused coding agent, as part of a limited-access pilot program.

The V8 Benchmark: Real Numbers on Real Code

Google tested Flash Cyber against its own mainline models on the V8 JavaScript Engine — Chrome’s open-source JavaScript and WebAssembly engine, one of the most scrutinized codebases in the world. The results, published by Google DeepMind:

Model	Unique Confirmed Vulnerabilities Found	Exclusive Findings (missed by other models)
Gemini 3.5 Flash Cyber	55	10
Gemini 3.5 Flash (mainline)	47	—
Opus 4.6	36	—

Flash Cyber found 10 vulnerabilities that neither 3.5 Flash nor Opus 4.6 detected. On the CyberGym benchmark, it delivered a 42% improvement on long-range multi-turn cyber tasks compared to its Flash 3 predecessor. These aren’t synthetic benchmarks — Google uses the model internally to scan Chrome, Android, Cloud, Ads, and YouTube codebases.

Why This Matters for Enterprise Security Teams

The significance isn’t just that the model is good at finding bugs. It’s that Google built a domain-specific fine-tuned variant of a production model and restricted it to government and trusted-partner access. This sets a precedent for how foundation model providers may handle sensitive capabilities going forward: train a specialized model, gate access to vetted organizations, and deliver it through a managed agent (CodeMender) rather than a raw API.

For enterprise security architects, Flash Cyber represents a shift from “AI-assisted” to “AI-primary” vulnerability detection. The model’s strength in scanning large codebases and analyzing numerous codepaths makes it particularly suited for the kind of comprehensive security auditing that human teams struggle to perform at scale — reviewing every code path in a million-line codebase, not just the paths flagged by static analysis tools.

Gemini 3.6 Flash vs Claude Sonnet 5 vs GPT-5.6: Which Model for Your Agents

The July 2026 model landscape gives agent builders three strong options in the mid-tier. Here’s the decision table:

Dimension	Gemini 3.6 Flash	Claude Sonnet 5	GPT-5.6 Terra
Input / Output Cost	$1.50 / $7.50	$2.00 / $10.00	$2.50 / $15.00
DeepSWE	49%	—	—
SWE-Bench Pro	58.7%	Higher (per Anthropic)	Higher (per OpenAI)
Terminal-Bench 2.1	76.2% (3.5 Flash)	80.4%	88.8% (Sol tier)
Computer Use (OSWorld)	83.0%	Available	Available
Context Window	1M tokens	200K tokens	128K-1M (varies by tier)
Max Output	64K tokens	64K tokens	32K-64K (varies)
Built-in Computer Use	Yes (native)	Yes	Yes
Knowledge Cutoff	March 2026	Early 2026	Early 2026
Best For	Cost-sensitive agentic loops, long-context tasks	Multi-step tool use, coding agents	Peak reasoning, complex chains

The Verdict for Agent Builders

Use Gemini 3.6 Flash when cost per agentic task is your primary constraint, when your agents need to process long documents (the 1M context window is 5× Claude’s), or when you’re building on the Google Cloud / Vertex AI stack and want native integration with ADK and A2A protocol.

Use Claude Sonnet 5 when your agents perform complex multi-step tool use, long-running coding tasks, or sophisticated debugging workflows. Anthropic’s strength is in sustained, multi-turn reasoning where the model needs to maintain coherent plans across many steps. The July 2026 model showdown on rpabotsworld.com breaks this down in detail.

Use GPT-5.6 (Sol or Terra) when peak reasoning quality matters more than cost — Sol leads Terminal-Bench 2.1 at 88.8% — or when you need OpenAI’s ecosystem integrations (Assistants API, function calling conventions, existing fine-tunes).

Use Flash-Lite when the task doesn’t need frontier intelligence but needs to be fast and cheap: routing, classification, extraction, guardrails. At $0.30/$2.50, it’s a rounding error in most budgets and can replace expensive model calls for 60-70% of the steps in a typical agentic pipeline.

What This Means for Agentic Workloads in Practice

The three-model release encodes a specific architectural thesis about how production agents should be built. Let’s unpack it.

1. The Multi-Model Agent Pipeline Is Now the Default

Google’s pricing structure makes it irrational to use a single model for every stage of an agent’s work. Consider a typical enterprise agent that processes customer support tickets:

Intake and classification (Flash-Lite at $0.30/$2.50): Read the ticket, classify priority, route to the right specialist agent.
Research and reasoning (3.6 Flash at $1.50/$7.50): Search knowledge bases, analyze the customer’s history, generate a draft response with tool calls to CRM and ticketing systems.
Quality check (Flash-Lite at $0.30/$2.50): Verify the draft response meets compliance requirements, check for PII exposure, validate tone.
Security scan (Flash Cyber, if available): For agents with code-execution capabilities, scan generated code for vulnerabilities before deployment.

Running this entire pipeline on a single frontier model would cost 3-5× more than distributing tasks across the appropriate tier. Google is making the economic case so stark that multi-model architectures become the obvious engineering choice, not an optimization for later. This is consistent with the multi-agent framework patterns emerging across LangGraph, CrewAI, Microsoft Agent Framework, and Google ADK.

2. Computer Use Changes the Automation Boundary

With Computer Use now built into the Gemini API as a native tool, agent builders can construct workflows that cross the API/UI boundary without switching models or integrating separate screen-reading services. The practical scenarios include:

Legacy system automation: Agents that interact with older ERP systems, mainframe terminal emulators, or industry-specific desktop applications that lack APIs.
Cross-application workflows: An agent that reads data from a web dashboard, copies it into a desktop spreadsheet, processes it, and uploads results to a different web application — all within a single agentic session.
Testing and QA: Agents that navigate web or desktop applications to verify that deployments work correctly, filling forms, clicking through workflows, and validating outputs visually.

The 83% OSWorld-Verified score means the model handles roughly four out of five GUI tasks correctly on the first attempt. That’s not reliable enough for unsupervised production use on critical systems, but it’s well past the threshold for supervised automation where a human reviews the agent’s actions before they’re committed. For teams already working with LangGraph agents deployed on UiPath, the Computer Use capability creates a natural bridge between API-driven and UI-driven automation within the same orchestration layer.

3. Token Efficiency Is the New Moat

The 17% output token reduction in 3.6 Flash reflects a broader shift in how foundation model providers compete. Raw intelligence (benchmark scores) is converging across providers — the July 2026 leaderboards show Claude Sonnet 5, GPT-5.6, and Gemini all within striking distance on most tasks. The differentiator is increasingly how efficiently a model accomplishes the task: fewer tokens per step means lower cost, lower latency, and more steps within the same context window budget.

For agentic workloads specifically, token efficiency compounds across the agent loop. An agent that takes 10 steps to complete a task, with each step producing 17% fewer output tokens, uses roughly 83% of the total output budget — across hundreds of daily runs, that’s a meaningful line-item reduction. Combined with the price cut from $9.00 to $7.50 per million output tokens, the effective cost per agent task drops by approximately 31%.

The Elephant in the Room: Where Is Gemini 3.5 Pro?

The most conspicuous absence in this release is Gemini 3.5 Pro — Google’s frontier-tier model that was expected alongside or before the Flash updates. TechCrunch reported that Google is “facing internal delays” as the model “struggled to meet internal performance goals.” Google’s official statement says 3.5 Pro is “currently testing with partners” with broad availability coming “as soon as it’s ready.”

For agent builders, the Pro delay matters because 3.5 Pro was expected to be Google’s answer to Claude Opus and GPT-5.6 Sol at the frontier tier — the model you’d use for the hardest reasoning tasks where Flash isn’t sufficient. Without it, Google’s lineup has a gap at the top: Flash is excellent for cost-efficient production workloads, but teams that need peak reasoning on their most complex agent tasks still need to reach for Anthropic or OpenAI.

The silver lining is the Gemini 4 teaser. Google confirmed that DeepMind has “already started our most ambitious pre-training run yet, for Gemini 4” — though no timeline was given. The implication is that Google may be deprioritizing 3.5 Pro in favor of leaping to the next generation, rather than shipping an incremental upgrade that arrives late to a market where Anthropic and OpenAI have already established their frontier positions.

Getting Started: API Access and Integration

All three models are accessible through Google’s existing developer infrastructure, though with different availability:

Gemini 3.6 Flash and 3.5 Flash-Lite

Google AI Studio: Available immediately for prototyping and testing at aistudio.google.dev.
Gemini API: Production-ready access via the standard Gemini API endpoints. Model IDs: gemini-3.6-flash and gemini-3.5-flash-lite.
Vertex AI / Gemini Enterprise: For enterprise deployments with VPC-SC, CMEK, and data residency requirements.
Google ADK: Native integration with Google’s Agent Development Kit for building multi-agent systems with the A2A protocol. If you’re building agents that need to communicate with agents on other platforms — the Google Agent Studio and Gemini Enterprise platform guide covers the full stack.
Gemini App and Google Search: Consumer-facing rollout for end users.

Gemini 3.5 Flash Cyber

Access: Limited to governments and trusted partners through CodeMender.
How to apply: Google hasn’t published a public application process yet. Organizations interested in the pilot should contact their Google Cloud account team or the Google DeepMind security research team.
What to expect: Flash Cyber is delivered as a managed agent (CodeMender) rather than a raw model API, meaning you interact with its vulnerability-detection capabilities through a structured interface rather than free-form prompting.

Integration Patterns for Agent Builders

If you’re adding 3.6 Flash to an existing agentic stack, the key architectural decisions are:

Router pattern: Use Flash-Lite as your routing model. It classifies incoming tasks and dispatches them to 3.6 Flash (or a frontier model from another provider) based on complexity. At $0.30/$2.50, the routing step costs virtually nothing.
Fallback chain: Start with Flash-Lite → escalate to 3.6 Flash → escalate to a frontier model. Each step up increases cost but also capability. Most tasks resolve at the cheapest tier.
Computer Use integration: If your agents need GUI interaction, enable Computer Use as a tool in your Gemini API calls. The model can then decide when to use it alongside function calling and code execution, without you needing to build separate screen-reading logic.
Multi-model pipelines: Pair 3.6 Flash with models from other providers. Use Flash for the bulk of agent work (research, tool calling, drafting) and Claude Sonnet 5 or GPT-5.6 Sol for the steps that require peak reasoning. The AI engineering map covers how to structure these hybrid architectures.

What the Industry Is Saying

The release lands during a period of intense competition in the agentic AI model market. Gartner projects that 40% of enterprise applications will have embedded agents by the end of 2026, up from less than 5% in 2025. Google’s three-model strategy — cost-optimized workhorse, ultra-cheap throughput tier, and domain-specific security variant — reflects a market that’s moving past the “which model is smartest” phase and into the “which model architecture is most cost-effective for production” phase.

The A2A (Agent-to-Agent) protocol, which Google has been building alongside its broader knowledge and agent infrastructure, is now in production at 150 organizations. Combined with the new model tiers, Google’s play is becoming clear: own the infrastructure layer — the models, the orchestration protocol, the enterprise platform — that makes agentic AI deployable at scale.

The enterprise governance challenge remains the bottleneck, however. Even with cheaper and more capable models, organizations still need control planes, audit trails, and human-in-the-loop guardrails to deploy agents safely. Flash Cyber’s restricted-access approach hints at one solution: instead of giving everyone access to the most powerful security tools and hoping they use them responsibly, gate access through managed agents with built-in controls.

Frequently Asked Questions

Is Gemini 3.6 Flash better than Claude Sonnet 5 for building AI agents?

It depends on your priority. Gemini 3.6 Flash is cheaper ($1.50/$7.50 vs $2.00/$10.00 per million tokens) and has a 5× larger context window (1M vs 200K tokens). Claude Sonnet 5 scores higher on multi-step tool use and sustained coding tasks. For cost-sensitive, high-volume agentic workloads, Flash wins on economics. For complex, multi-turn coding agents that need to maintain coherent plans across many steps, Sonnet 5 has the edge.

Can I use Gemini 3.5 Flash Cyber for my company’s security scanning?

Not yet, unless your organization is a government entity or a trusted partner accepted into Google’s limited-access pilot. Flash Cyber is delivered through CodeMender, Google’s managed security agent, not as a standalone API. Google has indicated the program will expand over time, but no public timeline or application process has been announced.

What’s the difference between Gemini 3.6 Flash and 3.5 Flash-Lite?

3.6 Flash is the general-purpose model for reasoning, coding, and tool use at standard Flash-tier quality. Flash-Lite is a stripped-down, high-speed variant optimized for throughput (350 tokens/sec) at much lower cost ($0.30/$2.50). Use 3.6 Flash for the tasks that need intelligence; use Flash-Lite for classification, routing, extraction, and guardrails where speed and cost matter more than peak capability.

When will Gemini 3.5 Pro be available?

Google says it’s “testing with partners” with no firm release date. Reports indicate internal delays due to performance targets not being met. Google has already started pre-training Gemini 4, suggesting the company may prioritize the next generation over shipping a late 3.5 Pro.

Does Gemini 3.6 Flash support Computer Use out of the box?

Yes. Computer Use is a built-in client-side tool in the Gemini API and Gemini Enterprise. You enable it in your API call configuration, and the model can autonomously decide when to use GUI automation alongside function calling and code execution. No separate integration or screen-reading service is required.

Key Takeaways

Three models, one strategy: Google shipped 3.6 Flash (workhorse), Flash-Lite (throughput), and Flash Cyber (security) — encoding a tiered architecture where different models serve different pipeline stages.
31% effective cost reduction: The combination of 17% fewer output tokens and a price cut from $9.00 to $7.50 per million output tokens makes each agentic task roughly a third cheaper than on 3.5 Flash.
Computer Use goes native: GUI automation is now a built-in tool in the Gemini API — no separate integration needed. OSWorld-Verified score: 83%.
Flash Cyber sets a precedent: A domain-specific fine-tuned model restricted to government and trusted partners, delivered through a managed agent (CodeMender). Found 55 vulnerabilities in V8 vs 47 for mainline Flash and 36 for Opus 4.6.
Flash-Lite enables cheap agent scaffolding: At $0.30/$2.50 and 350 tokens/sec, it’s the obvious choice for routing, classification, and guardrails in multi-model agent pipelines.
Pro is missing, Gemini 4 is coming: Google’s frontier model is delayed, but Gemini 4 pre-training has begun — signaling a leap rather than an incremental update.
The market is shifting from model intelligence to model economics: With benchmark scores converging across providers, the competitive axis is now cost-per-task and architectural fit for production agents.

References

Google, “Introducing Gemini 3.6 Flash, 3.5 Flash-Lite, and 3.5 Flash Cyber,” The Keyword (blog.google), July 21, 2026. Link
Google DeepMind, “Introducing Gemini 3.5 Flash Cyber,” deepmind.google, July 21, 2026. Link
TechCrunch, “Google releases three new Gemini models — but no 3.5 Pro,” July 21, 2026. Link
9to5Google, “Google launches Gemini 3.6 Flash and 3.5 Flash-Lite, teases Gemini 4,” July 21, 2026. Link
MarkTechPost, “Google Releases Gemini 3.6 Flash, 3.5 Flash-Lite, and 3.5 Flash Cyber: A Cheaper, More Token-Efficient Flash Tier Built for Agentic Workloads,” July 21, 2026. Link
Artificial Analysis, “Gemini 3.6 Flash – Intelligence, Performance & Price Analysis,” July 2026. Link
CybersecurityNews, “Gemini 3.5 Flash Cyber With Automated Faster Vulnerability Detection and Patch Capabilities,” July 2026. Link
The Hacker News, “Google Launches Gemini 3.5 Flash Cyber AI to Find and Fix Software Vulnerabilities,” July 2026. Link
DataCamp, “Gemini 3.6 Flash: Features, Benchmarks, and Tests,” July 2026. Link
Help Net Security, “Google’s Gemini 3.5 Flash Cyber becomes a vulnerability hunter,” July 22, 2026. Link

Alibaba Cloud Agent Native Cloud: AgentRun, AgentLoop & AgentTeams Guide (2026)

Satish Prasad — Wed, 22 Jul 2026 17:02:57 +0000

On July 18, 2026, Qi Zhou — head of Alibaba Cloud’s Cloud-Native Application Platform — walked onto the stage at the World Artificial Intelligence Conference (WAIC) in Shanghai and made an announcement that most Western enterprise architects will take weeks to fully process: Alibaba Cloud is no longer just running AI workloads on cloud infrastructure. It is rebuilding the cloud itself around AI agents as the primary compute primitive.

The announcement introduced Agent Native Cloud — a suite of three coordinated services (AgentRun, AgentLoop, and AgentTeams) plus a new inference engine (TokenWorks) and an open-sourced chip software stack (T-Head SAIL). Together, they represent the most vertically integrated agent infrastructure stack any cloud provider has publicly assembled: from custom silicon at the bottom to multi-agent governance at the top.

For the rpabotsworld.com audience — RPA developers, solution architects, and automation engineers moving into the agentic AI space — this matters regardless of whether you ever deploy on Alibaba Cloud. The announcement confirms that agent operations is now a sellable cloud primitive, on the same level as compute, storage, and networking. Every hyperscaler is converging on this layer, and the architectural patterns Alibaba is productizing are the same ones you’ll need to understand whether you’re building on AWS, Azure, Google Cloud, or an open-source stack.

This guide breaks down what Alibaba actually announced (and what it conspicuously didn’t), how the stack compares to AWS Bedrock AgentCore, Microsoft Agent 365, and Google’s Agentic Data Cloud, and what practitioners should do about it right now.

What Is Agent Native Cloud? The Three-Layer Architecture

Agent Native Cloud is not a single product — it is a vertically integrated platform comprising three services that address distinct phases of the AI agent lifecycle. Understanding the boundaries between them is critical, because most secondary coverage collapses all three into a single “Alibaba launches agent platform” headline that obscures what’s actually new.

Here’s the precise breakdown, drawn from Alibaba Cloud’s official announcement:

AgentRun: Lifecycle Management (Pre-Existing, Expanded)

AgentRun is not new. It predates WAIC 2026 and provides lifecycle management covering development, deployment, and operations for AI agents. Think of it as the runtime layer — the managed environment where your agent actually executes. It provides:

Native sandbox environments — isolated execution spaces with strong workload isolation, so one agent’s failure or misbehavior doesn’t cascade
Elastic scaling — automatic resource allocation based on agent workload demands
Enterprise identity integration — agents inherit organizational identity and access policies rather than running with their own credentials
Development-to-production pipeline — a managed path from agent prototyping through staging to production deployment

If you’ve worked with container orchestration platforms like Kubernetes, the mental model is similar: AgentRun is to AI agents what a managed Kubernetes service is to containerized applications — the runtime layer that handles the operational mechanics so you can focus on the agent logic itself.

AgentLoop: Observability and Optimization (New at WAIC 2026)

AgentLoop is the first of two genuinely new services announced at WAIC. It provides real-time tracing, evaluation, and optimization of agent performance. In practical terms, this is the agent-equivalent of an Application Performance Monitoring (APM) suite — the tooling that most teams currently assemble from a patchwork of tracing SDKs, eval harnesses, and prompt dashboards.

What AgentLoop centralizes:

Real-time tracing — following the complete execution path of an agent through tool calls, LLM invocations, memory reads/writes, and inter-agent communications
Performance evaluation — measuring agent accuracy, latency, token consumption, and task completion rates against defined benchmarks
Optimization recommendations — surfacing actionable improvements based on observed agent behavior patterns

For teams currently building agentic systems with frameworks like LangGraph, CrewAI, or the Microsoft Agent Framework, this maps to the observability gap that everyone acknowledges but few have solved well. Today, most teams instrument their agents with custom logging, run periodic eval suites offline, and debug production issues by reading raw logs. AgentLoop’s promise is to replace that manual stack with a managed service.

AgentTeams: Multi-Agent Governance (New at WAIC 2026)

AgentTeams is the second new service, and arguably the most strategically significant. It provides coordination and governance across multiple agents, allowing organizations to manage complex workflows that involve agent-to-agent collaboration.

This addresses what is rapidly becoming the hardest problem in enterprise agent deployments: once you have more than one agent operating in production, who decides which agent handles which task? How do you prevent two agents from taking conflicting actions on the same data? How do you enforce organizational policies across an entire fleet of autonomous agents?

AgentTeams tackles the agent control plane problem at the infrastructure level:

Multi-agent routing — directing incoming tasks to the appropriate specialized agent based on capability, context, and policy
Hand-off protocols — managing the structured transfer of context and control when one agent delegates work to another
Governance policies — enforcing organizational rules about what agents can do, what data they can access, and what actions require human approval
Workflow coordination — orchestrating complex multi-step workflows that span multiple agents, with state management and error recovery

If you’re building multi-agent systems today, you know this layer. It’s the part you’re currently implementing with custom orchestration code — the supervisor agents, the routing logic, the shared state stores, the policy enforcement hooks. AgentTeams says: that’s infrastructure, and it should be managed by the cloud provider.

Below the Agent Layer: TokenWorks and T-Head SAIL

The agent services grabbed the headlines, but two lower-in-the-stack announcements may prove more immediately relevant for practitioners.

TokenWorks: Unified Inference Serving

TokenWorks is a new service within Alibaba’s Platform for AI (PAI), available through PAI-EAS. It integrates four functions that are typically handled by separate systems into a single service:

Function	What It Does	Traditional Approach
Request Routing	Directs inference requests to the optimal model endpoint	Load balancer + custom routing logic
Inference Execution	Runs the actual model inference	vLLM / TGI / TensorRT-LLM
Compute Reuse	Shares GPU resources across requests efficiently	Custom batching + scheduling
Scheduling	Manages job queues and priority across workloads	Kubernetes scheduler + custom priority queues

For teams running inference at scale — which includes anyone operating agentic systems where each task may trigger dozens of LLM calls — the cost and reliability implications of unifying these functions are substantial. Inference cost is consistently the #1 operational expense in production agent deployments, and the inefficiencies from running these as separate systems compound quickly.

T-Head SAIL: Open-Source Chip Software Stack

Alibaba’s chip unit, T-Head, open-sourced its SAIL (Software AI Layer) stack at WAIC, providing developers with access to the computing framework optimized for Alibaba’s Zhenwu AI chips. The stack spans operating systems, SDKs, and interfaces, and is designed to be compatible with mainstream AI ecosystems.

The numbers behind the silicon are notable: as of April 2026, cumulative Zhenwu chip shipments reached 560,000 units, supporting over 400 customers across more than 20 industries. That’s a meaningful install base, and open-sourcing the software stack is a clear play to expand the developer ecosystem around Alibaba’s custom silicon — the same strategy that made CUDA dominant for NVIDIA.

The Competitive Landscape: Every Hyperscaler Wants This Layer

Alibaba’s announcement does not exist in isolation. It enters a race that AWS, Microsoft, and Google are all running simultaneously, and understanding the competitive positioning is essential for any architecture decision.

Vendor	Agent Control Plane	What It Governs	Pricing Disclosed?	Status (July 2026)
Alibaba Cloud	AgentRun + AgentLoop + AgentTeams	Agent lifecycle, observability, multi-agent governance	No	Unveiled at WAIC 2026; no GA date
AWS	Bedrock AgentCore	Agent runtime, guardrails, evaluations, gateway routing	Yes — free harness, pay for compute	GA; 1M+ downloads
Microsoft	Agent 365	Agent registry, access control, security monitoring	Partial	Security transition underway (July 2026)
Google Cloud	Agentic Data Cloud + Gemini Enterprise	Data fabric, agent engine, A2A protocol	Partial	A2A v1.0 in production at 150 organizations
Open Source	agentgateway (Linux Foundation AAIF)	Agent-to-tool traffic gateway	N/A — open source	300+ contributors, 60+ organizations

AWS Bedrock AgentCore: The Most Mature Offering

Amazon’s Bedrock AgentCore is the most commercially mature offering in this space, with over 1 million downloads since general availability. The AgentCore harness lets you create and run an agent with just two API calls (CreateHarness and InvokeHarness), with the agent running in its own isolated environment with a filesystem and shell, persistent memory across sessions, and the ability to browse the web, call tools through gateway or MCP, and switch model providers mid-session. AWS also recently added batch evaluations, A/B testing, and Bedrock Guardrails integration for evaluating every agent action against prompt injection, harmful content, and sensitive data policies.

The key advantage AWS has: it’s priced and shipping today. The harness itself is free; you pay for the underlying compute and model inference. That transparency is something Alibaba’s announcement conspicuously lacks.

Microsoft Agent 365: Security-First Approach

Microsoft’s approach, rebranded from Copilot Studio security capabilities to Agent 365, centers on the security and governance dimensions. As of July 1, 2026, Microsoft Copilot Studio agent security capabilities through Defender for Cloud Apps are transitioning to Agent 365 observability logs for licensed tenants. Microsoft has also introduced new certification paths — the Agentic AI Business Solutions Architect (AB-100) and AI Agent Builder Associate (AB-620) — signaling that they see agent operations as a distinct professional discipline.

Google: A2A Protocol and the Interoperability Bet

Google’s approach is arguably the most architecturally ambitious. The Agent2Agent (A2A) protocol reached v1.0 in production at 150 organizations, enabling agents on different platforms to communicate through a standardized protocol. This means a Salesforce agent built on Agentforce can hand off a task to a Google agent running on Vertex AI, which can query a ServiceNow agent for IT asset data — all through A2A without any of the three systems needing to understand each other’s internal architecture.

Google also rebranded Vertex AI to the Gemini Enterprise Agent Platform at Cloud Next 2026, consolidating its agent tooling under a unified brand with 200+ models in the Model Garden.

The Open-Source Alternative: agentgateway

Solo.io donated its “agentgateway” project to the Linux Foundation’s Agentic AI Foundation in June 2026, where it now counts 300+ contributors across 60+ organizations including CoreWeave, Red Hat, Adobe, Salesforce, and Microsoft. This is the open-source counterweight to the hyperscaler lock-in play — a neutral gateway layer for agent-to-tool traffic that doesn’t tie you to any single cloud provider.

The Qwen 3.8 Connection: Why the Model Layer Matters

One day after the Agent Native Cloud unveiling, Alibaba released Qwen 3.8-Max-Preview — a 2.4 trillion-parameter multimodal model with a 1 million-token context window. The timing is not coincidental. The model, the agent infrastructure, and Alibaba’s Token Plan pricing system represent a coordinated strategy: a frontier model (Qwen 3.8), a consumption model (Token Plan), and an operational platform (Agent Native Cloud) — all announced in the same WAIC week.

For practitioners, the relevant question isn’t whether Qwen 3.8 is better than GPT-5.6 or Claude Sonnet 5 (that benchmarking is a separate conversation). It’s that Alibaba is the only vendor that announced a complete vertical stack from silicon to governance in a single event:

Layer	Alibaba’s Product	Purpose
Silicon	Zhenwu chips (560K shipped)	Custom AI inference hardware
Chip Software	T-Head SAIL (open-sourced)	OS, SDKs, interfaces for Zhenwu
Inference Serving	TokenWorks	Routing, execution, scheduling
Foundation Model	Qwen 3.8-Max-Preview (2.4T params)	Frontier multimodal model
Agent Runtime	AgentRun	Lifecycle management, sandboxes
Agent Observability	AgentLoop	Tracing, evaluation, optimization
Multi-Agent Governance	AgentTeams	Coordination, routing, policies

No other vendor — not AWS, not Microsoft, not Google — made a stack claim that vertically complete at a single event. Whether Alibaba can execute across all seven layers is an open question, but the ambition sets the direction for what “full-stack agent infrastructure” will mean by 2027.

What Alibaba Didn’t Say: The Critical Gaps

Precision matters in evaluating vendor announcements, and what was not disclosed at WAIC is as important as what was. Across the entire Agent Native Cloud announcement:

No pricing — for AgentRun, AgentLoop, AgentTeams, or TokenWorks. Not even indicative pricing tiers or a “starting at” figure.
No general availability date — no preview date, no GA date, no waitlist. The services were “unveiled” and “introduced,” not “launched.”
No named customers — no pilot deployments, no reference architectures from real deployments, no case studies.
No executive quote — unusually, no Alibaba executive is directly quoted anywhere in the primary announcement.
No compliance posture — no mention of SOC 2, ISO 27001, GDPR, or any compliance framework for the agent services.
No SLAs — no uptime guarantees, no latency commitments, no data retention policies.

This absence doesn’t make the announcement empty — hyperscalers routinely unveil infrastructure directions months before commercial terms follow. But it means the correct classification is stated intent, not a shippable product. Teams should track the direction, not plan migrations around it.

The Gartner counterweight is worth noting here: according to Forbes’ reporting of Gartner’s analysis, more than 40% of agentic AI projects will be canceled by 2027, citing escalating costs, unclear value, and weak risk controls. That prediction is the right lens for any agent infrastructure announcement that arrives without pricing or commercial terms — and it’s a pattern we’ve explored in depth in our analysis of why agentic automation programs fail.

The MCP and Skills Portal: Cloud as Agent Toolkit

One detail from Alibaba’s broader WAIC announcements that flew under the radar of most coverage deserves particular attention from the rpabotsworld.com audience: Alibaba Cloud launched a new Skills portal that converts common cloud capabilities across more than 60 cloud products into Skill-based and MCP-compatible formats.

This is significant. MCP (Model Context Protocol, originally developed by Anthropic) has rapidly become the de facto standard for connecting AI agents to external tools and data sources. Alibaba’s decision to make its cloud services natively MCP-compatible means that an agent running on AgentRun could invoke cloud resources — databases, storage, compute, messaging — as naturally as calling functions, through the same protocol that agents use to call any other tool.

For practitioners working with MCP-aware frameworks (which now includes most major agent frameworks), this lowers the integration barrier significantly. Instead of writing custom API wrappers for each Alibaba Cloud service, you’d use the same MCP protocol you already use for everything else.

Market Context: Why Agent Infrastructure Is the Next Cloud Primitive

The simultaneous convergence of four hyperscalers on the same product category — agent operations infrastructure — is not a coincidence. It reflects a fundamental shift in how enterprises consume cloud services.

The numbers tell the story clearly:

The global AI agents market is projected to reach $10.9–12 billion in 2026, with Deloitte forecasting a CAGR of roughly 53% to $45 billion by 2030.
Agentic infrastructure now represents 17–22% of enterprise AI line items in 2026, projected to grow to 26–32% by 2027.
Q1 2026 agent-native venture funding hit $4.7 billion (annualized to $20B+) — the largest software vertical funded since cloud-native in 2015–2017.
56% of enterprises now have a formal “AI agent owner” or “agentic ops” lead, up from 11% in 2024.
40% of enterprise applications are expected to embed task-specific AI agents by end of 2026.

The pattern is unmistakable: agent operations is following the same trajectory that container orchestration followed a decade ago. First came the runtime (agents themselves), then chaos (everyone building custom orchestration), then a scramble to own the governance and orchestration layer. The open question — and it’s a consequential one — is whether the agent world gets its Kubernetes moment (one neutral standard the ecosystem consolidates around) or fragments into per-cloud control planes that make agent governance a portability problem.

Practitioner Playbook: What to Do Right Now

Alibaba’s announcement, combined with the broader hyperscaler convergence, creates a clear set of practical implications depending on where your agent workloads currently live.

If You’re Building Custom Agents Today

The most important takeaway isn’t about Alibaba specifically — it’s that tracing, evaluation, and governance are now officially infrastructure, not application concerns. If you’re building agents with LangGraph, CrewAI, or any other framework, invest in a thin version of this layer in your own stack right now:

Tracing — instrument every LLM call, tool invocation, and agent decision point. Use OpenTelemetry-compatible formats so you can migrate between observability providers later.
Evaluation — build eval suites that run automatically against your agents, not just on a developer’s laptop. Measure accuracy, latency, cost-per-task, and failure modes.
Governance — implement policy enforcement at the orchestration layer. What can each agent access? What actions require human approval? What data is off-limits?

Building this layer yourself — even in a lightweight form — means that adopting any vendor’s control plane later is a migration of convenience, not a rescue operation.

If You’re on AWS

Bedrock AgentCore is the most mature offering with real pricing and GA status. If you’re already on AWS, there’s no reason to wait for Alibaba’s offering. But keep your agent logic framework-agnostic (LangGraph or similar) rather than coupling to AgentCore-specific APIs, so you can evaluate alternatives as they mature.

If You’re Evaluating Cloud Providers for Agent Workloads

The decision framework is straightforward:

Priority	Best Current Option	Why
Ship today, iterate later	AWS Bedrock AgentCore	GA, priced, 1M+ downloads, free harness
Multi-vendor agent interop	Google A2A Protocol	v1.0 in production, 150 orgs, open standard
Security-first governance	Microsoft Agent 365	Defender integration, enterprise identity, certifications
Vertical integration (model + infra)	Alibaba Agent Native Cloud	Full stack from silicon to governance — but unpriced/undated
Avoid lock-in entirely	Linux Foundation agentgateway	Open source, 300+ contributors, vendor-neutral

If You’re in a Regulated Industry

Nothing in Alibaba’s Agent Native Cloud announcement includes compliance certifications, data residency guarantees, or SLA commitments. For enterprises with regulatory exposure (finance, healthcare, government), there is currently nothing to evaluate from Alibaba on this front. Revisit when GA terms and legal documentation exist.

The Bigger Picture: From Cloud-Native to Agent-Native

The terminology shift from “cloud-native” to “agent-native” is not marketing — it reflects a genuine architectural evolution. In a cloud-native architecture, the fundamental unit of deployment is a container or serverless function. In an agent-native architecture, the fundamental unit is an autonomous agent that can reason, use tools, maintain memory, and collaborate with other agents.

This shift has cascading implications across the infrastructure stack:

Infrastructure Concern	Cloud-Native Era	Agent-Native Era
Unit of deployment	Container / function	Agent
Orchestration	Kubernetes / step functions	Multi-agent coordinators (AgentTeams)
Observability	APM / distributed tracing	Agent tracing + eval (AgentLoop)
Lifecycle management	CI/CD pipelines	Agent runtime platforms (AgentRun)
Security model	Network policies + RBAC	Agent identity + action-level governance
Scaling trigger	CPU / memory thresholds	Task queue depth + token consumption
Cost model	Compute hours	Token consumption + tool call volume

Alibaba Cloud’s Agent Native Cloud is the most explicit articulation of this transition that any cloud provider has made. Whether Alibaba or AWS or Google ultimately wins the agent infrastructure market is less important than the fact that all of them agree the transition is happening. For practitioners, the right response is not to pick a winner, but to understand the architectural patterns and start building for them.

Frequently Asked Questions

Is Alibaba Cloud Agent Native Cloud available to use right now?

No. As of July 2026, Agent Native Cloud (specifically AgentLoop and AgentTeams) has been unveiled but has no published general availability date, no pricing, and no public preview. AgentRun, the pre-existing lifecycle management component, has been available on Alibaba Cloud, but the two new services have no disclosed timeline for launch.

How does Agent Native Cloud compare to AWS Bedrock AgentCore?

AWS Bedrock AgentCore is more mature — it’s generally available, priced (free harness, pay for compute), and has over 1 million downloads. Alibaba’s offering is more vertically integrated (covering silicon to governance in one stack) but has no pricing, no GA date, and no named customers. For teams that need to ship today, AWS is the pragmatic choice. For teams tracking where the market is heading, Alibaba’s architecture is worth studying.

Do I need to be on Alibaba Cloud to benefit from these architectural patterns?

No. The three-layer pattern Alibaba is productizing — agent runtime, agent observability, and multi-agent governance — is vendor-agnostic in concept. You can build equivalent capabilities using open-source tools (LangSmith for tracing, custom eval harnesses, policy engines) on any cloud. The value of watching Alibaba’s announcement is understanding that these are now considered infrastructure concerns, not application-level add-ons.

What is TokenWorks and why should I care?

TokenWorks unifies inference request routing, execution, compute reuse, and scheduling into a single service. If you’re running agentic systems where each task triggers multiple LLM calls, inference cost is your largest operational expense. TokenWorks addresses this by optimizing how GPU resources are shared across requests — a problem that most teams currently solve with a patchwork of vLLM, custom batching logic, and load balancers.

Will Agent Native Cloud support non-Qwen models?

Alibaba hasn’t explicitly addressed this in the WAIC announcement. Given that Alibaba’s Model Studio already hosts 200+ models and the platform is described as framework-agnostic, multi-model support is likely — but no confirmation exists as of this writing.

Key Takeaways

Agent ops is now a cloud primitive. When four hyperscalers independently converge on the same product shape within a year, the category is no longer speculative — it’s an established layer of the enterprise stack.
Alibaba announced two new services, not three. AgentLoop (observability) and AgentTeams (multi-agent governance) are new. AgentRun (lifecycle management) already existed and was expanded.
Nothing is priced or dated. No pricing, no GA date, no named customer, no compliance certification. This is stated intent, not a shippable product.
The vertical integration is unprecedented. Silicon (Zhenwu) → chip software (SAIL) → inference serving (TokenWorks) → foundation model (Qwen 3.8) → agent runtime → observability → governance. No other vendor assembled this complete a stack at a single event.
Build your own thin agent-ops layer now. Regardless of which vendor you ultimately adopt, owning your traces, evals, and governance policies in portable formats ensures you can switch platforms later without a rescue migration.
AWS Bedrock AgentCore is the pragmatic choice today. It’s GA, priced, and proven at scale. Alibaba’s offering is worth tracking for architectural direction, not immediate adoption.
The Kubernetes question is unanswered. Will the agent world consolidate around an open standard (like the Linux Foundation’s agentgateway) or fragment into per-cloud control planes? Your portability decisions today depend on which outcome you’re betting on.

References

Alibaba Cloud Community, “Alibaba Cloud Unveils Agent-Native Innovations at WAIC 2026,” July 20, 2026.
Digital Applied, “Alibaba’s Agent-Native Cloud: AgentLoop and AgentTeams,” July 21, 2026.
CryptoBriefing, “Alibaba Cloud launches Agent Native Cloud to scale enterprise AI agents,” July 2026.
MarkTechPost, “Alibaba Previews Qwen3.8-Max, a 2.4 Trillion-Parameter Multimodal Model,” July 19, 2026.
AWS, “Introducing Amazon Bedrock AgentCore,” 2026.
AWS, “Amazon Bedrock AgentCore harness is now generally available,” 2026.
Microsoft Learn, “Transition agent security capabilities to Microsoft Agent 365,” July 2026.
Google Developers Blog, “Announcing the Agent2Agent Protocol (A2A),” 2026.
Constellation Research, “Amazon Bedrock AgentCore generally available,” 2026.
Salesforce Newsroom, “Salesforce and AWS Deepen Collaboration to Launch Agentforce 360 for AWS,” 2026.

AI Agent Control Planes: The 2026 Enterprise Governance Guide

Satish Prasad — Wed, 22 Jul 2026 02:09:07 +0000

An OutSystems survey of 1,900 IT leaders found that 96% of enterprises now run AI agents in production — but only 12% can actually govern them. That gap is not a product roadmap problem. It is an operational risk that compounds every week another team spins up another copilot, another embedded agent, another shadow deployment nobody in IT even knows about.

In Q2 2026, Forrester published its first-ever Agentic Control Plane Solutions Landscape, identifying 33 vendors competing to close that governance gap. The timing is not coincidental. Three of the world’s largest enterprise software companies — IBM, ServiceNow, and Google — shipped dedicated agent control planes within weeks of each other. Palo Alto Networks paid roughly $700 million to acquire AI gateway startup Portkey. And the open-source community launched its own answer through the Linux Foundation’s Agentic AI Foundation.

This article breaks down what an AI agent control plane actually is, compares the five leading approaches, and gives you a practical framework for choosing the right one based on where your organization sits on the agent maturity curve.

AI agent control plane architecture — four layers: discovery, policy, observability, enforcement

What Is an AI Agent Control Plane?

Forrester defines the agentic control plane as “a common enterprise governance and control platform that sits above and across a heterogeneous estate of AI agents and agentic skills and applies a consistent envelope of oversight, governance, and controls so the agent portfolio can be managed consistently across platforms, vendors, and use cases.”

In plain terms: it is the operational layer that answers four questions about every AI agent in your organization.

Discovery — What agents exist, where do they run, and who owns them?
Governance — What policies, permissions, and guardrails apply to each agent?
Observability — What is each agent doing right now, and how is it performing?
Enforcement — Can we shut down a misbehaving agent in real time, before it causes damage?

If you have built or deployed multi-agent systems using frameworks like LangGraph, CrewAI, or Google ADK, you already know how quickly the agent count multiplies once teams start composing agents into workflows. The control plane is the answer to what happens after the build phase — when dozens or hundreds of agents need to coexist in a governed production environment.

Why 2026 Is the Inflection Point

Three converging forces made the control plane category inevitable in 2026:

1. Agent Sprawl Hit Critical Mass

Forrester’s research found that organizations significantly underestimate their AI footprint as agents proliferate through copilots, embedded SaaS features, departmental builds, and shadow projects. Their blunt assessment: “Buyers can’t govern agents they can’t see.” When your enterprise has agents embedded in Salesforce, Microsoft 365, ServiceNow, custom LangGraph deployments, and vendor-specific copilots — all running simultaneously — the governance challenge is fundamentally different from managing a handful of RPA bots.

If you have seen this pattern play out in traditional RPA programs, you will recognize the parallels. Many of the same reasons agentic automation programs fail — lack of governance, no centralized oversight, “shadow automation” proliferating unchecked — apply directly to the current AI agent landscape, only now the agents are autonomous and can take consequential actions.

2. Regulatory Pressure Materialized

The EU AI Act’s enforcement timeline, NIST’s AI Risk Management Framework updates, and sector-specific AI governance mandates (particularly in financial services and healthcare) created a compliance forcing function. Enterprises cannot simply promise to govern agents later. They need auditable proof that governance is in place — now.

3. Multi-Vendor Agent Estates Became the Norm

No enterprise runs agents from a single vendor. A typical 2026 enterprise agent estate includes Microsoft Copilot agents, Salesforce Agentforce agents, ServiceNow autonomous workflows, custom agents built on open-source frameworks, and vendor-embedded AI features that technically qualify as agents but were never registered anywhere. The control plane exists because no single vendor can govern the whole estate — and every vendor knows it.

The Five Leading Approaches: A Detailed Comparison

The control plane market has already stratified into five distinct approaches. Each reflects a different theory about where governance should live and who should own it.

1. IBM watsonx Orchestrate — The Agentic Control Plane

IBM launched the Agentic Control Plane in June 2026 as part of watsonx Orchestrate, available on both AWS and IBM Cloud. The positioning is explicit: IBM wants watsonx Orchestrate to be the single pane of glass for every AI agent in the enterprise, regardless of where that agent was built.

Key capabilities:

Multi-framework agent support — IBM native agents, Langflow agents, LangGraph agents, and agents built with the open A2A (agent-to-agent) protocol are all supported, with broader interoperability on the roadmap.
Agent Catalog — A tenant-level catalog where teams publish agents with semantic versioning, descriptions, categories, and icons. When an agent has dependencies (collaborator agents, Python tools), those travel automatically during publication.
Security Control Center — Centralized security dashboard with real-time safety guardrails designed to prevent cascading failures in autonomous workflows.
Natural-language scheduling — Users can schedule recurring agent tasks directly through conversational chat, lowering the barrier for business users to operationalize agents.
Analytics overhaul — Revamped analytics experience providing end-to-end visibility into agent activities, performance metrics, and usage patterns.

Best fit: Enterprises already invested in IBM’s AI stack, or organizations running heterogeneous agent frameworks (especially LangGraph and A2A-compatible agents) that need a vendor-neutral orchestration layer. IBM’s support for open protocols (A2A) is a meaningful differentiator for organizations wary of lock-in.

At Think 2026, IBM also unveiled a catalog of over 150 pre-built agents, signaling that they see the control plane not just as a governance tool but as an agent marketplace for enterprise reuse.

2. ServiceNow AI Control Tower

ServiceNow took a different angle at Knowledge 2026: rather than positioning as an agent builder, they positioned AI Control Tower as the governance layer that sits above every agent vendor in the enterprise. The ambition is to discover, govern, observe, secure, and measure every AI agent, model, and workflow across the organization — regardless of origin.

Key capabilities:

Cross-platform discovery — 30 new enterprise integrations spanning AWS, Google Cloud, Microsoft Azure, and enterprise applications like SAP, Oracle, and Workday. This lets AI Control Tower discover AI assets deployed outside ServiceNow’s own ecosystem.
Risk frameworks — Five new risk assessment frameworks aligned to NIST and EU AI Act standards, providing compliance controls across agents, models, datasets, prompts, and classic ML systems.
Real-time enforcement — The Control Tower can detect abnormal agent behavior, flag it in real time, automatically revoke the agent’s permissions, and shut it down.
Identity governance — Extended identity access governance to hyperscaler AI environments and connected devices.
Project Arc — A joint initiative with NVIDIA creating an autonomous desktop agent secured by the NVIDIA OpenShell runtime and governed by AI Control Tower. This extends governance from cloud agents to desktop-level autonomous workflows.

Best fit: Large enterprises that already use ServiceNow as their IT Service Management (ITSM) or IT Operations Management (ITOM) backbone. ServiceNow’s strength is that it already knows your infrastructure, your users, your approval workflows, and your compliance posture — extending that to AI agent governance is a natural adjacency. The NIST/EU AI Act alignment is particularly relevant for regulated industries.

ServiceNow also deepened its integration with Microsoft, extending AI Control Tower’s governance to Microsoft Agent 365 and the broader Azure-backed Foundry and Copilot Studio ecosystem.

3. Google Gemini Enterprise Agent Platform

At Cloud Next 2026, Google made a significant architectural decision: it rebranded and consolidated Vertex AI into the Gemini Enterprise Agent Platform, absorbing Agentspace into a unified product. The message was clear — Google is no longer selling model access. It is selling the full agentic enterprise platform, with governance at the center.

Key capabilities:

Unified control plane — Every agent deployed inside a company is visible, auditable, and controllable through a single control layer. Agents built on Agent Platform and surfaced in the Gemini Enterprise app operate under the same rules.
Govern and Optimize layers — The most meaningful new product at Next ’26 focused on identity management, agent registry, gateway functions, anomaly detection, simulation, evaluation, and observability.
A2A protocol integration — Google’s Agent-to-Agent (A2A) protocol, designed for cross-vendor agent communication, is natively supported. If you have been building multi-agent systems with Google ADK, the governance layer now provides production-grade oversight for those deployments.
Security GA — Google Threat Intelligence moved its agentic AI capabilities from public preview to general availability for Enterprise and Enterprise+ customers, specifically targeting automated threat hunting, incident response, and alert triage governance.

Best fit: Organizations building on Google Cloud, particularly those using ADK for multi-agent orchestration. Google’s advantage is vertical integration — the models, the agent framework, the deployment platform, and the governance layer are all first-party. The trade-off is that the governance story is strongest for Google-native agents and weaker (though improving) for third-party agent estates. Bain & Company, in their post-Next analysis, described Google’s approach as “the agentic enterprise control plane coming into view” — acknowledging both the ambition and the fact that multi-vendor governance is still a work in progress.

4. Palo Alto Networks Prisma AIRS (via Portkey Acquisition)

Palo Alto Networks completed its acquisition of AI gateway startup Portkey in May 2026, in a deal The New Stack valued at roughly $700 million. This represents the cybersecurity industry’s largest bet on agentic AI governance to date — and it reflects a fundamentally different philosophy from the three approaches above.

Where IBM, ServiceNow, and Google approach the control plane from the platform side (helping you build and manage agents), Palo Alto approaches it from the security side (treating every agent as a “privileged insider” that needs to be monitored, authenticated, and contained).

Key capabilities:

AI Gateway as central nervous system — Portkey now serves as the AI Gateway for Palo Alto’s Prisma AIRS platform, inspecting, routing, and securing every AI transaction across the enterprise.
Scale — Portkey processes trillions of tokens per month with latency low enough for agent-to-agent communication, a critical requirement for real-time governance of autonomous agents.
Multi-model governance — Centralized management across more than 3,000 LLMs and MCP tools, with semantic routing and automated failovers providing 99.99% uptime for autonomous workloads.
Artifact management — Seamless versioning and secure access control across all AI models, agents, and MCP servers.
Runtime enforcement — Governance policies enforced at runtime — not just at deployment — meaning the gateway can intervene while agents are actively operating.

Best fit: Security-first organizations, particularly in financial services, government, and healthcare, where the CISO’s office needs to govern AI agents with the same rigor applied to network security. Palo Alto’s approach is especially relevant for enterprises where the security team, not the platform team, owns the agent governance mandate. If your concern is less about “which agents exist” and more about “what damage can they do,” this is the approach designed for you.

5. Open-Source: agentgateway (Agentic AI Foundation / Linux Foundation)

In June 2026, Solo.io donated its agentgateway project to the Agentic AI Foundation under the Linux Foundation’s governance. This made it the fourth hosted project under the foundation, and it represents the open-source community’s answer to the proprietary control plane race.

Key capabilities:

Protocol-agnostic traffic management — Handles MCP (Model Context Protocol), agent-to-agent (A2A), inference, HTTP, and gRPC traffic through a single data plane.
Community scale — Over 300 contributors across 60 organizations including CoreWeave, Red Hat, Adobe, Salesforce, and Microsoft — an unusually broad coalition for an early-stage project.
Apache 2.0 licensing — No vendor lock-in, no usage fees, full source access.
Composability — Designed to be embedded into larger governance stacks rather than replacing them, making it complementary to (rather than competitive with) the proprietary platforms above.

Best fit: Platform engineering teams building custom agent infrastructure, organizations with strong open-source mandates, and companies that want governance primitives without committing to a specific vendor’s control plane vision. The agentgateway is not a complete control plane — it is the data-plane layer that a control plane needs. Think of it as the Envoy of agentic AI: a building block, not a finished product.

For practitioners already working with the prompts-context-loops architecture of modern AI engineering, the agentgateway provides the infrastructure layer that sits beneath those abstractions and governs the traffic between them.

Decision Table: Choosing Your Control Plane

Criteria	IBM watsonx Orchestrate	ServiceNow AI Control Tower	Google Gemini Enterprise	Palo Alto Prisma AIRS	agentgateway (OSS)
Primary angle	Build + govern	Govern + comply	Build + govern (Google-native)	Secure + enforce	Route + observe (data plane)
Multi-vendor agent discovery	Moderate (A2A, LangGraph, Langflow)	Strong (30+ integrations, AWS/GCP/Azure/SAP/Oracle)	Moderate (strongest for GCP-native)	Strong (3,000+ LLMs, MCP tools)	Protocol-level (MCP, A2A, gRPC)
Compliance frameworks	IBM AI Ethics, custom	NIST, EU AI Act (5 frameworks)	Google AI Principles, evolving	Security-first, CISO-aligned	BYO compliance layer
Real-time enforcement	Yes (guardrails)	Yes (auto-revoke + shutdown)	Yes (anomaly detection)	Yes (runtime policy enforcement)	Routing-level only
Agent catalog/marketplace	150+ pre-built agents	50+ specialized agents (Slack/Teams/IT)	Gemini Enterprise app ecosystem	N/A (security focus)	N/A
Open protocol support	A2A	Proprietary + Microsoft Agent 365	A2A (creator)	MCP, multi-model	MCP, A2A, HTTP, gRPC
Best for	Multi-framework enterprises	ITSM-centric, regulated	Google Cloud-native	Security-first orgs	Platform engineering teams
Availability	GA (June 2026)	Innovation Lab (May); GA expected Aug 2026	GA (Cloud Next 2026)	GA (May 2026)	Apache 2.0 (June 2026)

The Architecture Pattern: What Every Control Plane Has in Common

Despite the differences in positioning and feature sets, all five approaches share a common architectural pattern. Understanding this pattern helps you evaluate any control plane — including ones from the other 28 vendors in Forrester’s landscape report that we have not covered here.

Layer 1: Agent Registry and Discovery

Every control plane starts with an inventory. You cannot govern agents you do not know about. The registry discovers agents across the enterprise — including embedded copilots, SaaS-native agents, and custom deployments — and maintains a live catalog with ownership, version, permissions, and dependency metadata.

This is where the 96% vs. 12% gap lives. Most enterprises have agents running in production that no central team authorized, registered, or even knows about. The registry closes that blind spot.

Layer 2: Policy Engine and Identity

Once agents are discovered, the control plane applies governance policies: which tools can each agent access, what data can it read, which actions require human approval, and what identity does the agent operate under. This is the layer that maps enterprise IAM (Identity and Access Management) concepts to agent-level permissions.

ServiceNow’s approach to this layer is particularly mature, given its existing identity governance infrastructure. IBM’s approach leverages its enterprise security heritage. Google integrates with Cloud IAM natively.

Layer 3: Observability and Telemetry

Real-time monitoring of agent activities: what each agent is doing, how long it takes, what errors it encounters, what data it accesses, and how it interacts with other agents. This layer produces the audit trail that compliance teams need and the performance data that operations teams use to optimize.

Layer 4: Enforcement and Circuit Breaking

The critical differentiator between a “governance dashboard” and a true control plane: the ability to intervene in real time. When an agent behaves anomalously — accessing data it should not, taking actions outside its scope, or cascading failures through a multi-agent workflow — the enforcement layer can revoke permissions, halt execution, or reroute traffic. ServiceNow’s auto-revoke-and-shutdown capability and Palo Alto’s runtime policy enforcement are the most aggressive implementations of this layer today.

Practical Implementation: Where to Start

For organizations that have not yet implemented a control plane, the Forrester research suggests a phased approach. Trying to boil the ocean — governing every agent on day one — is a recipe for the project stalling in committee.

Phase 1: Discovery and Inventory (Weeks 1-4)

Before selecting a vendor, run an internal discovery exercise. Catalog every AI agent, copilot, and embedded AI feature running in your environment. Include:

Vendor-provided agents (Microsoft Copilot, Salesforce Agentforce, ServiceNow autonomous workflows)
Custom-built agents (LangGraph, CrewAI, Google ADK deployments)
Shadow agents (departmental experiments, personal GPTs connected to enterprise data, browser-based copilots)
Embedded AI features in SaaS products that technically qualify as agents

Most organizations discover 3-5x more agents than they expected. That discovery itself often provides the executive sponsorship needed for the governance investment.

Phase 2: Risk Tiering (Weeks 4-6)

Not every agent needs the same level of governance. Tier your agents by risk:

Tier 1 (critical) — Agents that can take financial actions, access PII, or interact with customers
Tier 2 (significant) — Agents that access internal business data or make recommendations humans act on
Tier 3 (low) — Read-only agents, summarization tools, internal productivity copilots

Apply control plane governance to Tier 1 first. This keeps the initial scope manageable while addressing the highest-risk agents immediately.

Phase 3: Vendor Selection and Deployment (Weeks 6-12)

Use the decision table above to shortlist vendors based on your existing infrastructure, compliance requirements, and the composition of your agent estate. Key questions:

Is your agent estate primarily one vendor’s ecosystem, or genuinely multi-vendor? (Single-vendor → Google or IBM; multi-vendor → ServiceNow or Palo Alto)
Does governance report to the platform team or the security team? (Platform → IBM, Google, agentgateway; Security → Palo Alto, ServiceNow)
What compliance frameworks are mandatory for your industry? (EU AI Act / NIST → ServiceNow has the most mature framework alignment)
Do you need to govern desktop-level agents, not just cloud agents? (Desktop → ServiceNow’s Project Arc with NVIDIA OpenShell)

Phase 4: Operationalize and Scale (Ongoing)

The control plane is not a one-time deployment. As new agents enter the estate — and they will, constantly — the discovery, tiering, and governance cycle repeats. The best implementations treat the control plane as a living system with regular reviews, updated policies, and continuous monitoring.

If you are building custom agents today using frameworks like LangGraph, plan for governance from the start. The LangGraph deployment pipeline should include control plane registration as a standard step, not an afterthought bolted on months after the agent is already in production.

The Market Trajectory: What Comes Next

The AI governance platform market is projected to reach $492 million in 2026 and exceed $1 billion by 2030. Several trends will shape where this market heads:

Consolidation is already happening. Palo Alto’s Portkey acquisition is the first major M&A move, but it will not be the last. Expect infrastructure vendors (cloud providers, security platforms, ITSM providers) to acquire specialized governance startups throughout 2026 and 2027.

Open standards will determine interoperability. The A2A protocol (backed by Google, supported by IBM) and MCP (Model Context Protocol) are emerging as the communication standards agents use to talk to each other and to tools. Control planes that support these protocols natively will have an advantage over those requiring proprietary integrations. The agentgateway project’s protocol-agnostic approach — handling MCP, A2A, inference, HTTP, and gRPC through one data plane — reflects where the open-source community thinks this is heading.

The “agent gateway” category will merge with the control plane category. Today, some vendors offer gateways (traffic management, routing, security enforcement) and others offer control planes (discovery, governance, observability). Within 18 months, the market will expect both capabilities in a single product. Forrester’s landscape report already evaluates vendors across both dimensions.

FAQs

What is the difference between an AI agent control plane and an AI gateway?

An AI gateway operates at the data-plane level — it routes, authenticates, and monitors traffic between agents and the models or tools they access. A control plane operates at the management level — it discovers agents, applies governance policies, provides observability, and enforces compliance. In practice, the market is converging: enterprise buyers expect both capabilities in a single product, and Forrester evaluates vendors across both dimensions. Think of the gateway as the enforcement mechanism and the control plane as the policy and visibility layer that directs it.

Do I need a control plane if I only have a few AI agents?

You probably have more agents than you think. Forrester’s research found that enterprises systematically undercount their AI agent footprint, especially when factoring in embedded copilots, SaaS-native agents, and shadow deployments. If you have fewer than 10 agents, a formal control plane may be premature — but you still need an inventory. Start with a manual discovery exercise and revisit the control plane decision as your agent count grows past the point where a spreadsheet can track them.

Can I use multiple control planes for different parts of my organization?

You can, but it defeats the purpose. The core value of a control plane is a single, unified view of the entire agent estate. Running multiple control planes recreates the visibility gap the technology was designed to close. If organizational politics require divisional autonomy, consider a federated model: one primary control plane with delegated governance domains for different business units.

How do open-source options like agentgateway compare to commercial platforms?

The open-source agentgateway provides data-plane capabilities — traffic routing, protocol handling, and basic observability — under Apache 2.0 licensing with no vendor lock-in. It does not provide the full governance stack (discovery, compliance frameworks, agent catalogs, risk assessment) that commercial platforms offer. It is best positioned as a building block for platform engineering teams constructing custom governance infrastructure, or as the data-plane layer beneath a commercial control plane.

Which compliance frameworks do AI agent control planes support?

ServiceNow’s AI Control Tower leads with five built-in risk frameworks aligned to NIST and EU AI Act standards. IBM integrates with its AI Ethics frameworks and supports custom policy configurations. Google aligns with its published AI Principles and Cloud IAM policies. Palo Alto focuses on security-first compliance aligned with CISO requirements. The specific frameworks supported should be a primary selection criterion for regulated industries — check whether the vendor’s built-in frameworks match your regulatory obligations or whether you will need to build custom policies.

Key Takeaways

The governance gap is real and urgent: 96% of enterprises run AI agents in production, but only 12% have the tooling to govern them. Forrester has formally recognized “agentic control plane” as an enterprise software category with 33 vendors.
Five distinct approaches have emerged: IBM (build + govern), ServiceNow (govern + comply), Google (build + govern, Google-native), Palo Alto Networks (secure + enforce), and open-source agentgateway (route + observe).
Start with discovery, not vendor selection: Most organizations discover 3-5x more agents than expected. The inventory exercise itself often generates the executive sponsorship for governance investment.
Match the control plane to governance ownership: If governance reports to the platform team, look at IBM or Google. If it reports to the CISO, look at Palo Alto. If it reports to IT operations, look at ServiceNow.
Open protocols (A2A, MCP) will be the interoperability backbone: Prioritize control planes with native support for these standards to avoid lock-in as the multi-agent ecosystem matures.
Plan for governance at build time: If you are building custom agents today, include control plane registration in your deployment pipeline from day one.

References

Forrester, Agentic Control Plane Solutions Landscape, Q2 2026 — forrester.com
IBM, “Agentic Control Plane in IBM watsonx Orchestrate: One Place to Control Every AI Agent” — ibm.com
ServiceNow Newsroom, “ServiceNow expands AI Control Tower to discover, observe, govern, secure, and measure AI” — newsroom.servicenow.com
Bain & Company, “Google Cloud Next 2026: The Agentic Enterprise Control Plane Comes into View” — bain.com
Palo Alto Networks, “Palo Alto Networks Completes Acquisition of Portkey to Secure AI Agents” — paloaltonetworks.com
The New Stack, “Palo Alto Networks makes a $700M-class AI bet on Portkey gateway” — thenewstack.io
Forbes, “Agent Gateways Are Becoming The Control Plane For Enterprise AI” — forbes.com
TWIML AI, “Google Cloud Next ’26: Delivering the Agentic Control Plane” — twimlai.com
Speakeasy, “2026 Is the Year of Enterprise AI Governance” — speakeasy.com
OutSystems, 2026 IT Leader Survey (cited via Google Cloud Next coverage)

Oracle Fusion Agentic Applications: The Complete Guide to AI Agent Studio and the New Pro-Code Builder (2026)

Satish Prasad — Wed, 22 Jul 2026 01:56:42 +0000

When Oracle announced 22 Fusion Agentic Applications across ERP, HCM, SCM, and CX in March 2026, the enterprise software world took notice. But the July 14, 2026 follow-up — opening Oracle AI Agent Studio to pro-code developers and AI coding agents like Codex and Claude Code — is the announcement that changes the game for automation architects and enterprise developers. It signals that Oracle isn’t just adding AI features to its cloud suite. It’s repositioning Fusion as a full development platform for autonomous enterprise applications.

This guide walks through what Fusion Agentic Applications actually are, how the new AI Agent Studio builder experience works across no-code, low-code, and pro-code tiers, what the 22 launch applications cover, how Oracle’s approach compares to Salesforce Agentforce and SAP Joule, and what this means for practitioners building the next generation of enterprise automation.

What Are Oracle Fusion Agentic Applications?
Oracle AI Agent Studio: The Builder Platform
The New Pro-Code Builder Experience (July 2026)
The 22 Fusion Agentic Applications by Domain
Architecture: How Fusion Agentic Applications Work Under the Hood
Pricing and Licensing: What It Actually Costs
Oracle vs. Salesforce Agentforce vs. SAP Joule
What This Means for Automation Architects
Getting Started: A Practitioner’s Roadmap
FAQs
Key Takeaways
References

What Are Oracle Fusion Agentic Applications?

Fusion Agentic Applications are a new class of enterprise software that Oracle introduced with Release 26B. Unlike traditional enterprise applications that record data and wait for human action, agentic applications are outcome-driven systems backed by teams of specialized AI agents that reason, coordinate, and decide — then execute work through Fusion business objects, workflows, tools, policies, approvals, and logged actions.

The distinction matters. A copilot suggests; an agent acts. A chatbot answers questions about invoice status; an agentic application evaluates overdue balances, assesses risk signals, prioritizes collection actions, drafts customer communications, and routes exceptions to human reviewers — all within the existing Fusion security and governance framework.

“Enterprise software is moving beyond systems that record work to systems that actively drive and execute outcomes,” said Chris Leone, Executive Vice President of Applications Development at Oracle, in the July 14 announcement.

Four Defining Characteristics

Oracle defines Fusion Agentic Applications by four properties that separate them from copilots, chatbots, and standalone AI automation tools:

Outcome-driven coordination. Instead of performing individual tasks, teams of specialized agents work together to deliver complete business results. A single agentic application might coordinate a sourcing agent, an engineering analysis agent, and a supplier communication agent — each with its own specialty — to achieve a defined procurement objective.
Adaptive user experience. The UX adapts to the specific user and the work being performed. Rather than presenting a fixed set of screens, the interface reshapes itself based on context, role, and the current state of the workflow.
Autonomous actions. Agents handle routine, repetitive actions autonomously, keeping work moving while surfacing exceptions and decisions that require human judgment. The line between “do it automatically” and “ask a human” is governed by configurable policies and approval hierarchies.
Native Fusion foundation. These applications are built natively on the Oracle Fusion data model, security, and business rules. AI reasoning is grounded in actual business logic — not a disconnected copy of data or an external orchestration layer bolted on after the fact.

This last point is Oracle’s key differentiator. Fusion Agentic Applications inherit identity management, role-based access, approval frameworks, and end-to-end traceability from the existing Fusion runtime. There’s no separate AI infrastructure to manage, no custom integration layer to build, and no additional security model to configure.

Oracle AI Agent Studio: The Builder Platform

Oracle AI Agent Studio for Fusion Applications is the design-time environment where organizations create, extend, deploy, and manage AI agents and agent teams. It was first introduced in March 2025 as a no-code tool for business users and has evolved significantly since then.

What AI Agent Studio Provides

The studio is available at no additional cost to Oracle Fusion Applications customers and includes:

Agent creation and orchestration — Build individual agents and compose them into teams with defined coordination patterns
Advanced testing and validation — Test agents against production-like scenarios before deployment
Oracle AI Agent Marketplace — Access a growing catalog of pre-built agents, workflows, connectors, templates, and (as of the July update) complete agentic applications from Oracle and partners
Built-in security and governance — Agents inherit Fusion’s existing security model, including role-based access, approval hierarchies, and audit trails
Monitoring and observability — Token tracking, telemetry, evaluation tools, and visibility into agent decision paths (added in the October 2025 update)

As of July 2026, Oracle reports over 80,000 certified experts trained in AI Agent Studio — a significant ecosystem investment that suggests Oracle is treating this as a foundational capability rather than a feature release.

The 1,000+ Agent Foundation

Oracle has shipped over 1,000 AI agents across Fusion Applications. These aren’t experimental features; they’re production-grade components embedded into the ERP, HCM, SCM, and CX modules that thousands of organizations already run. The 22 new Fusion Agentic Applications are built on top of this agent foundation, combining multiple agents into coordinated, outcome-driven workflows.

The New Pro-Code Builder Experience (July 2026)

The July 14 announcement is the one that matters most for developers and automation architects. Oracle expanded AI Agent Studio from a primarily no-code/low-code platform into a unified builder experience that spans natural language, low-code, and pro-code development.

What Changed

The centerpiece is the new AI Studio Skill — a capability that lets professional developers use their existing tools to build Fusion Agentic Applications. Specifically, developers can now:

Build agents and agentic applications using Visual Studio Code
Use standard command-line interfaces (CLIs)
Manage code with Git-based lifecycle management
Leverage AI coding agents and assistants — Oracle explicitly named OpenAI Codex and Anthropic’s Claude Code as supported tools
Run local validation, debugging, and CI/CD workflows

“We started with people who can use natural language or a little bit of coding — no-code to low-code people,” said Natalia Rachelson, Senior Vice President of Applications Development at Oracle, in an interview with SiliconANGLE. “But we’re now going to go after what’s called pro-code people.”

How It Works in Practice

The workflow is designed to feel native to professional developers:

Describe the business outcome — The developer specifies what the agentic application should achieve (e.g., “increase renewal rate by 15%”)
Load the AI Studio Skill into a compatible coding agent (Codex, Claude Code, or any tool that supports skill-based prompting)
The coding agent generates Fusion artifacts — agent definitions, workflow configurations, tool bindings, and UX components — all conforming to the Fusion-native framework
Validate and test locally — Standard debugging and validation against the Fusion runtime
Deploy via Git-based CI/CD — The same lifecycle management patterns professional developers already use

Rachelson used a renewal management example: an organization tells an agentic application to increase its renewal rate by a specified amount. The application examines historical and current records, identifies opportunities, drafts renewal contracts, and potentially sends them to customers — all reasoning and transacting against data already in Fusion, not a detached copy.

Developer Resources

Oracle also announced a new public GitHub repository providing templates, starter projects, sample applications, reusable assets, and reference architectures. This is notable because Oracle’s enterprise software development has historically been a closed ecosystem — opening a GitHub-based resource path signals a genuine cultural shift toward meeting developers where they already work.

The 22 Fusion Agentic Applications by Domain

Release 26B introduced 22 Fusion Agentic Applications across four pillars. Here’s what each domain covers:

Oracle Fusion Cloud ERP (Finance)

Application	What It Does
Collectors Workspace	Evaluates overdue balances, invoice aging, risk signals, and recent interactions to recommend high-priority collection actions. Agents prioritize accounts, draft communications, and route escalations.
Cost Accounting Close Workspace	Coordinates cost accounting activities during period close, surfacing exceptions and automating routine reconciliation tasks.
Security Command Centre	Monitors and responds to security-relevant events across the Fusion environment, coordinating investigation and remediation workflows.

Oracle Fusion Cloud HCM (Human Resources)

Application	What It Does
Workforce Operations Command Center	Evaluates staffing coverage, policy risk, absence requests, and timecard data. Handles shift-drop requests with recommended replacements, flags policy conflicts, and enables bulk absence approvals.
Career Advancement Workspace	Coordinates career development activities, matching employee skills and goals to internal opportunities.
Manager Support Workspace	Assists managers with day-to-day people management tasks, surfacing actionable insights about team operations.
Learning & Talent Workspaces	Manages training recommendations, skill gap analysis, and talent pipeline coordination.

Oracle Fusion Cloud SCM (Supply Chain Management)

Application	What It Does
Design-to-Source Workspace	Bridges engineering and strategic sourcing — agents reason across design changes, manage supplier risk, and automate RFQ processes. This is the flagship cross-functional example.
Product Readiness Workspace	Coordinates product launch readiness across engineering, manufacturing, and commercial teams.
Production Shift Operations Workspace	Manages production floor operations, shift scheduling, and equipment utilization.
Sales Order Command Centre	Orchestrates order processing, fulfillment, and exception handling across the order-to-cash cycle.
Batch Process Manufacturing Workspace	Manages batch production workflows for process manufacturing environments.
Logistics Execution Command Centre	Coordinates transportation, warehousing, and delivery operations.
Maintenance Operations Workspace	Manages asset maintenance scheduling, work orders, and predictive maintenance workflows.
Warehouse Operations Workspace	Coordinates warehouse activities including receiving, putaway, picking, and shipping.
Sourcing Command Centre	Manages strategic sourcing activities, supplier evaluation, and procurement optimization.

Oracle Fusion Cloud CX (Customer Experience)

Application	What It Does
Sales Command Center	Identifies contract renewal risk, recommends quote revisions based on total contract value and margin analysis, and prepares customer presentations and follow-up communications.
Service Manager Workspace	Reasons over customer history and support data to identify escalations before they impact satisfaction. Agents coordinate service responses across channels.
Marketing Campaign Workspace	Evaluates customer data, prepares campaign content, and segments audiences for cross-selling initiatives.

These 22 applications ship as part of Release 26B and are available to existing Fusion Cloud customers. They extend the foundation of 1,000+ individual AI agents already embedded in Fusion Applications.

Architecture: How Fusion Agentic Applications Work Under the Hood

Understanding the architecture is essential for automation architects evaluating whether Oracle’s approach fits their enterprise strategy.

Native Runtime, Not a Bolt-On

The critical architectural decision Oracle made is running agentic applications inside the Fusion runtime rather than alongside it. This means:

No separate AI infrastructure. Agents execute within the same runtime that processes business transactions. There’s no external orchestration layer, no message queue between “the AI system” and “the business system.”
Direct business object access. Agents reason against live transactional data — the actual invoices, purchase orders, employee records, and workflows that Fusion manages. Not a synced copy, not an API-mediated view.
Inherited governance. Identity, roles, permissions, approval hierarchies, and audit trails are the same ones the organization already configured for Fusion. There’s no parallel security model to maintain.

This is fundamentally different from the pattern where organizations build AI agents outside their enterprise system and then spend months integrating identity, data access, approvals, audit trails, and lifecycle management. As Rachelson put it: “They’re not really a bolt-on, like an extra layer. They operate inside Fusion.”

Agent-to-Agent Interoperability

Oracle also built support for agent-to-agent interoperability patterns. This means:

Oracle AI Data Platform agents can participate in Fusion workflows
Third-party agents can connect and coordinate with Fusion Agentic Applications
Custom-built agents have the same execution capabilities as Oracle’s own agents

This open execution model is important for enterprises that run multi-vendor AI stacks — it means Oracle isn’t requiring organizations to go all-in on Oracle-only AI.

Auditability and Traceability

Every Fusion Agentic Application provides step-by-step traceability of agent decisions, tools used, and execution paths. This isn’t optional observability — it’s a fundamental requirement for regulated industries (financial services, healthcare, public sector) where “the AI decided” is not an acceptable audit response. Organizations can trace exactly which agent made which decision, what data it accessed, and what policy governed the action.

Pricing and Licensing: What It Actually Costs

Oracle’s pricing model for Fusion AI agents uses an AI Unit consumption framework:

Component	Cost
AI Agent Studio platform	No additional cost for Fusion customers
Basic LLM usage	No AI Unit consumption (effectively free)
Premium LLM usage	~5 AI Units per action (~$0.03–$0.05 per query)
Monthly allocation	20,000 AI Units included
Additional capacity	100,000 AI Unit packs at $1,000 each (with rollover)

Notably, there’s no pricing distinction between seeded Oracle agents, custom-built agents, and marketplace agents — they all consume from the same AI Unit pool. The “no additional cost” framing for the studio itself is significant; Oracle is betting that the platform’s value will come from deeper Fusion adoption and higher cloud consumption, not from charging for the builder tools.

Oracle vs. Salesforce Agentforce vs. SAP Joule: Where Each Wins

Enterprise buyers evaluating agentic AI platforms inevitably compare Oracle’s approach to Salesforce and SAP. Here’s a practitioner’s breakdown:

Dimension	Oracle Fusion Agentic Apps	Salesforce Agentforce	SAP Joule
Architecture	Agents run natively inside Fusion runtime; direct business object access	Agent platform on Salesforce Data Cloud; strong CRM-native integration	Cross-module copilot across SAP landscape; embedded in BTP
Sweet spot	Back-office: Finance, HR, Supply Chain, Manufacturing	Front-office: Sales, Service, Marketing, Commerce	End-to-end value chain for SAP-centric organizations
Builder experience	No-code + low-code + pro-code (VS Code, CLI, Git, AI coding agents)	Low-code Agent Builder + Prompt Builder; limited pro-code extensibility	SAP Build AI + BTP; developer tools improving but less mature
Agent marketplace	Oracle AI Agent Marketplace (expanding to full agentic apps)	AppExchange + Agentforce partner ecosystem	SAP Store + partner solutions
Governance	Inherited from Fusion: roles, approvals, audit trails, policy controls	Einstein Trust Layer; data masking and grounding	SAP security model; data governance via BTP
Pricing model	AI Unit consumption (~$0.01/unit); 20K monthly included	$2/conversation (Agentforce); volume discounts available	Bundled with SAP AI licensing; varies by module
Scale signal	1,000+ agents shipped; 22 agentic apps; 80,000 certified experts	$1.4B ARR with 114% growth; 75%+ of top deals include Agentforce	400,000+ SAP customers; Joule embedded across S/4HANA, SuccessFactors, Ariba

The Verdict

Choose Oracle when your organization runs Oracle Fusion Cloud for back-office operations and needs agentic capabilities that operate natively against ERP, HCM, and SCM data with full governance and auditability. The pro-code builder is the strongest option for organizations with professional developers who want to extend enterprise AI using modern tooling.

Choose Salesforce Agentforce when your priority is customer-facing operations — sales, service, marketing, commerce — and you’re already invested in the Salesforce ecosystem. Agentforce’s conversation-based pricing and commerce integrations (including the new OpenAI/ChatGPT and Google Shopping connections) make it the front-office leader.

Choose SAP Joule when your organization is SAP-centric and needs agentic capabilities that span the full value chain. SAP’s advantage is breadth across 400,000+ customers, but the platform’s complexity (multiple clouds, data models, and process models) makes cross-module agent coordination harder than Oracle’s unified approach.

In reality, many large enterprises will use more than one. Oracle’s support for agent-to-agent interoperability with third-party agents acknowledges this multi-vendor reality.

What This Means for Automation Architects

If you’re an automation architect, solution architect, or RPA developer, Oracle’s Fusion Agentic Applications represent a significant shift in how enterprise automation gets built and deployed. Here’s what to pay attention to:

1. The “Build Outside, Integrate Later” Pattern Is Dying

The traditional approach — build an AI agent externally, then spend months wiring it into the enterprise system’s identity, data, approvals, and audit framework — is increasingly uncompetitive. Oracle, Salesforce, and SAP are all moving toward native agentic capabilities where the AI runs inside the system of record. If you’re still building standalone automation agents that need custom integration layers, the platform vendors are making that pattern obsolete.

2. Pro-Code Access Changes the Developer Equation

Oracle’s decision to open AI Agent Studio to VS Code, Git, CLI workflows, and AI coding agents (Codex, Claude Code) means professional developers can now build enterprise-grade agentic applications without learning Oracle’s proprietary tooling. This is a talent acquisition play — the pool of developers who know VS Code and Git is orders of magnitude larger than the pool who know Oracle’s traditional development tools.

3. The 80,000 Certified Experts Number Matters

For practitioners, the certified ecosystem size is a leading indicator of job market depth. 80,000 certified Oracle AI Agent Studio experts means training programs, partner implementations, and consulting opportunities are already scaling. If you’re considering adding Oracle agentic skills to your resume, the ecosystem is past the early-adopter phase.

4. Governance Is the Differentiator, Not AI Capability

Every major enterprise software vendor now has AI agents. The competitive differentiation has shifted from “can it do AI” to “can it do AI with the governance, auditability, and compliance controls that regulated enterprises require.” Oracle’s native runtime approach — where agents inherit existing Fusion security without additional configuration — is a direct answer to the deployment barrier that kills most agentic automation programs.

Getting Started: A Practitioner’s Roadmap

For automation professionals looking to explore Oracle Fusion Agentic Applications, here’s a practical starting path:

Assess your Fusion footprint. Fusion Agentic Applications require Oracle Fusion Cloud Applications. If your organization runs Fusion ERP, HCM, SCM, or CX, AI Agent Studio is available at no additional cost.
Start with the 22 seeded applications. Don’t build custom agents first. Deploy one of the 22 built-in agentic applications in a sandbox environment and observe how it coordinates agents, handles exceptions, and maintains audit trails. The Collectors Workspace (Finance) and Workforce Operations Command Center (HCM) are good starting points because their outcomes are measurable.
Explore the AI Agent Marketplace. Before building custom, check the marketplace for partner-built agents and agentic applications that address your use case. The marketplace is expanding rapidly.
Set up the pro-code toolchain. If you have professional developers, install the AI Studio Skill and connect it to your existing VS Code + Git workflow. The new GitHub repository provides templates and starter projects that accelerate the learning curve.
Map your governance requirements. Before deploying to production, document which roles, approval hierarchies, and audit requirements apply. Fusion Agentic Applications inherit these controls, but you need to ensure they’re correctly configured in your Fusion environment first.
Measure outcomes, not activity. Fusion Agentic Applications are outcome-driven by design. Set clear success metrics (days sales outstanding reduced by X, renewal rate increased by Y, shift coverage gaps reduced by Z) and track them — don’t just count how many agents you deployed.

Frequently Asked Questions

Do Fusion Agentic Applications require additional licensing beyond Oracle Fusion Cloud?

Oracle AI Agent Studio is available at no additional cost to Fusion Cloud customers. However, AI agent usage consumes AI Units — you get 20,000 monthly units included, and additional 100,000-unit packs cost $1,000 each with rollover. Basic LLM usage has no unit consumption; premium actions cost approximately 5 units each (~$0.03–$0.05 per query).

Can I build custom agentic applications with my own development tools?

Yes, as of July 2026. The new AI Studio Skill supports Visual Studio Code, standard CLIs, Git-based workflows, and AI coding assistants including OpenAI Codex and Claude Code. You can use local validation, debugging, and CI/CD pipelines. A public GitHub repository provides templates and starter projects.

How do Fusion Agentic Applications handle security and compliance?

Agents inherit the existing Fusion security model — role-based access, approval hierarchies, permissions, and policies. Every agent action is logged with step-by-step traceability of decisions, tools used, data accessed, and execution paths. There’s no separate security configuration required beyond what you’ve already set up in Fusion.

Can third-party agents work with Fusion Agentic Applications?

Yes. Oracle supports agent-to-agent interoperability patterns that allow Oracle AI Data Platform agents, third-party agents, and custom-built agents to participate in Fusion workflows with the same capabilities as native agents.

What’s the difference between an AI agent and a Fusion Agentic Application?

An AI agent is a single specialized component that performs a specific task (e.g., analyzing invoice aging). A Fusion Agentic Application is a complete business application composed of multiple coordinated agents, user experiences, workflows, tools, policy controls, approvals, and runtime assets — all working together toward a defined business outcome.

Key Takeaways

Oracle launched 22 Fusion Agentic Applications across ERP, HCM, SCM, and CX in Release 26B — outcome-driven systems powered by coordinated agent teams, not standalone copilots.
The July 2026 pro-code builder opens AI Agent Studio to VS Code, Git, CLI, and AI coding agents (Codex, Claude Code), dramatically expanding the developer audience beyond Oracle specialists.
Native Fusion runtime means agents operate directly against live business data with inherited security, governance, and audit trails — no external orchestration layer required.
Pricing uses an AI Unit consumption model with 20,000 monthly units included and the builder platform at no additional cost.
80,000+ certified experts signal a maturing ecosystem with real job market and consulting demand.
The competitive landscape is domain-specific: Oracle leads in back-office (ERP/HCM/SCM), Salesforce in front-office (Sales/Service/Commerce), and SAP across the full value chain for SAP-centric organizations.
For automation architects, the key insight is that platform-native agentic AI (where agents run inside the system of record) is replacing the “build outside, integrate later” pattern.

References

Oracle. “Oracle Introduces AI-Native Builder Experience to Create and Run Agentic Applications in Oracle Fusion Applications.” Oracle Press Release, July 14, 2026. Link
Dotson, Kyt. “Oracle opens Fusion Agentic Applications to pro-code developers and coding agents.” SiliconANGLE, July 14, 2026. Link
Oracle. “Oracle Introduces Fusion Agentic Applications.” Oracle Press Release, March 24, 2026. Link
Oracle Fusion Development Team. “New Fusion Agentic Applications — details and demos.” Oracle Fusion Insider Blog, May 15, 2026. Link
Oracle. “Oracle Expands AI Agent Studio for Fusion Applications with Agentic Applications Builder.” Oracle Press Release, March 24, 2026. Link
TechTarget. “Oracle AI agent builder brings no-code, low-code and pro-code together.” TechTarget, July 2026. Link
Futurum Group. “Oracle’s Fusion Agentic Apps: Can Platform-First AI Finally Deliver Enterprise ROI?” Futurum Group, 2026. Link
Version1. “Oracle Fusion AI Agents 26C Pricing Explained.” Version1, 2026. Link
ERP Today. “Oracle’s Next Agentic AI Move Puts Builders Inside Fusion.” ERP Today, July 2026. Link
CIO. “Oracle supercharges AI Agent Studio to rival Microsoft, Google, and Salesforce.” CIO, 2026. Link

What to read next on rpabotsworld.com:

GPT-5.6 for Agent Builders: Sol, Terra, Luna Explained

Satish Prasad — Tue, 21 Jul 2026 19:05:36 +0000

On July 9, 2026, OpenAI moved the GPT-5.6 family to general availability — and for the first time, a flagship OpenAI release is less about a single smarter model and more about restructuring how agentic systems get built. GPT-5.6 ships as three durable tiers (Sol, Terra, and Luna), introduces Programmatic Tool Calling that lets the model write its own orchestration code, adds a multi-agent beta to the Responses API, and overhauls prompt caching with explicit breakpoints. If you architect AI agents for a living, this release changes several assumptions your current stack is built on.

This deep dive covers what actually shipped, what the benchmarks say (including where GPT-5.6 clearly does not lead), and what it means for teams running LangGraph, CrewAI, or vendor agent platforms in production.

What OpenAI Shipped: Three Tiers, One Generation

GPT-5.6 introduces a new naming system. The number identifies the model generation; Sol, Terra, and Luna identify capability tiers that OpenAI says will advance on their own cadence, according to OpenAI’s preview announcement. In practice this gives architects a stable routing target: you design workload routing against tiers, not against model version strings that churn every quarter.

Tier	Positioning	Input / Output (per 1M tokens)	Best fit
Sol	Flagship — frontier reasoning, complex coding, multi-step agentic tasks, cybersecurity research	$5 / $30	Planner agents, hard coding tasks, long-horizon workflows
Terra	Balanced tier — GPT-5.5-competitive performance at roughly half the cost	$2.50 / $15	Everyday worker agents, most production traffic
Luna	Speed and cost tier	$1 / $6	Summarization, drafting, classification, high-volume automation

All three tiers offer a 1 million token context window, and all three clear GPT-5.5 on DeepSWE v1.1, SWE-Bench Pro, and Agents’ Last Exam, per MarkTechPost’s analysis of OpenAI’s published eval tables. The 5x price spread across tiers is the point: OpenAI is explicitly inviting you to route by task difficulty rather than defaulting everything to the flagship.

Two new reasoning controls also arrive with this generation: a max reasoning effort that gives Sol maximum time to think, and an ultra mode that goes beyond a single agent by spawning subagents for complex work — more on that below.

The Unusual Launch: A Government-Coordinated Preview First

GPT-5.6 did not launch the normal way. On June 26, OpenAI began a limited preview restricted to a small group of trusted partners whose participation was shared with the U.S. government, ahead of the broad July 9 release. The reason is capability: OpenAI describes GPT-5.6 Sol as its most capable cybersecurity model yet, competitive with far larger models on exploit-development benchmarks while using roughly a third of the output tokens.

The company paired that capability with what it calls its most robust safeguard stack to date: refusal training, real-time cyber and biology misuse classifiers that can pause generation for review by a larger reasoning model, account-level review across conversations, and over 700,000 A100-equivalent GPU hours of automated red-teaming against universal jailbreaks. OpenAI states Sol does not cross the Cyber Critical threshold under its Preparedness Framework, but chose a phased release anyway.

Why this matters to enterprise architects: OpenAI itself warns that safeguards “may occasionally intervene on legitimate work, particularly in dual-use areas.” If your agents do security scanning, penetration-test reporting, or vulnerability triage, budget validation time for false-positive refusals and generation pauses before you migrate those workloads.

Programmatic Tool Calling: The Model Writes the Orchestration Loop

The most architecturally significant feature in this release is Programmatic Tool Calling in the Responses API. The classic pattern — model returns one tool call, your code executes it, you send the result back, repeat — has defined agent development since function calling debuted. It is also slow and token-expensive: every round trip re-sends context.

With Programmatic Tool Calling, GPT-5.6 instead writes JavaScript that orchestrates multiple tool calls itself. That code runs in an isolated, sandboxed V8 runtime with no network access; the runtime invokes your tools, passes intermediate results between calls, and processes outputs without a model round trip at each step. You opt tools in explicitly via an allowed_callers parameter, and the feature is ZDR-compatible with no additional container costs, per OpenAI’s developer documentation.

When to use it — and when not to

Good fit: bounded, tool-heavy workflows where the steps are mechanical — fetch 40 records, filter them, enrich each one, aggregate. OpenAI reports named-customer token reductions of 38% to 63.5% on such workloads.
Poor fit: workflows that require fresh model judgment between steps, human-in-the-loop approval gates, or tools with side effects you need to audit individually. When the model batches ten tool calls inside generated JavaScript, your per-step observability and interception points move — your tracing has to follow.

This is worth saying plainly: Programmatic Tool Calling shifts part of the orchestration loop from your framework into the model vendor’s runtime. That is a genuine efficiency win and a genuine governance question at the same time. If your compliance posture depends on inspecting every tool invocation before it executes, the classic loop is still your pattern. We covered this control-vs-autonomy trade-off in depth in our guide to prompts vs. context vs. loops in AI engineering.

Multi-Agent Mode: Ultra and the Responses API Beta

GPT-5.6’s ultra mode coordinates four agents in parallel by default and synthesizes their results, trading higher token spend for better scores and faster wall-clock results. The measured lift is real but modest on some benchmarks — Terminal-Bench 2.1 goes from 88.8% (single Sol) to 91.9% (Sol Ultra) — and larger on wide-search tasks, where OpenAI also charts 16-agent runs on BrowseComp.

In the API, a multi-agent beta in the Responses API lets developers build ultra-like flows: one GPT-5.6 instance coordinating subagents in parallel — researching independent sources, inspecting separate modules, generating competing designs, or reviewing an artifact from different roles.

The pattern OpenAI is productizing here — supervisor spawns parallel workers, then synthesizes — is exactly what teams have hand-built in LangGraph, CrewAI, Microsoft Agent Framework, and Google ADK. If you are evaluating whether to keep that logic in a framework or push it down into the model API, our multi-agent framework comparison lays out the portability and debugging trade-offs that still favor framework-level orchestration for complex, stateful workflows. The short version: the API beta is compelling for parallel read-heavy work (research, review, inspection), while durable state, checkpointing, and human approval steps still belong in your orchestration layer.

Prompt Caching Grows Up: Breakpoints, TTLs, and a New Line Item

For anyone running high-volume agents, the quiet money feature is caching. GPT-5.6 introduces explicit cache breakpoints and a 30-minute minimum cache life, making cache behavior predictable enough to design around — you can now structure system prompts, tool schemas, and shared context so that the stable prefix caches deterministically.

The catch: from GPT-5.6 onward, cache writes are billed at 1.25x the model’s uncached input rate, while cache reads keep the 90% cached-input discount. For chatty agents that re-read a large stable prefix dozens of times per session, this nets out strongly positive. For workloads with long, unique, once-read prompts, the 1.25x write premium is a new cost-model line item — re-run your token economics before assuming caching is free money.

Benchmarks: Where GPT-5.6 Leads, and Where It Honestly Doesn’t

OpenAI’s published eval tables, summarized by MarkTechPost, show a model family that is state of the art on agentic and terminal-driven work but not uniformly dominant:

Benchmark	GPT-5.6 Sol	Best competitor	Verdict
AA Coding Agent Index v1.1	80	Claude Fable 5 — 77.2	Sol leads, with under half the output tokens
Terminal-Bench 2.1	88.8% (91.9% Ultra)	Claude Mythos 5 — 88%	New state of the art
Agents’ Last Exam	52.7%	Claude Opus 4.8 — 45.2%	Clear lead on long-running professional workflows
OSWorld 2.0	62.6%	Claude Opus 4.8 — 54.8%	Leads, using 85% fewer output tokens
SWE-Bench Pro	64.6%	Claude Mythos 5 — 80.3%	~15-point deficit on a widely watched coding eval
Toolathlon	58%	Claude Fable 5 — 61.7%	Trails on tool-use breadth
AA Intelligence Index v4.1	58.9	Claude Fable 5 — 59.9	Trails on broad intelligence

Three caveats belong in any honest reading. First, OpenAI’s latency and cost figures are offline simulations, not measured production numbers. Second, the headline Agents’ Last Exam figure OpenAI promoted (53.6) does not appear in its own eval table, which lists 52.7% — an unexplained discrepancy. Third, Luna shows a real long-context weakness, dropping to 41.3% on the MRCR v2 8-needle retrieval eval despite the 1M-token window — a reminder that context window size and effective retrieval are different things, a distinction we unpack in our guide to agent memory and RAG.

Vendor-published benchmarks should be the start of your evaluation, not the end of it. Before routing production traffic to any new model tier, run your own task-level evals — our walkthrough of agent quality evaluation with LLM-as-judge covers how to build that harness.

The Bigger Picture: ChatGPT Work and Distribution Everywhere

GPT-5.6 did not arrive alone. The same week, OpenAI launched ChatGPT Work, an agentic workspace that merges ChatGPT with Codex to create documents, presentations, and websites — a direct answer to Anthropic’s Claude Cowork, as press coverage of the launch noted. Distribution moved fast on the enterprise side too: Microsoft made GPT-5.6 the preferred model in Microsoft 365 Copilot, and Amazon announced GPT-5.6 models on Amazon Bedrock — notable because Bedrock-based agent stacks can now mix OpenAI’s new tiers with existing AgentCore infrastructure. OpenAI is also bringing Sol to Cerebras hardware at up to 750 tokens per second for select customers in July.

For architects, the takeaway is that model choice and platform choice continue to decouple. The same GPT-5.6 tiers will show up behind Copilot, Bedrock, and OpenAI’s own API — with different guardrails, pricing, and data-residency properties in each wrapper.

An Adoption Playbook for Agent Teams

1. Re-tier your routing before you re-platform

The cheapest win is routing: move summarization, extraction, and classification traffic to Luna, keep default worker agents on Terra, and reserve Sol (with max reasoning where warranted) for planning and hard coding steps. Terra at GPT-5.5-class performance for half the price is the workhorse story of this release.

2. Pilot Programmatic Tool Calling on one bounded workflow

Pick a tool-heavy, judgment-light workflow with good test coverage. Measure tokens, latency, and — critically — whether your observability stack still shows you what happened inside the batched execution. Do not migrate approval-gated workflows first.

3. Keep durable orchestration in your framework

The multi-agent beta is worth prototyping for parallel research and review patterns, but checkpointing, retries, human-in-the-loop steps, and cross-vendor portability still argue for LangGraph-style graph orchestration at the core of production systems.

4. Re-run your cost model

Explicit cache breakpoints reward deliberate prompt architecture; the 1.25x cache-write premium punishes careless caching. Both belong in your token-economics spreadsheet now.

5. Evaluate against your tasks, not the leaderboard

A model that leads Terminal-Bench but trails SWE-Bench Pro by 15 points is telling you something specific: strong at operating in environments, weaker than the best competitor at deep repository-scale code changes. Which of those resembles your workload is a question only your own evals can answer. Skipping that step is one of the classic mistakes we cataloged in 16 reasons why agentic automation programs fail.

FAQ

What is the difference between GPT-5.6 Sol, Terra, and Luna?

They are capability tiers within one model generation. Sol ($5/$30 per 1M tokens) is the flagship for frontier reasoning and complex agentic work; Terra ($2.50/$15) matches previous-generation GPT-5.5 performance at roughly half the cost; Luna ($1/$6) is the fast, low-cost tier for high-volume tasks. All three have a 1M-token context window.

What is Programmatic Tool Calling in GPT-5.6?

Instead of returning tool calls one at a time for your code to execute in a loop, the model writes JavaScript that orchestrates multiple tool calls itself, running in an isolated V8 sandbox with no network access. Tools must be explicitly opted in via allowed_callers. OpenAI reports customer token reductions of 38–63.5% on tool-heavy workflows.

Does GPT-5.6 replace agent frameworks like LangGraph or CrewAI?

No. Programmatic Tool Calling and the multi-agent beta absorb some orchestration work for bounded, parallel, read-heavy patterns, but durable state, checkpointing, human-in-the-loop gates, cross-model portability, and auditability still favor framework-level orchestration for production systems.

Is GPT-5.6 better than Claude for coding agents?

It depends on the task. Sol leads the Artificial Analysis Coding Agent Index (80 vs. 77.2) and Terminal-Bench 2.1, but trails Claude Mythos 5 by roughly 15 points on SWE-Bench Pro and trails Claude Fable 5 on Toolathlon and the broad Intelligence Index. Run task-level evals on your own workloads before switching.

Why did GPT-5.6 launch through a government-coordinated preview?

Because of its cybersecurity capabilities, OpenAI previewed the models with the U.S. government and began with a limited preview for vetted partners on June 26 before general availability on July 9, pairing the release with layered safeguards including real-time misuse classifiers and extensive automated red-teaming.

Key Takeaways

GPT-5.6 ships as three durable tiers — Sol ($5/$30), Terra ($2.50/$15), Luna ($1/$6) per 1M tokens — designed for routing workloads by difficulty.
Programmatic Tool Calling moves the orchestration loop into model-written JavaScript in a sandboxed V8 runtime, cutting tokens 38–63.5% on tool-heavy workflows — at the cost of per-step interception points.
The ultra mode and Responses API multi-agent beta productize the supervisor/parallel-worker pattern; frameworks still win for stateful, auditable orchestration.
Prompt caching gains explicit breakpoints and a 30-minute minimum life, but cache writes now bill at 1.25x — re-run your cost model.
Benchmarks are strong but not uniform: state of the art on Terminal-Bench, Agents’ Last Exam, and OSWorld; ~15 points behind Claude Mythos 5 on SWE-Bench Pro; Luna is weak on long-context retrieval.
The release was phased through a government-coordinated preview due to cyber capabilities — expect occasional safeguard interventions on legitimate dual-use security work.

References

Claude Sonnet 5 vs GPT-5.6 vs Grok 4.5 vs Muse Spark 1.1: The July 2026 Agentic AI Model Showdown

Satish Prasad — Tue, 21 Jul 2026 17:09:22 +0000

Between June 30 and July 9, 2026, four frontier AI labs shipped major new models within days of each other. Anthropic released Claude Sonnet 5 on June 30. xAI dropped Grok 4.5 on July 8. OpenAI followed with GPT-5.6 (in three tiers — Luna, Terra, and Sol) on July 9. Meta launched Muse Spark 1.1 and its first paid developer API the same day.

For teams building agentic automation pipelines, this is the most consequential model refresh cycle since GPT-4 reshaped what “tool use” meant. Each model is explicitly engineered for agentic workloads — long-horizon planning, multi-step tool calling, computer use, and subagent orchestration — and each makes different trade-offs on price, speed, context length, and benchmark performance.

This article cuts through the launch noise. We benchmarked the public numbers, mapped the pricing, and mapped each model to the automation scenarios where it earns its cost. No single winner — but a clear decision framework for the teams actually building with these models.

What Just Happened: Four Models in 48 Hours

The compressed timeline matters because it was not coincidental. The major labs have been tracking each other’s release calendars, and the simultaneous push reflects a broader strategic shift: the battlefield has moved from chat to agents. Each launch announcement led with agentic capability claims, not raw reasoning scores on academic benchmarks.

Model	Lab	Release Date	Tiers
Claude Sonnet 5	Anthropic	June 30, 2026	Single model
Grok 4.5	xAI (SpaceXAI)	July 8, 2026	Single model, configurable reasoning
GPT-5.6 (Sol / Terra / Luna)	OpenAI	July 9, 2026	Three distinct tiers
Muse Spark 1.1	Meta Superintelligence Labs	July 9, 2026	Single model, “Thinking” consumer mode

Quick-Reference Decision Table

If you are in a hurry, this table summarises the key dimensions. The detailed analysis follows.

Dimension	Claude Sonnet 5	GPT-5.6 Sol	GPT-5.6 Luna	Grok 4.5	Muse Spark 1.1
Input cost ($/1M tokens)	$2.00	$5.00	$1.00	$2.00 ($0.50 cached)	$1.25
Output cost ($/1M tokens)	$10.00	$30.00	$6.00	$6.00	$4.25
Context window	1M tokens	1M tokens	1M tokens	500K tokens	1M tokens
SWE-Bench Pro	82.1%	~88% (Sol)	61.3%	54.7%	Not published
MCP Atlas (tool use)	Not published	Not published	Not published	Not published	88.1
Agent reliability (multi-step)	94.8%	Not published	97.2%	89.6%	Not published
Best for	Code review, long codebases	Hard coding, research agents	High-volume agentic pipelines	Speed, SaaS automation	MCP tool use, computer use

Claude Sonnet 5: Anthropic’s Most Agentic Sonnet Yet

Anthropic shipped Claude Sonnet 5 on June 30, positioning it as a model that closes most of the gap with Opus 4.8 at a fraction of the cost. The headline number is SWE-Bench Pro at 82.1% — the strongest code-repair score of any mid-tier model in this cohort, and a meaningful lead over Grok 4.5’s 54.7% on the same benchmark.

What Changed for Agentic Use

The most significant change for automation builders is the jump in computer use reliability. Sonnet 5 can navigate browser-based workflows — competitive analysis, procurement approvals, customer onboarding forms — with enough stability to deploy in production, not just demos. Anthropic’s own documentation highlights “greater accuracy and reliability” in digital environment navigation compared to Sonnet 4.6.

Planning depth is the other upgrade that matters. Long-horizon tasks — think 40+ step UiPath-integrated workflows or multi-agent LangGraph pipelines — previously required Opus 4.8 to hold coherent state across the full run. Sonnet 5 handles that load at roughly one-fifth the output token cost. For architects who have been using Opus as a planning layer and cheaper models for execution, Sonnet 5 opens the question of whether you can consolidate to a single model tier.

Pricing and Availability

Sonnet 5 launches at $2.00 per million input tokens and $10.00 per million output tokens, with introductory pricing locked until August 31, 2026. It is the default model for all Claude plans — Free, Pro, Max, Team, and Enterprise — and is available in Claude Code and via the Claude Platform API. Compared to Opus 4.8’s $25 per million output tokens, it represents a 60% reduction in output cost while covering the majority of what most agentic pipelines actually need from a frontier model.

Safety Profile

Anthropic’s safety assessments confirmed that Sonnet 5 shows a lower rate of undesirable behaviours than Sonnet 4.6 in agentic contexts — a relevant data point for teams deploying autonomous agents with real tool access, where a model that refuses spuriously or hallucinates tool parameters has direct operational cost. The system card notes particular improvements in instruction-following stability over long agent runs.

For teams already using LangGraph agents deployed on UiPath, Sonnet 5 is the natural first test upgrade — the API surface is unchanged, and the improvement in coding and debugging makes it a drop-in improvement for the planning node.

GPT-5.6 (Sol, Terra, Luna): OpenAI’s Three-Tier Strategy

OpenAI’s July 9 launch was notable not for a single model but for a deliberate three-tier architecture that forces buyers to make an explicit trade-off at contract time: Sol for the hardest work, Terra for everyday agentic tasks, Luna for high-volume pipelines where cost per call dominates the decision.

The Three Tiers Explained

Sol ($5 input / $30 output per 1M tokens): The new flagship. OpenAI calls it “a step function better than GPT-5.5, not an incremental polish.” Terminal-Bench 2.1 score of 88.8%, rising to 91.9% in the new “ultra” sub-agent mode where Sol spawns parallel subagents to attack different parts of a complex coding task. Leads the Artificial Analysis Coding Agent Index at 80, 2.8 points above Claude Fable 5.
Terra ($2.50 input / $15 output per 1M tokens): The everyday workhorse. The most interesting benchmark here is agent reliability in multi-step tool chains: Terra scores 97.2%, ahead of Sonnet 5’s 94.8% and Grok 4.5’s 89.6%. For production pipelines where a failed step means a failed transaction, reliability beats raw reasoning score.
Luna ($1 input / $6 output per 1M tokens): Speed and cost. GPT-5.5 level performance at roughly half the cost of Terra. The obvious routing target for high-frequency tasks in a hybrid pipeline — summarisation, classification, parameter extraction — where you pay Sol-level prices only for the steps that genuinely need them.

All three models share a 1-million-token context window, 128,000 maximum output tokens, and a February 2026 knowledge cutoff. The shared specs matter for agent architects: you can route across tiers in a single pipeline without hitting context-length mismatches.

Programmatic Tool Calling

The architectural novelty in GPT-5.6 is Programmatic Tool Calling — the model can write and execute JavaScript in an isolated V8 runtime (no network access) as a native step rather than calling an external code execution tool. For automation engineers building tool-calling chains, this collapses one integration layer. The model can compute transformations, validate data shapes, and route decisions in-process rather than round-tripping to a separate executor.

When paired with the multi-agent frameworks like LangGraph or Microsoft Agent Framework, GPT-5.6 Terra in particular becomes a strong candidate for the orchestrator node in a heterogeneous agent system — high reliability, native computation, and a context window large enough to hold the full session state of most enterprise workflows.

Grok 4.5: xAI’s Speed-and-Cost Contender

Grok 4.5 launched publicly on July 8, 2026, positioned as xAI’s flagship for coding, agentic tasks, and knowledge work. Its benchmark profile is the inverse of Sonnet 5: lower on coding correctness (SWE-Bench Pro at 54.7%), but top of the field on agentic tool use and Terminal-Bench 2.1 at 83.3%. It ranks fourth on the Artificial Analysis Intelligence Index overall but leads on agentic tool use specifically.

The Configurable Reasoning Design

Grok 4.5 ships with a reasoning_effort parameter (low / medium / high, with high as default) that lets architects explicitly trade compute cost for reasoning depth on a per-call basis. This is an important design choice for agentic systems: you can route high-stakes planning steps through high reasoning and high-frequency tool calls through low reasoning without switching models. The same API key, the same pricing curve, different cost-per-call based on the task complexity you pass in.

SaaS and Office Automation

The integrations that set Grok 4.5 apart for enterprise automation are its native SaaS connectors. xAI built direct integrations with Gmail, Google Sheets, Slack, and — unusually — Microsoft Office (Word, PowerPoint, Excel via Office plugins). For automation engineers building cross-application workflows that touch both productivity suites, this is a native capability that other models require external tool wrappers to replicate.

Cursor integration is also first-class: Grok 4.5 is available on all Cursor plans and handles “full application generation, PR stacks, and debugging across languages” according to the launch documentation, including Rust and C/C++ — a differentiated capability for teams building systems software rather than web applications.

Pricing

At $2.00 per million input tokens and $6.00 per million output tokens (with cached input at $0.50 per million), Grok 4.5 is the most cost-efficient option when comparing input-to-output cost ratios. The 500K context window — half of the others in this cohort — is the meaningful constraint. Workflows that need to hold large codebases, long conversation histories, or extensive retrieval context in a single call will hit this limit. Workflows that don’t gain Grok 4.5’s speed and cost advantage without the ceiling being a concern.

Muse Spark 1.1: Meta Enters the Paid Agent Race

Meta’s launch of Muse Spark 1.1 on July 9 was significant not just as a model release but as a strategic pivot: it accompanied the opening of the Meta Model API in public preview, marking the end of Meta’s open-weight-only commercial strategy. Mark Zuckerberg announced it on X (his first post there in three years), and the benchmark claim that led — top score on MCP Atlas at 88.1 — was the sharpest positioning move in the cohort.

MCP Atlas and Tool Use at Scale

MCP Atlas is a benchmark specifically designed to test scaled tool use — not just whether a model can call a tool correctly once, but whether it can orchestrate many tool calls in the right sequence across a complex task. Muse Spark 1.1’s score of 88.1 is the highest published figure for any model on this benchmark as of July 2026. For teams building MCP-native agent stacks, this is the signal that matters most. The model “zero-shot generalises to new native tools, MCP servers, and custom skills” according to Meta’s launch documentation — meaning it can use a newly registered MCP server without fine-tuning or few-shot examples.

Parallel Subagent Orchestration

Muse Spark 1.1 is explicitly designed for multi-agent systems. Meta describes it as trained to operate as both an orchestrator (gathering context, forming a plan, delegating execution across parallel subagents to optimise end-to-end latency) and as a subagent (adhering to its assigned scope, understanding available tools, and knowing when to escalate back). This dual-mode design makes it a natural fit for hierarchical agent architectures — the kind that agent memory frameworks and RAG pipelines typically require at both the planning and retrieval layers.

For teams building agentic workflows where agent memory and RAG are core components, the 1M context window with active context management (the model can compact earlier steps while retaining critical decision points) makes Muse Spark 1.1 worth evaluating as the backbone model.

API Compatibility

One practical adoption advantage: the Meta Model API is self-serve and speaks both the OpenAI SDK (Chat Completions and Responses formats) and the Anthropic Messages format. Pointing an existing agent at Muse Spark 1.1 is a base-URL and API key change, not a rewrite. For teams running OpenAI-SDK-based LangGraph agents, this significantly reduces the switching cost for a trial evaluation.

Pricing

At $1.25 per million input tokens and $4.25 per million output tokens, Muse Spark 1.1 is the most affordable 1M-context model in this cohort by a meaningful margin. The output token cost is more than 55% cheaper than Claude Sonnet 5 and 86% cheaper than GPT-5.6 Sol. For agentic pipelines where output token volume is the primary cost driver — which is almost all of them, once you account for planning outputs, tool call responses, and intermediate reasoning — that gap compounds quickly at scale.

Head-to-Head: Agentic Performance Benchmarks

Comparing these models on benchmarks requires a caveat upfront: the labs publish different benchmarks, run them on different dates, and sometimes benchmark themselves on different versions of the same test suite. Where possible, we use third-party benchmark data from sources like the Artificial Analysis indexes and BenchLM rather than vendor-reported numbers alone.

Benchmark	Claude Sonnet 5	GPT-5.6 Sol	GPT-5.6 Terra/Luna	Grok 4.5	Muse Spark 1.1
SWE-Bench Pro (coding repair)	82.1%	~88%	61.3% (Luna)	54.7%	Not published
Terminal-Bench 2.1	76.1%	88.8% (91.9% ultra)	Not published	83.3%	Not published
MCP Atlas (tool use)	Not published	Not published	Not published	Not published	88.1
Agent reliability, multi-step	94.8%	Not published	97.2%	89.6%	Not published
Agentic score (avg)	81.9	Not published	Not published	83.3	Not published
Artificial Analysis Intelligence Index	Not in top 4	Not published	Not published	#4 (score 54)	Not published

What the benchmark gaps actually tell you: The non-published cells are not gaps in capability — they reflect that each lab chose to benchmark against the metrics where their model looks best. Meta published MCP Atlas because it tops that leaderboard. OpenAI published Terminal-Bench because Sol tops that one. A practitioner reading this table should treat the missing cells as “unknown” rather than “poor.” The agent reliability column, where Terra’s 97.2% and Sonnet 5’s 94.8% are directly comparable, is the most practically useful single number for teams building production multi-step pipelines — a 3-point reliability gap at 10 steps compounds to a meaningful difference in end-to-end success rate.

The Price War Paradox: Cheaper Tokens, Higher Agentic Bills

The simultaneous launches triggered a price war. Luna landed at $1 per million input tokens, Muse Spark at $1.25, and Grok 4.5 at $2 — versus $25-50 for legacy flagships. Output token costs fell similarly. On the surface this looks like unambiguous good news for automation builders.

It is partially good news, and partially a trap.

Agentic workflows consume between 5 and 30 times more tokens per user task than the equivalent 2024 prompt-and-response interaction, according to analysis published by Finout and DigitalApplied in July 2026. The reason is structural: an agent pipeline turns a single user request into repeated rounds of planning, retrieval, tool calls, validation, retries, and synthesis. Each round generates output tokens. The price per token can fall 80% while the cost per completed task rises, because the task now involves 10x as many tokens.

Forbes’ analysis of enterprise AI cost data noted that 73% of enterprises exceeded their original AI cost projections last fiscal year despite declining per-token prices. The disconnect is the shift from chat to agentic use patterns — a shift these July 2026 model releases are explicitly accelerating.

This is not an argument against adopting the new models. It is an argument for building cost-aware routing into your agent architecture from day one, rather than discovering the bill at the end of the month. A good model router — something we cover in the context of prompt vs. context vs. loop engineering — should assign the cheapest model capable of each task, not the best model available. Luna or Muse Spark for summarisation and classification. Terra or Sonnet 5 for planning and code generation. Sol or Opus 4.8 for complex debugging and multi-agent orchestration where getting it right on the first pass is cheaper than retrying three times at a lower tier.

Which Model Should You Use?

Use Claude Sonnet 5 when:

Your primary workload is code review, refactoring, or long-codebase comprehension (82.1% SWE-Bench Pro is the best mid-tier score available)
You are already deployed on the Anthropic platform and want a cost-efficient upgrade from Opus 4.8
Your agent needs computer-use reliability in browser-based workflows
Safety constraints matter and you need documented safety assessments (Anthropic’s system card is publicly available)

Use GPT-5.6 Sol when:

You need the absolute frontier for hard coding and terminal-based agent tasks (88.8% Terminal-Bench, rising to 91.9% in ultra mode)
You are building research agents that need to decompose complex tasks into parallel subagents (the “ultra” mode is purpose-built for this)
Cost is a secondary concern relative to first-pass accuracy

Use GPT-5.6 Terra when:

Production reliability is the non-negotiable (97.2% agent reliability in multi-step tool chains is the highest published figure in this cohort)
You are building customer-facing automation where a failed step has direct business cost
You want a mid-range price point that supports native Programmatic Tool Calling

Use GPT-5.6 Luna when:

You are routing high-frequency, lower-complexity tasks (summarisation, extraction, classification) in a multi-tier agent pipeline
Per-call cost is the primary constraint and GPT-5.5 level performance is sufficient

Use Grok 4.5 when:

Your workload is SaaS automation involving Gmail, Google Sheets, Slack, or Microsoft Office — where native integrations reduce your tool-wrapping overhead
Speed-per-dollar matters more than context length (500K is the ceiling, but the price and configurable reasoning design are optimised for throughput)
You are building in Cursor and want first-class IDE integration

Use Muse Spark 1.1 when:

MCP-native agentic tool use is your core pattern (88.1 MCP Atlas is the top published score)
You need computer use across desktop, browser, and mobile in the same pipeline
Cost is a primary driver — at $4.25 per million output tokens, it is the most affordable 1M-context model in this cohort
You want OpenAI SDK or Anthropic Messages API compatibility without a codebase rewrite

The Hybrid Routing Play

The biggest practical takeaway from this model generation is that the “pick one model” question is the wrong question. The per-token costs are now low enough and the capability differentiation clear enough that a hybrid routing layer pays for itself.

Analysis from BenchLM and independent automation engineers suggests that intelligent model routing cuts API spend by 30-45% compared to defaulting to the highest-ranked model on every call. The architecture is not complex: an orchestrator node classifies the incoming task by type and complexity, routes it to the appropriate model tier, and merges the result back into the shared agent state. For teams already building on LangGraph, this is a router node with a conditional edge. For UiPath-based pipelines, it is a decision activity upstream of the LLM activity.

The four-model landscape that just shipped maps cleanly to a four-tier routing strategy:

Frontier tier (GPT-5.6 Sol): Complex coding, hard debugging, multi-stage research with parallel subagents
Workhorse tier (Claude Sonnet 5 / GPT-5.6 Terra): Planning, code generation, code review, browser automation
Tool-use tier (Muse Spark 1.1): MCP orchestration, computer use, cross-application workflow coordination
Volume tier (GPT-5.6 Luna / Grok 4.5): Classification, summarisation, parameter extraction, high-frequency lightweight calls

This is also the moment to revisit your agent quality evaluation setup. When you add a routing layer, your evaluation pipeline needs to account for model-switching — a task that was previously always routed to Opus 4.8 may now go to Sonnet 5 or Terra, and your LLM-as-judge evaluator needs to be calibrated to the new expected output distribution.

For teams just starting to build multi-model routing, the most common failure mode is assuming the router itself is free — it adds latency, complexity, and its own error surface. Build the routing logic after you have validated each model tier independently on your specific workload, not before. A wrong routing decision that sends a hard coding task to Luna at 61.3% SWE-Bench Pro will cost you more in retries than you saved on the routing.

Key Takeaways

Four major agentic AI models shipped in 48 hours in early July 2026: Claude Sonnet 5, Grok 4.5, GPT-5.6 (Sol/Terra/Luna), and Muse Spark 1.1.
No single model wins across all agentic workloads — each has a clear benchmark leadership position for a specific task type.
Claude Sonnet 5 leads on code review and long-codebase work (82.1% SWE-Bench Pro); GPT-5.6 Sol leads on hard terminal tasks (88.8% Terminal-Bench); Muse Spark 1.1 leads on MCP tool use (88.1 MCP Atlas); GPT-5.6 Terra leads on multi-step agent reliability (97.2%).
Muse Spark 1.1 is the most cost-efficient 1M-context model at $4.25/M output tokens, with OpenAI and Anthropic SDK compatibility for zero-friction adoption.
Per-token prices fell significantly, but agentic workflows consume 5–30x more tokens per task than chat — so total spend may still rise if you do not build cost-aware routing.
A hybrid four-tier routing strategy (frontier / workhorse / tool-use / volume) can cut API spend by 30–45% versus defaulting to the strongest model on every call.
Evaluate your agent quality metrics and LLM-as-judge setup before switching model tiers — output distributions shift, and your evaluator needs to account for it.

FAQ

Is Claude Sonnet 5 better than GPT-5.6 for agentic AI?

It depends on the workload. Claude Sonnet 5 leads on code review and long-codebase comprehension (SWE-Bench Pro 82.1% vs GPT-5.6 Luna’s 61.3%). GPT-5.6 Sol leads on terminal tasks and multi-agent orchestration. GPT-5.6 Terra leads on agent reliability in multi-step tool chains (97.2%). The right answer is task-specific, and many production teams will route across both.

What is Muse Spark 1.1 and how does it compare to Claude and GPT?

Muse Spark 1.1 is Meta’s first paid commercial AI model, released July 9, 2026. It tops the MCP Atlas benchmark for scaled tool use (88.1), operates as both orchestrator and subagent in multi-agent systems, and is the cheapest 1M-context model in this cohort at $1.25/$4.25 per million tokens. It is compatible with the OpenAI SDK and Anthropic Messages API, making it easy to test as a drop-in replacement for Claude or GPT in existing pipelines.

Is Grok 4.5 worth it for enterprise agentic automation?

Yes, for specific workloads. Grok 4.5’s native SaaS integrations (Gmail, Sheets, Slack, Microsoft Office) and configurable reasoning effort make it the strongest option for SaaS automation pipelines and Cursor-based development workflows. The 500K context limit is a real constraint for large codebases or long conversation histories, so evaluate your context requirements first.

Will the AI price war actually lower my agentic automation costs?

Lower per-token prices help, but agentic workflows use 5–30x more tokens per task than chat interactions. Without model-tier routing, switching to a cheaper model for the wrong tasks can increase total spend through higher retry rates. Build cost-aware routing into your architecture from the start and measure cost-per-completed-task, not cost-per-token.

Which model should I use for a LangGraph multi-agent pipeline?

A common production pattern: Muse Spark 1.1 or GPT-5.6 Terra as the orchestrator node (strong MCP and tool-use reliability), Claude Sonnet 5 for subagents that do code generation or analysis, and GPT-5.6 Luna or Grok 4.5 for high-frequency classification and extraction subagents. Validate routing logic against your own workload benchmarks before optimising on published scores alone.

References

Anthropic. Introducing Claude Sonnet 5. anthropic.com/news/claude-sonnet-5, June 30, 2026.
TechCrunch. Anthropic launches Claude Sonnet 5 as a cheaper way to run agents. techcrunch.com, June 30, 2026.
TechRadar. Claude Sonnet 5 is here, and the ‘most agentic Sonnet model yet’ shows that the AI war is shifting from chat to agents. techradar.com, June 30, 2026.
MarkTechPost. OpenAI Releases GPT-5.6: A Three-Tier Model Family With Programmatic Tool Calling. marktechpost.com, July 9, 2026.
Simon Willison. The new GPT-5.6 family: Luna, Terra, Sol. simonwillison.net, July 9, 2026.
Meta AI. Introducing Muse Spark 1.1. ai.meta.com/blog/introducing-muse-spark-meta-model-api/, July 9, 2026.
MarkTechPost. Meta Superintelligence Labs Releases Muse Spark 1.1: A Multimodal Reasoning Model for Agentic Tasks. marktechpost.com, July 9, 2026.
eesel AI. Grok 4.5: benchmarks, pricing, and what it means. eesel.ai/blog/grok-4-5, July 8, 2026.
BenchLM. Claude Sonnet 5 vs Grok 4.5: Benchmarks, Pricing, Speed (July 2026). benchlm.ai, July 2026.
Finout. AI Model Cost Breakdowns: The Complete 2026 Comparison Guide. finout.io, July 2026.

Prompts vs. Context vs. Loops: The 2026 AI Engineering Map

Satish Prasad — Sun, 19 Jul 2026 12:27:00 +0000

“The model can’t follow instructions.”

That complaint gets filed against three completely different bugs, and most teams debug it as if it were always the same one. Sometimes the instruction really is ambiguous. Sometimes the instruction is fine but the model never saw the right file, the right ticket, or the right schema. And sometimes the model had everything it needed and still drifted off course by step twelve of a forty-step run. Three different failures, one symptom, and if you fix the wrong layer you’ll spend a week rewriting a prompt to solve a problem that was never about the prompt.

This is the actual substance behind the “loops vs. context vs. prompts” question that’s been circulating since a June 2026 post by Peter Steinberger — founder of OpenClaw — crossed six million views in days: “You shouldn’t be prompting coding agents anymore. You should be designing loops that prompt your agents.”[^1] Boris Cherny, who built Claude Code at Anthropic, said the same thing on stage days earlier: “I don’t prompt Claude anymore. I have loops that are running… My job is to write loops.”[^2] Within weeks, Andrew Ng had mapped his own three nested loops for product building, and developer Addy Osmani had given the pattern a name: loop engineering.[^3]

None of this retires prompt engineering. It’s not a succession story — prompts got replaced by context, which got replaced by loops. It’s a layering story. Each discipline solves a harder problem than the one before it, and production AI systems in 2026 need all of them at once. This guide maps all four layers — prompt, context, harness, and loop engineering — explains what each one actually controls, gives you a diagnostic for finding your real bottleneck, and covers what this means if you’re building agentic automation on UiPath, LangGraph, or any stack in between.

The short answer

Four disciplines, one ratchet. Each term took over when task complexity broke the previous layer, and each corresponds to a progressively harder question about the same underlying goal: getting a model to do the right thing.[^4]

Layer	Time horizon	Core question	What you engineer	Failure mode
Prompt Engineering	This one turn	Did I say it clearly?	Instruction wording, rubric, role, examples	Model misreads or ignores a constraint
Context Engineering	This moment in this turn	Does the model see the right information?	Retrieval, tool outputs, message history, file structure	Model is fluent but confidently wrong
Harness Engineering	One full agent run	Does it keep doing the right thing across steps?	Tools, state, verification, error recovery	Drift, error compounding, false “done”
Loop Engineering	Many runs, hours to days, unattended	Does the work happen without me?	Triggers, scheduling, verifiers, budgets, stop conditions	Runaway cost, or you’re the bottleneck

Each layer contains the one before it: the prompt is an object inside the context window; the context pipeline is a subsystem inside the harness; the harness is what a loop calls, repeatedly, on a schedule.[^4] A sloppy prompt still hurts inside the best-engineered loop ever built — nothing here is optional, it’s additive.

Why “the model can’t follow instructions” is usually a misdiagnosis

Puppyone’s practitioner framing gives a genuinely useful three-question diagnostic for figuring out which layer actually broke, and it’s worth running before you touch anything:[^5]

Can you reproduce the bad output with the same prompt and the same context, twice? If yes, it’s a prompt problem — the instruction is ambiguous or under-specified, and that’s deterministic. If it’s flaky within a single turn, something noisy got into the input: stale retrieval, tool-output variance, hidden state. That’s a context problem, and it’s stochastic because an upstream retriever or tool changed the input, not because the prompt changed.

Did the model receive the right files, tools, and facts? If no, that’s context engineering, full stop. If yes, but the model ignored them anyway, you’re looking at a prompt problem (wrong priority order) or a loop policy problem (bad tool selection, incomplete acceptance criteria).

Did it know when to stop? If no — the loop kept retrying, drifted in scope, or claimed success falsely — that’s a harness or loop problem, and it’s almost always traceable to a missing or weak verifier.

The reason this matters operationally: most teams over-invest in the layer that’s fashionable and under-invest in the layer that’s actually broken. A team that keeps rewriting prompts when their real problem is stale retrieval is polishing the wrong floor of the building.

Layer 1: Prompt Engineering — say it clearly

Prompt engineering is the founding discipline: same model, different phrasing, different output. “Summarize this article” gets mush; “as a senior editor, summarize in three paragraphs — core claim, evidence, limitations, 150 words each” gets something usable. The toolkit is well established: role assignment, few-shot examples, step-by-step decomposition, output format contracts, explicit refusal boundaries.[^4]

Mechanically, an LLM is a context-sensitive probability machine. A role shifts the sampling distribution toward that persona’s training data; examples establish a pattern to continue; explicit constraints raise the weight of compliance. Prompting doesn’t command the model — it shapes the probability space the answer gets drawn from.

The ceiling: prompts can’t conjure facts that aren’t present. A perfectly worded instruction to “analyze our internal architecture doc” still fails if the doc was never in the context window. Prompt engineering solves the expression problem. It cannot solve the information problem, and the moment work shifts from open-ended Q&A to “do something with my data,” the center of gravity moves to the next layer.

Layer 2: Context Engineering — feed it the right information

Anthropic’s own applied AI team frames this precisely: context engineering is “the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference, including all the other information that may land there outside of the prompts” — tools, message history, retrieved documents, MCP connections, everything the model sees beyond the instruction itself.[^6] The prompt is one curated object inside the context window, not the whole interface.

Two forces make this a real engineering discipline rather than a bigger prompt. First, agents generate far more candidate context than chat ever did — an agent run can touch dozens of tools, each spraying results, errors, and state into a finite window. Second, and less intuitively, more context isn’t free. Anthropic’s team points to context rot: as token count in the window increases, a model’s ability to accurately recall information from that context decreases, even in models built for long context.[^6] This isn’t a bug you can prompt around — it’s architectural. Transformers give every token pairwise attention to every other token (n² relationships for n tokens), so as context grows, that attention gets stretched thinner, and models simply have less training-data experience with long-range dependencies than short ones.

The practical rule Anthropic states directly: good context engineering means finding the smallest possible set of high-signal tokens that maximizes the likelihood of the desired outcome.[^6] Concretely, that plays out in a few well-documented techniques:

Compaction — summarizing a conversation nearing its context limit and reinitiating with the summary, preserving architectural decisions and unresolved issues while discarding redundant tool output. Claude Code does this automatically, keeping the five most recently accessed files alongside the compressed summary.[^6]
Structured note-taking (agentic memory) — the agent writes persistent notes outside the context window (a NOTES.md, a to-do list) and reads them back after a context reset. This is how long-running agents maintain coherence across thousands of steps without ever holding the full history in the window at once.[^6]
Just-in-time retrieval — rather than pre-loading everything, the agent keeps lightweight references (file paths, stored queries) and loads data on demand, mirroring how humans use file systems and bookmarks instead of memorizing entire corpora.[^6]
Sub-agent architectures — specialized sub-agents explore extensively in their own clean context windows (tens of thousands of tokens each) and return only a condensed summary (1,000–2,000 tokens) to the lead agent, keeping the orchestrating context lean.[^6]

Bad tool design is one of the most common context failures in practice: a bloated tool set with overlapping functionality creates ambiguous decision points, and “if a human engineer can’t definitively say which tool should be used in a given situation, an AI agent can’t be expected to do better.”[^6]

The ceiling: perfect inputs, unsupervised execution. The model can have every fact it needs and still plan well, then drift by step seven; misread a tool result and build on the misreading; or report confident completion on work that doesn’t actually run. Input quality was never the whole game, because nobody was watching the work happen turn over turn. That’s the next layer.

Layer 3: Harness Engineering — control the run

LangChain states this one as a formula: Agent = Model + Harness. Harness = Agent − Model.[^4] The harness is everything around the weights — what the model sees, what it can touch, how steps sequence, what persists, who checks the output, and what happens on failure. Oracle’s engineering team makes the same split explicit: an agent’s architecture has two separable layers, the model (the inference engine that reasons) and the harness (the code that prepares context, executes tool calls, enforces constraints, and persists state) — and most agent engineering work happens in the harness, not the model.[^7]

The analogy that lands cleanest: you brief a new hire before an important client visit (that’s the prompt). You hand them account history and a pricing sheet (that’s context). But if the meeting genuinely matters, you also send a checklist, require a check-in call at key milestones, review the recording afterward, and check results against criteria. Nobody skips that bundle for high-stakes human work. Harness engineering is refusing to skip it for agents.[^4]

Concretely, a harness owns:

Tool integration — defining tools with clear names, descriptions, and schemas, classified as data tools (retrieve context), action tools (side effects), or orchestration tools (invoke sub-agents).[^7]
State and memory — the read/write lifecycle around each turn: reading conversational history, knowledge base entries, and entity memory before the model is called; writing new memory after it acts.[^7]
Verification — a separate check on whether the model’s “done” claim is actually true, since a terminal message with no more tool calls doesn’t mean the goal was met; it might just mean the model gave up or asked a clarifying question the harness didn’t notice.[^7]
Observability — structured logging of every reasoning step, tool call, argument, and result, because a 20-iteration run touching eight tools produces a trace too complex to debug from vibes alone.[^7]

The results are measurable. OpenAI ran a near-million-line production application where agents wrote 100% of the code and the engineers built the environment around them; Anthropic got Claude running unattended for hours via fresh-context resets and independent evaluator agents; LangChain reported the same model jumping from rank 30 to rank 5 on Terminal Bench purely by changing the harness, not the weights.[^4]

The ceiling: a perfect harness still controls one run, and you’re still the trigger for every run. You decide when to start it, you read the result, you decide what happens next. The agent got supervised; the workflow didn’t get autonomous. For a single high-stakes task, that’s fine. The moment you have recurring work — triage every new issue, refresh a report weekly, keep a test suite green — you become the cron job. That’s the bottleneck the fourth layer removes.

Layer 4: Loop Engineering — remove yourself from the run

Addy Osmani’s definition, the one nearly every other write-up on this topic cites: “Loop engineering is replacing yourself as the person who prompts the agent. You design the system that does it instead.”[^3] The prompt still exists — a machine writes it now. The harness still runs — a loop decides when it runs and whether the result is actually done. Instead of issuing instructions turn by turn, you define a goal with a testable termination condition, a verifier that decides “good enough,” a trigger (cron, webhook, another agent), and an exit path for failure. Then you walk away.

The five building blocks (plus the one everyone forgets)

Osmani’s breakdown of what a working loop needs is now the reference model cited across the space:[^3]

Block	Job	What it looks like
Automations	Starts loops on schedule or event	cron, scheduled tasks, webhooks
Worktrees	Keeps parallel agents from colliding	git worktrees, isolated checkouts per agent
Skills	Encodes project knowledge so the agent stops guessing	`SKILL.md`, playbooks, runbooks
Connectors	Lets the agent touch real systems	MCP servers, GitHub, Slack, ticketing APIs
Sub-agents	Separates the maker from the checker	writer agent vs. verifier agent
Memory (the +1)	Persists state outside the conversation	a markdown file, a board, a database the loop re-reads

The memory piece is easy to underrate. “The model forgets everything between runs so the memory has to be on disk and not in the context. The agent forgets, the repo doesn’t.”[^3] Two native implementations have emerged as the reference primitives: Claude Code’s /goal (a durable objective that persists across turns, checked by a separate evaluator model so the agent that wrote the work isn’t the one grading it) and OpenAI Codex’s equivalent /goal command, which supports pause, resume, and clear across a long-horizon run.[^8]

What the loop is actually doing, mechanically

Strip away the tooling and every agent loop — whether it’s a single tool-calling agent or a full autonomous pipeline — runs the same five-stage cycle, repeating until a stop condition fires:[^7]

Perceive — take in input: a user message, a tool result, an error, or the outcome of the last action.
Reason — the LLM processes everything currently in context and decides what to do next.
Plan — for complex tasks, decompose the objective into subtasks; simple tasks skip straight to acting.
Act — execute something: a tool call, an API request, a code run.
Observe — examine the result. Did it work? Is the task actually complete? Does the plan need adjusting?

Then it loops back to step one. In pseudocode this reduces to six lines — while not done: call the model, execute any tool calls, append results, repeat until no tool calls remain — and that six-line pattern, formalized as ReAct (Reasoning + Acting) by Princeton and Google Research in 2022, is the architecture every major AI company has converged on independently.[^7][^9] Yao et al.’s original ReAct paper measured the gap directly: models that reason, act, and observe in a loop scored 34% higher on ALFWorld and 10% higher on WebShop than action-only baselines.[^9] Loop engineering isn’t a new architecture; it’s the discipline of making that 2022 pattern survive unattended, at scale, for hours.

Stop conditions are not optional

A loop that doesn’t know when to stop doesn’t fail gracefully — it fails expensively. The canonical cautionary example, cited across multiple independent sources covering this topic: an agent deployed to scrape a website hit a structure change, got an empty result, had no hard stopping condition, and called the broken tool 400 times in five minutes before hitting a platform rate limit.[^7] A production loop needs, at minimum:

A hard iteration cap (max turns before forced exit)
A wall-clock timeout
A token or dollar budget per run
No-progress detection — exiting when the same tool gets called with identical arguments for the third consecutive time, a strong signal the agent is stuck, not working[^7]
A goal-completion check that is a real predicate, not just the absence of further tool calls — the model returning a terminal message doesn’t mean the objective was met; it might mean the model asked a question the harness silently ignored[^7]

The economics are real, not theoretical

This is where loop engineering diverges from every layer before it: it’s the first layer where the failure mode is measured in dollars, not just quality. Anthropic’s own internal data shows agents consume roughly 4x more tokens than standard chat, and multi-agent systems push that to approximately 15x.[^6][^7] Uber reportedly capped engineers at $1,500 per person per tool per month for agent tooling after burning its annual AI budget in four months — the expense moved from writing code to running the thing that writes code.[^10] The practitioner consensus on guardrails, repeated across every source covering this in 2026, comes down to three non-negotiables: a hard iteration ceiling, a diff/no-progress check that kills a run once recent passes stop changing anything, and a spend cap that ends the run before billing does.[^10] Without all three, what you’re running isn’t a loop — it’s an open invoice.

There’s a second, quieter risk that compounds the longer a loop runs unattended: comprehension debt, Osmani’s term for the growing gap between what exists in your codebase or knowledge base and what you actually understand, because the faster a loop ships work you didn’t personally review, the faster that gap widens.[^3] Verification by a sub-agent tells you tests passed. It does not tell you the loop didn’t also flip a feature flag, use the wrong credentials, or make a change outside your intended blast radius — that requires governance (identity, scope, audit trail, rollback path), not just a verifier.[^11]

The full comparison

	Prompt Eng.	Context Eng.	Harness Eng.	Loop Eng.
Object	The instruction	The input environment	The execution system	The recurring system
Core question	Did I say it clearly?	Does it see the right info?	Does it keep doing it right?	Does it run without me?
Typical artifact	System prompt, rubric	RAG index, file tree, tool defs	Verifier, state store, logging	Scheduler, verifier, budget guard
Failure it fights	Misunderstanding	Missing or noisy knowledge	Drift, error compounding, false “done”	Human-as-bottleneck, runaway cost
Canonical source	IBM prompt engineering overview	Anthropic, “Effective context engineering”[^6]	LangChain, “The Anatomy of an Agent Harness”[^4]	Addy Osmani, “Loop Engineering”[^3]
Skill background	Language design	Information architecture	Systems + verification design	SRE + control-theory thinking

A practical diagnostic for daily use, adapted from the AI Builder Club’s framing: output misunderstands the ask → prompt problem. Output is fluent but wrong or stale → context problem. Output starts right and degrades across steps, or claims success falsely → harness problem, usually a weak verifier. Output is fine but nothing happens unless you personally kick it off → loop problem: you’re the cron job.[^4]

What this means for RPA and agentic automation teams

If you’re building on UiPath, Maestro, or any agent framework, this stack maps directly onto decisions you’re already making, whether you’ve named them or not.

Prompt engineering is the system prompt on your triage agent or the instructions block in Agent Builder — worth getting right, but it’s the smallest lever you have.

Context engineering is the layer where a knowledge format like Open Knowledge Format (OKF) does its work: a versioned, typed bundle of process documentation, runbooks, and schema definitions is exactly the “smallest set of high-signal tokens” Anthropic’s guidance calls for, instead of an agent re-deriving your queue-retry policy from scratch every run. It’s also the layer MCP servers operate at — MCP is how an agent’s context gets extended with live tool access rather than static documents.

Harness engineering is your Orchestrator integration, your exception-handling logic, and — critically — whatever checks whether a bot actually completed a transaction correctly versus just exiting without erroring. This is the layer most RPA teams have historically under-invested in relative to prompt/context work, and it’s a recurring root cause in why agentic automation programs fail: a bot that reports success without a real verification step is a harness failure wearing a green checkmark.

Loop engineering is a scheduled Orchestrator trigger evolving into something closer to /goal — not “run this process at 6am” but “keep reconciling this queue until it’s clean, verify with a separate check, and escalate to a human if you’re not making progress after N attempts.” Very few automation programs are here yet, and the token-cost lesson from coding agents applies directly: an unattended automation loop without a hard iteration cap and a cost budget is not more autonomous, it’s just a slower version of the same runaway-tool-call failure mode documented above.

The organizations getting real value from agentic automation in 2026 aren’t the ones with the cleverest prompts. They’re the ones who correctly diagnosed which of these four layers was actually their bottleneck before spending engineering time on it.

Key Takeaways

Prompt, context, harness, and loop engineering are nested layers, not a succession of fads. Each one wraps the layer before it; none of them retire the others, and a sloppy prompt still hurts inside the best-engineered loop.
“The model can’t follow instructions” is usually a misdiagnosis. Use the three-question test: reproducible with the same prompt and context → prompt problem. Flaky within one turn → context problem. Only breaks after N turns → harness or loop problem.
Context engineering’s core discipline is subtraction, not addition — Anthropic’s own guidance is to find the smallest set of high-signal tokens, because context rot means more tokens can actively degrade recall, not just cost more.
Harness engineering is where most of the actual engineering work happens — LangChain’s formula, Agent = Model + Harness, means the model is often not your bottleneck; your verification and state logic is.
Loop engineering is the first layer where failure is measured in dollars. Agents run ~4x the tokens of standard chat, multi-agent systems ~15x. A loop without a hard iteration cap, a no-progress check, and a spend budget isn’t a loop — it’s an open invoice.
For automation teams, this stack maps directly onto tools you already use: OKF-style knowledge bundles are context engineering, Orchestrator exception handling is harness engineering, and scheduled /goal-style automations are the next step in loop engineering — most programs are still one or two layers behind where the leading edge already is.

FAQs

What’s the difference between prompt engineering and context engineering? Prompt engineering is about what you ask the model to do in a single turn — the instruction wording, format, and constraints. Context engineering is about everything else the model sees while doing it — retrieved documents, tool outputs, message history, file structure. Anthropic frames context engineering as the natural progression of prompt engineering: the prompt becomes one curated object inside a much larger, actively managed context window.

What is loop engineering? Loop engineering, a term popularized by Addy Osmani in June 2026, is the practice of designing the system that prompts, checks, and re-runs an AI agent so a human doesn’t have to trigger every turn manually. It wraps prompt, context, and harness engineering rather than replacing them, adding scheduling, verification, and stop conditions on top.

Is loop engineering just agent orchestration rebranded? Partly. Orchestration frameworks (LangGraph, CrewAI, etc.) are the tools. Loop engineering is the discipline: deciding how work continues across runs, how it’s checked, and how and when it stops. You can use an orchestration framework badly (no verifier, no budget) or well; loop engineering is the practice that makes the difference.

How do you stop an AI agent loop from running forever? Layer multiple stop conditions: a hard maximum-iteration cap, a wall-clock timeout, a token or dollar spend budget, no-progress detection (exiting when repeated iterations produce no new information or repeat the same tool call), and a goal-completion check based on a verifiable predicate rather than just the model going silent.

What is harness engineering and how is it different from loop engineering? The harness is the environment a single agent run operates inside — its tools, state, and verification logic (LangChain’s formula: Agent = Model + Harness). The loop is what decides when that harness runs and whether the result is good enough to stop on. A loop without a harness is an unattended agent with no guardrails; a harness without a loop is a safe system that still needs a human to press go every time.

Why do AI agents consume so many more tokens than a chat conversation? Because every loop iteration is a full LLM call that includes the growing context of prior tool results, reasoning traces, and state. Anthropic’s internal data shows single agents run roughly 4x the token consumption of standard chat interactions, and multi-agent systems (where a lead agent spawns sub-agents) run closer to 15x, since each sub-agent independently explores and reasons before returning a condensed summary.

References

[^1]: Peter Steinberger (@steipete), post on X, June 7, 2026, cited across multiple sources including Firecrawl and Data Science Dojo coverage of loop engineering.

[^2]: Boris Cherny, creator of Claude Code at Anthropic, quoted in Firecrawl’s “Loop Engineering: Should You Stop Prompting Agents and Start Designing Loops,” June 11, 2026 — https://www.firecrawl.dev/blog/loop-engineering

[^3]: Addy Osmani, “Loop Engineering,” June 7, 2026 — https://addyosmani.com/blog/loop-engineering/

[^4]: AI Builder Club, “Prompt vs Context vs Harness vs Loop Engineering: The 4 Shifts,” June 11, 2026 (updated July 2, 2026) — https://www.aibuilderclub.com/blog/prompt-context-harness-evolution

[^5]: Puppyone, “Loop Engineering vs Prompt Engineering vs Context Engineering: A 2026 Practitioner’s Map,” June 17, 2026 — https://www.puppyone.ai/en/blog/loop-vs-prompt-vs-context-engineering-2026-map

[^6]: Anthropic Applied AI team, “Effective context engineering for AI agents,” September 29, 2025 — https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents

[^7]: Richmond Alake, “The Agent Loop Decoded,” Oracle Developers Blog, June 11, 2026 — https://blogs.oracle.com/developers/the-agent-loop-decoded-three-levels-every-agent-engineer-must-know; and Casius Lee, “What Is the AI Agent Loop?,” Oracle Developers Blog, March 16, 2026 — https://blogs.oracle.com/developers/what-is-the-ai-agent-loop-the-core-architecture-behind-autonomous-ai-systems

[^8]: Puppyone, “Loop Engineering: 5 Building Blocks + The Missing One,” June 12, 2026 — https://www.puppyone.ai/en/blog/what-is-loop-engineering-5-building-blocks-missing-one

[^9]: Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models,” Princeton University and Google Research, arXiv:2210.03629, 2022 — https://arxiv.org/abs/2210.03629

[^10]: Data Science Dojo, “Agentic Loops Explained: From ReAct to Loop Engineering (2026 Guide),” June 9, 2026 — https://datasciencedojo.com/blog/agentic-loops-explained-from-react-to-loop-engineering-2026-guide/

[^11]: Puppyone, “Loop Engineering: 5 Building Blocks + The Missing One” (governed workspace / identity, scope, audit, rollback section) — https://www.puppyone.ai/en/blog/what-is-loop-engineering-5-building-blocks-missing-one

Further sources consulted: MindStudio, “What Is Loop Engineering? The New Meta for AI Coding Agents,” June 9, 2026; Firecrawl, “Loop Engineering: Should You Stop Prompting Agents and Start Designing Loops,” June 11, 2026.

What Is Google’s Open Knowledge Format (OKF)? A Practitioner’s Guide

Satish Prasad — Sun, 19 Jul 2026 12:08:43 +0000

Every team building AI agents eventually hits the same wall. The model is capable enough — it can write SQL, draft a workflow, summarize a document — but it doesn’t know which table holds “active users,” why the finance team’s revenue number excludes refunds, or that the checkout API changed last sprint. That knowledge exists somewhere. It’s just scattered across a wiki, a Slack thread, a senior engineer’s head, and a comment in a repo nobody opens anymore.

On June 12, 2026, Google Cloud’s Data Cloud team — led by Sam McVeety and Amir Hormati — published a proposed fix: the Open Knowledge Format (OKF), an open, vendor-neutral specification for packaging organizational knowledge so both humans and AI agents can read it without translation.[^1] It’s not a database, not a new model, and not a Google product you sign up for. It’s a file format — plain markdown with YAML frontmatter — and that plainness is the entire pitch.

This guide covers what OKF actually is, how the v0.1 spec works down to the file structure, how it stacks up against RAG and adjacent conventions like llms.txt and AGENTS.md, and what it means if you’re building or documenting automation programs on UiPath, LangGraph, or any agent stack in between.

What the Open Knowledge Format actually is

OKF represents knowledge as a bundle: a directory of markdown files, each one a self-contained unit called a concept. A concept can be anything worth capturing — a database table, a metric definition, an incident runbook, an API endpoint, a business process. Each concept file has two parts: a short YAML frontmatter block with structured metadata, and a markdown body with the actual explanation, schema, or steps.

Here’s the entire idea in one file, taken directly from the v0.1 spec:

---
type: BigQuery Table
title: Customer Orders
description: One row per completed customer order across all channels.
resource: https://console.cloud.google.com/bigquery?p=acme&d=sales&t=orders
tags: [sales, orders, revenue]
timestamp: 2026-05-28T14:30:00Z
---

# Schema

| Column        | Type      | Description                              |
|---------------|-----------|-------------------------------------------|
| `order_id`    | STRING    | Globally unique order identifier.         |
| `customer_id` | STRING    | Foreign key into [customers](/tables/customers.md). |

# Joins

Joined with [customers](/tables/customers.md) on `customer_id`.

Only one field is required: type. Everything else — title, description, resource, tags, timestamp, or any custom key a producer wants to add — is optional. Google’s stated design goal was to standardize the smallest possible set of conventions needed for interoperability, and leave everything else to the people actually writing the knowledge.[^2]

Three properties make this different from “just another docs folder”:

It’s just files. No SDK, no API, no runtime. If a tool can read a directory of .md files, it can consume an OKF bundle.
It’s diffable and versionable. Because a bundle is plain text, it lives naturally in Git — pull requests, code review, and blame history apply to your knowledge the same way they apply to your code.
Producers and consumers are decoupled. A human can hand-write a concept file. A pipeline can auto-generate one from a database schema. An LLM can draft one and a different LLM, from a different vendor, can consume it — because the contract is the file format, not an integration.

Why Google built it: the fragmented-context problem

Google’s announcement frames the motivation plainly: in most organizations, the knowledge an agent needs to be useful — schema definitions, business logic behind a metric, join paths between two systems, the reason an API was deprecated — is spread across metadata catalogs with their own APIs, wikis and shared drives, code comments and docstrings, and “the heads of a few senior engineers.”[^1]

When an agent needs to answer something like “how do we compute weekly active users from the event stream,” it has to reassemble that answer from scattered, mutually incompatible surfaces — every single time, for every single agent, at every single company independently solving the exact same problem. Google’s bet is that the fix isn’t a better retrieval pipeline or another vendor catalog. It’s a shared format that lets knowledge itself become portable.

The pattern it formalizes: Karpathy’s “LLM wiki”

OKF didn’t appear from nowhere. It’s an explicit formalization of a pattern AI researcher Andrej Karpathy described in a widely-circulated gist: instead of having a model re-search the same raw documents for the same facts on every run, give it a persistent markdown wiki it reads from and writes back into, the way a human maintains a personal knowledge base — except the model doesn’t get bored or forget to fix a stale cross-reference.[^3] Karpathy’s framing, quoted directly in Google’s post: “LLMs don’t get bored, don’t forget to update a cross-reference, and can touch 15 files in one pass.”

That pattern had already been reinvented independently and informally — Obsidian vaults wired to coding agents, the AGENTS.md/CLAUDE.md family of repo convention files, ad hoc index.md/log.md pairs inside data teams’ “metadata as code” repos. They all rhyme (markdown, frontmatter, cross-links), but none of them agree on what fields a document should carry or what a filename means. OKF’s contribution is narrow but useful: it’s the same idea, specified enough that a bundle produced by one team’s tooling is legible to another team’s agent without a translation layer.

How OKF works, structurally

Bundle structure

A bundle is a directory tree. Producers organize it however fits the domain — there’s no fixed taxonomy:

sales/
├── index.md
├── datasets/
│   ├── index.md
│   └── orders_db.md
├── tables/
│   ├── index.md
│   ├── orders.md
│   └── customers.md
└── metrics/
    ├── index.md
    └── weekly_active_users.md

A concept’s concept ID is just its file path with the .md stripped — tables/orders.md has the concept ID tables/orders. The filesystem location is the identity.

The two reserved filenames

Two filenames carry special meaning at any level of the tree and can’t be used for ordinary concepts:

Filename	Purpose
`index.md`	A directory listing that supports progressive disclosure — an agent reads the index first to decide what’s worth opening, instead of loading an entire directory into context at once. Contains no frontmatter, just grouped links with short descriptions.
`log.md`	An append-only, date-grouped changelog for that scope (`## 2026-05-22` followed by `Update:`, `Creation:`, or `Deprecation:` entries). Optional but recommended for anything an agent maintains over time.

Cross-linking turns the tree into a graph

Concepts link to each other with ordinary markdown links — either bundle-root-relative (/tables/customers.md, the recommended form) or path-relative (./customers.md). A link asserts a relationship; the kind of relationship (join, dependency, reference) lives in the surrounding prose, not in the link syntax itself. Consumers that render a graph view treat every link as a directed edge. Notably, the spec requires consumers to tolerate broken links — a dangling reference isn’t malformed, it’s just knowledge that hasn’t been written yet.

Conformance is deliberately loose

A bundle is v0.1-conformant if every non-reserved .md file has parseable YAML frontmatter with a non-empty type field, and reserved files follow the index.md/log.md conventions when present. That’s it. Unknown type values, missing optional fields, unknown extra frontmatter keys, and broken links are all things a compliant consumer must tolerate rather than reject.[^2] This permissiveness is intentional — the format needs to stay useful as bundles are partially generated by agents and grow messier over time.

Worked example: an OKF bundle for a UiPath automation program

To make this concrete for an automation team, here’s what a small OKF bundle might look like for a UiPath Orchestrator estate — the kind of tribal knowledge that normally lives in a wiki nobody keeps current:

automation-coe/
├── index.md
├── processes/
│   ├── index.md
│   └── invoice-processing.md
├── queues/
│   ├── index.md
│   └── ap-exceptions.md
└── runbooks/
    ├── index.md
    └── orchestrator-robot-offline.md

runbooks/orchestrator-robot-offline.md might read:

---
type: Playbook
title: Unattended robot shows Offline in Orchestrator
description: Triage steps when a robot goes offline mid-job.
resource: https://orchestrator.acme.com/robots
tags: [uipath, orchestrator, oncall]
timestamp: 2026-06-30T09:00:00Z
---

# Trigger

Robot status flips to Offline while a job from
[invoice-processing](/processes/invoice-processing.md) is In Progress.

# Steps

1. Check the machine's connectivity to Orchestrator (port 443).
2. Confirm the UiRobot service is running on the host.
3. Requeue the affected item in
   [ap-exceptions](/queues/ap-exceptions.md) rather than resubmitting the job.

# Citations

[1] UiPath Orchestrator troubleshooting guide — https://docs.uipath.com

Nothing here requires UiPath tooling to understand — it’s readable in any editor and parseable by any agent, which is exactly the point. An agent handling an on-call alert, a coding agent generating a new dispatcher workflow, or a new hire’s onboarding bot can all read the same file without a bespoke integration into Orchestrator’s API.

OKF vs. RAG: different tools for different knowledge

It’s tempting to read OKF as a RAG replacement. It isn’t, and Google doesn’t frame it that way — but the comparison is useful because it clarifies what each approach is actually for.

	RAG	OKF
Core approach	Search and retrieve on demand from a vector index	Maintain a persistent, curated wiki agents read and update
Knowledge shape	Unstructured chunks / embeddings	Structured markdown + YAML
Portability	Low — tied to a vector DB and embedding model	High — plain files, vendor-neutral
Version control	Rarely versioned directly	Native — lives in Git
Setup complexity	High (embeddings, vector store, retrieval pipeline)	Low (a directory of files)
Best for	Large, unstructured document corpora	Curated, repeatedly-needed organizational knowledge
Knowledge growth	Static — re-retrieved every query	Cumulative — the wiki gets better over time
Maturity (as of mid-2026)	High, years of tooling (LangChain, LlamaIndex)	Very early — v0.1, published June 2026

RAG still wins when you have millions of raw, unstructured documents and no pre-existing structure to lean on — support tickets, contracts, research papers. OKF wins for the smaller, denser layer of knowledge an agent needs reliably and repeatedly: schema definitions, metric logic, runbooks, join paths — the stuff currently being re-retrieved, re-explained, or re-guessed on every single agent run. Think of RAG as a library and OKF as a well-maintained team handbook. Most agentic systems will eventually want both, with OKF as the always-loaded core context and RAG handling the long tail.

OKF vs. llms.txt vs. AGENTS.md/CLAUDE.md

OKF also sits inside a broader, informal stack of conventions for making a system legible to machines, and it’s worth knowing where each layer stops:

Format	What it is	Scope	Who reads it
`robots.txt` / `sitemap.xml`	Tells a crawler which URLs exist	A site	Crawlers
`llms.txt`	A single pointer file at a site root	A handful of pages worth reading	Web crawlers, LLMs
`AGENTS.md` / `CLAUDE.md`	Instructions for how a coding agent should behave	One repo or agent	The coding agent in that repo
OKF	A directory of typed, cross-linked markdown concepts	A whole knowledge base	Any agent or tool, across organizations

llms.txt points; AGENTS.md instructs; OKF hands over the actual knowledge, structured as a traversable graph. They’re complementary rather than competing — a repo can ship an AGENTS.md for behavior and an OKF bundle for domain knowledge at the same time.

What ships alongside the v0.1 spec

Google published reference implementations at both ends of the pipeline, explicitly as proofs of concept rather than the only valid way to build one:[^1]

An enrichment agent that walks a BigQuery dataset, drafts an OKF concept for every table and view, then runs a second LLM pass to enrich each concept with citations, schemas, and join paths pulled from documentation.
A static HTML visualizer — a single self-contained file that turns any OKF bundle into an interactive graph, with no backend and no data leaving the page.
Three sample bundles (GA4 e-commerce, Stack Overflow, Bitcoin public datasets) committed to the repo as working examples of conformant OKF.

The spec, sample bundles, and reference agent are all in the public GoogleCloudPlatform/knowledge-catalog repository on GitHub, and Google has already wired its own Knowledge Catalog product to ingest OKF bundles.[^1] Independent tooling is already appearing outside Google — including a free WordPress plugin that auto-generates a bundle from published posts, and third-party OKF conformance validators.[^4]

Limitations and open risks

A few things are worth being honest about before adopting this on a live system:

It’s genuinely early. Google calls v0.1 “a starting point, not a finished standard.” Field conventions, tooling, and even the reserved-filename list may change in future minor or major versions.
Curation isn’t free. Unlike RAG, which can point at raw documents as-is, OKF’s value comes from someone — human or agent — actually writing typed, cross-linked concepts. A bundle where every file has the same generic type and no real relationships is barely more useful than a folder of loose notes; several early guides flag exactly this failure mode with auto-generated bundles.[^4]
It’s a new attack surface if agents write into it. A bundle that an agent updates from untrusted input is a plausible vector for indirect prompt injection — what’s allowed to write into a shared, always-loaded knowledge base matters as much as what’s allowed to read from it.
It is not an SEO or ranking signal. Despite some early coverage implying otherwise, Google’s search systems don’t fetch a public OKF bundle from a website to rank it. It’s an internal knowledge format for agents, not a web publishing signal.[^4]

What this means for RPA and agentic automation teams

For readers building or running automation programs, OKF is worth watching for a specific reason: it targets exactly the knowledge layer that agentic automation keeps tripping over — the gap between “the model can generate a workflow” and “the model knows your schema, your queue-retry policy, and why the checkout API changed last sprint.” That gap is a recurring failure mode in agentic rollouts (see our breakdown of why agentic automation programs fail, where undocumented tribal context shows up repeatedly as a root cause).

Automation Centers of Excellence already maintain a version of this knowledge informally — process documentation, Orchestrator runbooks, queue definitions, exception-handling playbooks. The pitch of OKF is that turning that material into a typed, linked, Git-versioned bundle costs little more than writing it down properly once, and it pays off the moment you’re feeding context to a coding agent building a new dispatcher workflow, a triage agent responding to Orchestrator alerts, or an onboarding assistant for a new automation developer. It’s also a natural complement to MCP-based agent architectures: MCP servers expose actions an agent can take, while an OKF bundle can supply the context the agent needs to decide which action is correct.

Whether OKF specifically becomes the standard, or gets superseded by something else in the next year of iteration, the underlying discipline — curated, versioned, agent-readable knowledge instead of scattered tribal memory — is worth adopting regardless of which file format wins.

Key Takeaways

OKF is a file format, not a service. A knowledge bundle is a directory of markdown files with YAML frontmatter — no SDK, database, or proprietary runtime required.
Only one field is mandatory: type. Everything else (title, description, resource, tags, timestamp, custom keys) is optional, keeping the barrier to producing a bundle low.
It formalizes Karpathy’s “LLM wiki” pattern — a persistent, agent-maintained markdown knowledge base instead of re-retrieving the same facts on every run.
It doesn’t replace RAG. RAG still wins for large, unstructured corpora; OKF wins for curated, repeatedly-needed organizational knowledge like schemas, metrics, and runbooks.
It’s genuinely v0.1. Treat it as an early, evolving spec — useful to pilot now, not yet something to bet a production architecture on wholesale.
For automation teams, OKF maps directly onto the process documentation, runbooks, and queue definitions that Centers of Excellence already maintain — formalizing that material is low-cost and pays off the moment agents need to reason about it.

FAQs

What is the Open Knowledge Format (OKF)? OKF is an open specification from Google Cloud, published June 12, 2026, that represents organizational knowledge as a directory of markdown files with YAML frontmatter. It’s designed to be readable by humans and parseable by AI agents without a translation layer or SDK.

Who created OKF? Google Cloud’s Data Cloud team created OKF. The announcement was authored by Sam McVeety (Tech Lead, Data Analytics) and Amir Hormati (Tech Lead, BigQuery), and the spec is published as an open standard in the public GoogleCloudPlatform/knowledge-catalog GitHub repository.

Does OKF replace RAG? No. OKF and RAG solve different problems. RAG is best for large, unstructured document corpora searched on demand. OKF is best for curated, structured knowledge — schemas, metric definitions, runbooks — that an agent needs reliably and repeatedly. Most systems will end up using both.

Is OKF an SEO or Google ranking signal? No. Google’s search systems do not fetch a public OKF bundle to rank a website with it. OKF is an internal knowledge format for AI agents, not a web publishing or SEO signal.

How do I create an OKF bundle? Write the knowledge worth capturing as individual markdown files, each with a YAML frontmatter block containing at least a type field. Cross-link related concepts with markdown links, add an index.md per directory for navigation, and host the bundle in a Git repo (recommended) or as a tarball. Google’s reference enrichment agent and third-party validators can help automate and check the output.

Is OKF only for BigQuery or Google Cloud data? No. The reference implementations Google shipped happen to target BigQuery, but the format itself is domain-agnostic — it works equally well for runbooks, API documentation, RPA process libraries, or any knowledge a team wants an agent to read.

References

[^1]: Sam McVeety and Amir Hormati, “Introducing the Open Knowledge Format,” Google Cloud Blog, June 12, 2026 — https://cloud.google.com/blog/products/data-analytics/how-the-open-knowledge-format-can-improve-data-sharing

[^2]: “Open Knowledge Format (OKF), Version 0.1 — Draft,” GoogleCloudPlatform/knowledge-catalog, GitHub — https://github.com/GoogleCloudPlatform/knowledge-catalog/blob/main/okf/SPEC.md

[^3]: Andrej Karpathy, “LLM Wiki,” GitHub Gist — https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f [^4]: WitsCode, “Open Knowledge Format (OKF): The Complete 2026 Guide,” published June 18, 2026, updated June 21, 2026 — https://witscode.com/open-knowledge-format

Further industry commentary consulted: “Google’s New Knowledge Standard: What Is the Open Knowledge Format (OKF)?”, Medium, June 19, 2026 — https://medium.com/@aristojeff/googles-new-knowledge-standard-what-is-the-open-knowledge-format-okf-f044ddf5b6bd;

“Google’s Open Knowledge Format (OKF) vs. RAG,” AlphaMatch, June 30, 2026 — https://www.alphamatch.ai/blog/google-open-knowledge-format-okf-vs-rag-2026

Guardrails in Amazon Bedrock: Building Safer and Governed Generative AI Applications in LangGrap

Satish Prasad — Wed, 15 Jul 2026 02:44:01 +0000

A guardrail-less LangGraph agent is one bad tool call away from a headline. Not hypothetically — this is the exact failure mode we catalogued in 16 Reasons Why Agentic Automation Programs Fail: the agent that started approving things it was never authorized to approve, the pipeline that silently produced output nobody checked. LangGraph gives you precise control over how an agent moves through a task. It does not, by itself, stop that agent from leaking a customer’s SSN, answering a question it should have refused, or hallucinating a policy that doesn’t exist. That’s a separate layer, and Amazon Bedrock Guardrails is one of the more complete implementations of it — governance controls that work whether your model is on Bedrock, self-hosted, or a third-party API entirely.

This guide does two things. First, it walks through wiring Bedrock Guardrails into a real LangGraph agent — not the console click-through, the actual code. Second, it puts Bedrock Guardrails side by side with the other guardrail options a LangGraph developer will realistically run into: NVIDIA NeMo Guardrails, Guardrails AI, Azure AI Content Safety, and — since a good share of this audience is building on UiPath — the guardrails built into the UiPath LangChain SDK itself.

What Bedrock Guardrails Actually Checks

Amazon Bedrock Guardrails is a policy layer that sits between your application, the prompt, and the model response — evaluating both directions against six configurable safeguards (AWS, Bedrock Guardrails documentation):

Content filters — detect and filter Hate, Insults, Sexual, Violence, Misconduct, and Prompt Attack content, with configurable strength per category. The Standard tier extends this detection into code elements — comments, variable names, string literals — which matters if your agent generates or reviews code.
Denied topics — you define topics that are simply off-limits for your application (AWS’s own example: a banking app blocking fiduciary/investment advice), and the guardrail blocks them regardless of how the prompt is phrased.
Word filters — exact-match blocking of custom words or phrases: profanity, competitor names, internal project codenames.
Sensitive information filters — probabilistic PII detection (SSN, date of birth, address, and more) with block-or-mask actions, plus custom regex patterns for things PII detection won’t catch on its own, like internal account number formats.
Contextual grounding checks — flags model responses that aren’t actually supported by your retrieved source material, or that don’t answer the user’s question. This is the RAG-hallucination check most guardrail products don’t have.
Automated Reasoning checks — validates a response against a set of formal logical rules rather than a probabilistic classifier. AWS claims up to 99% accuracy on this check and describes it as the first guardrail feature to use formal logic (rather than another LLM) to catch hallucinations — a meaningfully different approach from every other product in this comparison. (AWS, Bedrock Guardrails product page)

AWS states its content filters block up to 88% of harmful content with these safeguards active. Guardrails can be invoked two ways: automatically during a model inference call (attach a guardrailIdentifier and guardrailVersion to the request), or independently via the standalone ApplyGuardrail API, which evaluates content against your policies without calling a foundation model at all — useful for gating retrieved documents or tool output before they ever reach the LLM.

Why This Matters More in LangGraph Than in a Simple Chatbot

A single-turn chatbot has one place to add a safety check: right before the response goes out. A LangGraph agent has many — the user’s initial input, every tool call’s arguments, every tool’s return value, every intermediate LLM turn, and the final output. LangGraph’s whole architectural pitch (see our LangGraph vs. CrewAI vs. Microsoft Agent Framework vs. Google ADK comparison) is that you get explicit control over every node and edge in that flow — which means you also get to decide exactly where guardrails sit, rather than accepting a single bolt-on filter at the edges.

Building It: A Guarded LangGraph Support Agent

We’ll build a banking customer-support agent — the same domain AWS uses in its own guardrails examples — that answers account questions but must never give investment advice, must redact PII in both directions, and must stay grounded in retrieved account documentation rather than inventing policy details.

Step 1: Install dependencies

pip install boto3 langgraph langchain-aws

Step 2: Create the guardrail

This is a one-time setup step, typically run once via a script or the console, not on every agent invocation:

import boto3

bedrock = boto3.client(service_name="bedrock", region_name="us-east-1")

guardrail = bedrock.create_guardrail(
    name="banking-support-guardrail",
    description="Blocks investment advice, redacts PII, checks grounding for banking support agent.",
    topicPolicyConfig={
        "topicsConfig": [{
            "name": "Investment Advice",
            "definition": "Personalized recommendations on investing, trading, or asset allocation.",
            "examples": [
                "Should I move my savings into stocks?",
                "What's a good retirement investment strategy?",
            ],
            "type": "DENY",
        }]
    },
    contentPolicyConfig={
        "filtersConfig": [
            {"type": "SEXUAL", "inputStrength": "HIGH", "outputStrength": "HIGH"},
            {"type": "VIOLENCE", "inputStrength": "HIGH", "outputStrength": "HIGH"},
            {"type": "HATE", "inputStrength": "HIGH", "outputStrength": "HIGH"},
            {"type": "INSULTS", "inputStrength": "HIGH", "outputStrength": "HIGH"},
            {"type": "MISCONDUCT", "inputStrength": "HIGH", "outputStrength": "HIGH"},
            {"type": "PROMPT_ATTACK", "inputStrength": "HIGH", "outputStrength": "NONE"},
        ]
    },
    sensitiveInformationPolicyConfig={
        "piiEntitiesConfig": [
            {"type": "EMAIL", "action": "ANONYMIZE"},
            {"type": "PHONE", "action": "ANONYMIZE"},
            {"type": "NAME", "action": "ANONYMIZE"},
            {"type": "US_SOCIAL_SECURITY_NUMBER", "action": "BLOCK"},
            {"type": "US_BANK_ACCOUNT_NUMBER", "action": "BLOCK"},
            {"type": "CREDIT_DEBIT_CARD_NUMBER", "action": "BLOCK"},
        ]
    },
    contextualGroundingPolicyConfig={
        "filtersConfig": [
            {"type": "GROUNDING", "threshold": 0.75},
            {"type": "RELEVANCE", "threshold": 0.75},
        ]
    },
    blockedInputMessaging="I can't help with that request through this channel. Please contact support directly.",
    blockedOutputsMessaging="I'm not able to provide that information here. Let me connect you with a specialist.",
)

# Snapshot the working draft into a stable, deployable version
version = bedrock.create_guardrail_version(
    guardrailIdentifier=guardrail["guardrailId"],
    description="v1 - initial banking support policy",
)

(This mirrors the structure in AWS’s own Guardrail API notebook — worth reading directly if you want the console-driven version of the same workflow, which the dev.to walkthrough by Dipayan Das covers screenshot-by-screenshot.)

Step 3: Attach the guardrail to the model inside LangGraph

langchain-aws exposes a guardrail_config parameter directly on ChatBedrockConverse — every call through this model object is automatically evaluated against your guardrail, in both directions:

from langchain_aws import ChatBedrockConverse

guarded_llm = ChatBedrockConverse(
    model="anthropic.claude-3-5-sonnet-20241022-v2:0",
    region_name="us-east-1",
    guardrail_config={
        "guardrailIdentifier": guardrail["guardrailId"],
        "guardrailVersion": version["version"],
        "trace": "enabled",  # populates response_metadata["trace"] for auditing
    },
)

A useful optimization for multi-turn conversations: set guard_last_turn_only=True alongside guardrail_config if you only need the guardrail to inspect the newest user message rather than re-scanning the entire conversation history on every turn — this cuts guardrail latency in longer-running agent sessions. (langchain-aws reference, BedrockBase.guardrails)

Step 4: Wire it into a graph with an explicit intervention branch

This is the part a plain chatbot integration skips, and where LangGraph’s explicit control actually earns its keep. When the guardrail intervenes, the response includes stopReason: "guardrail_intervened" in response_metadata — check for it and route accordingly, the same confidence-gated pattern we used for human-in-the-loop classification on UiPath:

from typing import Literal
from langgraph.graph import StateGraph, START, END, MessagesState
from langgraph.types import Command


async def call_guarded_model(state: MessagesState) -> Command:
    response = await guarded_llm.ainvoke(state["messages"])
    return Command(update={"messages": [response]})


def route_on_guardrail(state: MessagesState) -> Literal["respond", "handle_intervention"]:
    last_message = state["messages"][-1]
    if last_message.response_metadata.get("stopReason") == "guardrail_intervened":
        return "handle_intervention"
    return "respond"


async def handle_intervention(state: MessagesState) -> Command:
    # Log the trace for audit, notify a human reviewer, or route to a specialist queue.
    trace = state["messages"][-1].response_metadata.get("trace")
    # ... send `trace` to your logging/observability pipeline here ...
    return Command(update={})


async def respond(state: MessagesState) -> Command:
    return Command(update={})  # deliver state["messages"][-1] to the user


builder = StateGraph(MessagesState)
builder.add_node("call_model", call_guarded_model)
builder.add_node("handle_intervention", handle_intervention)
builder.add_node("respond", respond)

builder.add_edge(START, "call_model")
builder.add_conditional_edges("call_model", route_on_guardrail, ["respond", "handle_intervention"])
builder.add_edge("respond", END)
builder.add_edge("handle_intervention", END)

graph = builder.compile()

Step 5: Guard content the model never generated — RAG results and tool output

Model-level guardrail_config only inspects what the LLM says. If your agent retrieves documents or calls tools, that content can carry PII or violate policy before the model ever sees it. This is exactly what the standalone ApplyGuardrail API is for — call it directly as its own graph node, without invoking a foundation model at all:

import boto3

bedrock_runtime = boto3.client("bedrock-runtime", region_name="us-east-1")


async def guard_retrieved_content(state: MessagesState) -> Command:
    retrieved_text = state.get("retrieved_context", "")
    result = bedrock_runtime.apply_guardrail(
        guardrailIdentifier=guardrail["guardrailId"],
        guardrailVersion=version["version"],
        source="OUTPUT",  # evaluating content as if it were a model response
        content=[{"text": {"text": retrieved_text}}],
    )
    if result["action"] == "GUARDRAIL_INTERVENED":
        return Command(update={"retrieved_context": "[content withheld by policy]"})
    return Command(update={})

Because ApplyGuardrail doesn’t invoke a model, it’s cheap and fast enough to run on every tool output as a matter of course — a pattern worth adopting even for agents where the LLM itself is already guarded.

Production Considerations

Tiers matter. Bedrock Guardrails ships Classic and Standard safeguard tiers; Standard extends coverage into code elements and is the one to pick for coding agents or DevOps-adjacent workflows.

Cross-account and cross-region. For organizations running multiple AWS accounts, Bedrock Guardrails supports centralized cross-account enforcement — one policy applied automatically across organizational units, rather than reconfiguring the same guardrail per account. Guardrail inference can also be distributed across regions for latency and resilience.

It’s not Bedrock-only. The ApplyGuardrail API works against any foundation model — including self-hosted models and third-party APIs like OpenAI and Google Gemini — so a Bedrock guardrail policy isn’t locked to Bedrock-hosted models. If your LangGraph agent calls multiple model providers, you can still centralize policy enforcement through one guardrail.

Turn on trace, always. "trace": "enabled" costs nothing and gives you the assessment breakdown — which specific policy fired, at what confidence — for every intervention. Skipping this is the single most common mistake in the walkthroughs we reviewed for this guide; without it, you know a guardrail blocked something but not why, which makes tuning thresholds pure guesswork.

How It Compares to the Other Guardrail Options

	Amazon Bedrock Guardrails	UiPath Guardrails (LangChain SDK)	NVIDIA NeMo Guardrails	Guardrails AI	Azure AI Content Safety
Model	Managed AWS service	Open-source middleware/decorator (`uipath_langchain`)	Open-source toolkit (Apache 2.0)	Open-source Python library	Managed Azure service
Integration style	`guardrail_config` on the model, or standalone `ApplyGuardrail` API call	`middleware=[...]` on `create_agent()`, or `@guardrail` decorator on any LangGraph node	Colang DSL defining rails around your LLM calls	`Guard` object wrapping validator checks	REST API / SDK call
Works across model providers?	Yes — `ApplyGuardrail` is explicitly model-agnostic, including OpenAI and Gemini	Yes, within LangChain/LangGraph regardless of model provider	Yes, framework-agnostic by design	Yes, any Python LLM call	Primarily Azure OpenAI-native
Unique strength	Automated Reasoning checks — formal-logic hallucination detection, not just classifier-based	Native `EscalateAction` → UiPath Action Center human review, built for LangGraph out of the box	Deepest dialog/topical control via Colang; sub-100ms GPU-accelerated checks	Largest library of pre-built validators (50+ on Guardrails Hub)	Groundedness detection tightly coupled to Azure RAG pipelines
Human-in-the-loop	Not built-in — you wire it yourself (as in Step 4 above)	Built-in — `EscalateAction` suspends the run and creates an Action Center task	Not built-in	Not built-in	Not built-in
Learning curve	Low if you already use `boto3`/`langchain-aws`	Low if you’re already on `create_agent()`; two clear patterns (middleware vs. decorator)	Moderate — Colang is a new DSL to learn	Low — plain Python validators	Low — REST API
Best fit	AWS-native stacks, especially where Automated Reasoning’s audit trail matters for regulated decisions	Teams deploying LangGraph agents into UiPath Orchestrator who want guardrail violations to become Action Center tasks automatically	Teams needing fine-grained conversational flow control across a topic list	Python teams wanting composable, testable output validators independent of any cloud vendor	Azure OpenAI / Azure AI Foundry shops already inside that ecosystem

A few things worth calling out that the table can’t fully capture:

UiPath’s guardrails are the most natural fit if you’re already following this site’s LangGraph-on-UiPath deployment pattern. The EscalateAction we referenced doesn’t just log a violation — it calls the same interrupt(CreateEscalation(...)) primitive from our UiPath human-in-the-loop coverage, suspending the graph and creating a real Action Center task for a human reviewer, with Approve/Reject resuming or terminating the run. Bedrock Guardrails gives you the block/log/mask decision; it doesn’t hand you a human review workflow for free the way UiPath’s does. Nothing stops you from combining both — Bedrock Guardrails as the detection layer, UiPath’s EscalateAction pattern (or the manual conditional-edge approach in Step 4) as the review layer.

NeMo Guardrails is the deepest option if “guardrail” means “stay on-topic” more than “block harmful content.” Its five rail types (input, dialog, retrieval, execution, output) let you script entire permitted conversation flows, not just filter categories — genuinely different from the policy-filter model every other product here uses.

Guardrails AI is the right call if vendor neutrality is the actual requirement. It’s a plain Python library with no cloud dependency, which matters if your LangGraph agent needs to run identically whether it’s calling Bedrock, Azure OpenAI, or a local model.

Azure AI Content Safety is the one to reach for only if you’re already committed to Azure OpenAI — its groundedness detection is strong, but the tightest value is inside that specific ecosystem, similar to how Bedrock Guardrails’ Automated Reasoning checks are strongest when your models already live on Bedrock.

The Verdict

Use Bedrock Guardrails when Automated Reasoning’s formal-logic hallucination check matters for your use case (regulated decisions, financial or medical adjacent content), or when you want one guardrail policy that stays consistent whether your LangGraph agent calls Bedrock-hosted or third-party models via ApplyGuardrail.

Use UiPath’s LangChain guardrails when you’re deploying into UiPath Orchestrator anyway and want a policy violation to become an Action Center task without hand-building the escalation graph yourself — the tightest fit for this site’s core audience.

Use NeMo Guardrails when the problem is conversational scope control as much as harmful-content filtering — you need the agent to stay strictly on a defined set of topics and flows.

Use Guardrails AI when cloud-vendor neutrality is a hard requirement and you want a large library of ready-made validators without adopting a managed service.

Use Azure AI Content Safety only if your models already live in Azure OpenAI or Azure AI Foundry — outside that ecosystem it doesn’t offer enough over the alternatives to justify the integration.

None of these are mutually exclusive within a single LangGraph graph — a defense-in-depth approach (a fast local validator on tool inputs, a managed cloud guardrail on model calls, a human escalation path for anything in between) is a reasonable default for anything touching regulated or customer-facing data.

Key Takeaways

Bedrock Guardrails checks six policies — content filters, denied topics, word filters, sensitive information, contextual grounding, and Automated Reasoning — and can run either attached to a model call or standalone via ApplyGuardrail.
In LangGraph, attach guardrail_config to ChatBedrockConverse for automatic model-level checks, and use ApplyGuardrail as its own graph node to check RAG content and tool output the model never generated.
Always check response_metadata["stopReason"] == "guardrail_intervened" and route it through an explicit conditional edge — don’t let a guardrail block silently disappear into a generic error path.
ApplyGuardrail is model-agnostic — it works against OpenAI, Gemini, and self-hosted models, not just Bedrock-hosted ones.
UiPath’s own LangChain guardrails are the only option in this comparison with human-in-the-loop escalation built in natively — worth combining with Bedrock Guardrails’ detection strength rather than choosing one exclusively.
Turn on trace from day one; it’s free and it’s the only way to actually tune your thresholds instead of guessing.

FAQs

Do I need Bedrock Guardrails if I’m already using UiPath’s guardrails middleware? Not necessarily one instead of the other. UiPath’s middleware is strong on PII/harmful-content detection and has human-in-the-loop escalation built in; Bedrock Guardrails adds contextual grounding checks and formal-logic Automated Reasoning checks that UiPath’s guardrails don’t currently offer. Many production setups layer both.

Does ApplyGuardrail cost the same as running the guardrail through a model call? ApplyGuardrail is priced separately from model invocation and doesn’t incur foundation model inference costs since no model is called — check current Bedrock pricing for exact rates, but it’s meaningfully cheaper to run on every tool output than routing that content through an LLM call just to check it.

Can I use Bedrock Guardrails with a non-Anthropic, non-Amazon model? Yes — through ApplyGuardrail, guardrails work with any foundation model, including third-party APIs like OpenAI and Google Gemini, not just models hosted on Bedrock.

What’s the difference between BlockAction and masking/redaction in Bedrock Guardrails? Content filters and denied topics generally block outright; sensitive information filters give you the choice per PII type — BLOCK refuses the response entirely, ANONYMIZE/mask replaces the sensitive span and lets the rest of the response through.

Is Automated Reasoning the same as contextual grounding checks? No — grounding checks verify a response is supported by retrieved source material (a RAG-specific check). Automated Reasoning validates a response against formal logical rules you define, independent of any retrieved documents — it’s closer to a policy compliance check than a hallucination-in-RAG check, though both catch overlapping failure modes.

References

Amazon Bedrock Guardrails documentation — docs.aws.amazon.com/bedrock/latest/userguide/guardrails.html
Amazon Bedrock Guardrails product page — aws.amazon.com/bedrock/guardrails
Guardrail API Example notebook — AWS Bedrock Samples
Using Guardrails in Amazon Bedrock (console walkthrough) — Dipayan Das, DEV Community
Amazon Bedrock Guardrails: Essential Setup Guide 2026 — Tech Jacks Solutions
langchain-aws guardrails reference — reference.langchain.com
langchain-aws Guardrails and Content Filtering internals — DeepWiki
UiPath LangChain SDK Guardrails documentation — uipath.github.io/uipath-python/langchain/guardrails
NVIDIA NeMo Guardrails (GitHub) — github.com/NVIDIA-NeMo/Guardrails
Guardrails AI LangChain integration — guardrailsai.com/docs/integrations/langchain