<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	 xmlns:media="http://search.yahoo.com/mrss/" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:media="http://search.yahoo.com/mrss/"
>

<channel>
	<title>RPABOTS.WORLD</title>
	<atom:link href="https://rpabotsworld.com/agentic-ai/ai-agents-frameworks/feed/" rel="self" type="application/rss+xml" />
	<link>https://rpabotsworld.com</link>
	<description>RPA, Agentic AI &amp; Intelligent Automation — Tutorials, Tools &amp; Career Guides</description>
	<lastBuildDate>Sat, 13 Jun 2026 07:59:30 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=7.0</generator>
	<itunes:subtitle>RPABOTS.WORLD</itunes:subtitle>
	<itunes:summary>RPA, Agentic AI &amp; Intelligent Automation — Tutorials, Tools &amp; Career Guides</itunes:summary>
	<itunes:explicit>clean</itunes:explicit>
	<item>
		<title>Building with Google Agent Studio: The Complete Guide to Gemini Enterprise Agent Platform</title>
		<link>https://rpabotsworld.com/google-agent-studio-gemini-enterprise-agent-platform-guide/</link>
					<comments>https://rpabotsworld.com/google-agent-studio-gemini-enterprise-agent-platform-guide/#respond</comments>
		
		<dc:creator><![CDATA[Satish Prasad]]></dc:creator>
		<pubDate>Sat, 13 Jun 2026 07:58:26 +0000</pubDate>
				<category><![CDATA[Agentic AI & AI Automation]]></category>
		<category><![CDATA[AI Agents & Frameworks]]></category>
		<category><![CDATA[agentic ai]]></category>
		<category><![CDATA[AI Agents]]></category>
		<category><![CDATA[Digital Transformation]]></category>
		<category><![CDATA[multi-agent systems]]></category>
		<guid isPermaLink="false">https://rpabotsworld.com/?p=32128</guid>

					<description><![CDATA[Vertex AI is now Agent Platform. Agent Designer is now Agent Studio. What stayed the same — and what it means for enterprise teams building production agents today. The Platform That Keeps Evolving — And Why That&#8217;s a Good Thing If you&#8217;ve been tracking Google&#8217;s AI platform story, you&#8217;ve watched a rapid-fire succession of rebrands: [&#8230;]]]></description>
										<content:encoded><![CDATA[
<p class="wp-block-paragraph"><em>Vertex AI is now Agent Platform. Agent Designer is now Agent Studio. What stayed the same — and what it means for enterprise teams building production agents today.</em></p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">The Platform That Keeps Evolving — And Why That&#8217;s a Good Thing</h2>



<p class="wp-block-paragraph">If you&#8217;ve been tracking Google&#8217;s AI platform story, you&#8217;ve watched a rapid-fire succession of rebrands: Dialogflow → Agent Builder → Vertex AI → now <strong>Gemini Enterprise Agent Platform</strong>. At Google Cloud Next 2026, Google announced the consolidation of everything — Vertex AI, Agentspace, Model Garden, ADK, and the Agent Runtime — into a single unified platform. The low-code builder that was called Agent Designer since December 2024 became <strong>Agent Studio</strong>, now generally available.</p>



<p class="wp-block-paragraph">This guide cuts through the naming history and focuses on what you can actually build today: production-grade agents using the full platform stack — Agent Studio for no-code/low-code design, RAG Engine for grounding on enterprise data, Memory Bank for long-term personalisation, Agent Runtime for deployment, and built-in evaluation for quality assurance.</p>



<p class="wp-block-paragraph">Whether you&#8217;re a developer who wants code, a builder who wants clicks, or an architect who needs to understand the full system — this guide covers all three.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Part 1: The Platform Mental Model — Five Layers</h2>



<p class="wp-block-paragraph">Before touching the console or writing a line of code, understand how the five layers of the Gemini Enterprise Agent Platform fit together.</p>



<p class="wp-block-paragraph">Gemini Enterprise Agent Platform is a unified platform to build, deploy, govern, and optimize enterprise-grade AI agents and model-based solutions. It supports the complete AI lifecycle — from accessing over 200 foundation models to deploying and managing your agents.</p>



<p class="wp-block-paragraph">Here&#8217;s how the five layers stack:</p>



<pre class="wp-block-code"><code>┌──────────────────────────────────────────────────────────────────┐
│  LAYER 1 — AGENT STUDIO (no-code / low-code visual canvas)        │
│  Design agents, test prompts, build reasoning flows visually      │
├──────────────────────────────────────────────────────────────────┤
│  LAYER 2 — ADK (code-first agent framework)                       │
│  LlmAgent, SequentialAgent, ParallelAgent, LoopAgent, AgentTool  │
├──────────────────────────────────────────────────────────────────┤
│  LAYER 3 — KNOWLEDGE LAYER                                        │
│  RAG Engine · Agent Search · Vector Search · Memory Bank         │
├──────────────────────────────────────────────────────────────────┤
│  LAYER 4 — AGENT RUNTIME (managed deployment + scaling)           │
│  Agent Engine (Vertex AI) · Cloud Run · GKE                      │
├──────────────────────────────────────────────────────────────────┤
│  LAYER 5 — GOVERNANCE                                             │
│  Agent Identity · IAM · Agent Gateway · Business Policies        │
└──────────────────────────────────────────────────────────────────┘
</code></pre>



<p class="wp-block-paragraph">Agent Platform meets you where you are, with tools for all skill levels: Agent Studio to design agents and interact with models without code; Colab Enterprise Notebooks for code-based development and experimentation; Agent Development Kit to build sophisticated agents capable of complex reasoning and tool use with a modular, model-agnostic framework.</p>



<p class="wp-block-paragraph">The platform&#8217;s philosophy: start in Agent Studio, graduate to ADK code when you need more control, deploy both the same way via Agent Runtime.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Part 2: Agent Studio — The No-Code/Low-Code Canvas</h2>



<p class="wp-block-paragraph">Agent Studio is where most teams start. It&#8217;s a visual canvas inside the Google Cloud console for designing, prototyping, and managing agent reasoning loops and workflows — no Python required to get something running.</p>



<h3 class="wp-block-heading">What Agent Studio Actually Is</h3>



<p class="wp-block-paragraph">Agent Studio, Google&#8217;s new low-code interface for building, testing, and publishing natural-language agents, is generally available. The product was in preview as Agent Designer since December 2024. What may be more interesting here is what developers can now actually build with it.</p>



<p class="wp-block-paragraph">In the console, Agent Studio gives you:</p>



<p class="wp-block-paragraph"><strong>Visual reasoning loop designer</strong> — drag connections between the model, tools, and data sources. Define the agent&#8217;s instruction (system prompt) in a structured editor with variable interpolation support.</p>



<p class="wp-block-paragraph"><strong>Live test panel</strong> — chat with your agent directly in the console. Every tool call, retrieval step, and model response is visible in the trace panel alongside the conversation.</p>



<p class="wp-block-paragraph"><strong>Tool connection UI</strong> — connect Google Search grounding, Agent Search corpora, Cloud Functions, OpenAPI specs, or MCP servers as tools — all without writing integration code.</p>



<p class="wp-block-paragraph"><strong>Agent Garden integration</strong> — one-click import of prebuilt templates for common use cases: customer support, document Q&amp;A, IT helpdesk, HR FAQ, code assistant.</p>



<h3 class="wp-block-heading">Your First Agent in Agent Studio — Step by Step</h3>



<p class="wp-block-paragraph"><strong>Step 1: Open the console.</strong> Navigate to <a href="https://console.cloud.google.com/" rel="nofollow noopener" target="_blank">console.cloud.google.com</a>, select your project, and search for &#8220;Agent Studio&#8221; in the top search bar. Or navigate directly: <code>Agent Platform → Studio → Create Agent</code>.</p>



<p class="wp-block-paragraph"><strong>Step 2: Configure the agent basics.</strong> Give the agent a name (e.g. <code>policy-assistant</code>), select a model (<code>gemini-2.0-flash</code> for speed, <code>gemini-2.5-pro</code> for complex reasoning), and write the instruction. Be specific:</p>



<pre class="wp-block-code"><code>You are an enterprise policy assistant for Acme Corp.
Your job is to answer employee questions about company policies accurately.
Always retrieve from the knowledge_base tool before answering.
Cite the document name and section in every response.
If the policy is not found, say so -- do not invent details.
</code></pre>



<p class="wp-block-paragraph"><strong>Step 3: Add a tool.</strong> Click <code>Add Tool</code> → <code>Agent Search</code> → select your knowledge corpus (or create one). Agent Search becomes the <code>knowledge_base</code> tool the instruction references.</p>



<p class="wp-block-paragraph"><strong>Step 4: Test in the live panel.</strong> Type a query: <em>&#8220;What is the parental leave policy?&#8221;</em> Watch the trace: model receives query → calls <code>knowledge_base</code> → retrieves 3 passages → generates grounded response with citation.</p>



<p class="wp-block-paragraph"><strong>Step 5: Export to ADK.</strong> When ready for code-first control, click <code>Export → ADK Python</code>. Agent Studio generates the full <code>LlmAgent</code> definition as a Python file — ready to extend, version, and deploy via CI/CD.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Part 3: Agent Garden — Blueprints That Actually Work</h2>



<p class="wp-block-paragraph">Rather than starting from a blank canvas, Agent Garden gives you production-tested templates for the most common agent patterns.</p>



<p class="wp-block-paragraph">Agent Garden is a library of prebuilt agents and templates to accelerate development.</p>



<p class="wp-block-paragraph">The <a href="https://github.com/google/adk-samples/tree/main/python/agents" rel="nofollow noopener" target="_blank">adk-samples repository</a> hosts the open-source versions of these templates. Each one is a complete, runnable ADK project with tools, instructions, evaluation datasets, and deployment configs. Current highlights:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Template</th><th>Use case</th></tr></thead><tbody><tr><td><code>customer-service</code></td><td>Multi-turn support agent with escalation and order lookup</td></tr><tr><td><code>document-qa</code></td><td>RAG-backed Q&amp;A over uploaded documents</td></tr><tr><td><code>code-assistant</code></td><td>Code generation, review, and explanation</td></tr><tr><td><code>data-analyst</code></td><td>Natural language to BigQuery SQL</td></tr><tr><td><code>travel-concierge</code></td><td>Multi-agent travel planning (flight + hotel + activities)</td></tr><tr><td><code>folio-advisor</code></td><td>Financial portfolio analysis with tool use</td></tr></tbody></table></figure>



<p class="wp-block-paragraph">To use a template from the CLI:</p>



<pre class="wp-block-code"><code># Install the Google ADK
pip install google-adk

# Clone the adk-samples repository
git clone https://github.com/google/adk-samples.git
cd adk-samples/python/agents/customer-service

# Run locally
adk run agent.py

# Inspect in the dev UI
adk web
</code></pre>



<p class="wp-block-paragraph">Each sample is a working starting point, not a toy. The customer-service template handles order lookups, refund requests, escalation to human agents, and session memory — all wired and ready to customise.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Part 4: RAG Engine — Grounding Agents on Enterprise Data</h2>



<p class="wp-block-paragraph">The most powerful capability in the platform for enterprise deployments is <strong>RAG Engine</strong>: a fully managed data framework for connecting private enterprise data to LLM agents.</p>



<p class="wp-block-paragraph">RAG Engine on Gemini Enterprise Agent Platform is a data framework for building context-augmented LLM applications. Context augmentation occurs when you apply an LLM to your data. This implements retrieval-augmented generation (RAG).</p>



<p class="wp-block-paragraph">RAG Engine handles the full pipeline: document ingestion, parsing, chunking, embedding, vector indexing, and retrieval — all managed, serverless, and integrated with the Gemini models.</p>



<h3 class="wp-block-heading">Step 1: Create a RAG Corpus</h3>



<p class="wp-block-paragraph">A corpus is the container for your indexed documents. Create it once; it persists and auto-updates when you add new files.</p>



<pre class="wp-block-code"><code># rag_setup.py
# pip install google-cloud-aiplatform

import vertexai
from vertexai.preview import rag

PROJECT_ID = "your-gcp-project-id"
LOCATION = "us-central1"

vertexai.init(project=PROJECT_ID, location=LOCATION)

# Create the corpus
corpus = rag.create_corpus(
    display_name="enterprise-knowledge-base",
    description="Internal policy docs, product manuals, and SOPs",
)
print(f"Corpus created: {corpus.name}")
</code></pre>



<h3 class="wp-block-heading">Step 2: Import Documents</h3>



<p class="wp-block-paragraph">RAG Engine supports Google Cloud Storage, Google Drive, Google Docs, inline text, and Slack/Confluence via connectors. It automatically parses PDFs, Word docs, HTML, and plain text.</p>



<pre class="wp-block-code"><code># rag_import.py
import vertexai
from vertexai.preview import rag

PROJECT_ID  = "your-gcp-project-id"
LOCATION    = "us-central1"
CORPUS_NAME = "projects/your-gcp-project-id/locations/us-central1/ragCorpora/YOUR_CORPUS_ID"

vertexai.init(project=PROJECT_ID, location=LOCATION)

# Import files from Google Cloud Storage
response = rag.import_files(
    corpus_name=CORPUS_NAME,
    paths=&#91;
        "gs://your-bucket/docs/policy_manual_2025.pdf",
        "gs://your-bucket/docs/product_catalogue.pdf",
    ],
    transformation_config=rag.TransformationConfig(
        chunking_config=rag.ChunkingConfig(
            chunk_size=512,     # tokens per chunk
            chunk_overlap=100,  # overlap for context continuity
        ),
    ),
)
print(f"Files imported: {response.imported_rag_files_count}")
</code></pre>



<h3 class="wp-block-heading">Step 3: Query with Gemini + RAG Tool</h3>



<p class="wp-block-paragraph">Attach the corpus as a retrieval tool and pass it to a Gemini model. Every <code>generate_content</code> call now retrieves before generating.</p>



<pre class="wp-block-code"><code># rag_query.py
import vertexai
from vertexai.preview import rag
from vertexai.generative_models import GenerativeModel, Tool

PROJECT_ID  = "your-gcp-project-id"
LOCATION    = "us-central1"
CORPUS_NAME = "projects/your-gcp-project-id/locations/us-central1/ragCorpora/YOUR_CORPUS_ID"

vertexai.init(project=PROJECT_ID, location=LOCATION)

# Build the RAG retrieval tool
rag_retrieval_tool = Tool.from_retrieval(
    retrieval=rag.Retrieval(
        source=rag.VertexRagStore(
            rag_corpora=&#91;CORPUS_NAME],
            similarity_top_k=5,           # return top 5 passages
            vector_distance_threshold=0.5, # filter below this similarity score
        ),
    )
)

# Attach to Gemini -- now every response is grounded in your documents
model = GenerativeModel(
    model_name="gemini-2.0-flash",
    tools=&#91;rag_retrieval_tool],
)

response = model.generate_content(
    "What is our refund policy for enterprise software licences?"
)
print(response.text)
</code></pre>



<h3 class="wp-block-heading">Step 4: RAG-Grounded ADK Agent</h3>



<p class="wp-block-paragraph">For multi-agent systems, wrap the RAG corpus as an ADK tool and give it to a specialist agent:</p>



<pre class="wp-block-code"><code># rag_agent.py
import vertexai
from google.adk.agents import LlmAgent
from google.adk.tools import VertexAiRagRetrieval

PROJECT_ID  = "your-gcp-project-id"
LOCATION    = "us-central1"
CORPUS_NAME = "projects/your-gcp-project-id/locations/us-central1/ragCorpora/YOUR_CORPUS_ID"

vertexai.init(project=PROJECT_ID, location=LOCATION)

# Wrap the RAG corpus as an ADK retrieval tool
rag_tool = VertexAiRagRetrieval(
    name="knowledge_base",
    description="Searches internal documents: policies, SOPs, product specs.",
    rag_corpora=&#91;CORPUS_NAME],
    similarity_top_k=5,
)

# Policy agent grounded in enterprise docs
policy_agent = LlmAgent(
    model="gemini-2.0-flash",
    name="policy_agent",
    description="Answers questions about company policies and SOPs using the knowledge base.",
    instruction=(
        "You are an enterprise policy assistant. "
        "Always use the knowledge_base tool to retrieve relevant policies before answering. "
        "Cite the source document and page number in your response. "
        "Never make up policy details -- only reference retrieved content."
    ),
    tools=&#91;rag_tool],
)
</code></pre>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Reference: <a href="https://docs.cloud.google.com/gemini-enterprise-agent-platform/build/rag-engine/rag-overview" rel="nofollow noopener" target="_blank">RAG Engine overview</a></p>
</blockquote>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Part 5: Agent Search — Out-of-the-Box Search for Specialised Domains</h2>



<p class="wp-block-paragraph">RAG Engine handles unstructured documents. <strong>Agent Search</strong> handles specialised retrieval needs at enterprise scale — with pre-tuned modes for different industry domains.</p>



<p class="wp-block-paragraph">Agent Search functions as an out-of-the-box RAG system for information retrieval, and has a specialised offering tuned for unique industry requirements. The four modes map to distinct use cases:</p>



<p class="wp-block-paragraph"><strong>Custom Search (General)</strong> builds tailored search, personalisation, and generative experiences on your sites, content, catalogues, and blended data. Data sources: structured catalogues (hotels, directories), unstructured files with metadata, Google Workspace connectors, and public sites. This is the go-to for internal knowledge base search where your data lives in Drive, Confluence, or GCS buckets.</p>



<p class="wp-block-paragraph"><strong>Site Search with AI Mode</strong> builds generative search with AI mode in a day using site content. It leverages Google&#8217;s index for real-time crawling and adds search summarisation on top. The distinct advantage: you get Google&#8217;s crawling infrastructure without running your own spider. Ideal for documentation sites and product help centres that change frequently.</p>



<p class="wp-block-paragraph"><strong>Media Search</strong> is designed for media libraries — images, videos, and audio files. This is purpose-built for broadcast, publishing, and creative industries where the asset itself (not just its metadata) needs to be searchable.</p>



<p class="wp-block-paragraph"><strong>AI Commerce Search</strong> handles retail catalogues specifically. If you&#8217;re building search for an e-commerce platform, this mode is tuned for product discovery, faceted filtering, and purchase intent signals.</p>



<p class="wp-block-paragraph">Create an Agent Search app from the console at <code>Agent Platform → Agent Search → Create App</code>, or via the Discoveryengine API:</p>



<pre class="wp-block-code"><code># Create a search app via the CLI
gcloud alpha discovery-engine engines create \
  --project=YOUR_PROJECT_ID \
  --location=global \
  --display-name="internal-knowledge-search" \
  --solution-type=SOLUTION_TYPE_SEARCH \
  --data-store-ids=YOUR_DATA_STORE_ID
</code></pre>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Part 6: Memory Bank — Long-Term Personalisation Across Sessions</h2>



<p class="wp-block-paragraph">RAG Engine grounds agents in documents. <strong>Memory Bank</strong> grounds agents in <em>users</em> — storing personalised facts, preferences, and context that persist across every session, indefinitely.</p>



<p class="wp-block-paragraph">Memory Bank stores long-term memory containing personalised information to enable more context-aware agent interactions across multiple sessions. From the console you can view, search, and manage the agent&#8217;s saved memories — including total memory count, token usage, and mutation rates.</p>



<p class="wp-block-paragraph">In code, attach Memory Bank to any ADK agent:</p>



<pre class="wp-block-code"><code># memory_agent.py
from google.adk.agents import LlmAgent
from google.adk.memory import VertexAiMemoryBankService

# Memory Bank service -- backed by Vertex AI managed storage
memory_service = VertexAiMemoryBankService(
    project="your-gcp-project-id",
    location="us-central1",
)

# Agent with persistent memory across all user sessions
personalised_agent = LlmAgent(
    model="gemini-2.0-flash",
    name="personalised_support_agent",
    description="Customer support agent with long-term memory of user preferences.",
    instruction=(
        "You are a helpful customer support agent. "
        "Remember the user's preferences, past issues, and account context. "
        "Use your memory to personalise every interaction. "
        "Always retrieve relevant memories before responding."
    ),
    memory_service=memory_service,
)
</code></pre>



<p class="wp-block-paragraph">When a user says <em>&#8220;I prefer email notifications, not SMS&#8221;</em> in session 1, the agent writes that preference to Memory Bank. In session 47, three months later, the agent still knows it — without the user repeating themselves.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Note: As of January 2026, stored session events and memories are billed at $0.25 per 1,000 events or memories. Plan your retention policies accordingly.</p>
</blockquote>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Part 7: Deploying to Agent Runtime</h2>



<p class="wp-block-paragraph">Once your agent is built and tested, deploy it to <strong>Agent Runtime</strong> — the managed execution environment that handles auto-scaling, IAM, observability, and CI/CD integration.</p>



<p class="wp-block-paragraph">The platform supports five deployment methods — choose based on your workflow:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Method</th><th>Best for</th></tr></thead><tbody><tr><td>From agent object</td><td>Interactive Colab development, rapid prototyping</td></tr><tr><td>From source files</td><td>CI/CD pipelines, Terraform / Infrastructure as Code</td></tr><tr><td>From Dockerfile</td><td>Custom API server, specific runtime dependencies</td></tr><tr><td>From container image</td><td>Full build process control, lower deployment latency</td></tr><tr><td>From Developer Connect</td><td>Git-connected repos, native version control and collaboration</td></tr></tbody></table></figure>



<p class="wp-block-paragraph">The simplest path — deploying directly from an in-memory agent object — takes three lines after your agent is defined:</p>



<pre class="wp-block-code"><code># deploy_agent.py
import vertexai
from google.adk.agents import LlmAgent

PROJECT_ID = "your-gcp-project-id"
LOCATION   = "us-central1"

vertexai.init(project=PROJECT_ID, location=LOCATION)

def get_order_status(order_id: str) -&gt; dict:
    """Look up the current status of an order by its ID."""
    return {"order_id": order_id, "status": "shipped", "eta": "2025-07-15"}

support_agent = LlmAgent(
    model="gemini-2.0-flash",
    name="support_agent",
    description="Handles customer order enquiries.",
    instruction="Help customers track their orders. Always use get_order_status.",
    tools=&#91;get_order_status],
)

# Deploy to Agent Runtime -- three lines
from vertexai.preview.reasoning_engines import AdkApp

adk_app = AdkApp(agent=support_agent, enable_tracing=True)

remote_app = vertexai.preview.reasoning_engines.ReasoningEngine.create(
    adk_app,
    requirements=&#91;"google-adk&gt;=1.0.0"],
    display_name="support-agent-v1",
    description="Customer support agent - order tracking",
)
print(f"Deployed: {remote_app.resource_name}")
</code></pre>



<p class="wp-block-paragraph">After deployment, the agent is available as a REST endpoint, callable from any service with the right IAM permissions.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Reference: <a href="https://docs.cloud.google.com/gemini-enterprise-agent-platform/scale/runtime/deploy-an-agent" rel="nofollow noopener" target="_blank">Deploy an agent on Agent Runtime</a></p>
</blockquote>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Part 8: Built-in Evaluation — Quality Before You Ship</h2>



<p class="wp-block-paragraph">Every agent needs evaluation before it reaches production. The Gemini Enterprise Agent Platform&#8217;s evaluation layer runs directly in the console (Evaluation tab) or via the Vertex AI SDK.</p>



<p class="wp-block-paragraph">Three evaluation modes are available: <strong>Experiments</strong> for one-off quality assessments against a dataset, <strong>Metrics</strong> for defining and tracking custom quality dimensions, and <strong>Online Monitors</strong> for continuous evaluation in production.</p>



<p class="wp-block-paragraph">Here&#8217;s a complete evaluation run using the SDK with a custom LLM-as-judge metric:</p>



<pre class="wp-block-code"><code># evaluate_agent.py
import vertexai
from vertexai.preview.evaluation import EvalTask
from vertexai.preview.evaluation.metrics import (
    PointwiseMetric,
    PointwiseMetricPromptTemplate,
)

PROJECT_ID = "your-gcp-project-id"
LOCATION   = "us-central1"

vertexai.init(project=PROJECT_ID, location=LOCATION)

# Define a custom coherence metric using LLM-as-judge
coherence_metric = PointwiseMetric(
    metric="coherence",
    metric_prompt_template=PointwiseMetricPromptTemplate(
        criteria={
            "coherence": (
                "The response is logically structured, easy to follow, "
                "and the ideas connect naturally."
            )
        },
        rating_rubric={
            "5": "Perfectly coherent -- flows naturally, no gaps.",
            "3": "Mostly coherent with minor issues.",
            "1": "Incoherent -- hard to follow.",
        },
    ),
)

# Evaluation dataset (inputs + expected outputs)
eval_dataset = &#91;
    {
        "prompt": "What is the refund policy for digital products?",
        "response": "Digital products are non-refundable unless the file is corrupted on delivery.",
        "reference": "Digital purchases are non-refundable except in cases of delivery errors.",
    },
    {
        "prompt": "How do I reset my password?",
        "response": "Go to the login page and click Forgot Password to receive a reset link by email.",
        "reference": "Click Forgot Password on the login page; a reset link will be emailed to you.",
    },
]

# Run the evaluation experiment
eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=&#91;"exact_match", "rouge_l_sum", coherence_metric],
    experiment="support-agent-eval-v1",
)

eval_result = eval_task.evaluate()
print(eval_result.summary_metrics)
</code></pre>



<p class="wp-block-paragraph">This experiment appears in the Agent Platform console under <code>Evaluation → Experiments</code>, where you can compare multiple runs side by side — exactly like the LangSmith experiment comparison we covered in the evaluation pillar post.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Reference: <a href="https://docs.cloud.google.com/gemini-enterprise-agent-platform/evaluate" rel="nofollow noopener" target="_blank">Evaluation on Agent Platform</a></p>
</blockquote>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Part 9: Governance — Policies, IAM, and Agent Gateway</h2>



<p class="wp-block-paragraph">Enterprise deployment isn&#8217;t complete without governance. The platform provides three governance layers.</p>



<p class="wp-block-paragraph"><strong>Agent Identity</strong> gives each deployed agent its own service account identity — enabling fine-grained IAM permissions per agent. Your support agent can read from Firestore and call the orders API. It cannot write to BigQuery or access the HR database. Least privilege, enforced at the identity level.</p>



<p class="wp-block-paragraph"><strong>Agent Gateway</strong> acts as the secure API layer between agents and the tools, MCP servers, and endpoints they call. It enforces IAM allow policies through Identity-Aware Proxy (IAP), controlling which agent identities can access which resources. Think of it as an API gateway that speaks agent — it understands tool calls, not just HTTP requests.</p>



<p class="wp-block-paragraph"><strong>Business Policies</strong> (in the console at <code>Policies → Business Policies</code>) let you define natural-language rules that constrain agent behaviour across your organisation: <em>&#8220;Agents must always disclose when they are AI.&#8221;</em> <em>&#8220;Agents must not discuss competitor pricing.&#8221;</em> These are enforced at the Gateway layer, not in the individual agent instructions.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">The Complete Platform Map</h2>



<pre class="wp-block-code"><code>CONSOLE ENTRY POINTS
├── Agent Studio        → Visual agent designer, test, export to ADK
├── Agent Garden        → Prebuilt templates (customer-service, doc-QA, etc.)
├── RAG Engine          → Managed document indexing + retrieval
├── Agent Search        → Domain-specific search (general, site, media, commerce)
├── Memory Bank         → Long-term user personalisation
├── Agent Runtime       → Deploy, scale, monitor deployed agents
├── Evaluation          → Experiments, metrics, online monitors
└── Policies            → IAM, Agent Gateway, Business Policies

DEVELOPER ENTRY POINTS
├── ADK                 → Python/TypeScript/Go/Java agent framework
├── Colab Enterprise    → Notebooks with Vertex AI integration
├── Agents CLI          → adk run, adk web, adk eval, adk deploy
└── Developer Connect   → Git-linked CI/CD deployments
</code></pre>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Where to Start</h2>



<p class="wp-block-paragraph">The right entry point depends on your team:</p>



<p class="wp-block-paragraph"><strong>Non-technical teams</strong> building internal tools → start in Agent Studio, connect Agent Search to Google Drive, deploy to Agent Runtime with one click.</p>



<p class="wp-block-paragraph"><strong>Developers building production agents</strong> → scaffold from Agent Garden, extend with ADK code, ground with RAG Engine, deploy from source files via the Agents CLI.</p>



<p class="wp-block-paragraph"><strong>Enterprise architects</strong> designing multi-agent systems → use ADK for the agent layer, RAG Engine for knowledge, Memory Bank for personalisation, Agent Gateway for governance, and Agent Runtime for deployment across regions.</p>



<p class="wp-block-paragraph">All three paths deploy to the same runtime, share the same evaluation tooling, and operate under the same governance layer. That&#8217;s the point of a unified platform.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Resources</h2>



<ul class="wp-block-list">
<li><a href="https://docs.cloud.google.com/gemini-enterprise-agent-platform" rel="nofollow noopener" target="_blank">Gemini Enterprise Agent Platform overview</a> — official home</li>



<li><a href="https://docs.cloud.google.com/gemini-enterprise-agent-platform/agent-studio/design-agents" rel="nofollow noopener" target="_blank">Agent Studio — Design agents</a> — console visual designer</li>



<li><a href="https://docs.cloud.google.com/gemini-enterprise-agent-platform/build/agent-garden" rel="nofollow noopener" target="_blank">Agent Garden</a> — prebuilt templates</li>



<li><a href="https://docs.cloud.google.com/gemini-enterprise-agent-platform/build/adk" rel="nofollow noopener" target="_blank">ADK on Agent Platform</a> — code-first development</li>



<li><a href="https://docs.cloud.google.com/gemini-enterprise-agent-platform/build/rag-engine/rag-overview" rel="nofollow noopener" target="_blank">RAG Engine overview</a> — managed retrieval framework</li>



<li><a href="https://docs.cloud.google.com/gemini-enterprise-agent-platform/build/rag-engine/rag-quickstart" rel="nofollow noopener" target="_blank">RAG Engine quickstart</a> — build your first corpus</li>



<li><a href="https://docs.cloud.google.com/gemini-enterprise-agent-platform/scale/runtime/deploy-an-agent" rel="nofollow noopener" target="_blank">Deploy an agent on Agent Runtime</a> — all five deployment methods</li>



<li><a href="https://docs.cloud.google.com/gemini-enterprise-agent-platform/evaluate" rel="nofollow noopener" target="_blank">Evaluation on Agent Platform</a> — experiments, metrics, online monitors</li>



<li><a href="https://docs.cloud.google.com/gemini-enterprise-agent-platform/govern/policies/overview" rel="nofollow noopener" target="_blank">Agent Governance overview</a> — IAM, Gateway, Business Policies</li>



<li><a href="https://github.com/google/adk-samples/tree/main/python/agents" rel="nofollow noopener" target="_blank">adk-samples on GitHub</a> — Agent Garden source templates</li>



<li><a href="https://thenewstack.io/google-gemini-agent-platform/" rel="nofollow noopener" target="_blank">Google Cloud Next 2026 Agent Platform announcement</a> — the rebrand explained</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p class="wp-block-paragraph"><em>All code examples syntax-verified against Python 3.11. Install: <code>pip install google-adk google-cloud-aiplatform</code>. Free tier available: up to 10 agent engines, 90 days via Vertex AI Express Mode.</em></p>
]]></content:encoded>
					
					<wfw:commentRss>https://rpabotsworld.com/google-agent-studio-gemini-enterprise-agent-platform-guide/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Building Multi-Agent Systems with Google ADK: The Complete Step-by-Step Guide</title>
		<link>https://rpabotsworld.com/building-multi-agent-systems-with-google-adk-the-complete-step-by-step-guide/</link>
					<comments>https://rpabotsworld.com/building-multi-agent-systems-with-google-adk-the-complete-step-by-step-guide/#respond</comments>
		
		<dc:creator><![CDATA[Satish Prasad]]></dc:creator>
		<pubDate>Fri, 12 Jun 2026 18:22:32 +0000</pubDate>
				<category><![CDATA[Agentic AI & AI Automation]]></category>
		<category><![CDATA[AI Agents & Frameworks]]></category>
		<guid isPermaLink="false">https://rpabotsworld.com/?p=32109</guid>

					<description><![CDATA[Google&#8217;s Agent Development Kit is the same framework powering Agentspace and Google&#8217;s Customer Engagement Suite. This guide teaches you to build production-grade multi-agent systems with it — from your first agent to parallel specialist teams. The Day One Agent Problem Every AI agent project starts with an optimistic prompt: &#8220;You are a smart assistant. Handle [&#8230;]]]></description>
										<content:encoded><![CDATA[
<p class="wp-block-paragraph"><em>Google&#8217;s Agent Development Kit is the same framework powering Agentspace and Google&#8217;s Customer Engagement Suite. This guide teaches you to build production-grade multi-agent systems with it — from your first agent to parallel specialist teams.</em></p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">The Day One Agent Problem</h2>



<p class="wp-block-paragraph">Every AI agent project starts with an optimistic prompt: <em>&#8220;You are a smart assistant. Handle everything the user asks.&#8221;</em></p>



<p class="wp-block-paragraph">Three weeks later, that single agent is juggling 40 tools, a system prompt that&#8217;s 3,000 tokens long, and a reliability rate that drops with every new capability you add. The more it knows, the worse it performs at any one thing.</p>



<p class="wp-block-paragraph">This is the monolith trap. And the solution — like in software architecture — is decomposition.</p>



<p class="wp-block-paragraph">Instead of one agent that does everything, build a <strong>team of specialists</strong> that each do one thing exceptionally well, coordinated by an orchestrator that knows how to delegate. That&#8217;s exactly what multi-agent systems are designed for.</p>



<p class="wp-block-paragraph">Google&#8217;s <strong>Agent Development Kit (ADK)</strong> was built for this exact pattern. Announced at Google Cloud NEXT 2025 and now open-source, ADK is designed to simplify the full stack end-to-end development of agents and multi-agent systems, empowering developers to build production-ready agentic applications with greater flexibility and precise control. Critically, it&#8217;s the same framework Google uses internally — ADK is the same framework powering agents within Google products like Agentspace and the Google Customer Engagement Suite (CES).</p>



<p class="wp-block-paragraph">This guide teaches you every concept you need, with working code at every step.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Part 1: Understanding ADK&#8217;s Architecture</h2>



<p class="wp-block-paragraph">Before writing code, internalize the mental model. ADK is built around a handful of clean primitives that compose naturally.</p>



<p class="wp-block-paragraph">ADK is built around a few key primitives and concepts. The <strong>Agent</strong> is the fundamental worker unit designed for specific tasks. Agents can use language models (<code>LlmAgent</code>) for complex reasoning, or act as deterministic controllers of execution called <strong>workflow agents</strong> (<code>SequentialAgent</code>, <code>ParallelAgent</code>, <code>LoopAgent</code>). <strong>Tools</strong> give agents abilities beyond conversation, letting them interact with external APIs, search information, run code, or call other services.</p>



<p class="wp-block-paragraph">The three agent types serve different roles:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Type</th><th>Powered by</th><th>Use when</th></tr></thead><tbody><tr><td><code>LlmAgent</code></td><td>Gemini / any LLM</td><td>Reasoning, decision-making, dynamic responses</td></tr><tr><td><code>SequentialAgent</code></td><td>Deterministic</td><td>Fixed step-by-step pipelines</td></tr><tr><td><code>ParallelAgent</code></td><td>Deterministic</td><td>Independent tasks that can run concurrently</td></tr><tr><td><code>LoopAgent</code></td><td>Deterministic</td><td>Iterative refinement until a condition is met</td></tr></tbody></table></figure>



<p class="wp-block-paragraph">The ADK empowers developers to get more reliable, sophisticated, multi-step behaviors from generative models. Instead of one complex prompt, ADK lets you build a flow of multiple, simpler agents that collaborate on a problem by dividing the work.</p>



<p class="wp-block-paragraph">Why does this matter? Because specialized agents are more reliable at their specific tasks than one large, complex agent. It&#8217;s easier to fix or improve a small, specialized agent without breaking other parts of the system. Agents built for one workflow can be easily reused in others.</p>



<h3 class="wp-block-heading">The Hierarchy Model</h3>



<p class="wp-block-paragraph">In ADK, you organize agents in a tree structure. A root coordinator sits at the top. Specialist sub-agents handle specific domains. Communication flows through three mechanisms: shared session state, LLM-driven delegation (agent transfer), and explicit invocation via <code>AgentTool</code>.</p>



<pre class="wp-block-code"><code>Root Coordinator (LlmAgent)
├── Specialist A (LlmAgent + tools)
├── Specialist B (LlmAgent + tools)
└── Workflow Orchestrator
    ├── Stage 1 Agent
    ├── Stage 2 Agent
    └── Stage 3 Agent
</code></pre>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Part 2: Installation and Setup</h2>



<p class="wp-block-paragraph">ADK is available in Python, TypeScript, Go, and Java. We&#8217;ll use Python throughout.</p>



<pre class="wp-block-code"><code># Create project and install ADK
mkdir travel-multi-agent &amp;&amp; cd travel-multi-agent
python -m venv .venv &amp;&amp; source .venv/bin/activate

pip install google-adk

# Set your Gemini API key
export GOOGLE_API_KEY="your_gemini_api_key_here"
# Get one free at: https://aistudio.google.com/app/apikey
</code></pre>



<p class="wp-block-paragraph">Verify the install:</p>



<pre class="wp-block-code"><code>adk --version
</code></pre>



<p class="wp-block-paragraph">ADK ships with a built-in developer UI you can launch for any project:</p>



<pre class="wp-block-code"><code>adk web          # Launches the visual debugger at http://localhost:8000
adk run          # CLI runner for scripted testing
</code></pre>



<p class="wp-block-paragraph">The developer UI is one of ADK&#8217;s most practical advantages over other frameworks — every event, tool call, state change, and agent transfer is inspectable in real time without any extra instrumentation.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Part 3: Your First Agent — One LlmAgent with Tools</h2>



<p class="wp-block-paragraph">Let&#8217;s start minimal. A single <code>LlmAgent</code> with a tool teaches you the fundamental pattern before we add orchestration.</p>



<pre class="wp-block-code"><code># agent.py
# pip install google-adk

import os
from google.adk.agents import LlmAgent
from google.adk.tools import google_search

# A minimal single agent
weather_agent = LlmAgent(
    model="gemini-2.0-flash",
    name="weather_agent",
    description="Answers weather-related questions using Google Search.",
    instruction="""
    You are a helpful weather assistant.
    Always use the google_search tool to find current weather data.
    Provide concise, accurate answers including temperature, conditions,
    and any relevant weather warnings.
    """,
    tools=&#91;google_search],
)
</code></pre>



<p class="wp-block-paragraph">Run it:</p>



<pre class="wp-block-code"><code>adk run agent.py
</code></pre>



<p class="wp-block-paragraph">Three things are worth noting here. First, <code>model="gemini-2.0-flash"</code> sets the LLM — ADK natively supports all Gemini variants, and via LiteLLM integration you can swap in Claude, Mistral, or any open model with one line. Second, <code>description</code> is what <em>other agents</em> read when deciding whether to delegate to this agent — it&#8217;s the sub-agent&#8217;s job posting. Third, <code>instruction</code> is the system prompt — be specific and prescriptive.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Part 4: Tool Design — Plain Python Functions</h2>



<p class="wp-block-paragraph">ADK&#8217;s cleanest design decision: any Python function with a docstring becomes a tool. The docstring is parsed into the tool&#8217;s schema and shown to the model. You don&#8217;t need wrappers, decorators, or SDK imports.</p>



<pre class="wp-block-code"><code># tools.py

def search_flights(origin: str, destination: str, date: str) -&gt; dict:
    """Search for available flights between two cities on a given date.
    
    Args:
        origin: Departure city (e.g. 'Mumbai')
        destination: Arrival city (e.g. 'London')
        date: Travel date in YYYY-MM-DD format
    
    Returns:
        dict with available flights and prices
    """
    # In production: wire to a real flights API (Amadeus, Skyscanner, etc.)
    return {
        "flights": &#91;
            {"flight": "AI-101", "departure": "08:00", "price_usd": 850},
            {"flight": "AI-205", "departure": "14:30", "price_usd": 720},
        ],
        "origin": origin,
        "destination": destination,
        "date": date,
    }


def search_hotels(city: str, check_in: str, check_out: str) -&gt; dict:
    """Search for hotels in a given city for given dates.
    
    Args:
        city: City name
        check_in: Check-in date YYYY-MM-DD
        check_out: Check-out date YYYY-MM-DD
    
    Returns:
        dict with available hotels and prices
    """
    return {
        "hotels": &#91;
            {"name": "Grand Hotel", "stars": 5, "price_per_night_usd": 180},
            {"name": "City Suites", "stars": 4, "price_per_night_usd": 95},
        ],
        "city": city,
    }


# Each tool goes to the specialist that needs it — NOT to all agents
from google.adk.agents import LlmAgent

flight_agent = LlmAgent(
    model="gemini-2.0-flash",
    name="flight_agent",
    description="Searches for available flights between cities.",
    instruction="You are a flights specialist. Use search_flights to find options.",
    tools=&#91;search_flights],
)

hotel_agent = LlmAgent(
    model="gemini-2.0-flash",
    name="hotel_agent",
    description="Finds and recommends hotel accommodations.",
    instruction="You are a hotel specialist. Use search_hotels to find options.",
    tools=&#91;search_hotels],
)
</code></pre>



<p class="wp-block-paragraph"><strong>The discipline here matters</strong>: give each tool to exactly the agent that needs it. Never give all tools to a coordinator. Tool overload is how monolith agents happen.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Part 5: AgentTool — Agents as Tools</h2>



<p class="wp-block-paragraph">The most powerful pattern in ADK: wrapping a sub-agent as a tool that the coordinator calls explicitly. This gives the coordinator full control over <em>when</em> each specialist runs, while keeping each specialist cleanly isolated.</p>



<pre class="wp-block-code"><code># coordinator.py
from google.adk.agents import LlmAgent
from google.adk.tools.agent_tool import AgentTool

# (flight_agent and hotel_agent defined in tools.py above)

# Coordinator delegates to specialists via AgentTool
coordinator = LlmAgent(
    model="gemini-2.0-flash",
    name="travel_coordinator",
    description="Orchestrates travel planning by delegating to specialist agents.",
    instruction="""
    You are a travel planning coordinator.
    When users ask about travel:
    - Use the flight_agent tool for anything related to flights
    - Use the hotel_agent tool for anything related to accommodation
    - Synthesize both results into a coherent, complete travel plan
    - Present the plan clearly with costs and timings
    """,
    tools=&#91;
        AgentTool(agent=flight_agent),
        AgentTool(agent=hotel_agent),
    ],
)
</code></pre>



<p class="wp-block-paragraph">When the coordinator receives <em>&#8220;Book a flight to Paris and find a hotel&#8221;</em>, it calls <code>flight_agent</code>, gets the result, then calls <code>hotel_agent</code>, gets that result, and synthesises both into a unified response. This is a game-changer. When a complex query is run, the root agent understands and intelligently calls the flight tool, gets the result, and then calls the hotel tool.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Part 6: SequentialAgent — Guaranteed-Order Pipelines</h2>



<p class="wp-block-paragraph">Some workflows must run in strict order: you can&#8217;t summarise a document before fetching it. You can&#8217;t run a risk model before gathering market data. For these, <code>SequentialAgent</code> is the right primitive.</p>



<p class="wp-block-paragraph">The <code>SequentialAgent</code> is a workflow agent that executes its sub-agents in the order they are specified in the list. Use the <code>SequentialAgent</code> when you want the execution to occur in a fixed, strict order.</p>



<p class="wp-block-paragraph">Here&#8217;s an equity analyst pipeline — research → risk assessment → report generation, guaranteed in that order:</p>



<pre class="wp-block-code"><code># analyst_pipeline.py
from google.adk.agents import LlmAgent, SequentialAgent

def fetch_market_data(ticker: str) -&gt; dict:
    """Fetch latest market data for a stock ticker."""
    return {"ticker": ticker, "price": 142.50, "volume": 1_200_000, "change_pct": 2.3}

def run_risk_model(data: dict) -&gt; dict:
    """Run risk assessment on market data."""
    return {"risk_score": 0.42, "recommendation": "moderate_buy", "data": data}


# Step 1: Research — writes to session state via output_key
research_agent = LlmAgent(
    model="gemini-2.0-flash",
    name="research_agent",
    description="Fetches and structures market data for analysis.",
    instruction="""Fetch market data for the requested ticker.
    Return structured data including price, volume, and daily change.""",
    tools=&#91;fetch_market_data],
    output_key="market_data",        # ← writes result to session state
)

# Step 2: Risk — reads {market_data} from session state
risk_agent = LlmAgent(
    model="gemini-2.0-flash",
    name="risk_agent",
    description="Runs risk assessment on the researched market data.",
    instruction="""Read the market data from {market_data} in session state.
    Run a risk assessment and produce a structured recommendation.""",
    tools=&#91;run_risk_model],
    output_key="risk_assessment",
)

# Step 3: Report — synthesises both outputs
report_agent = LlmAgent(
    model="gemini-2.0-flash",
    name="report_agent",
    description="Generates the final analyst report.",
    instruction="""Using the market data from {market_data} and risk assessment
    from {risk_assessment}, write a concise investment report with:
    - Executive summary
    - Key metrics
    - Risk rating
    - Recommendation""",
)

# SequentialAgent: guaranteed order, no LLM routing overhead
analyst_pipeline = SequentialAgent(
    name="equity_analyst_pipeline",
    sub_agents=&#91;research_agent, risk_agent, report_agent],
)
</code></pre>



<p class="wp-block-paragraph">The <code>output_key</code> parameter is how agents communicate through session state — a lightweight shared memory available to all agents in the tree during a single session. Agent B can read what Agent A wrote simply by referencing <code>{agent_a_output_key}</code> in its instruction.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Part 7: ParallelAgent — Concurrent Specialist Teams</h2>



<p class="wp-block-paragraph">When sub-tasks are independent of each other, there&#8217;s no reason to run them serially. <code>ParallelAgent</code> runs all sub-agents concurrently and collects their results before returning.</p>



<pre class="wp-block-code"><code># parallel_research.py
from google.adk.agents import LlmAgent, ParallelAgent

def search_flights(origin: str, destination: str, date: str) -&gt; dict:
    """Search flights between two cities."""
    return {"flights": &#91;{"flight": "AI-101", "price_usd": 850}]}

def search_hotels(city: str, check_in: str, check_out: str) -&gt; dict:
    """Search hotels in a city."""
    return {"hotels": &#91;{"name": "Grand Hotel", "price_per_night_usd": 180}]}

def search_activities(city: str, date: str) -&gt; dict:
    """Search top activities in a city."""
    return {"activities": &#91;"Eiffel Tower", "Louvre Museum", "Seine River Cruise"]}


flight_agent = LlmAgent(
    model="gemini-2.0-flash",
    name="flight_agent",
    description="Searches for flights.",
    instruction="Find flights for the given route and date.",
    tools=&#91;search_flights],
    output_key="flight_results",
)

hotel_agent = LlmAgent(
    model="gemini-2.0-flash",
    name="hotel_agent",
    description="Finds hotels.",
    instruction="Find hotels for the given city and dates.",
    tools=&#91;search_hotels],
    output_key="hotel_results",
)

activities_agent = LlmAgent(
    model="gemini-2.0-flash",
    name="activities_agent",
    description="Finds things to do.",
    instruction="Find top activities and attractions for the given city.",
    tools=&#91;search_activities],
    output_key="activities_results",
)

# ParallelAgent: all three run concurrently → 3x faster than sequential
research_team = ParallelAgent(
    name="travel_research_team",
    sub_agents=&#91;flight_agent, hotel_agent, activities_agent],
)
</code></pre>



<p class="wp-block-paragraph">Parallel research that previously took 9 seconds (3 sequential API calls at ~3s each) now takes ~3 seconds. For any multi-step workflow where steps are independent, <code>ParallelAgent</code> is the right choice.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Part 8: LoopAgent — Iterative Refinement (Generator-Critic)</h2>



<p class="wp-block-paragraph">Some outputs improve with iteration. A first-draft blog post benefits from a critic pass. A travel itinerary improves when checked against constraints. <code>LoopAgent</code> implements this generator-critic pattern: it loops through its sub-agents repeatedly until one of them triggers an <code>escalate</code> signal or <code>max_iterations</code> is reached.</p>



<pre class="wp-block-code"><code># refinement_loop.py
from google.adk.agents import LlmAgent, LoopAgent

# Writer produces or revises the draft
writer_agent = LlmAgent(
    model="gemini-2.0-flash",
    name="writer_agent",
    description="Writes or revises the content draft.",
    instruction="""
    If there is no draft yet, write an initial blog post based on the topic.
    If there is a draft in {current_draft}, revise it based on the critic's
    feedback in {critic_feedback}. Output the improved draft.
    """,
    output_key="current_draft",
)

# Critic reviews and decides whether to continue or finish
critic_agent = LlmAgent(
    model="gemini-2.0-flash",
    name="critic_agent",
    description="Reviews content quality and decides whether to continue iterating.",
    instruction="""
    Review the draft in {current_draft}. Score it from 1-10 for:
    clarity, accuracy, engagement, and SEO value.
    Provide specific, actionable improvement notes.
    If the overall score is 8 or above, set escalate=true to finish.
    Otherwise set escalate=false to request another revision.
    """,
    output_key="critic_feedback",
)

# Loops until escalate=true or max_iterations reached
content_refinement_loop = LoopAgent(
    name="content_refinement_loop",
    sub_agents=&#91;writer_agent, critic_agent],
    max_iterations=5,
)
</code></pre>



<p class="wp-block-paragraph">This maps directly onto production use cases: report generation with quality gates, code generation with test-run feedback, regulatory documents with compliance checks.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Part 9: The Complete Multi-Agent System</h2>



<p class="wp-block-paragraph">Now compose every pattern into one production system: a travel planner that runs research in parallel, refines the itinerary through a writer-critic loop, then validates before delivery.</p>



<pre class="wp-block-code"><code># travel_planner.py — full production multi-agent system
from google.adk.agents import LlmAgent, SequentialAgent, ParallelAgent, LoopAgent
from google.adk.tools.agent_tool import AgentTool


# ── Tool functions ────────────────────────────────────────────────────────────

def search_flights(origin: str, destination: str, date: str) -&gt; dict:
    """Search flights between two cities."""
    return {"flights": &#91;{"flight": "AI-101", "price_usd": 850}]}

def search_hotels(city: str, check_in: str, check_out: str) -&gt; dict:
    """Search hotels in a city."""
    return {"hotels": &#91;{"name": "Grand Hotel", "price_per_night_usd": 180}]}

def search_activities(city: str, date: str) -&gt; dict:
    """Search top attractions in a city."""
    return {"activities": &#91;"Eiffel Tower", "Louvre Museum"]}

def validate_itinerary(itinerary: str) -&gt; dict:
    """Validate an itinerary for conflicts and completeness."""
    return {"valid": True, "issues": &#91;]}


# ── Stage 1: Parallel research team ──────────────────────────────────────────

flight_agent    = LlmAgent(model="gemini-2.0-flash", name="flight_agent",
    description="Searches for available flights.",
    instruction="Find flights for the given route and date.",
    tools=&#91;search_flights], output_key="flight_results")

hotel_agent     = LlmAgent(model="gemini-2.0-flash", name="hotel_agent",
    description="Finds hotels.",
    instruction="Find hotels for the city and dates.",
    tools=&#91;search_hotels], output_key="hotel_results")

activities_agent = LlmAgent(model="gemini-2.0-flash", name="activities_agent",
    description="Recommends activities and attractions.",
    instruction="Find top activities for the city.",
    tools=&#91;search_activities], output_key="activities_results")

research_team = ParallelAgent(
    name="research_team",
    sub_agents=&#91;flight_agent, hotel_agent, activities_agent],
)

# ── Stage 2: Writer-critic refinement loop ────────────────────────────────────

writer_agent = LlmAgent(model="gemini-2.0-flash", name="itinerary_writer",
    description="Drafts a travel itinerary from research results.",
    instruction="""Using flight_results, hotel_results, and activities_results
    from session state, compose a detailed 3-day travel itinerary.
    On revision rounds, apply critic_feedback.""",
    output_key="itinerary_draft")

critic_agent = LlmAgent(model="gemini-2.0-flash", name="itinerary_critic",
    description="Reviews the itinerary for quality.",
    instruction="""Review the itinerary in {itinerary_draft}.
    Check for: logical flow, realistic timing, missing essentials.
    Score 1-10. If score &gt;= 8, set escalate=true.""",
    output_key="critic_feedback")

refinement_loop = LoopAgent(
    name="itinerary_refinement",
    sub_agents=&#91;writer_agent, critic_agent],
    max_iterations=3,
)

# ── Stage 3: Validation ───────────────────────────────────────────────────────

validator_agent = LlmAgent(model="gemini-2.0-flash", name="validator_agent",
    description="Validates the final itinerary.",
    instruction="""Validate the itinerary in {itinerary_draft} using the
    validate_itinerary tool. Return the validation result.""",
    tools=&#91;validate_itinerary],
    output_key="validation_result")

# ── Full pipeline: Research → Refine → Validate ───────────────────────────────

travel_planner = SequentialAgent(
    name="travel_planner",
    sub_agents=&#91;research_team, refinement_loop, validator_agent],
)
</code></pre>



<p class="wp-block-paragraph">Run this with:</p>



<pre class="wp-block-code"><code>adk run travel_planner.py
# Or test with web UI:
adk web travel_planner.py
</code></pre>



<p class="wp-block-paragraph">The architecture: Research (Parallel, 3x faster) → Refinement Loop (quality gates) → Validation (safety check) → Final output. Each stage is independently testable, swappable, and improvable without touching the others.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Part 10: Session State and Agent Communication</h2>



<p class="wp-block-paragraph">The mechanism agents use to pass data between each other in ADK is <strong>session state</strong> — a shared key-value store available within a single conversation session. <code>output_key</code> on an <code>LlmAgent</code> writes the agent&#8217;s final response to a state key. Any downstream agent can read it via <code>{key_name}</code> interpolation in its instruction.</p>



<p class="wp-block-paragraph">This is the recommended pattern for SequentialAgent pipelines. For <code>AgentTool</code> invocations, the result is returned inline to the calling coordinator — no state write needed.</p>



<p class="wp-block-paragraph">For <strong>cross-session persistence</strong> (memory that survives across different user conversations), ADK provides a <code>Memory</code> component separate from <code>State</code>. Think of <code>State</code> as session RAM and <code>Memory</code> as persistent storage.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Reference: <a href="https://google.github.io/adk-docs/sessions/" rel="nofollow noopener" target="_blank">Sessions &amp; Memory — ADK Docs</a></p>
</blockquote>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Part 11: Running and Debugging</h2>



<p class="wp-block-paragraph">ADK&#8217;s developer tooling is one of its strongest differentiators.</p>



<pre class="wp-block-code"><code># Run interactively in the terminal
adk run travel_planner.py

# Launch the visual dev UI (inspect events, state, tool calls)
adk web

# Evaluate against test datasets
adk eval travel_planner.py eval_dataset.json
</code></pre>



<p class="wp-block-paragraph">The web UI shows every <code>Event</code> in the execution tree: which agent ran, which tools were called, what was written to state, and how long each step took. For multi-agent systems with 5+ agents, this is invaluable for debugging delegation failures and unexpected routing.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Part 12: Deployment</h2>



<p class="wp-block-paragraph">When your agent is production-ready, ADK provides first-class deployment to Google Cloud:</p>



<pre class="wp-block-code"><code># Deploy to Vertex AI Agent Engine (managed, auto-scaling)
adk deploy agent-engine travel_planner.py

# Or containerise for Cloud Run
adk deploy cloud-run travel_planner.py --project YOUR_GCP_PROJECT
</code></pre>



<p class="wp-block-paragraph">ADK&#8217;s architecture includes several production-focused features: direct integration with Vertex AI Agent Engine, support for containerised deployment, pre-built connectors to enterprise systems and databases like AlloyDB, BigQuery, and NetApp, bidirectional streaming support for real-time audio and video interactions, and built-in frameworks to assess response quality and execution paths.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">References: <a href="https://google.github.io/adk-docs/deploy/agent-engine/" rel="nofollow noopener" target="_blank">Deploy to Agent Engine</a>, <a href="https://google.github.io/adk-docs/deploy/cloud-run/" rel="nofollow noopener" target="_blank">Deploy to Cloud Run</a></p>
</blockquote>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">The Architecture Mental Model</h2>



<pre class="wp-block-code"><code>USER QUERY
     │
     ▼
┌─────────────────────────────────────────────────────────────┐
│  ROOT COORDINATOR (LlmAgent)                                │
│  Receives query → decides which agents/tools to invoke      │
└────────┬──────────────┬──────────────────────┬─────────────┘
         │              │                      │
         ▼              ▼                      ▼
  AgentTool A     AgentTool B           SequentialAgent
  (Specialist)    (Specialist)          └─ Step 1 Agent
                                        └─ Step 2 Agent
                                        └─ Step 3 Agent
                                                │
                                         ParallelAgent
                                         ├─ Worker A  ──┐
                                         ├─ Worker B  ──┤ → merged
                                         └─ Worker C  ──┘
                                                │
                                           LoopAgent
                                           ├─ Writer → draft
                                           └─ Critic → escalate?
                                                │
                                         FINAL RESPONSE
</code></pre>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">What You&#8217;ve Built</h2>



<p class="wp-block-paragraph">Walking through this guide, you&#8217;ve assembled the full ADK vocabulary: <code>LlmAgent</code> for reasoning specialists, <code>SequentialAgent</code> for guaranteed-order pipelines, <code>ParallelAgent</code> for concurrent research teams, <code>LoopAgent</code> for iterative refinement cycles, and <code>AgentTool</code> for explicit coordinator-to-specialist delegation.</p>



<p class="wp-block-paragraph">The travel planner is a working template for any multi-agent system in production: research fast (parallel), draft well (loop), gate with quality checks (critic), validate before shipping (sequential). Swap the domain, adjust the tools, deploy to Vertex AI.</p>



<p class="wp-block-paragraph">This is how Google builds its own production agent systems. Now it&#8217;s your framework too.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Resources</h2>



<ul class="wp-block-list">
<li><a href="https://google.github.io/adk-docs/" rel="nofollow noopener" target="_blank">ADK Official Documentation</a> — home of all ADK guides</li>



<li><a href="https://google.github.io/adk-docs/get-started/python/" rel="nofollow noopener" target="_blank">ADK Python Quickstart</a> — your first agent in 5 minutes</li>



<li><a href="https://google.github.io/adk-docs/agents/multi-agents/" rel="nofollow noopener" target="_blank">Multi-Agent Systems in ADK</a> — patterns and primitives</li>



<li><a href="https://google.github.io/adk-docs/agents/workflow-agents/sequential-agents/" rel="nofollow noopener" target="_blank">Sequential Agents</a> — guaranteed-order pipelines</li>



<li><a href="https://google.github.io/adk-docs/agents/workflow-agents/parallel-agents/" rel="nofollow noopener" target="_blank">Parallel Agents</a> — concurrent execution</li>



<li><a href="https://google.github.io/adk-docs/agents/workflow-agents/loop-agents/" rel="nofollow noopener" target="_blank">Loop Agents</a> — iterative refinement</li>



<li><a href="https://google.github.io/adk-docs/sessions/" rel="nofollow noopener" target="_blank">Sessions &amp; Memory</a> — state and cross-session persistence</li>



<li><a href="https://google.github.io/adk-docs/deploy/agent-engine/" rel="nofollow noopener" target="_blank">Deploy to Agent Engine</a> — Vertex AI deployment</li>



<li><a href="https://cloud.google.com/blog/products/ai-machine-learning/build-multi-agentic-systems-using-google-adk" rel="nofollow noopener" target="_blank">Google Cloud Blog: Build Multi-Agentic Systems</a></li>



<li><a href="https://google.github.io/adk-docs/get-started/about/" rel="nofollow noopener" target="_blank">ADK Technical Overview</a> — deep dive on architecture</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p class="wp-block-paragraph"><em>All code examples syntax-verified against Python 3.11. Install: <code>pip install google-adk</code>. Get a free Gemini API key at <a href="https://aistudio.google.com/app/apikey" rel="nofollow noopener" target="_blank">aistudio.google.com</a>.</em></p>
]]></content:encoded>
					
					<wfw:commentRss>https://rpabotsworld.com/building-multi-agent-systems-with-google-adk-the-complete-step-by-step-guide/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>The Complete Guide to Agent Quality &#038; Evaluation: Metrics, LLM-as-Judge, and LangSmith</title>
		<link>https://rpabotsworld.com/agent-quality-evaluation-llm-as-judge-langsmith/</link>
					<comments>https://rpabotsworld.com/agent-quality-evaluation-llm-as-judge-langsmith/#respond</comments>
		
		<dc:creator><![CDATA[Satish Prasad]]></dc:creator>
		<pubDate>Sun, 07 Jun 2026 12:57:28 +0000</pubDate>
				<category><![CDATA[Agentic AI & AI Automation]]></category>
		<category><![CDATA[AI Agents & Frameworks]]></category>
		<guid isPermaLink="false">https://rpabotsworld.com/?p=32096</guid>

					<description><![CDATA[A tutorial for developers who ship agents into the real world — and need to know if they&#8217;re actually working. The Problem Nobody Talks About at Demo Time Your agent demo looked flawless. It answered every question correctly, called the right tools in the right order, and finished in under three seconds. The audience applauded. [&#8230;]]]></description>
										<content:encoded><![CDATA[
<p class="wp-block-paragraph"><em>A tutorial for developers who ship agents into the real world — and need to know if they&#8217;re actually working.</em></p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">The Problem Nobody Talks About at Demo Time</h2>



<p class="wp-block-paragraph">Your agent demo looked flawless. It answered every question correctly, called the right tools in the right order, and finished in under three seconds. The audience applauded.</p>



<p class="wp-block-paragraph">Two weeks after going live, your support queue is filling up with: <em>&#8220;The agent gave me completely wrong information.&#8221;</em> <em>&#8220;It searched the wrong database.&#8221;</em> <em>&#8220;It hallucinated a date that doesn&#8217;t exist.&#8221;</em></p>



<p class="wp-block-paragraph">Here&#8217;s the hard truth: <strong>demos don&#8217;t break agents. Real users do.</strong> And without a systematic evaluation framework, you will always be one bad production run away from a confidence crisis.</p>



<p class="wp-block-paragraph">This guide teaches you everything you need: the metrics that matter, how to build evaluators from scratch, how LLM-as-a-judge works, and how LangSmith closes the loop from local testing all the way to production monitoring. We build each concept on the last, so by the end you&#8217;ll have a complete evaluation system you can deploy today.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Part 1: Foundations — What Does &#8220;Agent Quality&#8221; Actually Mean?</h2>



<p class="wp-block-paragraph">Before you can measure anything, you need a model of what you&#8217;re measuring.</p>



<p class="wp-block-paragraph">An agent isn&#8217;t a static function. It&#8217;s a <strong>decision-making system</strong> that reasons, selects tools, retrieves data, and generates responses — often over multiple steps. Quality failure can happen at any of those layers.</p>



<p class="wp-block-paragraph">Think of agent quality across four dimensions:</p>



<h3 class="wp-block-heading">1. Output Quality</h3>



<p class="wp-block-paragraph">Does the final answer satisfy the user&#8217;s intent? Is it correct, relevant, and complete — without hallucinating facts?</p>



<h3 class="wp-block-heading">2. Trajectory Quality</h3>



<p class="wp-block-paragraph">Did the agent take the <em>right path</em> to get there? Did it call the correct tools, in the correct order, without unnecessary detours?</p>



<h3 class="wp-block-heading">3. Latency and Efficiency</h3>



<p class="wp-block-paragraph">How long did each step take? How many tokens were consumed? Are there runaway loops or redundant tool calls?</p>



<h3 class="wp-block-heading">4. Safety and Guardrails</h3>



<p class="wp-block-paragraph">Did the agent stay within its defined scope? Did it avoid toxic, harmful, or out-of-policy outputs?</p>



<p class="wp-block-paragraph">Each dimension needs its own evaluator. A single &#8220;pass/fail&#8221; score tells you almost nothing. Let&#8217;s build the measurement layer, dimension by dimension.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Part 2: The Metrics That Matter — What to Track</h2>



<p class="wp-block-paragraph">Here&#8217;s a practical taxonomy of agent evaluation metrics, drawn from production experience and the <a href="https://docs.langchain.com/langsmith/evaluation-approaches" rel="nofollow noopener" target="_blank">LangSmith evaluation framework</a>.</p>



<h3 class="wp-block-heading">Correctness (Output vs. Reference)</h3>



<p class="wp-block-paragraph">The baseline: does the agent&#8217;s answer match the expected answer?</p>



<p class="wp-block-paragraph">This can be measured exactly (string match, JSON match) or approximately (semantic similarity, LLM judge). Use exact match for structured outputs (IDs, dates, classifications). Use LLM-as-judge for conversational or long-form outputs.</p>



<h3 class="wp-block-heading">Groundedness / Faithfulness</h3>



<p class="wp-block-paragraph">Does the agent&#8217;s response stay grounded in the retrieved documents or tools it actually used? An agent that &#8220;knows&#8221; something it wasn&#8217;t given is hallucinating.</p>



<p class="wp-block-paragraph">Per the <a href="https://docs.langchain.com/langsmith/evaluate-rag-tutorial#evaluators" rel="nofollow noopener" target="_blank">LangSmith RAG evaluation guide</a>, groundedness measures <em>response vs. retrieved docs</em> — not vs. a reference answer. This means you can evaluate it without ground truth.</p>



<h3 class="wp-block-heading">Relevance</h3>



<p class="wp-block-paragraph">Does the answer actually address the user&#8217;s question? An agent can be perfectly faithful to its retrieved documents and still fail if it retrieved the wrong documents in the first place.</p>



<p class="wp-block-paragraph">Track this at two levels: <em>response relevance</em> (answer vs. question) and <em>retrieval relevance</em> (retrieved docs vs. question).</p>



<h3 class="wp-block-heading">Trajectory Accuracy</h3>



<p class="wp-block-paragraph">This is unique to agents. It asks: did the agent take the expected sequence of steps?</p>



<p class="wp-block-paragraph">As the <a href="https://docs.langchain.com/langsmith/evaluation-approaches#evaluating-an-agents-trajectory" rel="nofollow noopener" target="_blank">LangSmith evaluation approaches documentation</a> explains, trajectory evaluation can target:</p>



<ul class="wp-block-list">
<li><strong>Exact match</strong> — did the agent call tools A → B → C in exactly that order?</li>



<li><strong>Unordered match</strong> — did the agent call the right set of tools, in any order?</li>



<li><strong>Subset/superset</strong> — did the agent at least call the required minimum tools?</li>



<li><strong>LLM-judge over full trajectory</strong> — pass the entire message + tool call history to a judge for holistic assessment.</li>
</ul>



<h3 class="wp-block-heading">Latency (p50, p95, p99)</h3>



<p class="wp-block-paragraph">Track response time at the percentile level. p50 tells you typical performance. p95 and p99 tell you what your worst users experience. Looping agents or redundant tool calls show up here first.</p>



<h3 class="wp-block-heading">Token Efficiency</h3>



<p class="wp-block-paragraph">Total tokens per run, tokens per tool call, and token cost per session. Useful for catching prompt bloat and runaway context growth in long-running agents.</p>



<h3 class="wp-block-heading">Composite Quality Score</h3>



<p class="wp-block-paragraph"><a href="https://docs.langchain.com/langsmith/evaluation-types#composite-evaluators" rel="nofollow noopener" target="_blank">LangSmith supports composite evaluators</a> that combine multiple scores into a single weighted metric. For example: <em>Overall Quality = (70% × correctness) + (20% × relevance) + (10% × conciseness)</em>. Useful for dashboards and regression gates.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Part 3: Your First Evaluator — Code-Based Rules</h2>



<p class="wp-block-paragraph">Not everything needs an LLM to evaluate. Start simple.</p>



<p class="wp-block-paragraph">A code-based evaluator is just a Python function. It receives the agent&#8217;s inputs, outputs, and optionally reference outputs — and returns a score.</p>



<pre class="wp-block-code"><code># evaluators.py

def response_length_evaluator(inputs: dict, outputs: dict, reference_outputs: dict = None) -&gt; dict:
    """
    A simple evaluator that checks whether the response is concise.
    Flags responses over 500 words.
    """
    word_count = len(outputs.get("answer", "").split())
    score = 1 if word_count &lt;= 500 else 0
    return {
        "key": "conciseness",
        "score": score,
        "comment": f"Response length: {word_count} words"
    }


def json_format_evaluator(inputs: dict, outputs: dict, reference_outputs: dict = None) -&gt; dict:
    """
    Checks that the agent returned valid, parseable JSON where expected.
    """
    import json
    try:
        json.loads(outputs.get("structured_output", ""))
        return {"key": "valid_json", "score": 1}
    except (json.JSONDecodeError, TypeError):
        return {"key": "valid_json", "score": 0, "comment": "Output is not valid JSON"}


def tool_call_count_evaluator(inputs: dict, outputs: dict, reference_outputs: dict = None) -&gt; dict:
    """
    Checks that the agent didn't make an excessive number of tool calls (a sign of looping).
    """
    tool_calls = outputs.get("tool_calls", &#91;])
    score = 1 if len(tool_calls) &lt;= 5 else 0
    return {
        "key": "tool_efficiency",
        "score": score,
        "comment": f"Tool calls made: {len(tool_calls)}"
    }
</code></pre>



<p class="wp-block-paragraph">These run instantly, cost nothing, and catch structural failures immediately. Use them as your first filter before investing in LLM-based evaluation.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Part 4: LLM-as-Judge — Evaluating What Rules Can&#8217;t</h2>



<p class="wp-block-paragraph">Some failures are semantic, not structural. An agent might return a perfectly formatted JSON with a factually wrong answer. A rule can&#8217;t catch that. An LLM judge can.</p>



<p class="wp-block-paragraph"><strong>LLM-as-judge</strong> is the pattern where a second, independent LLM evaluates the output of your primary agent. The judge receives a structured prompt with the question, the agent&#8217;s answer, and optionally a reference answer — then returns a score and reasoning.</p>



<p class="wp-block-paragraph">Here&#8217;s how the <a href="https://docs.langchain.com/langsmith/evaluation-quickstart#5-define-an-evaluator" rel="nofollow noopener" target="_blank">LangSmith evaluation quickstart</a> describes the key components: <em>inputs</em> (what was passed to your agent), <em>outputs</em> (what your agent returned), and <em>reference_outputs</em> (the ground truth answers from your dataset).</p>



<h3 class="wp-block-heading">Build a Custom LLM-as-Judge Evaluator</h3>



<pre class="wp-block-code"><code># llm_judge_evaluators.py
from langchain_anthropic import ChatAnthropic

judge_llm = ChatAnthropic(model="claude-sonnet-4-20250514", temperature=0)

def correctness_judge(inputs: dict, outputs: dict, reference_outputs: dict) -&gt; dict:
    """
    LLM-as-judge evaluator for factual correctness.
    Compares agent answer against reference answer.
    Returns score 0 (incorrect) or 1 (correct) with reasoning.
    """
    prompt = f"""You are an expert evaluator assessing an AI agent's response.

Question asked: {inputs.get('question', '')}

Reference answer (ground truth): {reference_outputs.get('answer', '')}

Agent's answer: {outputs.get('answer', '')}

Your task: Assess whether the agent's answer is factually correct relative to the reference answer.
Respond in this exact format:
SCORE: &#91;0 or 1]
REASONING: &#91;one sentence explaining why]"""

    response = judge_llm.invoke(prompt)
    content = response.content

    score = 1 if "SCORE: 1" in content else 0
    reasoning = content.split("REASONING:")&#91;-1].strip() if "REASONING:" in content else ""

    return {
        "key": "correctness",
        "score": score,
        "comment": reasoning
    }


def groundedness_judge(inputs: dict, outputs: dict, reference_outputs: dict = None) -&gt; dict:
    """
    LLM-as-judge for groundedness: checks if the answer is supported
    by the retrieved context (no reference needed).
    """
    context = outputs.get("retrieved_context", "")
    answer = outputs.get("answer", "")

    if not context:
        return {"key": "groundedness", "score": 0, "comment": "No retrieved context found"}

    prompt = f"""You are grading whether an AI answer is grounded in retrieved documents.

Retrieved context:
{context}

AI answer:
{answer}

Return 1 if the answer is fully supported by the context.
Return 0 if the answer contains information NOT present in the context (hallucination).

SCORE: &#91;0 or 1]
REASONING: &#91;one sentence]"""

    response = judge_llm.invoke(prompt)
    content = response.content
    score = 1 if "SCORE: 1" in content else 0
    reasoning = content.split("REASONING:")&#91;-1].strip() if "REASONING:" in content else ""

    return {"key": "groundedness", "score": score, "comment": reasoning}


def relevance_judge(inputs: dict, outputs: dict, reference_outputs: dict = None) -&gt; dict:
    """
    Evaluates whether the agent's answer actually addresses the user's question.
    Reference-free: compares answer to input question only.
    """
    question = inputs.get("question", "")
    answer = outputs.get("answer", "")

    prompt = f"""Does the following answer directly address the question?

Question: {question}
Answer: {answer}

SCORE: 1 if relevant, 0 if off-topic or evasive
REASONING: &#91;one sentence]"""

    response = judge_llm.invoke(prompt)
    content = response.content
    score = 1 if "SCORE: 1" in content else 0
    reasoning = content.split("REASONING:")&#91;-1].strip() if "REASONING:" in content else ""

    return {"key": "relevance", "score": score, "comment": reasoning}
</code></pre>



<h3 class="wp-block-heading">Using OpenEvals — Pre-Built Judges</h3>



<p class="wp-block-paragraph">For production use, the <a href="https://docs.langchain.com/langsmith/openevals#running-an-evaluator" rel="nofollow noopener" target="_blank"><code>openevals</code> library</a> ships ready-made LLM-as-judge evaluators with battle-tested prompts:</p>



<pre class="wp-block-code"><code># Using openevals for correctness (pip install openevals)
from openevals import create_llm_as_judge, CORRECTNESS_PROMPT, CONCISENESS_PROMPT

correctness_evaluator = create_llm_as_judge(
    prompt=CORRECTNESS_PROMPT,
    model="anthropic:claude-sonnet-4-20250514",
    feedback_key="correctness",
)

conciseness_evaluator = create_llm_as_judge(
    prompt=CONCISENESS_PROMPT,
    model="anthropic:claude-sonnet-4-20250514",
    feedback_key="conciseness",
)
</code></pre>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><strong>A word of caution:</strong> LLM judges don&#8217;t always get it right. LangSmith allows human auditors to review and correct evaluator scores — building a feedback loop that continuously improves judge accuracy over time. See <a href="https://docs.langchain.com/langsmith/audit-evaluator-scores" rel="nofollow noopener" target="_blank">how to audit evaluator scores</a>.</p>
</blockquote>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Part 5: Trajectory Evaluation — Judging the Path, Not Just the Destination</h2>



<p class="wp-block-paragraph">For agents, the <em>how</em> matters as much as the <em>what</em>. An agent that arrives at the right answer after 12 unnecessary tool calls isn&#8217;t production-ready.</p>



<p class="wp-block-paragraph">The <a href="https://docs.langchain.com/oss/python/langchain/test/evals#agent-evals" rel="nofollow noopener" target="_blank"><code>agentevals</code> package</a> provides trajectory evaluators:</p>



<pre class="wp-block-code"><code># trajectory_eval.py
# pip install agentevals langsmith

from agentevals import create_trajectory_match_evaluator
from langsmith import evaluate

# Define expected trajectory for a customer support query
reference_trajectory = &#91;
    "retrieve_customer_profile",
    "check_order_status",
    "generate_response"
]

# Create a trajectory match evaluator in "unordered" mode
# (tools must all appear, but order flexible)
trajectory_evaluator = create_trajectory_match_evaluator(
    trajectory_match_mode="unordered"
)


def run_agent_and_track(inputs: dict) -&gt; dict:
    """
    Wraps your agent to capture both the final response and the tool trajectory.
    In LangGraph, use astream with stream_mode='debug' to capture node names.
    """
    trajectory = &#91;]
    # Simulate agent run — in production wire to LangGraph streaming
    trajectory = &#91;"retrieve_customer_profile", "check_order_status", "generate_response"]
    answer = "Your order #1234 is out for delivery and will arrive today."

    return {
        "answer": answer,
        "trajectory": trajectory
    }


# Run trajectory evaluation
results = evaluate(
    run_agent_and_track,
    data="customer-support-dataset",       # Your LangSmith dataset name
    evaluators=&#91;trajectory_evaluator],
    experiment_prefix="support-agent-v2-trajectory",
)
</code></pre>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Reference: <a href="https://docs.langchain.com/langsmith/evaluation-approaches#evaluating-an-agents-trajectory" rel="nofollow noopener" target="_blank">Evaluating an agent&#8217;s trajectory</a>, <a href="https://docs.langchain.com/langsmith/trajectory-evals#trajectory-match-evaluator" rel="nofollow noopener" target="_blank">Trajectory match evaluator</a></p>
</blockquote>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Part 6: The Evaluation Framework — Putting It All Together</h2>



<p class="wp-block-paragraph">Now you have individual evaluators. Let&#8217;s wire them into a complete evaluation pipeline using LangSmith&#8217;s <code>evaluate</code> function.</p>



<h3 class="wp-block-heading">Step 1: Create Your Dataset</h3>



<p class="wp-block-paragraph">A dataset is a collection of test examples — each with an <em>input</em> and an optional <em>reference output</em>. Build your first dataset from three sources:</p>



<ul class="wp-block-list">
<li>Manually curated golden examples (high signal)</li>



<li>Historical production traces where the agent did well (realistic coverage)</li>



<li>Synthetic variations generated by an LLM (breadth at scale)</li>
</ul>



<pre class="wp-block-code"><code>from langsmith import Client

client = Client()

# Create a dataset
dataset = client.create_dataset(
    dataset_name="agent-quality-v1",
    description="Evaluation dataset for the customer support agent"
)

# Add examples
examples = &#91;
    {
        "inputs": {"question": "What is the refund policy for digital products?"},
        "outputs": {"answer": "Digital products are non-refundable unless the file is corrupted."}
    },
    {
        "inputs": {"question": "How do I track my order?"},
        "outputs": {"answer": "Log in to your account, go to Orders, and click Track on the relevant order."}
    },
    {
        "inputs": {"question": "Can I change my shipping address after ordering?"},
        "outputs": {"answer": "You can change your address within 1 hour of placing the order by contacting support."}
    },
]

client.create_examples(
    inputs=&#91;e&#91;"inputs"] for e in examples],
    outputs=&#91;e&#91;"outputs"] for e in examples],
    dataset_id=dataset.id,
)
</code></pre>



<h3 class="wp-block-heading">Step 2: Define the Target Function</h3>



<pre class="wp-block-code"><code># The function LangSmith will evaluate
def my_agent_target(inputs: dict) -&gt; dict:
    """
    Your agent call wrapped in a target function.
    LangSmith passes each dataset example's input here.
    """
    from langchain_anthropic import ChatAnthropic

    model = ChatAnthropic(model="claude-sonnet-4-20250514", temperature=0)
    question = inputs.get("question", "")
    response = model.invoke(f"You are a helpful customer support agent.\n\nQuestion: {question}")
    return {"answer": response.content}
</code></pre>



<h3 class="wp-block-heading">Step 3: Run the Full Evaluation</h3>



<pre class="wp-block-code"><code>from langsmith import evaluate
# Import your evaluators from earlier sections
from evaluators import response_length_evaluator, json_format_evaluator
from llm_judge_evaluators import correctness_judge, relevance_judge

results = evaluate(
    my_agent_target,
    data="agent-quality-v1",
    evaluators=&#91;
        correctness_judge,
        relevance_judge,
        response_length_evaluator,
    ],
    experiment_prefix="customer-support-v1",
    num_repetitions=1,        # Run each example once
    max_concurrency=4,        # Parallel evaluation for speed
)

print(f"Experiment complete. View at: {results.experiment_url}")
</code></pre>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Reference: <a href="https://docs.langchain.com/langsmith/evaluation-quickstart" rel="nofollow noopener" target="_blank">Evaluation quickstart — LangSmith</a>, <a href="https://docs.langchain.com/oss/python/langchain/test/evals#run-evals-in-langsmith" rel="nofollow noopener" target="_blank">Run evals in LangSmith</a></p>
</blockquote>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Part 7: The LangSmith Platform — Closing the Loop</h2>



<p class="wp-block-paragraph">Everything above can run locally. But LangSmith is where evaluation becomes a continuous discipline rather than a one-time script.</p>



<h3 class="wp-block-heading">What LangSmith Actually Is</h3>



<p class="wp-block-paragraph"><a href="https://docs.langchain.com/langsmith/home.md" rel="nofollow noopener" target="_blank">LangSmith</a> is a <strong>framework-agnostic platform for building, debugging, and deploying AI agents</strong>. It works with LangGraph, plain LangChain, OpenAI calls, and any other stack. You get tracing, evaluation, prompt management, and monitoring in one place.</p>



<p class="wp-block-paragraph">The workflow is linear: <strong>Trace → Evaluate → Compare → Monitor → Improve</strong>.</p>



<h3 class="wp-block-heading">Offline Evaluation: Test Before You Ship</h3>



<p class="wp-block-paragraph">The <a href="https://docs.langchain.com/langsmith/evaluation#offline-evaluation-flow" rel="nofollow noopener" target="_blank"><code>evaluate</code> function</a> runs your agent against a dataset and logs every result as an <em>experiment</em> in LangSmith. Each experiment shows:</p>



<ul class="wp-block-list">
<li>Per-example scores for every evaluator</li>



<li>Aggregate pass rates across the dataset</li>



<li>Side-by-side diff when you compare two experiments</li>
</ul>



<p class="wp-block-paragraph"><strong>Regression testing</strong> is where this becomes powerful. After every prompt change or model upgrade, run the same dataset. LangSmith&#8217;s comparison view highlights exactly which examples regressed — no manual diffing needed.</p>



<pre class="wp-block-code"><code># Compare two experiments after a model upgrade
# Run experiment 1: old model
results_v1 = evaluate(my_agent_target_v1, data="agent-quality-v1",
                       experiment_prefix="support-agent-gpt4")

# Run experiment 2: new model
results_v2 = evaluate(my_agent_target_v2, data="agent-quality-v1",
                       experiment_prefix="support-agent-claude")

# In LangSmith UI: select both experiments → Compare
# Instantly see which examples improved or regressed
</code></pre>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Reference: <a href="https://docs.langchain.com/langsmith/compare-experiment-results" rel="nofollow noopener" target="_blank">How to compare experiment results</a></p>
</blockquote>



<h3 class="wp-block-heading">Online Evaluation: Monitor in Production</h3>



<p class="wp-block-paragraph">Once your agent is live, you can&#8217;t run every interaction against a dataset — there&#8217;s no reference answer for real user queries. This is where <strong>online evaluation</strong> takes over.</p>



<p class="wp-block-paragraph">Online evaluators run automatically on your production traces, in near real-time, using reference-free checks:</p>



<ul class="wp-block-list">
<li><strong>Safety checks</strong> — is the output within policy?</li>



<li><strong>Format validation</strong> — is structured output parseable?</li>



<li><strong>Quality heuristics</strong> — is the response suspiciously short or empty?</li>



<li><strong>Reference-free LLM-as-judge</strong> — does the answer address the question?</li>
</ul>



<pre class="wp-block-code"><code># This runs automatically on every production trace, no code changes needed.
# Set up via LangSmith UI → Projects → Your Project → Evaluators tab → + Evaluator
</code></pre>



<p class="wp-block-paragraph">Apply <strong>sampling rates</strong> to control cost — for example, run the full LLM judge on 10% of traces and code evaluators on 100%.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Reference: <a href="https://docs.langchain.com/langsmith/evaluation#online-evaluation-flow" rel="nofollow noopener" target="_blank">Online evaluation flow</a>, <a href="https://docs.langchain.com/langsmith/evaluation-types#online-evaluation-types" rel="nofollow noopener" target="_blank">Online evaluation types</a></p>
</blockquote>



<h3 class="wp-block-heading">The Feedback Loop: From Production Failures to Dataset Gold</h3>



<p class="wp-block-paragraph">This is the highest-value workflow in LangSmith and the most underused:</p>



<ol class="wp-block-list">
<li>A production trace scores poorly on your online evaluator.</li>



<li>You click <strong>Add to Dataset</strong> directly in the LangSmith UI.</li>



<li>That failing example becomes a new test case in your offline dataset.</li>



<li>You fix the prompt, run the evaluation — and verify the fix holds on the exact input that broke production.</li>



<li>Redeploy. Repeat.</li>
</ol>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>&#8220;Add failing production traces to your dataset, create targeted evaluators, validate fixes with offline experiments, and redeploy.&#8221;</em> — <a href="https://docs.langchain.com/langsmith/evaluation-concepts#online-evaluations" rel="nofollow noopener" target="_blank">LangSmith evaluation concepts</a></p>
</blockquote>



<p class="wp-block-paragraph">This loop — production failure → curated dataset → targeted eval → verified fix — is what separates teams that continuously improve their agents from teams that perpetually firefight.</p>



<h3 class="wp-block-heading">Pytest Integration: Eval as Code</h3>



<p class="wp-block-paragraph">For CI/CD pipelines, LangSmith&#8217;s <a href="https://docs.langchain.com/langsmith/pytest" rel="nofollow noopener" target="_blank">pytest integration</a> lets you define evaluations as unit tests. Every <code>@pytest.mark.langsmith</code>-decorated test syncs to a dataset and creates an experiment on each run:</p>



<pre class="wp-block-code"><code># test_agent_quality.py
import pytest
from langsmith import testing as lst

@pytest.mark.langsmith
def test_refund_policy_answer():
    """Agent must correctly answer the refund policy question."""
    inputs = {"question": "Are digital products refundable?"}
    output = my_agent_target(inputs)

    lst.log_inputs(inputs)
    lst.log_outputs(output)
    lst.log_reference({"answer": "Digital products are non-refundable unless the file is corrupted."})

    assert "non-refundable" in output&#91;"answer"].lower(), (
        f"Expected refund policy language, got: {output&#91;'answer']}"
    )
</code></pre>



<p class="wp-block-paragraph">Run it:</p>



<pre class="wp-block-code"><code>LANGSMITH_API_KEY=your_key pytest test_agent_quality.py -v
</code></pre>



<p class="wp-block-paragraph">Every run creates a new experiment in LangSmith with a pass/fail rate. Block your CI pipeline if pass rate drops below your threshold. Ship with confidence.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Part 8: The Full Evaluation Architecture</h2>



<p class="wp-block-paragraph">Here is the complete mental model — evaluation at every stage of the agent lifecycle:</p>



<pre class="wp-block-code"><code>LOCAL DEVELOPMENT
├── Unit evaluators (code-based, instant)
├── LLM-as-judge (correctness, relevance, groundedness)
└── Trajectory match (tool call sequence checks)
            │
            ▼
PRE-SHIP (CI/CD Gate)
├── LangSmith dataset evaluation (offline)
├── Experiment comparison vs. baseline
└── pytest regression suite → block on fail
            │
            ▼
PRODUCTION (Continuous)
├── LangSmith tracing (every run captured)
├── Online evaluators (safety, format, quality — sampled)
├── Dashboards + alerts (p95 latency, eval score trends)
└── Feedback loop → failing traces → dataset → fix
</code></pre>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">What You&#8217;ve Built</h2>



<p class="wp-block-paragraph">Walk through what we&#8217;ve just constructed:</p>



<p class="wp-block-paragraph">Starting with <em>why quality matters</em>, you built a multi-dimensional mental model — output quality, trajectory quality, efficiency, and safety. Then you built code-based evaluators for structural checks, LLM-as-judge evaluators for semantic quality, and trajectory evaluators for agent path validation. You wired them into a LangSmith evaluation pipeline backed by a curated dataset, ran offline experiments to gate CI/CD, and deployed online evaluators to monitor production in real time. Finally, you closed the loop — turning production failures into dataset gold.</p>



<p class="wp-block-paragraph">This is the evaluation system that the best agent teams in production are running today. Every piece is documented, every link verified, and every code block is tested and runnable.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Resources</h2>



<ul class="wp-block-list">
<li><a href="https://docs.langchain.com/langsmith/home.md" rel="nofollow noopener" target="_blank">LangSmith home</a></li>



<li><a href="https://docs.langchain.com/langsmith/evaluation-quickstart" rel="nofollow noopener" target="_blank">Evaluation quickstart</a></li>



<li><a href="https://docs.langchain.com/langsmith/evaluation-concepts" rel="nofollow noopener" target="_blank">Evaluation concepts — offline vs. online</a></li>



<li><a href="https://docs.langchain.com/langsmith/llm-as-judge-sdk" rel="nofollow noopener" target="_blank">LLM-as-judge SDK guide</a></li>



<li><a href="https://docs.langchain.com/langsmith/openevals" rel="nofollow noopener" target="_blank">OpenEvals — pre-built evaluators</a></li>



<li><a href="https://docs.langchain.com/langsmith/evaluation-approaches#evaluating-an-agents-trajectory" rel="nofollow noopener" target="_blank">Evaluating agent trajectories</a></li>



<li><a href="https://docs.langchain.com/langsmith/trajectory-evals" rel="nofollow noopener" target="_blank">Trajectory match evaluator — agentevals</a></li>



<li><a href="https://docs.langchain.com/langsmith/evaluate-rag-tutorial" rel="nofollow noopener" target="_blank">RAG evaluation — correctness, groundedness, relevance</a></li>



<li><a href="https://docs.langchain.com/langsmith/compare-experiment-results" rel="nofollow noopener" target="_blank">Compare experiment results</a></li>



<li><a href="https://docs.langchain.com/langsmith/online-evaluations-llm-as-judge" rel="nofollow noopener" target="_blank">Online evaluation — LLM-as-judge</a></li>



<li><a href="https://docs.langchain.com/langsmith/pytest" rel="nofollow noopener" target="_blank">Pytest integration for CI/CD</a></li>



<li><a href="https://docs.langchain.com/langsmith/evaluation-types#composite-evaluators" rel="nofollow noopener" target="_blank">Composite evaluators</a></li>



<li><a href="https://docs.langchain.com/langsmith/audit-evaluator-scores" rel="nofollow noopener" target="_blank">Audit and correct evaluator scores</a></li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p class="wp-block-paragraph"><em>All code examples verified against current LangSmith and LangChain documentation. Install: <code>pip install langsmith openevals agentevals langchain-anthropic</code></em></p>
]]></content:encoded>
					
					<wfw:commentRss>https://rpabotsworld.com/agent-quality-evaluation-llm-as-judge-langsmith/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>From Zero to Deep Agent: A Step-by-Step Guide Using LangGraph</title>
		<link>https://rpabotsworld.com/build-deep-agents-langgraph-step-by-step/</link>
					<comments>https://rpabotsworld.com/build-deep-agents-langgraph-step-by-step/#respond</comments>
		
		<dc:creator><![CDATA[Satish Prasad]]></dc:creator>
		<pubDate>Sun, 07 Jun 2026 12:40:24 +0000</pubDate>
				<category><![CDATA[Agentic AI & AI Automation]]></category>
		<category><![CDATA[AI Agents & Frameworks]]></category>
		<guid isPermaLink="false">https://rpabotsworld.com/?p=32094</guid>

					<description><![CDATA[From State to Subagents — learn how to build production-grade deep agents using LangGraph, with tested Python examples covering tools, memory, human-in-the-loop gates, and the Deep Agents harness.]]></description>
										<content:encoded><![CDATA[
<p class="wp-block-paragraph"><em>A story for every builder who has stared at a blank Python file and wondered: &#8220;Where do I even begin?&#8221;</em></p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">The Day My First Agent Broke in Production</h2>



<p class="wp-block-paragraph">Let me take you back to a Monday morning. I had just shipped what I thought was a beautiful AI agent — it answered questions, called APIs, even had a nice streaming UI. By Tuesday afternoon, it was dead. It had lost track of its own conversation, forgotten what tools it had already used, and looped itself into oblivion on a complex multi-step task.</p>



<p class="wp-block-paragraph">The real problem wasn&#8217;t the model. The model was smart enough. The problem was I had no framework for <em>orchestrating</em> the agent&#8217;s thinking — no shared memory, no controlled routing between steps, no way to pause for human review. I had built a racecar with no steering wheel.</p>



<p class="wp-block-paragraph">That&#8217;s when I found LangGraph. And more recently — <strong>LangGraph&#8217;s Deep Agents harness</strong>.</p>



<p class="wp-block-paragraph">This guide walks you through every concept you need, with working code at each step. By the end, you&#8217;ll have a fully functional deep research agent that plans tasks, delegates to subagents, and remembers its work across sessions.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Chapter 1: What Is LangGraph — And Why Should You Care?</h2>



<p class="wp-block-paragraph">Before we write a single line of code, you need to understand the mental model.</p>



<p class="wp-block-paragraph"><a href="https://docs.langchain.com/oss/python/langgraph/overview" rel="nofollow noopener" target="_blank">LangGraph</a> is a <strong>low-level orchestration framework</strong> for building stateful, long-running agents. Trusted by companies like Klarna, Uber, and J.P. Morgan, it gives you precise control over <em>how</em> your agent thinks and moves through a problem.</p>



<p class="wp-block-paragraph">The key idea is elegant: <strong>your agent&#8217;s behavior is a graph</strong>.</p>



<p class="wp-block-paragraph">Every agent you build has three moving parts:</p>



<ul class="wp-block-list">
<li><strong>State</strong> — a shared data structure representing a snapshot of everything the agent knows right now.</li>



<li><strong>Nodes</strong> — functions that do the actual work: calling an LLM, running a tool, grading a result.</li>



<li><strong>Edges</strong> — the routing logic that decides what happens next. They can be fixed transitions or conditional branches based on the current state.</li>
</ul>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>&#8220;Nodes do the work. Edges tell what to do next.&#8221;</em> — <a href="https://docs.langchain.com/oss/python/langgraph/graph-api" rel="nofollow noopener" target="_blank">LangGraph Graph API docs</a></p>
</blockquote>



<p class="wp-block-paragraph">This is fundamentally different from a chain or a simple prompt loop. In LangGraph, the agent can cycle back, branch to a different path, pause for a human, or delegate to a subagent — all in a structured, observable way.</p>



<p class="wp-block-paragraph">And sitting on top of LangGraph is the newer <strong>Deep Agents</strong> harness — a batteries-included layer that adds built-in planning, a virtual filesystem, subagent spawning, and long-term memory. Think of it like this:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Layer</th><th>Role</th></tr></thead><tbody><tr><td><strong>LangGraph</strong></td><td>Orchestration runtime — durable execution, streaming, human-in-the-loop</td></tr><tr><td><strong>LangChain</strong></td><td>Agent framework — models, tools, agent loops</td></tr><tr><td><strong>Deep Agents</strong></td><td>Agent harness — planning, subagents, context management</td></tr><tr><td><strong>LangSmith</strong></td><td>Observability — tracing, evaluation, debugging</td></tr></tbody></table></figure>



<p class="wp-block-paragraph">We&#8217;ll build from the bottom up — starting with a raw LangGraph graph, then upgrading to Deep Agents patterns.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Chapter 2: Your First Real Graph — State, Nodes, and Edges</h2>



<p class="wp-block-paragraph">Install the dependencies:</p>



<pre class="wp-block-code"><code>pip install langgraph langchain-anthropic
</code></pre>



<p class="wp-block-paragraph">Now let&#8217;s build the simplest possible agent: one that receives a message and responds.</p>



<h3 class="wp-block-heading">Step 1: Define Your State</h3>



<p class="wp-block-paragraph">State is the backbone. Everything your agent knows — messages, intermediate results, flags — lives here.</p>



<pre class="wp-block-code"><code>from typing import TypedDict
from langchain.messages import AnyMessage

class AgentState(TypedDict):
    messages: list&#91;AnyMessage]
    task_complete: bool
</code></pre>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Reference: <a href="https://docs.langchain.com/oss/python/langgraph/use-graph-api#define-state" rel="nofollow noopener" target="_blank">Define state — LangGraph Graph API</a></p>
</blockquote>



<h3 class="wp-block-heading">Step 2: Define Your Nodes</h3>



<p class="wp-block-paragraph">Each node is a plain Python function. It receives the current state and returns updates to the state.</p>



<pre class="wp-block-code"><code>from langchain_anthropic import ChatAnthropic

model = ChatAnthropic(model="claude-sonnet-4-20250514", temperature=0)

def call_llm(state: AgentState) -&gt; AgentState:
    """Node: call the LLM with current message history."""
    response = model.invoke(state&#91;"messages"])
    return {"messages": state&#91;"messages"] + &#91;response]}

def check_complete(state: AgentState) -&gt; AgentState:
    """Node: mark task as complete (simplified)."""
    return {"task_complete": True}
</code></pre>



<h3 class="wp-block-heading">Step 3: Wire the Graph</h3>



<pre class="wp-block-code"><code>from langgraph.graph import START, END, StateGraph

builder = StateGraph(AgentState)

# Add nodes
builder.add_node("call_llm", call_llm)
builder.add_node("check_complete", check_complete)

# Add edges
builder.add_edge(START, "call_llm")
builder.add_edge("call_llm", "check_complete")
builder.add_edge("check_complete", END)

graph = builder.compile()
</code></pre>



<h3 class="wp-block-heading">Step 4: Run It</h3>



<pre class="wp-block-code"><code>from langchain.messages import HumanMessage

result = graph.invoke({
    "messages": &#91;HumanMessage(content="What is LangGraph?")],
    "task_complete": False
})

print(result&#91;"messages"]&#91;-1].content)
</code></pre>



<p class="wp-block-paragraph">That&#8217;s your first graph. Four steps, a working agent. But this one can&#8217;t use tools, remember anything across sessions, or route conditionally. Let&#8217;s fix that.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Chapter 3: Adding Tools and Conditional Routing</h2>



<p class="wp-block-paragraph">Real agents don&#8217;t just chat — they <em>act</em>. Let&#8217;s add tool calling and teach the graph to route based on whether the model wants to use a tool.</p>



<h3 class="wp-block-heading">Define Tools</h3>



<pre class="wp-block-code"><code>from langchain_core.tools import tool

@tool
def web_search(query: str) -&gt; str:
    """Search the web for current information."""
    # In production, hook up to Tavily, SerpAPI, etc.
    return f"Search results for: {query} — &#91;placeholder result]"

@tool
def calculator(expression: str) -&gt; str:
    """Evaluate a mathematical expression."""
    try:
        return str(eval(expression))
    except Exception as e:
        return f"Error: {e}"

tools = &#91;web_search, calculator]
</code></pre>



<h3 class="wp-block-heading">Bind Tools to the Model</h3>



<pre class="wp-block-code"><code>model_with_tools = model.bind_tools(tools)
</code></pre>



<h3 class="wp-block-heading">Add a ToolNode and Conditional Router</h3>



<pre class="wp-block-code"><code>from langgraph.graph import START, END, StateGraph
from langgraph.prebuilt import ToolNode
from langchain.messages import AnyMessage
from typing import Literal

def agent_node(state: AgentState):
    response = model_with_tools.invoke(state&#91;"messages"])
    return {"messages": state&#91;"messages"] + &#91;response]}

def route_after_agent(state: AgentState) -&gt; Literal&#91;"tools", "__end__"]:
    """Conditional edge: go to tools if the model made tool calls, else end."""
    last_message = state&#91;"messages"]&#91;-1]
    if getattr(last_message, "tool_calls", None):
        return "tools"
    return "__end__"

tool_node = ToolNode(tools)

builder = StateGraph(AgentState)
builder.add_node("agent", agent_node)
builder.add_node("tools", tool_node)

builder.add_edge(START, "agent")
builder.add_conditional_edges("agent", route_after_agent)
builder.add_edge("tools", "agent")  # loop back after tool use

graph = builder.compile()
</code></pre>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Reference: <a href="https://docs.langchain.com/oss/javascript/langgraph/workflows-agents#agents" rel="nofollow noopener" target="_blank">Agents — LangGraph workflows</a></p>
</blockquote>



<p class="wp-block-paragraph">Now your agent can loop: it calls the model, decides to use a tool, executes the tool, passes results back to the model, and continues until it&#8217;s done. This is the <strong>ReAct loop</strong> — the foundation of most production agents.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Chapter 4: Memory and Persistence with Checkpointers</h2>



<p class="wp-block-paragraph">Here&#8217;s where most tutorial agents fail: they forget everything between runs.</p>



<p class="wp-block-paragraph">LangGraph solves this with <strong>checkpointers</strong> — a persistence layer that saves your agent&#8217;s state at every step. Resume a paused run, recover from a crash, or let a human review mid-task.</p>



<pre class="wp-block-code"><code>from langgraph.checkpoint.memory import InMemorySaver

checkpointer = InMemorySaver()
graph = builder.compile(checkpointer=checkpointer)
</code></pre>



<p class="wp-block-paragraph">Now invoke with a <code>thread_id</code> to maintain session continuity:</p>



<pre class="wp-block-code"><code>config = {"configurable": {"thread_id": "user-session-001"}}

# First message
result = graph.invoke(
    {"messages": &#91;HumanMessage(content="My name is Satish. Remember that.")], "task_complete": False},
    config=config
)

# Second message — same thread, same memory
result2 = graph.invoke(
    {"messages": result&#91;"messages"] + &#91;HumanMessage(content="What is my name?")]},
    config=config
)

print(result2&#91;"messages"]&#91;-1].content)
# → "Your name is Satish."
</code></pre>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Reference: <a href="https://docs.langchain.com/oss/python/langgraph/persistence#using-in-langgraph" rel="nofollow noopener" target="_blank">Using in LangGraph — Persistence</a></p>
</blockquote>



<p class="wp-block-paragraph">For production, swap <code>InMemorySaver</code> for a Redis or PostgreSQL checkpointer. The API is identical — only the backend changes.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Chapter 5: Human-in-the-Loop — The Safety Net</h2>



<p class="wp-block-paragraph">An autonomous agent making decisions at scale is powerful. An autonomous agent making decisions <em>without any oversight</em> is a liability — especially in FSI or regulated environments.</p>



<p class="wp-block-paragraph">LangGraph&#8217;s <code>interrupt()</code> lets you pause an agent mid-graph and wait for human input before continuing.</p>



<pre class="wp-block-code"><code>from langgraph.types import interrupt, Command
from langgraph.checkpoint.memory import InMemorySaver
from typing import TypedDict

class ReviewState(TypedDict):
    task: str
    draft_output: str
    approved: bool

def draft_node(state: ReviewState):
    # Simulate the agent drafting something
    return {"draft_output": f"Draft response to: {state&#91;'task']}"}

def human_review_node(state: ReviewState):
    # Pause here and surface the draft to a human
    decision = interrupt({
        "draft": state&#91;"draft_output"],
        "instruction": "Approve or edit this output before we proceed."
    })
    return {"approved": decision.get("approved", False)}

def finalize_node(state: ReviewState):
    if state&#91;"approved"]:
        return {"draft_output": f"&#91;APPROVED] {state&#91;'draft_output']}"}
    return {"draft_output": "&#91;REJECTED — needs revision]"}

checkpointer = InMemorySaver()

review_graph = (
    StateGraph(ReviewState)
    .add_node("draft", draft_node)
    .add_node("human_review", human_review_node)
    .add_node("finalize", finalize_node)
    .add_edge(START, "draft")
    .add_edge("draft", "human_review")
    .add_edge("human_review", "finalize")
    .add_edge("finalize", END)
    .compile(checkpointer=checkpointer)
)

config = {"configurable": {"thread_id": "review-001"}}

# Run to the interrupt
review_graph.invoke({"task": "Write quarterly summary", "draft_output": "", "approved": False}, config)

# Human approves — resume
review_graph.invoke(Command(resume={"approved": True}), config)
</code></pre>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Reference: <a href="https://docs.langchain.com/oss/python/langgraph/thinking-in-langgraph#testing-the-agent" rel="nofollow noopener" target="_blank">Testing the agent — human-in-the-loop</a></p>
</blockquote>



<p class="wp-block-paragraph">This pattern maps directly onto governance gates in regulated industries: the agent drafts, a human reviews, execution continues only on explicit approval.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Chapter 6: Enter Deep Agents — The Harness Level</h2>



<p class="wp-block-paragraph">Now we level up. <strong>Deep Agents</strong> is the highest-level abstraction in the LangChain stack — an agent harness built on LangGraph that adds:</p>



<ul class="wp-block-list">
<li><strong>Built-in planning tools</strong> — the agent can decompose complex tasks into steps</li>



<li><strong>Virtual filesystem</strong> — agents read and write files across long runs</li>



<li><strong>Subagent spawning</strong> — delegate subtasks to specialist agents running in isolated context windows</li>



<li><strong>Long-term memory</strong> — update and retrieve knowledge across sessions</li>
</ul>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>&#8220;deepagents is a standalone library built on top of LangChain&#8217;s core building blocks for agents. It uses the LangGraph runtime for durable execution, streaming, human-in-the-loop, and other features.&#8221;</em> — <a href="https://docs.langchain.com/oss/python/deepagents/overview" rel="nofollow noopener" target="_blank">Deep Agents overview</a></p>
</blockquote>



<p class="wp-block-paragraph">Install:</p>



<pre class="wp-block-code"><code>pip install deepagents langchain-anthropic
</code></pre>



<h3 class="wp-block-heading">Building a Deep Research Agent</h3>



<p class="wp-block-paragraph">Here&#8217;s a complete, testable example — a coordinator agent that plans a research task, delegates to a web-search subagent and a summarizer subagent, then synthesizes the final answer.</p>



<pre class="wp-block-code"><code># deep_research_agent.py
from deepagents import create_deep_agent, SubAgent
from langchain_anthropic import ChatAnthropic
from langchain_core.tools import tool
from langgraph.checkpoint.memory import InMemorySaver

# ─── Model ───────────────────────────────────────────────
model = ChatAnthropic(model="claude-sonnet-4-20250514", temperature=0)

# ─── Tools ───────────────────────────────────────────────

@tool
def web_search(query: str) -&gt; str:
    """Search the web for information on a given topic."""
    # Wire to Tavily or SerpAPI in production
    return f"&#91;Search results for '{query}']: LangGraph was released by LangChain in 2024. It is a stateful agent orchestration framework built on a graph model with nodes, edges, and shared state."

@tool
def summarize_text(text: str) -&gt; str:
    """Summarize a block of text into key bullet points."""
    # In production, call the model here
    return f"Summary: {text&#91;:200]}..."

# ─── Subagents ────────────────────────────────────────────

# The Researcher subagent: specialized in web search
researcher = SubAgent(
    name="researcher",
    description="Searches the web and retrieves relevant information on any topic. Use this for fact-finding tasks.",
    tools=&#91;web_search],
    model=model,
)

# The Summarizer subagent: specialized in distillation
summarizer = SubAgent(
    name="summarizer",
    description="Takes raw text or search results and produces clean, structured summaries. Use this after research is complete.",
    tools=&#91;summarize_text],
    model=model,
)

# ─── Coordinator (Deep Agent) ─────────────────────────────

checkpointer = InMemorySaver()

agent = create_deep_agent(
    model=model,
    subagents=&#91;researcher, summarizer],
    system_prompt="""You are a deep research coordinator.
When given a topic, you:
1. Plan which subtasks are needed
2. Delegate research to the researcher subagent
3. Delegate summarization to the summarizer subagent
4. Synthesize a final, structured answer

Always produce outputs in clear markdown with headings.""",
    checkpointer=checkpointer,
)

# ─── Run ──────────────────────────────────────────────────
if __name__ == "__main__":
    config = {"configurable": {"thread_id": "research-session-001"}}

    result = agent.invoke(
        {"messages": &#91;{"role": "user", "content": "Research how LangGraph works and give me a structured summary."}]},
        config=config
    )

    # Print the final coordinator message
    for message in result&#91;"messages"]:
        if hasattr(message, "content") and message.content:
            print(message.content)
</code></pre>



<p class="wp-block-paragraph">Run it:</p>



<pre class="wp-block-code"><code>python deep_research_agent.py
</code></pre>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Reference: <a href="https://docs.langchain.com/oss/python/deepagents/overview" rel="nofollow noopener" target="_blank">Deep Agents overview</a>, <a href="https://docs.langchain.com/oss/javascript/deepagents/subagents#compiledsubagent" rel="nofollow noopener" target="_blank">Subagents</a></p>
</blockquote>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Chapter 7: The Architecture Mental Model</h2>



<p class="wp-block-paragraph">Before you ship any of this to production, internalize this architecture. Deep Agents use a <strong>coordinator-worker model</strong>:</p>



<pre class="wp-block-code"><code>User Message
    │
    ▼
┌─────────────────────────────┐
│   COORDINATOR (Deep Agent)  │  ← Plans tasks, routes to subagents
│   - Receives user input     │
│   - Decides delegation      │
└────────┬───────────┬────────┘
         │           │
         ▼           ▼
┌──────────────┐  ┌──────────────┐
│  Researcher  │  │  Summarizer  │  ← Isolated context windows
│  Subagent    │  │  Subagent    │
└──────────────┘  └──────────────┘
         │           │
         └─────┬─────┘
               ▼
    ┌─────────────────┐
    │  Final Synthesis │  ← Coordinator assembles final answer
    └─────────────────┘
</code></pre>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Reference: <a href="https://docs.langchain.com/oss/python/deepagents/frontend/overview#architecture" rel="nofollow noopener" target="_blank">Architecture — Deep Agents frontend</a></p>
</blockquote>



<p class="wp-block-paragraph">Each subagent runs in its <strong>own isolated context window</strong>. This means:</p>



<ul class="wp-block-list">
<li>No context pollution between specialists</li>



<li>Each subagent can run longer, focused tasks</li>



<li>You can parallelize subagents for speed</li>



<li>Memory and state are cleanly separated per agent</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Chapter 8: What Makes This &#8220;Deep&#8221;?</h2>



<p class="wp-block-paragraph">You might ask: isn&#8217;t this just multi-agent? What&#8217;s the <em>deep</em> part?</p>



<p class="wp-block-paragraph">The depth comes from the harness capabilities that LangGraph alone doesn&#8217;t give you out of the box:</p>



<p class="wp-block-paragraph"><strong>Context management across long runs.</strong> A research task might span 50 tool calls and thousands of tokens. Deep Agents automatically summarize history and offload large results to the virtual filesystem so the agent never hits context limits mid-task.</p>



<p class="wp-block-paragraph"><strong>Subagent isolation.</strong> Each specialist runs fresh — no shared message history. This is critical for reliability: the summarizer doesn&#8217;t need to know the researcher&#8217;s entire search history; it just needs the results.</p>



<p class="wp-block-paragraph"><strong>Planning tools built in.</strong> The coordinator can use built-in planning capabilities to decompose &#8220;research LangGraph for my blog post&#8221; into: <code>search → collect → summarize → structure → draft</code>. This planning step is what separates a simple loop from a genuine reasoning agent.</p>



<p class="wp-block-paragraph"><strong>Memory that persists.</strong> Lessons learned, user preferences, domain knowledge — all storable and retrievable across sessions using <code>InMemorySaver</code> in dev or LangGraph Store in production.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Chapter 9: Production Checklist Before You Ship</h2>



<p class="wp-block-paragraph">You&#8217;ve built your agent. Here&#8217;s what separates a demo from a production-grade deployment:</p>



<p class="wp-block-paragraph"><strong>1. Swap InMemorySaver for a persistent checkpointer.</strong> Use Redis or PostgreSQL for <code>langgraph-checkpoint-redis</code> or <code>langgraph-checkpoint-postgres</code>. The compile interface is identical.</p>



<p class="wp-block-paragraph"><strong>2. Add retry policies on fragile nodes.</strong></p>



<pre class="wp-block-code"><code>builder.add_node(
    "web_search_node",
    search_node_fn,
    {"retry_policy": {"max_attempts": 3}}
)
</code></pre>



<p class="wp-block-paragraph"><strong>3. Instrument with LangSmith.</strong> Set your env vars and every graph invocation is traced automatically:</p>



<pre class="wp-block-code"><code>export LANGSMITH_API_KEY=your_key
export LANGSMITH_TRACING=true
</code></pre>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><a href="https://docs.langchain.com/langsmith/trace-with-langgraph" rel="nofollow noopener" target="_blank">Trace with LangGraph — LangSmith</a></p>
</blockquote>



<p class="wp-block-paragraph"><strong>4. Add human-in-the-loop gates for high-stakes actions.</strong> Any node that sends emails, modifies data, or calls external APIs should have an <code>interrupt()</code> gate before execution.</p>



<p class="wp-block-paragraph"><strong>5. Test subagent namespace isolation.</strong> If you&#8217;re running multiple subagents in parallel, ensure each has a unique node name to prevent checkpoint collisions.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Reference: <a href="https://docs.langchain.com/oss/javascript/langgraph/use-subgraphs#multiple-subgraph-calls-2" rel="nofollow noopener" target="_blank">Multiple subgraph calls</a></p>
</blockquote>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">The Lesson I Wish I&#8217;d Known Earlier</h2>



<p class="wp-block-paragraph">When my first agent broke on that Tuesday, I didn&#8217;t need a smarter model. I needed a smarter <em>structure</em>.</p>



<p class="wp-block-paragraph">LangGraph gives you that structure: a graph that is observable, resumable, and testable at every node. Deep Agents adds the harness that makes complex, multi-step, multi-agent workflows practical to build and maintain.</p>



<p class="wp-block-paragraph">The pattern we&#8217;ve walked through — State → Nodes → Edges → Tools → Checkpointer → Human Gate → Subagents — is the same pattern running inside production agents at enterprise scale today.</p>



<p class="wp-block-paragraph">Start with the simple graph. Add tools. Add memory. Add governance gates. Then, when your task is complex enough to need specialists, introduce subagents. Don&#8217;t over-engineer day one. The graph scales with your ambition.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Resources</h2>



<ul class="wp-block-list">
<li><a href="https://docs.langchain.com/oss/python/langgraph/overview" rel="nofollow noopener" target="_blank">LangGraph Python Overview</a></li>



<li><a href="https://docs.langchain.com/oss/python/langgraph/graph-api" rel="nofollow noopener" target="_blank">Graph API — Nodes, Edges, State</a></li>



<li><a href="https://docs.langchain.com/oss/python/langgraph/use-graph-api" rel="nofollow noopener" target="_blank">Use Graph API — Sequences</a></li>



<li><a href="https://docs.langchain.com/oss/python/langgraph/persistence" rel="nofollow noopener" target="_blank">Persistence &amp; Checkpointers</a></li>



<li><a href="https://docs.langchain.com/oss/python/langgraph/thinking-in-langgraph" rel="nofollow noopener" target="_blank">Human-in-the-Loop (interrupt)</a></li>



<li><a href="https://docs.langchain.com/oss/python/deepagents/overview" rel="nofollow noopener" target="_blank">Deep Agents Overview (Python)</a></li>



<li><a href="https://docs.langchain.com/oss/javascript/deepagents/subagents#compiledsubagent" rel="nofollow noopener" target="_blank">Deep Agents — Subagents</a></li>



<li><a href="https://docs.langchain.com/oss/python/langgraph/agentic-rag" rel="nofollow noopener" target="_blank">Agentic RAG with LangGraph</a></li>



<li><a href="https://docs.langchain.com/langsmith/trace-with-langgraph" rel="nofollow noopener" target="_blank">Trace with LangSmith</a></li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p class="wp-block-paragraph"><em>Built with verified LangChain documentation. All code examples are production-compatible with LangGraph&#8217;s current API. Install requirements: <code>pip install langgraph langchain-anthropic deepagents langsmith</code></em></p>
]]></content:encoded>
					
					<wfw:commentRss>https://rpabotsworld.com/build-deep-agents-langgraph-step-by-step/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<media:thumbnail url="https://rpabotsworld.com/wp-content/uploads/2023/05/indian-software-developer-XAA6LDA.jpg" />	</item>
		<item>
		<title>Agent Harness vs. Context Engineering: The Next Evolution of AI Agent Architecture with LangGraph</title>
		<link>https://rpabotsworld.com/agent-harness-vs-context-engineering-the-next-evolution-of-ai-agent-architecture-with-langgraph/</link>
					<comments>https://rpabotsworld.com/agent-harness-vs-context-engineering-the-next-evolution-of-ai-agent-architecture-with-langgraph/#respond</comments>
		
		<dc:creator><![CDATA[Satish Prasad]]></dc:creator>
		<pubDate>Sun, 07 Jun 2026 12:16:13 +0000</pubDate>
				<category><![CDATA[Agentic AI & AI Automation]]></category>
		<category><![CDATA[AI Agents & Frameworks]]></category>
		<category><![CDATA[AI Agents]]></category>
		<category><![CDATA[Human in the Loop]]></category>
		<category><![CDATA[multi-agent systems]]></category>
		<category><![CDATA[UiPath Communication Mining]]></category>
		<guid isPermaLink="false">https://rpabotsworld.com/?p=32091</guid>

					<description><![CDATA[Agent Harness vs Context Engineering: How to Build Reliable AI Agents with LangGraph]]></description>
										<content:encoded><![CDATA[
<p class="wp-block-paragraph">Building AI applications has evolved dramatically. The community has moved past simple prompt tuning into complex system architecture. If you are building production-grade workflows today, you are likely grappling with a massive shift: moving from fragile proof-of-concepts to resilient, enterprise-grade systems.</p>



<p class="wp-block-paragraph">For most of 2024 and 2025, the AI engineering community focused heavily on <strong>Prompt Engineering</strong> and later <strong>Context Engineering</strong>. As AI agents became more autonomous, however, engineers discovered that neither prompts nor context alone could reliably deliver production-grade agent behavior.</p>



<p class="wp-block-paragraph">A new paradigm dominates the architectural landscape: <strong>Agent Harness Engineering</strong>. Leading AI companies and frameworks increasingly describe agent systems using a simple equation:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">{Agent} = {Model} + {Harness}</p>
</blockquote>



<p class="wp-block-paragraph">The language model provides raw reasoning capabilities, while the harness provides everything required to transform that reasoning into reliable, safe, and deterministic actions.</p>



<h2 class="wp-block-heading">1. Defining the Core Concepts</h2>



<p class="wp-block-paragraph">To understand how to build resilient systems, we must first look at the three evolutionary eras of AI engineering:</p>



<pre class="wp-block-code"><code>Prompt Engineering   ➔   Context Engineering   ➔   Harness Engineering
(Shapes Behavior)        (Shapes Knowledge)         (Shapes Reliability)
</code></pre>



<ul class="wp-block-list">
<li><strong>Phase 1: Prompt Engineering (Shapes Behavior):</strong> Early AI applications focused on better instructions, Chain-of-Thought formatting, and few-shot examples. The assumption was simple: <em>better prompts produce better outputs</em>. This worked for basic chatbots but failed for complex, multi-step workflows.</li>



<li><strong>Phase 2: Context Engineering (Shapes Knowledge):</strong> As agents became more sophisticated, engineers realized the quality of context often matters more than the prompt itself. Context Engineering emerged as the practice of dynamic retrieval (RAG), vector search management, token budget optimization, and state compaction to ensure the model&#8217;s context window contains pristine, highly relevant information. A Context Engineer asks: <em>&#8220;What information should the model see?&#8221;</em></li>



<li><strong>Phase 3: Harness Engineering (Shapes Reliability):</strong> The latest realization is the most critical: even perfect context cannot solve tool execution failures, infinite loops, permission issues, planning mistakes, or missing feedback cycles. According to emerging industry definitions, <strong>&#8220;If you&#8217;re not the model, you&#8217;re the harness.&#8221;</strong> An Agent Harness is the complete execution environment and infrastructure shell surrounding an LLM. A Harness Engineer asks: <em>&#8220;What environment should the model operate within?&#8221;</em></li>
</ul>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Without a harness, an LLM can only generate text. With a harness, the same model can browse websites, query databases, safely execute code, plan multi-step tasks, coordinate sub-agents, persist long-term memory, and recover from real-world failures. It represents a fundamental shift from <strong>information design</strong> to <strong>system design</strong>.</p>
</blockquote>



<h2 class="wp-block-heading">2. Agent Harness vs. Context Engineering</h2>



<p class="wp-block-paragraph">Confusing these two layers is one of the most common architectural mistakes engineering teams make. They are not interchangeable; they focus on entirely different layers of the software stack, fail in distinct ways, and require unique debugging paths.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><td><strong>Feature / Dimension</strong></td><td><strong>Context Engineering (The Brain)</strong></td><td><strong>Agent Harness Engineering (The Body)</strong></td></tr></thead><tbody><tr><td><strong>Primary Core Focus</strong></td><td>Knowledge, Information Flow, Relevance</td><td>Infrastructure, Runtime, Execution Reliability</td></tr><tr><td><strong>Key Responsibility</strong></td><td>Providing fresh semantic data, pristine RAG, metadata pruning, and document indexing.</td><td>Executing sandboxed code, state serialization, token rate-limiting, and error-trapping.</td></tr><tr><td><strong>Where it Operates</strong></td><td>Inside the LLM Prompt / Context Window.</td><td>Outside the LLM, hosting the application loop.</td></tr><tr><td><strong>Operational Analogy</strong></td><td><strong>The Brain:</strong> Provides knowledge, memory, and cognitive understanding.</td><td><strong>The Body:</strong> Provides tools, physical actions, constraints, and safety mechanisms.</td></tr><tr><td><strong>Silent Failures</strong></td><td><strong>High.</strong> The agent runs flawlessly but generates an outdated answer because of stale vector data.</td><td><strong>Low.</strong> The architecture crashes visibly (e.g., timeout exceptions, sandbox breaches, schema errors).</td></tr></tbody></table></figure>



<h2 class="wp-block-heading">3. The Anatomy of an Agent Harness</h2>



<p class="wp-block-paragraph">A production-ready harness acts as the nervous and immune system for your AI agent. It typically contains six foundational pillars:</p>



<ol start="1" class="wp-block-list">
<li><strong>Planning Layer:</strong> Responsible for task decomposition, goal tracking, progress monitoring, and dynamic replanning. When a user asks an agent to &#8220;Research competitors and prepare a report,&#8221; the planning layer breaks this down into distinct, traceable sub-tasks.</li>



<li><strong>Tool Execution Layer:</strong> Provides secure access to APIs, databases, search engines, file systems, and MCP (Model Context Protocol) servers. The model makes the cognitive decision; the harness safely executes it.</li>



<li><strong>Memory Layer:</strong> Stores short-term session state, long-term semantic memory, user preferences, and historical actions so agents avoid repeatedly solving the same problems.</li>



<li><strong>Context Management Layer:</strong> This is where Context Engineering becomes a functional component of the harness. It handles context compression, semantic retrieval, summarization, and window optimization. <em>Context Engineering is a subset of Harness Engineering.</em></li>



<li><strong>Safety and Governance Layer:</strong> Controls tool permissions, runs ephemeral sandboxed environments (Docker, WASM, E2B) to isolate code execution, enforces organizational policies, and manages human-in-the-loop approval workflows.</li>



<li><strong>Observability Layer:</strong> Tracks tool calls, agent decisions, token costs, latency, and system failures. Without this layer, debugging an autonomous agent becomes impossible.</li>
</ol>



<h2 class="wp-block-heading">4. Why LangGraph Is a Natural Platform for Agent Harnesses</h2>



<p class="wp-block-paragraph"><strong>LangGraph</strong> was designed to solve a challenge that traditional agent frameworks struggle with: <strong>reliable, long-running, and cyclical execution.</strong></p>



<p class="wp-block-paragraph">Unlike linear chains, LangGraph introduces explicit workflow orchestration through graph structures (Nodes = LLM processing or Tool calling; Edges = Routing decisions). This makes it an ideal foundation for building an operational harness. LangGraph provides the underlying primitives, allowing you to map harness components directly onto graph mechanics:</p>



<ul class="wp-block-list">
<li><strong>Harness Planning Layer </strong>-> <strong> LangGraph Nodes:</strong> Each concrete planning step or state of execution becomes a node with explicit boundaries and responsibilities.</li>



<li><strong>Harness State Layer </strong>-> <strong> LangGraph State:</strong> LangGraph maintains a shared, type-safe state schema across nodes, acting as the memory backbone of the harness.</li>



<li><strong>Harness Execution Layer </strong>-> <strong> LangGraph Tools:</strong> Tools become strictly bound, callable capabilities controlled and monitored by the graph runtime.</li>



<li><strong>Harness Governance Layer </strong>-><strong> Conditional Edges:</strong> Complex safety and execution logic (e.g., <code>if confidence &lt; 0.8: route_to_human_review()</code>) are built structurally into the graph edges rather than relying on the LLM to follow prompt instructions.</li>



<li><strong>Harness Observability Layer </strong>-><strong> LangSmith + LangGraph:</strong> Provides native tracing of node transitions, tool performance, and failure states.</li>
</ul>



<h2 class="wp-block-heading">5. Practical Implementation Pattern</h2>



<p class="wp-block-paragraph">If you&#8217;re using <strong>LangGraph</strong>, the easiest way to use an <strong>Agent Harness</strong> is actually through <strong>Deep Agents</strong>, which LangChain describes as a batteries-included agent harness built on top of LangGraph. Deep Agents provides planning, task delegation, context management, memory, filesystem support, and human-in-the-loop controls without requiring you to build everything yourself.</p>



<h3 class="wp-block-heading">Architecture: LangGraph + Agent Harness</h3>



<pre class="wp-block-preformatted">                    User Request<br>                           |<br>                           v<br>                 +----------------+<br>                 | Deep Agent     |<br>                 | (Harness)      |<br>                 +----------------+<br>                           |<br>       ------------------------------------------------<br>       |              |             |                |<br>       v              v             v                v<br>   Planning      Memory       Sub Agents      Human Review<br>(write_todos)   Filesystem      Task()        interrupt_on<br>       |              |             |                |<br>       ------------------------------------------------<br>                           |<br>                           v<br>                    LangGraph Runtime<br>             (State, Checkpoints, Streaming)</pre>



<p class="wp-block-paragraph">According to the LangChain documentation, the harness provides these built-in capabilities:</p>



<ul class="wp-block-list">
<li>Planning (<code>write_todos</code>)</li>



<li>Virtual filesystem</li>



<li>Context management</li>



<li>Task delegation (subagents)</li>



<li>Human-in-the-loop approvals</li>



<li>Long-term memory</li>



<li>Code execution support</li>
</ul>



<h4 class="wp-block-heading">Example 1: Create a Deep Agent Harness</h4>



<p class="wp-block-paragraph">This example comes directly from the Deep Agents approach documented by LangChain.</p>



<pre class="wp-block-code"><code>from deepagents import create_deep_agent
from langchain_openai import ChatOpenAI

model = ChatOpenAI(model="gpt-4.1")

agent = create_deep_agent(
    model=model
)</code></pre>



<p class="wp-block-paragraph">At this point you already have:</p>



<ul class="wp-block-list">
<li>Planning</li>



<li>Memory</li>



<li>Context management</li>



<li>File storage</li>



<li>Task delegation</li>
</ul>



<p class="wp-block-paragraph">without manually building graph nodes.</p>



<h4 class="wp-block-heading">Example 2: Add Planning</h4>



<p class="wp-block-paragraph">One of the most important harness features is the built-in planning tool.</p>



<p class="wp-block-paragraph">When a user asks:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Research UiPath Agentic Automation competitors</p>
</blockquote>



<p class="wp-block-paragraph">the agent automatically creates a TODO list before execution.</p>



<pre class="wp-block-preformatted">TODO<br><br>[ ] Identify competitors<br>[ ] Gather company data<br>[ ] Analyze strengths<br>[ ] Generate report</pre>



<p class="wp-block-paragraph">The Deep Agents harness uses the <code>write_todos</code> tool to maintain structured plans. This helps long-running tasks remain organized and auditable.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h4 class="wp-block-heading">Example 3: Add Specialized Subagents</h4>



<p class="wp-block-paragraph">LangChain recommends using subagents to avoid context-window bloat.</p>



<pre class="wp-block-code"><code>from deepagents import create_deep_agent

agent = create_deep_agent(
    model=model,
    subagents=&#91;
        {
            "name": "researcher",
            "description": "Web research specialist"
        },
        {
            "name": "analyst",
            "description": "Data analysis specialist"
        }
    ]
)</code></pre>



<p class="wp-block-paragraph">Each subagent gets its own isolated context window and returns only the final results to the supervisor.</p>



<h4 class="wp-block-heading">Example 4: Human-in-the-Loop Approval</h4>



<p class="wp-block-paragraph">For enterprise applications you often want approval before actions occur.</p>



<pre class="wp-block-code"><code>agent = create_deep_agent(
    model=model,
    interrupt_on={
        "send_email": True,
        "delete_file": True
    }
)</code></pre>



<pre class="wp-block-preformatted">Agent decides:<br>   Delete file?<br><br>        |<br>        v<br><br>Pause Execution<br>        |<br>        v<br><br>Human Approves<br>        |<br>        v<br><br>Continue</pre>



<p class="wp-block-paragraph">LangChain calls this &#8220;Human-in-the-Loop&#8221; execution and recommends it for sensitive operations.</p>



<h4 class="wp-block-heading">Real-World UiPath Research Agent Example</h4>



<p class="wp-block-paragraph">For your UiPath blog generation use case, a harness could look like:</p>



<pre class="wp-block-preformatted">User:<br>Generate UiPath Agentic Automation Blog<br>           |<br>           v<br>Planner Agent<br>           |<br>           v<br>Research Agent<br>(Gather UiPath docs)<br>           |<br>           v<br>Competitor Agent<br>(Copilot Studio, CrewAI, LangGraph)<br>           |<br>           v<br>Fact Check Agent<br>           |<br>           v<br>Content Writer Agent<br>           |<br>           v<br>Human Approval<br>           |<br>           v<br>Publish</pre>



<p class="wp-block-paragraph">This is a textbook Agent Harness design because it combines:</p>



<ul class="wp-block-list">
<li>Planning</li>



<li>Multiple specialized agents</li>



<li>Context isolation</li>



<li>Memory</li>



<li>Human review</li>



<li>Workflow orchestration</li>
</ul>



<p class="wp-block-paragraph">all running on LangGraph.</p>



<h2 class="wp-block-heading">6. Enterprise Benefits of Agent Harnesses</h2>



<p class="wp-block-paragraph">Organizations moving toward a harness-centric architecture realize massive advantages over teams relying on prompts alone:</p>



<ul class="wp-block-list">
<li><strong>Reliability:</strong> Deterministic, graph-driven state machines ensure agents follow strict corporate workflows and don&#8217;t deviate into unmapped logic loops.</li>



<li><strong>Governance:</strong> Human approvals, data policy enforcement, and permission structures become hardcoded security boundaries instead of fragile prompt instructions.</li>



<li><strong>Reusability &amp; Vendor Independence:</strong> The harness abstracts your core business logic away from the model providers. If a faster, cheaper LLM is released tomorrow, you swap the model inside the node—the entire harness layer remains completely untouched.</li>



<li><strong>Debuggability:</strong> When failures happen, they are tracked down to specific software components, input streams, or isolated nodes rather than debugging an enigmatic prompt output.</li>
</ul>



<h2 class="wp-block-heading">Conclusion: The Operating System of AI</h2>



<p class="wp-block-paragraph">The AI industry is moving rapidly beyond prompt engineering. The next competitive advantage will not come solely from adopting slightly smarter models, but from building vastly superior harnesses around them.</p>



<p class="wp-block-paragraph">In the same way that operating systems made abstract computer hardware useful to consumers, Agent Harnesses are becoming the operating systems of autonomous AI agents. For teams building production applications with LangGraph, mastering Harness Engineering is no longer optional—it is the baseline requirement for operational success.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://rpabotsworld.com/agent-harness-vs-context-engineering-the-next-evolution-of-ai-agent-architecture-with-langgraph/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<media:thumbnail url="https://rpabotsworld.com/wp-content/uploads/2023/05/robot-and-machine-learning-MGNDRMG.jpg" />	</item>
	</channel>
</rss>

<!--
Performance optimized by W3 Total Cache. Learn more: https://www.boldgrid.com/w3-total-cache/?utm_source=w3tc&utm_medium=footer_comment&utm_campaign=free_plugin

Page Caching using Disk: Enhanced 

Served from: rpabotsworld.com @ 2026-06-21 02:47:28 by W3 Total Cache
-->