RPABOTS.WORLD

250 LangGraph Interview Questions & Answers (2026)

Satish Prasad — Sun, 05 Jul 2026 04:27:14 +0000

If you’ve prepared for a UiPath automation interview using our 400-question guide, this is the LangGraph counterpart for the other side of the agentic AI stack — the pro-code, Python-first framework that shows up in interviews for AI engineer, automation architect, and agent-platform roles alike.

Who this is for: developers moving from RPA or traditional backend work into agent frameworks, AI engineers who’ve used LangChain but not LangGraph specifically, and interviewers building a technical screen for either role.

How to use this guide: questions are grouped into 10 sections and arranged in increasing difficulty within each section — start at the top of a section if you’re new to that topic, skip to the later questions if you already know the basics. Every answer is grounded in LangGraph’s official documentation (linked inline), and every code snippet reflects the current Graph API and Functional API surface as of mid-2026. Where LangGraph’s API has changed recently (the v3 event-streaming API, the Command primitive, semantic search in BaseStore), the answer says so explicitly rather than presenting it as settled trivia.

Section 1: LangGraph Fundamentals & Core Concepts (Q1–25)

Start here if you haven’t built a graph before. These questions cover why LangGraph exists, the Pregel execution model underneath it, and the vocabulary (nodes, edges, state) every later section assumes you already know.

Q1. What is LangGraph, in one sentence? LangGraph is a low-level orchestration framework for building stateful, controllable agents and workflows as graphs of nodes and edges, where each node is a unit of computation and the graph’s state is threaded through and updated as execution proceeds. Docs: LangGraph Overview

Q2. How is LangGraph different from plain LangChain? LangChain provides the building blocks — chat models, prompts, tool abstractions, retrievers. LangGraph provides the orchestration layer on top: explicit state, branching and looping control flow, checkpointing, and human-in-the-loop primitives that a linear LangChain chain doesn’t have. You typically use LangChain components inside LangGraph nodes.

Q3. Why would you choose a graph over a simple chain of prompts? A chain executes in one fixed direction. The moment your logic needs to loop (an agent retrying a tool call), branch (route to different nodes based on model output), or pause for a human decision, a linear chain can’t represent that naturally — you end up hand-rolling control flow around it. A graph makes looping, branching, and pausing first-class.

Q4. What execution model does LangGraph use internally? LangGraph runs on Pregel, a bulk-synchronous-parallel graph processing model (originally from Google’s large-scale graph processing paper). Execution proceeds in discrete “super-steps”; all nodes scheduled to run in a given step execute (conceptually in parallel), their writes are applied, and then the next step’s nodes are determined. Docs: Runtime / Pregel

Q5. What are the three things you define to build a graph? A state schema (what data flows through the graph), one or more nodes (functions that read state and return updates), and edges (which connect nodes and determine execution order, either fixed or conditional).

Q6. What’s the minimal code to construct and compile a graph?

from typing import TypedDict
from langgraph.graph import StateGraph, START, END

class State(TypedDict):
    topic: str
    joke: str

def generate_joke(state: State):
    return {"joke": f"A joke about {state['topic']}"}

graph = (
    StateGraph(State)
    .add_node(generate_joke)
    .add_edge(START, "generate_joke")
    .add_edge("generate_joke", END)
    .compile()
)

compile() validates the graph structure and returns a runnable object with .invoke(), .stream(), and related methods. Docs: Graph API Overview

Q7. What are START and END? They’re special, reserved node names marking the graph’s entry and exit points. Every graph needs at least one edge from START to a real node, and paths through the graph eventually need to reach END (or return a Command that resolves execution) or the run never terminates.

Q8. Can a node be an async function? Yes. LangGraph supports both sync and async node functions, and you invoke the corresponding sync or async method (.invoke()/.ainvoke(), .stream()/.astream()) to match. Mixing sync nodes into an async run generally works, but async nodes require the async invocation path.

Q9. What does add_node() actually register? It registers a callable (or Runnable) under a name in the graph, along with optional configuration like a retry_policy or cache_policy. The name defaults to the function’s __name__ unless you pass one explicitly — worth knowing because that name is what shows up in updates stream output and in metadata like langgraph_node.

Q10. What’s the difference between add_edge() and add_conditional_edges()? add_edge() creates a fixed, unconditional transition from one node to another (or to END). add_conditional_edges() takes a routing function that inspects the current state and returns the name (or names) of the node(s) to run next — this is how you implement branching. Docs: add_conditional_edges reference

Q11. Can multiple nodes run in the same super-step? Yes — if a node has edges fanning out to two or more nodes with no dependency between them, LangGraph runs them concurrently within the same step. This is the basis of parallel fan-out patterns like calling three tools at once and merging their results with a reducer.

Q12. What is “thinking in LangGraph” as opposed to thinking in a plain script? It means modeling your application as explicit state transitions rather than as an imperative sequence of function calls. Instead of asking “what function do I call next,” you ask “what does the state look like after this node runs, and which node(s) should see that state next.” This reframing is what makes persistence, replay, and human-in-the-loop possible without extra plumbing. Docs: Thinking in LangGraph

Q13. What happens if two nodes in the same super-step both write to the same state key without a reducer? Without a reducer defined for that key, LangGraph raises an InvalidUpdateError — concurrent, non-reducer-guarded writes to the same key are ambiguous and the framework refuses to silently pick a winner. Defining a reducer (see Section 2) is exactly how you tell LangGraph how to merge such writes.

Q14. What’s a “graph” versus a “subgraph” in LangGraph terms? A subgraph is just a compiled graph used as a node inside another (parent) graph. From the parent’s perspective it’s a single node; internally it runs its own multi-step execution, and its state can be fully separate from or partially overlapping with the parent’s state schema. Docs: Use subgraphs

Q15. When would you use a subgraph instead of just another node? When a chunk of logic is reusable across multiple parent graphs, when you want to encapsulate a multi-step process (like a research sub-workflow) behind a single interface, or when you’re building a multi-agent system where each agent is itself a small graph.

Q16. What is the “Graph API” versus the “Functional API”? The Graph API is the explicit StateGraph / nodes / edges model covered in this section. The Functional API (@entrypoint and @task decorators, covered in Section 4) lets you write workflows as regular Python functions with loops and conditionals, while still getting persistence, streaming, and human-in-the-loop for free. They’re two front ends over the same runtime. Docs: Functional API overview

Q17. Does LangGraph require you to use LangChain chat models? No. Nodes are plain Python functions — you can call any LLM client (OpenAI’s SDK directly, a self-hosted model, etc.) from inside a node. Using LangChain’s init_chat_model or a specific chat model integration gets you provider-agnostic streaming and tool-calling conventions for free, but it isn’t a hard requirement.

Q18. What’s the difference between LangGraph (Python) and LangGraph.js? They’re parallel implementations of the same core concepts — StateGraph, Pregel execution, checkpointers, Command, streaming — for Python and TypeScript/JavaScript respectively. API shapes are close but not identical (for example, Send and Command are exposed as classes in both, but idiomatic usage differs slightly per language). Pick based on your application’s runtime, not a capability gap.

Q19. Is LangGraph tied to LangSmith? No — LangGraph is open source and runs standalone. LangSmith is LangChain’s observability and deployment product; it adds tracing, evaluation, and (via LangSmith Deployment, formerly “LangGraph Platform”) managed hosting for graphs, but none of that is required to build or run a graph.

Q20. What is a “Send” object used for at a conceptual level, before the mechanics? It lets a conditional edge dispatch a variable number of parallel tasks to the same node, each with its own slice of input — the classic case being map-reduce, where you don’t know ahead of time how many items you’re mapping over. Full mechanics are in Section 3.

Q21. What’s the practical difference between a “workflow” and an “agent” in LangGraph’s own vocabulary? LangGraph’s docs distinguish workflows (predefined code paths — you decide the sequence of steps up front) from agents (the LLM dynamically decides its own steps, typically by choosing which tools to call and when to stop). Most production systems are a hybrid: an outer workflow with an agent as one or more of its nodes. Docs: Workflows and agents

Q22. Why does LangGraph favor explicit graphs over letting an LLM “just figure out” control flow entirely on its own? Full LLM-driven control flow (the LLM decides literally everything, unconstrained) is flexible but unpredictable and hard to debug, test, or put reliability guarantees around. An explicit graph lets you fix the parts of your logic that should be deterministic (validation, routing rules, approval gates) while still giving the LLM freedom where it adds value (reasoning, tool selection within a bounded node).

Q23. What does “low-level” mean in LangGraph’s own description of itself? It means LangGraph doesn’t prescribe a specific agent architecture or hide the state machine from you — you compose your own graph shape rather than configuring a fixed template. Prebuilt agents (Section 6) exist on top of this low-level core for common patterns, but you’re never forced to use them.

Q24. Can a compiled graph be visualized? Yes — a compiled graph exposes methods to render its structure as a Mermaid diagram or PNG, which is useful in interviews and code review alike for showing you actually understand the control flow you built rather than just describing it verbally.

Q25. What’s a common first mistake developers make when moving from a chain to a graph? Treating every node as if it must be a single LLM call, when a node is just “a function that takes state and returns a partial update” — plain deterministic Python (validation, formatting, an API call with no LLM involved) belongs in nodes just as much as model calls do. Overloading single nodes with multiple responsibilities is the second most common mistake, and it makes both debugging and reducer design harder.

Section 2: State Management & Reducers (Q26–50)

State is the one concept everything else in LangGraph builds on. This section covers how to define it, how updates merge into it, and the built-in patterns (MessagesState, add_messages) you’ll see in almost every real graph.

Q26. What can a graph’s state schema be defined as? A TypedDict, a Pydantic BaseModel, or a Python dataclass. All three work as the type argument to StateGraph(...); which one you pick affects validation behavior (Pydantic validates on construction) and whether you get attribute or dict-style access inside nodes.

Q27. What does a node return, and how does that relate to the full state? A node returns a partial update — a dict containing only the keys it’s changing, not the entire state object. LangGraph merges that partial update into the full state using each key’s reducer (or, absent a reducer, by overwriting the key’s previous value).

Q28. What is a reducer, precisely? A function attached to a state key (via Annotated[Type, reducer_fn]) that defines how a new value from a node’s update combines with the key’s existing value. Without one, LangGraph’s default behavior is simple overwrite — the new value replaces the old.

Q29. Show the canonical example of a reducer.

import operator
from typing import Annotated, TypedDict

class State(TypedDict):
    # each node's update to `items` is appended, not overwritten
    items: Annotated[list[str], operator.add]

Every node that returns {"items": [...]} has its list concatenated onto the existing one via operator.add, rather than replacing it outright.

Q30. What is add_messages and why does almost every chat-oriented graph use it? add_messages is a built-in reducer for lists of chat messages. It appends new messages by default, but if an incoming message has the same id as an existing one, it replaces that message in place instead of duplicating it — which is exactly the semantics you want for streaming partial updates to an existing AI message rather than accumulating duplicates. Docs: Graph API — messages

Q31. What is MessagesState? A prebuilt TypedDict state schema with a single messages field already annotated with add_messages. It’s a convenience so you don’t redeclare the same chat-history pattern in every graph:

from langgraph.graph import MessagesState

class State(MessagesState):
    extra_field: str  # extend it with your own keys

Q32. Can you write a custom reducer instead of using operator.add? Yes — a reducer is just any two-argument function (existing, new) -> merged. A common custom reducer deduplicates a list, keeps only the last N items, or merges two dicts key-by-key rather than replacing one with the other wholesale.

Q33. If your state is a Pydantic model, does validation run on every node’s partial update? Validation behavior depends on the LangGraph version and streaming mode in use; broadly, the framework coerces/validates the full merged state at defined points rather than validating each node’s raw partial return in isolation, since a partial update alone would fail required-field validation on its own. Check current docs for your exact version before relying on this for input sanitization.

Q34. What’s the difference between “private” node-local state and the graph’s shared state? Shared state (declared on StateGraph) is visible to every node and persisted at checkpoints. If a node needs scratch variables that shouldn’t be part of that shared, persisted schema, it just uses regular local Python variables inside the function — those never touch the graph’s state object and disappear when the node returns.

Q35. How do you give two nodes access to different slices of state (an input schema versus output schema)? StateGraph supports separate input and output schemas in addition to the internal state schema, so a node can be typed to only see (and be validated against) the subset of fields relevant to it, while the internal schema carries everything the graph needs end-to-end.

Q36. What happens to state keys a node doesn’t mention in its return value? They’re left untouched — a node’s return is a partial update, so any key not present in the returned dict simply keeps its prior value going into the next step.

Q37. Why can a dataclass be a better fit than a TypedDict for some graph states? A dataclass gives you attribute access (state.topic) instead of dict access (state["topic"]), default values via field(default=...), and it’s often more ergonomic to extend and unit test in isolation, at the cost of the union-style flexibility TypedDicts offer for partial/optional keys.

Q38. Give an example of a reducer that merges dictionaries instead of overwriting them.

def merge_dicts(existing: dict, new: dict) -> dict:
    return {**existing, **new}

class State(TypedDict):
    metadata: Annotated[dict, merge_dicts]

Without this, a second node’s update to metadata would silently wipe out keys the first node had already set.

Q39. What’s the risk of using operator.add as a reducer on a list state key that multiple parallel branches write to? None inherently — that’s exactly the pattern parallel fan-out relies on, since each branch’s contribution gets appended rather than racing to overwrite. The risk shows up if you also need deterministic ordering of those appended items across runs; Pregel’s super-step model guarantees all writes in a step are applied, but not necessarily in a specific append order across concurrent branches unless you sort downstream.

Q40. Can state include non-serializable objects (like an open DB connection)? You can put arbitrary Python objects into state, but if you’re using a checkpointer for persistence, every value in state needs to survive whatever serialization the checkpointer uses (pickle by default for most first-party checkpointers). Open connections, threads, and similar objects generally shouldn’t live in checkpointed state — pass them via config/context instead.

Q41. What’s the purpose of Annotated in Annotated[list[str], operator.add]? Annotated is standard Python typing machinery for attaching metadata to a type without changing the type itself. LangGraph specifically looks for a reducer function in that metadata slot; the type checker sees list[str], and LangGraph additionally sees “use operator.add to merge updates to this field.”

Q42. How would you model a counter that increments across nodes?

class State(TypedDict):
    step_count: Annotated[int, operator.add]

def some_node(state: State):
    return {"step_count": 1}  # each call adds 1, doesn't set an absolute value

This is the same pattern as the list case — operator.add works on any type that supports +, not just lists.

Q43. What’s the difference between updating state via a node’s return value and updating it via graph.update_state()? A node’s return value is applied as part of normal graph execution, going through the same reducer logic as any other step. graph.update_state() is called outside normal execution (typically for human-in-the-loop editing or time travel) — it still goes through reducers, but it creates a new checkpoint directly rather than as the result of a node running.

Q44. Why might you deliberately avoid putting large blobs (like full document text) directly in graph state? Every checkpoint serializes the entire state; large objects bloat checkpoint storage and slow down persistence on every single step, even steps that don’t touch that field. A common pattern is storing a reference (file path, object-store key, or document ID) in state and fetching the actual content on demand inside the node that needs it.

Q45. Is state scoped per-thread or global to the whole application? Per-thread. A thread_id (passed via config={"configurable": {"thread_id": ...}}) scopes a distinct conversation or run’s checkpoint history. Two different thread_ids never see each other’s state through the checkpointer — cross-thread data has to go through the separate Store API (Section 7).

Q46. What’s a practical reason to define an explicit output schema separate from your internal state? It lets you hide internal bookkeeping fields (retry counters, intermediate scratch results, raw tool outputs) from whatever’s consuming the graph’s final output — the caller only sees the fields you’ve declared as part of the output schema, keeping your public interface stable even as internal implementation details change.

Q47. Can a state field’s type differ from what a reducer is annotated for, e.g. can you reduce a set instead of a list? Yes — reducers aren’t restricted to lists. A set-typed field with a custom lambda existing, new: existing | new reducer merges via set union; the mechanism is the same regardless of the underlying collection type, as long as your reducer function’s signature matches (existing, new) -> merged.

Q48. What happens if a node raises an exception mid-execution — does partial state from that node get applied? No — a node’s writes are only committed to the checkpoint once the node returns successfully. If it raises, none of its partial update is applied for that attempt; whether the step retries depends on whether a RetryPolicy is attached (Section 4).

Q49. How would you test a single node in isolation without running the whole graph? Since a node is just a plain function taking a state dict/object and returning a partial update, you call it directly with a hand-constructed state fixture and assert on the returned update — no graph, checkpointer, or compilation needed for that level of unit test. Save integration-level assertions (does the graph behave correctly end-to-end) for a separate test tier.

Q50. What’s the tradeoff of putting too much logic into reducers versus into nodes? Reducers should stay small and mechanical — merge these two values this way. Once a reducer starts making decisions that feel like business logic (should this cancel that, should this override under some condition), it becomes hard to trace where behavior actually lives, since reducers run implicitly on every write rather than being called explicitly like a node. If a merge rule needs real judgment, that’s usually a sign that logic belongs in an explicit node instead.

Section 3: Control Flow — Conditional Edges, Command & Send (Q51–75)

This is the section that separates “I can build a linear pipeline” from “I can build an agent that actually branches, loops, and fans out work.” Expect at least a third of a real LangGraph interview to live here.

Q51. What does a conditional edge’s routing function receive and return? It receives the current state (after the preceding node has run) and returns either a single node name (as a string) or a list of node names to route to next. LangGraph then schedules exactly those nodes for the following super-step.

Q52. Write a minimal conditional edge that routes based on a boolean state field.

def route(state: State) -> str:
    return "human_review" if state["needs_approval"] else "finalize"

graph.add_conditional_edges("check", route, {"human_review": "human_review", "finalize": "finalize"})

The third argument (a mapping) is optional in many cases but makes the possible destinations explicit for graph visualization and validation.

Q53. What is the Command primitive and what four things can it carry? Command is a return type (or input) that combines a state update with control-flow instructions in one object. It can carry: update (a partial state update, same as a normal node return), goto (which node(s) to run next, like a conditional edge), graph (target the parent graph when returning from inside a subgraph), and resume (a value used as input to continue execution after an interrupt()). Docs: Graph API — Command

Q54. When would you use Command(goto=…) instead of add_conditional_edges()? When the routing decision and the state update naturally belong together in the same node — for example, a node that both writes a result and decides where to go next based on that result, without needing a separate routing function to re-derive the decision from state afterward. It collapses “update state” and “route” into a single return statement.

Q55. Show a node that both updates state and routes using Command.

from langgraph.types import Command
from typing import Literal

def classify(state: State) -> Command[Literal["urgent_path", "normal_path"]]:
    label = "urgent_path" if "asap" in state["text"].lower() else "normal_path"
    return Command(update={"label": label}, goto=label)

Typing the return as Command[Literal[...]] also lets LangGraph’s graph-drawing tooling render the possible destinations correctly.

Q56. What does Command(graph=Command.PARENT) do? It lets a node inside a subgraph route control back up to the parent graph rather than staying within the subgraph’s own nodes — necessary when a subgraph needs to hand control to a sibling node in the parent, not just to another node in its own scope.

Q57. What is the Send object, and what problem does it specifically solve? Send(node_name, state_for_that_invocation) lets a conditional edge dispatch a dynamic, runtime-determined number of parallel invocations of a node, each with its own custom input — solving map-reduce-style fan-out where you don’t know the count ahead of time (e.g., “run this node once per item in a list whose length depends on a previous step’s output”).

Q58. Show a Send-based map-reduce fan-out.

from langgraph.types import Send

def continue_to_map(state: State):
    return [Send("process_item", {"item": item}) for item in state["items"]]

graph.add_conditional_edges("split", continue_to_map)

Each Send triggers a separate execution of process_item with its own isolated {"item": item} input; their results are gathered back into the parent state via whatever reducer the receiving key uses.

Q59. How does the state passed via Send relate to the graph’s overall state schema? It can be a different shape entirely from the main graph state — Send‘s second argument is whatever input the target node expects, which is commonly a narrower dict than the full graph state, precisely because each parallel invocation only needs its own slice of data.

Q60. What’s the difference between a conditional edge returning a list of node names versus using Send? Returning a plain list of node names fans out to a fixed, known set of specific nodes with the same graph state going into each. Send fans out to potentially many invocations of the same node, each with different, per-invocation state — the dynamic-count, per-item-input case a plain list can’t express.

Q61. Can a routing function raise an exception instead of returning a valid destination? It can, and if it does the run fails at that step just like any other unhandled exception in node execution — there’s nothing special protecting routing functions from normal Python error semantics, so validate whatever state field you’re branching on before trusting it blindly.

Q62. How do you implement a loop (e.g., “keep calling this tool-using node until the model stops requesting tools”) in LangGraph? With a conditional edge whose routing function checks the last message for tool calls: if present, route back to the tool-execution node; if absent, route to END (or the next node). This is exactly the shape of the prebuilt ReAct loop under the hood.

Q63. What guards against an infinite loop in a graph like that? LangGraph enforces a recursion limit (configurable via config={"recursion_limit": N}, default 25) on the number of super-steps a single run can execute — once exceeded, the run raises a GraphRecursionError rather than looping forever.

Q64. Why might two different nodes need to route to the same next node, and how do you express that cleanly? It’s common for both a “success” path and a “needs escalation” path to eventually converge on a shared “finalize” or “notify” node. You express it exactly like any other edge — multiple add_edge() calls (or conditional edges) can target the same destination node; LangGraph doesn’t require a tree shape, just a valid DAG-or-cycle-with-a-limit.

Q65. What is a “fan-in” and how does state merging make it safe? Fan-in is multiple parallel branches converging back into a single downstream node. It’s safe specifically because of reducers — if two branches both wrote to the same state key, the reducer defines how those concurrent writes combine (append, merge, etc.) rather than one silently clobbering the other.

Q66. Can you route to END conditionally from the middle of a graph? Yes — END is just another valid destination a conditional edge’s routing function can return, used whenever a branch’s logic determines the run is genuinely finished early rather than needing to reach some fixed “final” node.

Q67. What’s the difference between static graph structure and dynamic control flow at runtime? The graph’s nodes and possible edges are fixed at compile time — you can’t add a node that didn’t exist in the compiled graph. But which of those predefined edges actually get taken, and how many times a Send-based node runs, is entirely dynamic at runtime based on state. This is the core tension conditional edges and Command both resolve: fixed topology, dynamic traversal.

Q68. How would you implement “retry this node up to 3 times with different prompts if the output fails validation” using control flow alone (not RetryPolicy)? Track an attempt counter in state (Annotated[int, operator.add]), have the node’s routing function check both the validation result and the counter, and route back to the same node (adjusting the prompt based on the failure) while under the limit, or forward once valid or once the limit is hit. This is a business-logic retry, distinct from the infrastructure-level RetryPolicy covered in Section 4, which handles transient exceptions, not “the LLM’s output didn’t validate.”

Q69. What happens if a Command’s update conflicts with a reducer expecting a specific shape? The same rules apply as any node return — Command(update={...}) is merged through the target keys’ reducers exactly like a plain dict return would be, so a malformed update fails in the same way (an InvalidUpdateError for un-reduced concurrent writes, or a type error inside a custom reducer that doesn’t defend against the shape it received).

Q70. Why can Command be returned from inside a tool, not just from a graph node? Because agentic tool-calling loops often need a tool’s execution to affect graph control flow directly — e.g., a tool that looks up an order and determines the conversation should jump straight to a “handle_refund” node rather than going back through the LLM for another round of reasoning. Letting tools return Command avoids forcing that decision to be re-derived by a separate router after the fact.

Q71. What’s the interaction between add_conditional_edges() and cycles in the graph? Conditional edges are exactly how cycles get created — a routing function that can return the name of a node “earlier” in the logical flow is what makes looping (retry, re-planning, continued tool use) possible; LangGraph doesn’t distinguish a “cycle edge” from any other conditional edge structurally.

Q72. Give a realistic branching example beyond toy code: routing a customer support graph.

def route_ticket(state: State) -> str:
    if state["sentiment"] == "angry" and state["order_value"] > 500:
        return "escalate_to_human"
    if state["category"] == "refund":
        return "refund_flow"
    return "auto_respond"

This combines two independent signals (sentiment, order value) with a category check — realistic routing logic is rarely a single flag, and interviewers often probe whether you’d centralize this in one router node or split it into staged conditional edges.

Q73. How do you unit-test a routing function without running the graph? Since it’s a plain function taking state and returning a string (or list of strings/Sends), call it directly with hand-built state fixtures covering each branch and assert on the returned destination — identical testing approach to testing a node in isolation (Q49).

Q74. What’s a common interview follow-up after explaining Send, and how do you answer it? “How do results from all the parallel Send invocations get back together?” — the answer is: through the state key(s) those invocations write to, combined via whatever reducer is attached (commonly operator.add to collect a list of per-item results), which the fan-in node then reads as a complete collection once all parallel branches for that super-step have finished.

Q75. What’s the single most common mistake developers make with conditional edges? Forgetting that the routing function runs after state has already been updated by the preceding node — trying to branch on a value the current step is still computing, rather than the value the previous step already committed. Related: not handling every possible return value of the routing function in the destinations mapping, which surfaces as a confusing runtime error rather than a compile-time one.

Section 4: Persistence, Durable Execution & Fault Tolerance (Q76–100)

Checkpointing is what makes memory, human-in-the-loop, and time travel possible — and it’s also where “toy graph” and “production graph” diverge the most sharply.

Q76. What is a checkpointer, in one sentence? A pluggable backend that saves a snapshot of the graph’s full state (a “checkpoint”) after every super-step, keyed by thread_id, so execution can be resumed, replayed, or inspected later. Docs: Persistence

Q77. Name the built-in checkpointer implementations and when you’d use each. InMemorySaver (formerly MemorySaver) for local development and tests — nothing persists past the process. SqliteSaver for lightweight local/single-instance persistence. PostgresSaver for production, multi-instance deployments needing a real durable store.

Q78. What’s the minimal code to compile a graph with a checkpointer and run it against a specific thread?

from langgraph.checkpoint.memory import InMemorySaver

graph = builder.compile(checkpointer=InMemorySaver())
config = {"configurable": {"thread_id": "conversation-42"}}
graph.invoke({"topic": "ice cream"}, config=config)

Every subsequent .invoke() or .stream() call using that same thread_id continues from wherever that thread’s last checkpoint left off, rather than starting fresh.

Q79. Without a checkpointer, can a graph still run at all? Yes — a checkpointer is optional for a single-shot .invoke() with no persistence needs. It becomes mandatory the moment you need any of: multi-turn memory across separate calls, interrupt()-based human-in-the-loop, or time travel — all three are built directly on the checkpoint history.

Q80. What exactly does a single checkpoint contain? The full graph state as of that step, plus metadata: which step number it is, which node(s) just ran, and a “pending writes” record used for retry safety. get_state() returns this same shape for the latest checkpoint on a thread.

Q81. What’s the difference between “checkpointing” and “durable execution” — are they the same thing? No, and this is a common interview trap. Checkpointing saves state between completed steps — if a node fails mid-execution, whatever it was doing when it crashed is lost and the step reruns from scratch on retry. True durable execution (as offered by systems like Temporal) additionally makes individual side effects within a step resumable/replayable, not just the state between steps. LangGraph’s checkpointing plus RetryPolicy gets you resilient step-level retries; it isn’t the same guarantee as a dedicated durable-execution engine for arbitrarily long, side-effect-heavy single steps.

Q82. Given the answer above, when would a team pair LangGraph with Temporal rather than relying on LangGraph alone? When individual nodes perform expensive, non-idempotent side effects (charging a customer, calling a non-retryable external API) inside a workflow that might span hours or days and must survive process crashes mid-step — Temporal’s durable-execution guarantees cover exactly that failure mode, with LangGraph handling the agent’s reasoning and state while Temporal handles the surrounding durable orchestration.

Q83. What is a RetryPolicy and where can it be attached? A configuration object controlling automatic retries for a node (or, in the Functional API, a @task) when it raises certain exception types. It can be attached per-node via add_node(..., retry_policy=...), or set as a default across the whole graph. Docs: Fault tolerance

Q84. What fields does a RetryPolicy support? max_attempts (including the first try), initial_interval and max_interval (backoff bounds), backoff_factor (exponential multiplier), jitter (randomize interval to avoid thundering-herd retries), and retry_on (which exception types or a custom predicate qualify for retry).

Q85. What does RetryPolicy retry by default, and why is that default deliberately narrow? By default it retries things that look like transient infrastructure failures — connection errors, 5xx-style responses — but not ValueError, TypeError, or RuntimeError, since those almost always indicate a genuine programming bug rather than a flaky dependency. Retrying a bug just re-triggers the same bug three times instead of surfacing it.

Q86. What happens to a node’s partial writes if it fails partway through and then retries? Before each retry attempt, LangGraph clears any writes the failed attempt had already staged for that step, so a retried node starts clean rather than layering a second partial attempt’s writes on top of a first partial attempt’s leftovers.

Q87. Can you attach different retry policies to different nodes in the same graph? Yes — retry policy is a per-node (or per-task) setting, so a node calling a flaky third-party API can have an aggressive retry policy while a node doing pure local computation (where retrying a bug is pointless) has none at all.

Q88. What’s a CachePolicy and how is it different from retrying? A CachePolicy lets you cache a node’s output keyed by its input, so re-running the same input (common during development iteration, or when replaying from an earlier checkpoint) skips re-executing expensive or non-deterministic work like an LLM call — this is about avoiding redundant execution, not about recovering from failure.

Q89. Why does thread_id matter so much operationally, beyond just “which conversation is this”? It’s the unit of isolation for concurrency, persistence, and human-in-the-loop resumption all at once — two requests with the same thread_id racing against a database-backed checkpointer need application-level coordination to avoid interleaved writes, since the checkpointer itself doesn’t serialize concurrent access to a single thread for you.

Q90. How would you migrate a graph’s state schema (add a new required field) without breaking existing persisted threads? Add the new field with a sensible default (or make it Optional) rather than a hard requirement, since old checkpoints won’t have it populated; a node that depends on the new field should handle its absence gracefully for any thread whose checkpoint history predates the schema change, rather than assuming every thread was created after the migration.

Q91. What’s the PostgresSaver’s setup() method for? It creates the checkpointer’s required tables/schema in the target Postgres database — a one-time (or per-migration) step that has to run before the checkpointer can actually persist anything, distinct from constructing the PostgresSaver instance itself.

Q92. Can checkpoints be deleted, and why would you need to? Yes, most checkpointer implementations expose a way to delete a thread’s checkpoint history — needed for data-retention compliance (a user requests deletion of their conversation history) or simply to reclaim storage for threads that are no longer relevant.

Q93. What does “list of pending writes” in a checkpoint’s metadata actually protect against? It’s how LangGraph knows, if a process crashes after a node finishes but before the next step’s scheduling is fully recorded, which writes were already durably committed versus which need to be recomputed on resume — preventing either silently losing a completed node’s output or double-applying it.

Q94. How do checkpoints interact with parallel branches (fan-out) in terms of what gets saved? All writes from every node that ran in a given super-step are captured together in that step’s checkpoint — a fan-out of five parallel nodes produces one checkpoint reflecting all five nodes’ combined, reducer-merged writes, not five separate checkpoints.

Q95. What’s a realistic interview question testing whether you understand checkpoint frequency? “If a node takes 30 seconds and the graph has 10 sequential nodes, how many checkpoints does a single run produce, and what does that mean for storage cost at scale?” — the answer is one checkpoint per completed super-step (so up to 10 here, fewer if steps run in parallel within the same super-step), meaning checkpoint volume scales with graph depth × run volume, which is why teams often trim what’s stored in state (Q44) before scaling to production traffic.

Q96. What’s the operational difference between SqliteSaver and PostgresSaver beyond “one’s a file and one’s a server”? SQLite’s single-writer model makes it a poor fit the moment you have more than one application instance writing concurrently — it’s fine for a single local process or a low-concurrency prototype, but a multi-instance production deployment needs Postgres’s proper concurrent-write support and connection pooling.

Q97. Does a checkpointer store your prompts and model outputs in plaintext by default? Yes, by default the entire state (which typically includes full message history) is serialized as-is — if that includes sensitive data, you’re responsible for encryption at rest (a database-level concern) or redacting sensitive fields from state before they’re checkpointed, LangGraph doesn’t apply field-level encryption on your behalf.

Q98. What does “durable execution mode” configurability at the graph level actually toggle, conceptually? It controls how aggressively LangGraph treats already-completed work as replayable versus safe-to-recompute on resume after an interruption — different modes trade off strict exactly-once semantics against simplicity and performance, which is why LangGraph’s docs frame this as a spectrum rather than a single on/off durability flag.

Q99. How would you explain, to a non-engineer stakeholder, why checkpointing costs anything at all in latency? Every completed step now includes writing a snapshot to a database before the graph is “done” with that step — that’s an extra I/O round-trip per step compared to a purely in-memory, no-persistence run, which is the tradeoff you’re accepting in exchange for resumability, memory, and auditability.

Q100. What’s a good interview answer to “when would you deliberately not use a checkpointer”? Stateless, single-shot utility graphs with no need for memory, replay, or human-in-the-loop — e.g., a graph that classifies one piece of text and returns a label, called fresh each time with no conversational context to preserve. Adding persistence there is pure overhead with no corresponding benefit.

Section 5: Human-in-the-Loop, Interrupts & Time Travel (Q101–125)

If Section 4 was “how state survives,” this section is “how a human gets to change what happens next” — the piece every interviewer probing production-readiness will ask about.

Q101. What does interrupt() do, mechanically? Called inside a node, interrupt(value) pauses the graph’s execution at that exact point and surfaces value (any JSON-serializable payload — a question, a proposed action, a diff to review) to whatever’s driving the graph. The node function’s execution is suspended until the graph is invoked again with a Command(resume=...). Docs: Interrupts

Q102. Show the minimal interrupt-and-resume pattern.

from langgraph.types import interrupt, Command

def human_review(state: State):
    decision = interrupt({"action": state["proposed_action"]})
    return {"approved": decision == "approve"}

# first call pauses at interrupt():
graph.invoke(initial_input, config=config)
# resuming later, from the same thread_id:
graph.invoke(Command(resume="approve"), config=config)

The value passed to Command(resume=...) becomes interrupt()‘s return value inside the node, and the node re-runs from the top with that value available — which has an important consequence covered next.

Q103. Why does that last point (“the node re-runs from the top”) matter for how you write nodes containing interrupt()? Because the node function resumes by re-executing from its beginning up to and past the interrupt() call, any code before the interrupt() call inside that node runs again on resume. That code needs to be idempotent (safe to run twice) — a side effect like sending an email before the interrupt would fire a second time on every resume unless you guard against it.

Q104. Does interrupt() require a checkpointer? Yes, unconditionally — pausing and later resuming exactly where execution left off is only possible because the checkpointer persisted the state at that point. A graph compiled without a checkpointer can’t use interrupt() meaningfully.

Q105. What’s the difference between interrupt() and a static “breakpoint” set via compile(interrupt_before=[…])? interrupt_before/interrupt_after are compile-time, node-name-based breakpoints that always pause before or after a named node runs, regardless of any runtime condition. interrupt() is called from inside node logic itself, so it can pause conditionally — e.g., only when a proposed action exceeds some risk threshold, not on every single run.

Q106. How do you inspect what a pending interrupt is asking for, from outside the graph? graph.get_state(config).next and the state snapshot’s tasks (or, on the event-streaming API, stream.interrupts/stream.interrupted) surface the pending interrupt’s payload — you read that to render a UI prompt (approve/reject, edit this field, etc.) before deciding what to pass to Command(resume=...).

Q107. Can a single graph run pause on more than one interrupt across its execution? Yes — a graph can hit multiple interrupt() calls across different nodes (or the same node called multiple times via a loop), each one pausing and later resuming independently as the run progresses; there’s no limit of “one interrupt per thread.”

Q108. What’s the difference between Command(resume=…) and Command(update=…) when resuming after a human review step? resume supplies the human’s answer to the specific interrupt() call that’s pending — it becomes that call’s return value. update separately lets you also patch other state fields at the same time you resume, if the human’s input should change more than just what the interrupt was directly asking about.

Q109. How would you implement “let the human edit the agent’s draft before it’s sent,” not just approve/reject it?

def review(state: State):
    edited = interrupt({"draft": state["draft"], "action": "edit_or_approve"})
    return {"draft": edited}  # human's edited text replaces the draft

The interrupt’s payload can carry the full draft for the human to see, and whatever they pass to resume= becomes the new draft — approve is just “resume with the same text unchanged.”

Q110. What’s get_state_history() and what does it return? It returns an iterator over every checkpoint ever recorded for a given thread_id, from most recent to oldest — the full audit trail of every super-step’s state, which is the basis for time travel. Docs: Use time travel

Q111. How do you “rewind” execution to an earlier point and try a different path from there? Find the target checkpoint via get_state_history(), then call graph.invoke(new_input, config={"configurable": {"thread_id": ..., "checkpoint_id": earlier_id}}) (or the equivalent config carrying that checkpoint’s identity) — execution resumes from that earlier point forward, effectively branching a new timeline while the original run’s history remains untouched.

Q112. What does graph.update_state() let you do that pure time-travel-and-replay doesn’t? It lets you edit a checkpoint’s state values before resuming from it, rather than just replaying exactly what happened. Combined with time travel, this is how you implement “the agent made a mistake at step 4 — let’s correct that one field and continue from there” instead of only being able to replay the original mistake verbatim.

Q113. Does update_state() overwrite the original checkpoint, or create a new one? It creates a new checkpoint at the same logical point in the thread’s history, with the altered values — the original checkpoint is preserved untouched, which is exactly what makes this safe to use for exploration without destroying your audit trail.

Q114. What’s a realistic production reason to combine interrupt() with a queue/notification system rather than blocking synchronously? A human reviewer might not be available for minutes or hours — the graph’s execution genuinely needs to sit paused (persisted via checkpoint) while a notification (Slack message, email, ticket) goes out, and only resume whenever the reviewer eventually acts, which could be a very different process invocation entirely from the one that hit the interrupt.

Q115. How does event streaming (Section 9) expose interrupts, versus the lower-level stream_mode API? On the v3 event-streaming API, stream.interrupted is a boolean you check after consuming a stream, and stream.interrupts gives you the structured interrupt payloads directly. On the older stream_mode API, the same information surfaces as a __interrupt__ key in the returned state dict (v1) or a dedicated interrupts field on values stream parts (v2).

Q116. What happens if you call Command(resume=…) on a thread that isn’t actually paused at an interrupt? This is generally a misuse — resuming implies there’s a pending interrupt waiting for that specific value. Behavior in that case depends on version and isn’t something to rely on; the correct pattern is always to check get_state(config).next (or stream.interrupted) first to confirm a run is actually paused before attempting to resume it.

Q117. Why is human-in-the-loop design something interviewers specifically probe for at senior levels? Because getting the mechanics of interrupt() right is necessary but not sufficient — the harder design question is which actions in a given workflow actually warrant a pause (irreversible, high-value, or ambiguous ones) versus which should run fully autonomously, and how escalation thresholds are decided and maintained over time. That’s a judgment/architecture question, not an API-syntax one.

Q118. What’s the relationship between interrupt() and the idea of “durable execution” from Section 4? They’re built on the same foundation — a paused interrupt() is only recoverable across arbitrary delays (including a full process restart) because the checkpointer already persisted everything needed to resume; without durable checkpointing, interrupt() would only be able to pause for as long as the current process stays alive in memory.

Q119. How would you let a human reject an action and provide a reason that feeds back into the agent’s next attempt?

def review(state: State):
    result = interrupt({"action": state["proposed_action"]})
    if result["decision"] == "reject":
        return {"rejected_reason": result["reason"], "attempt": state["attempt"] + 1}
    return {"approved": True}

The interrupt payload the human resumes with can be a structured object (not just a string), letting a single interrupt carry both the decision and supporting context back into state for a routing function to act on next.

Q120. Can time travel be used for anything other than debugging or correcting mistakes? Yes — a common use is exploring “what if” alternatives for evaluation purposes: replay the same conversation from a fixed checkpoint with a different prompt or model, and compare the two resulting trajectories side by side, which is a cheap way to A/B test a change against a real historical scenario rather than only against synthetic test cases.

Q121. What’s the difference between “replay” (re-running from a checkpoint with the same input) and “branch” (re-running with different input)? Replay reproduces exactly what already happened, useful for verifying determinism or debugging with full visibility into a past run. Branching intentionally diverges from that point forward — same history up to the checkpoint, different path afterward — which is what both correcting a mistake and “what if” exploration actually rely on.

Q122. What would you check first if a human-in-the-loop workflow appears to “lose” state after a reviewer approves an action? Whether the node’s pre-interrupt code is accidentally non-idempotent (Q103) and is resetting or overwriting a field every time it re-runs on resume, and whether Command(resume=...)‘s value is actually reaching the intended interrupt() call rather than a different pending interrupt earlier in a multi-interrupt thread.

Q123. How does interrupt() interact with parallel branches — if two parallel nodes both call interrupt(), what happens? Each call surfaces its own interrupt payload, and the run pauses until all pending interrupts for that step have been resumed — you generally need to resume each one (or provide resume values keyed appropriately) rather than assuming a single Command(resume=...) call satisfies every outstanding interrupt in a fan-out.

Q124. Why might a team build their own lightweight approval UI on top of get_state()/interrupt() rather than using a prebuilt tool? Because the approval UI’s requirements are almost always domain-specific — what fields to show a human reviewer, what edit affordances make sense, what audit trail format compliance requires — and interrupt()‘s payload is intentionally a plain JSON-serializable value precisely so it can back whatever bespoke UI a team already has, rather than forcing a specific reviewer interface.

Q125. What’s a good closing answer to “what’s the single biggest risk of human-in-the-loop design done poorly”? Interrupt fatigue — routing so many low-stakes decisions to a human reviewer that they start rubber-stamping approvals without real scrutiny, which defeats the entire purpose of the checkpoint. The fix is the same judgment call as Q117: reserve interrupts for genuinely high-stakes or ambiguous decisions, and let everything else run autonomously with strong guardrails and after-the-fact monitoring instead.

Section 6: Tools, Tool-Calling & Prebuilt Agents (Q126–150)

Most real LangGraph graphs revolve around an LLM deciding to call tools. This section covers the mechanics of that loop and the prebuilt helpers that implement the common version of it for you.

Q126. What is create_react_agent and what does it build? A prebuilt function (from langgraph.prebuilt, split out as part of LangGraph 0.3’s move toward first-class prebuilt agents) that constructs a complete ReAct-style tool-calling agent graph — model call, conditional routing to tools if the model requested any, tool execution, and looping back to the model — from just a model and a list of tools, without you hand-wiring that graph yourself. Docs: create_react_agent reference

Q127. Show the minimal create_react_agent usage.

from langgraph.prebuilt import create_react_agent

agent = create_react_agent(
    model="anthropic:claude-sonnet-4-6",
    tools=[search_tool, calculator_tool],
)
agent.invoke({"messages": [{"role": "user", "content": "What's 42 * 17?"}]})

The returned object is itself a compiled graph — you can still pass a checkpointer to create_react_agent(..., checkpointer=...) and get all the same persistence, streaming, and interrupt capabilities as a hand-built graph.

Q128. What is ToolNode and how does it relate to create_react_agent? ToolNode is the prebuilt node that actually executes tool calls requested by a model — given the last AI message’s tool_calls, it looks up and invokes each named tool with the provided arguments and returns the results as tool messages. create_react_agent uses a ToolNode internally; you can also use it directly if you’re hand-building a similar loop with custom routing around it.

Q129. How does a model “decide” to call a tool at the LangGraph level — what’s actually happening? The chat model is bound to the tool definitions (via .bind_tools([...]) or equivalently by passing tools to init_chat_model), which tells the underlying provider’s API about the available functions and their schemas. The model’s response then either contains normal text or one or more tool_calls entries; a conditional edge inspects the last message for the presence of tool_calls and routes to ToolNode if present, or ends/continues otherwise.

Q130. What happens if a tool raises an exception during execution inside ToolNode? By default, ToolNode catches tool execution errors and returns them as a tool message content (so the model gets to see the error and can decide how to react — retry with different arguments, apologize to the user, try a different tool) rather than crashing the whole graph run. This behavior is configurable if you want errors to propagate instead.

Q131. How do you give a tool access to the graph’s current state, not just the arguments the model provided? By adding a parameter annotated with InjectedState to the tool function’s signature — LangGraph recognizes that annotation and populates it from the graph’s current state automatically, without exposing that parameter to the model (so the model never has to “guess” values that should come from state rather than from its own reasoning).

Q132. What’s InjectedToolCallId used for? It gives a tool access to its own tool_call_id — useful when the tool itself needs to construct a Command that includes a properly-formed tool message referencing the call that triggered it, since the tool message response has to carry that ID for the model to correctly associate the result with its request.

Q133. Can a tool return a Command instead of a plain string/dict result? Yes (referenced in Q70) — a tool can return Command(update={...}, goto=...) to both supply its result and influence graph control flow directly from inside the tool, rather than requiring a separate node afterward to inspect the tool’s output and decide what happens next.

Q134. What’s the difference between a “tool” in the LangChain/LangGraph sense and a plain Python function? A tool wraps a plain function with a name, description, and an argument schema (typically inferred from type hints and docstring, or defined explicitly via a Pydantic model) that gets serialized and sent to the model’s API so the model knows the tool exists and how to call it correctly — the @tool decorator is the common way to turn a plain function into that wrapped form.

Q135. Why does a tool’s docstring/description matter so much for reliability? The model chooses which tool to call and how to call it based entirely on the name, description, and parameter descriptions it’s given — a vague description (“gets data”) leads to the model guessing wrong or misusing the tool’s parameters, while a precise one (“returns order status, amount, and date for a given order ID; do not use for inventory lookups”) measurably improves correct tool selection, since that description functions as part of the prompt.

Q136. How would you limit which tools are available depending on graph state (e.g., a user’s permission level)? Rather than binding a fixed tool list once, build the tool list dynamically inside the node that calls the model — filtering the full tool set down to whichever subset the current state’s permission level allows — before binding that filtered list to the model call for that particular invocation.

Q137. What’s the recommended way to handle a tool that needs human approval before executing (e.g., “send this email”)? Combine interrupt() with the tool-calling loop: route to a review node before the actual side-effecting tool executes, surface the proposed tool call’s arguments via interrupt(), and only invoke the real tool (or a Command reflecting the approved action) once a human has resumed with approval — never let a destructive tool execute unconditionally inside ToolNode if it needs a human gate first.

Q138. What is create_supervisor and how does it differ from create_react_agent? create_supervisor (from the separate langgraph-supervisor package) builds a multi-agent graph where a central supervisor agent decides, turn by turn, which of several specialized sub-agents (each often itself built with create_react_agent) should handle the current request — it’s a level up from a single tool-calling agent, orchestrating multiple agents rather than multiple tools. Docs: langgraph-supervisor reference

Q139. Can prebuilt agents like create_react_agent be customized, or are you stuck with the default loop shape? They expose meaningful customization points — a custom system prompt/state modifier, a custom state schema extending the default, hooks for pre-model and post-model processing — without requiring you to fork the whole implementation; if your needs go beyond what those hooks support, that’s the signal to hand-build the graph with the Graph API instead.

Q140. What’s a common reason a tool-calling loop never terminates in practice? The model keeps requesting tool calls indefinitely — often because a tool’s results aren’t actually resolving the model’s underlying question (bad tool design, Q135), or because there’s no explicit stopping instruction/limit in the system prompt or a recursion-limit-style safeguard, so the model has no signal that it should conclude and respond directly instead of calling yet another tool.

Q141. How would you stream just the tool-call arguments as they’re generated, not the final tool result? Using the v3 event-streaming API’s message.tool_calls projection (or, in raw content-block terms, filtering for tool-call-typed content blocks), you get the incrementally-generated tool-call arguments as the model produces them, distinct from stream.tool_calls (via ToolCallTransformer), which surfaces the correlated call and its eventual execution result together.

Q142. What’s the difference between binding tools to a model directly versus letting create_react_agent do it? Functionally similar — create_react_agent calls .bind_tools() (or the equivalent) for you as part of constructing its internal model-calling node. The difference is convenience versus control: binding tools yourself inside a hand-built node gives you a place to add custom logic (dynamic tool filtering, per-call configuration) that the prebuilt agent’s default construction doesn’t expose without using its customization hooks.

Q143. What’s a realistic multi-tool scenario interviewers use to test whether you understand tool-call routing, not just definition? “The agent has a search tool and a calculator tool. The user asks a question that needs both, in sequence, where the calculator’s input depends on the search result.” The correct answer walks through multiple loop iterations: model requests search → ToolNode executes it → result returns to model → model requests calculator with the search-derived number → ToolNode executes that → model finally responds with text and no further tool calls, ending the loop.

Q144. Why might you deliberately keep tool execution outside the graph’s checkpointed state (e.g., not storing full raw API responses in state)? Same reasoning as Q44 — a tool that returns a huge payload (a full document, a large dataset) bloats every subsequent checkpoint if stored verbatim in state; a common pattern is having the tool store the raw result externally (cache, object store) and return only a reference or a trimmed summary into state for the model to reason over.

Q145. What does “middleware” mean in the context of LangChain’s newer agent-building surface, and how does it relate to LangGraph? Middleware refers to composable hooks that intercept and modify agent behavior at defined points (before/after a model call, before/after tool execution) without rewriting the underlying graph — it’s a higher-level convenience layered on top of the same LangGraph primitives (nodes, edges, state) covered throughout this guide, aimed at making common cross-cutting concerns (logging, guardrails, retries) reusable across agents.

Q146. How do you test a tool-calling agent’s behavior without hitting a real LLM API on every test run? Use a fake/stub chat model that returns pre-scripted tool-call requests for given inputs (LangChain ships test utilities for exactly this), so you can assert the graph routes correctly and ToolNode executes the right tool with the right arguments — deterministic, fast, and free of live-API flakiness — reserving real-model tests for a smaller integration-test tier.

Q147. What’s a tool-design mistake that looks fine in isolation but breaks down in a multi-tool agent? Two tools with overlapping, ambiguous descriptions (e.g., both plausibly described as “look up customer information”) — the model can’t reliably pick the right one, and the failure mode isn’t a crash, it’s silently calling the wrong tool and returning a confidently wrong answer, which is much harder to catch in testing than an outright error.

Q148. Can create_react_agent’s default agent use structured output for its final response instead of free text? Yes, current versions support configuring a response format/structured output schema so the agent’s final answer (once it stops calling tools) is validated against and returned as a structured object rather than only free-form text — useful when the agent’s output feeds directly into downstream code rather than being shown to a human as-is.

Q149. Why is “the model decided not to call any tools when it should have” a harder bug to debug than “the tool call failed”? A failed tool call produces a visible error you can trace directly. A model that should have called a tool but didn’t leaves no error at all — it just answers from its own (possibly wrong or outdated) knowledge instead of using the available tool, and the only way to catch it is evaluation against known-correct expected behavior, not error-log inspection.

Q150. What’s the honest tradeoff of using create_react_agent versus hand-building the same loop with the Graph API? create_react_agent gets you a correct, maintained implementation of a very common pattern in a few lines, and you inherit fixes/improvements to that pattern for free. Hand-building gives you full control over every routing decision, state field, and intermediate node — worth it the moment your agent’s loop needs to deviate meaningfully from plain “call model, call tools if requested, repeat,” which happens more often in production systems than beginner tutorials suggest.

Section 7: Memory & the Store API (Q151–175)

Checkpointers (Section 4) give you memory within a thread. This section covers memory that needs to survive across threads — a returning user, a fact learned in one conversation that should inform another.

Q151. What’s the fundamental difference between short-term and long-term memory in LangGraph’s own terminology? Short-term memory is thread-scoped conversation history, handled by the checkpointer — it’s naturally tied to one ongoing interaction. Long-term memory is scoped across threads (and often across users or sessions entirely), handled by a separate BaseStore, because a checkpointer’s thread_id scoping is the wrong shape for “remember this fact about this user regardless of which conversation they start next.” Docs: Memory overview

Q152. What is BaseStore? An interface for persisting and retrieving arbitrary key-value data organized into namespaces, independent of any particular thread — the mechanism long-term memory is built on. Built-in implementations include InMemoryStore (development) and PostgresStore (production).

Q153. Show the minimal pattern for saving and retrieving a memory via the store.

from langgraph.store.memory import InMemoryStore

store = InMemoryStore()
namespace = ("memories", user_id)
store.put(namespace, "preferences", {"likes": "concise answers"})
item = store.get(namespace, "preferences")

Namespaces are tuples, letting you organize memories hierarchically (by user, by application area, by memory type) however your application needs.

Q154. How does a node or tool access the store at runtime? The store is passed into the compiled graph (builder.compile(store=store, checkpointer=checkpointer)), and a node or tool accesses it via a parameter annotated to receive the injected store at call time, similar in spirit to how InjectedState (Q131) works for graph state.

Q155. What is semantic search in the context of BaseStore, and when was it added? It’s the ability to query the store with a natural-language query and get back memories ranked by embedding similarity to that query, rather than only exact-key lookup — added as a capability across PostgresStore, InMemoryStore, LangGraph Studio, and LangGraph Platform deployments, letting an agent recall “something relevant to this topic” without knowing the exact key a past memory was stored under. Docs: Semantic search for LangGraph memory

Q156. How do you configure semantic search for a store? You specify an embedding provider and model (e.g., "openai:text-embedding-3-small"), a vector dimension size, and which fields of a stored item should be indexed for embedding — either programmatically when constructing the store, or in langgraph.json‘s store configuration block when deploying to LangGraph Platform.

Q157. What’s the difference between store.get() and store.search()? get() is an exact lookup by namespace and key — you already know precisely what you’re retrieving. search() queries across a namespace, optionally with a natural-language query for semantic ranking, returning the most relevant matches when you don’t know the exact key, which is the more common access pattern for an agent recalling relevant-but-not-exactly-known facts.

Q158. What are the three integration patterns for adding long-term memory to an agent, at a high level? As a tool the model can explicitly call (“remember this,” “recall anything about X”), as logic baked directly into a node (automatically save/load relevant memories before or after a model call without the model having to ask), or via the BaseStore accessed more indirectly through a dedicated memory-management subsystem — the right choice depends on whether you want memory access to be an explicit, model-visible decision or an implicit, always-on background behavior.

Q159. Why would you choose “memory as an explicit tool” over “memory baked into every node automatically”? Explicit-tool memory gives the model (and your evaluation/observability tooling) visibility into exactly when memory was consulted or written, which matters for debugging and for cases where memory access has a cost (embedding calls, retrieval latency) you don’t want paid on every single turn regardless of relevance.

Q160. What does “cross-thread memory” actually solve that a very long single-thread conversation couldn’t? A single thread’s history grows unboundedly and eventually exceeds context-window-practical limits even with summarization; more importantly, real usage patterns are naturally multi-thread (a user starts a new conversation tomorrow, or interacts through a different channel) — cross-thread memory via the store lets facts learned in thread A be available in an entirely separate thread B, which checkpointing alone structurally cannot do.

Q161. How would you decide what’s worth saving to long-term memory versus what should just live in a given thread’s checkpoint history? Durable, user-level facts that should hold regardless of conversation context (stated preferences, profile details, standing instructions) belong in the store; conversational specifics relevant only to the current exchange (what the user just asked five messages ago) are exactly what thread-scoped checkpointing already handles well — saving everything to long-term memory indiscriminately just recreates unbounded-context problems one layer up.

Q162. What’s a namespace collision risk when designing your store’s namespace scheme, and how do you avoid it? Using something ambiguous (like just a username string) as a namespace segment risks collisions if two logically distinct memory types share that same segment structure — a more robust scheme includes an explicit type/category segment (("memories", "preferences", user_id) vs. ("memories", "facts", user_id)) so retrieval and writes for one category can’t accidentally clobber or leak into another.

Q163. Can memory stored via BaseStore be shared across multiple different graphs/applications, not just multiple threads of the same graph? Yes, in principle — since the store is a standalone key-value/semantic-search backend independent of any specific graph’s compilation, any application with a reference to the same underlying store (same Postgres instance, same connection config) can read and write the same namespaces, which is how organizations share a user-memory layer across multiple agent products.

Q164. What are the four “parallel strategies” sometimes described for memory retrieval, beyond plain semantic similarity? Semantic (embedding similarity to the query), keyword/BM25-style (exact or near-exact term overlap), graph traversal (finding memories connected through shared entities/relationships), and temporal (weighting more recent memories higher) — a production memory system often blends more than one of these rather than relying on embedding similarity alone, since pure semantic search can miss an exact-term match a keyword search would have caught immediately.

Q165. What’s a realistic failure mode of relying on semantic search alone for memory recall? A query using very specific, distinctive terminology (an exact product SKU, a precise error code) can retrieve worse results via pure embedding similarity than a simple exact/keyword match would, because semantically-similar-but-wrong memories can outscore the one memory with the literal exact term — which is the practical argument for blending keyword and semantic strategies rather than treating semantic search as a strict upgrade over keyword search.

Q166. How do you test memory-dependent agent behavior without needing an actual persisted store between test runs? Instantiate a fresh InMemoryStore per test, pre-populate it with whatever memories the test scenario assumes already exist, and assert on the agent’s behavior given that fixture — same isolation principle as testing with InMemorySaver instead of a real database-backed checkpointer (Q77).

Q167. What’s the relationship between the store’s namespaces and multi-tenant applications (many separate customers/organizations)? Including a tenant or organization identifier as a namespace segment is the natural way to enforce memory isolation between tenants at the data-access level — critically, this needs to be enforced in your application’s access-control logic (which namespaces a given request is allowed to query), since BaseStore itself doesn’t inherently understand or enforce tenant boundaries on your behalf.

Q168. Why might an interviewer ask “how would you expire old memories” and what’s a reasonable answer? Because unbounded memory accumulation eventually hurts both retrieval quality (more noise competing with genuinely relevant memories) and storage cost. A reasonable answer covers explicit TTL/expiration on writes if the store implementation supports it, or a periodic background job that prunes or archives memories past some age or below some usage/relevance threshold, rather than assuming memory should simply persist forever unmanaged.

Q169. What’s the difference between “episodic” and “semantic” memory as sometimes discussed in agent-memory design (not to be confused with semantic search)? Episodic memory is memory of specific past events or interactions (“last Tuesday, the user asked about refunds and was frustrated”). Semantic memory (in this cognitive-science sense) is memory of general facts or knowledge, decoupled from any specific episode (“this user prefers email over phone contact”) — both can live in the same BaseStore, but conflating them under a single undifferentiated “memories” namespace makes both harder to retrieve precisely.

Q170. How would you evaluate whether an agent’s memory system is actually helping, rather than just adding latency and cost? Compare task success or user-satisfaction metrics between otherwise-identical agent runs with memory enabled versus disabled on a representative test set — memory that isn’t measurably improving outcomes on real scenarios is a cost (retrieval latency, embedding spend, occasional wrong-memory-retrieved errors) without a demonstrated benefit, and that’s a legitimate finding, not just an implementation bug to fix.

Q171. Can a subgraph (Section 1) have its own isolated store, separate from the parent graph’s store? The store, like the checkpointer, is typically configured at compile time and inherited by subgraphs run within the parent’s execution — but nothing structurally prevents compiling a subgraph independently with its own distinct store reference if a use case genuinely calls for isolated long-term memory scoped only to that subgraph’s concerns.

Q172. What’s a common mistake teams make when first adding long-term memory to an existing agent? Saving every single interaction indiscriminately to long-term memory “just in case,” rather than being deliberate about what’s actually worth remembering (Q161) — this bloats storage, slows retrieval, and often degrades answer quality because irrelevant memories compete with genuinely useful ones during semantic search.

Q173. How does long-term memory interact with the human-in-the-loop patterns from Section 5? A memory-write step is itself a reasonable candidate for a lighter-weight interrupt or at least an audit log — particularly for memories that will influence future automated decisions — since an incorrectly saved “fact” about a user can silently bias every future interaction that retrieves it, which is a harder-to-detect failure than a single bad response.

Q174. What’s the argument for keeping the store and the checkpointer as genuinely separate systems, rather than trying to unify them? They have different natural access patterns and lifecycles — checkpoint history is inherently sequential and thread-scoped (a linear chain of “what happened next”), while long-term memory is inherently associative and cross-cutting (arbitrary facts retrieved by relevance rather than by position in a sequence) — collapsing them into one system tends to produce a data model that’s awkward for both use cases rather than good at either.

Q175. What’s a strong closing answer to “how would you design the memory system for a customer support agent used by thousands of users”? Thread-scoped checkpointing for in-conversation context, a BaseStore namespaced by user ID (and organization ID if multi-tenant) for durable facts like preferences and known account details, semantic search for recalling relevant past interactions without needing exact keys, an explicit retention/expiration policy rather than unbounded accumulation, and — critically — an evaluation harness measuring whether memory retrieval is actually improving resolution quality rather than assuming it does by default.

Section 8: Multi-Agent Architectures (Q176–200)

Once a single agent’s tool-calling loop isn’t enough — because different tasks genuinely need different expertise, prompts, or tool access — you’re into multi-agent territory. This is also where LangGraph, CrewAI, and AutoGen get compared most directly in interviews.

Q176. What’s the core reason to split one agent into multiple agents, rather than giving one agent a very large tool list and a very long system prompt? Focus and reliability — a single agent juggling twenty tools and a sprawling system prompt covering unrelated domains tends to make worse tool-selection and reasoning decisions than several narrower agents, each with a small, coherent tool set and a prompt scoped to one job, coordinated by an explicit handoff mechanism.

Q177. What is the “supervisor” multi-agent pattern? A central supervisor agent receives each turn, decides which specialized sub-agent should handle it, delegates, and (in the common design) receives control back afterward to decide the next step — every handoff is mediated by that single decision-maker rather than sub-agents transferring control directly to each other. Docs: langgraph-supervisor

Q178. What is the “swarm” multi-agent pattern, and how does it differ from supervisor? In a swarm, agents hand off control directly to one another based on their own assessment of which specialist is now needed, and the system tracks which agent was last active so a follow-up turn resumes with the same agent rather than routing back through a central decision-maker — there’s no single supervisor node deciding every handoff. Docs: langgraph-swarm

Q179. Given Q177 and Q178, when would you choose swarm over supervisor? When agent-to-agent handoff is naturally peer-like rather than hierarchical — e.g., a billing agent that, mid-conversation, recognizes a question is actually a technical support issue and hands off directly, versus needing to report back up to a central router first. Choose supervisor when you want one place that owns and can audit every routing decision; choose swarm when direct, decentralized handoff better matches how the work actually flows.

Q180. How is a “handoff” between agents typically implemented at the LangGraph primitive level? Via Command(goto=target_agent_name, update={...}) returned from the active agent (or from a dedicated handoff tool it calls) — the same Command primitive from Section 3, just used at the granularity of “which whole agent runs next” rather than “which node within one agent’s internal graph runs next.”

Q181. What state is typically shared versus kept private when multiple agents operate in the same graph? Shared state commonly includes the conversation’s message history (so each agent has full context of what’s happened) and any task-level facts relevant across agents; agent-private state (an agent’s own scratch reasoning, its own tool-call intermediate results) is often kept out of the shared schema, sometimes by giving each agent its own subgraph with a narrower internal state.

Q182. What’s a network (as opposed to supervisor or swarm) multi-agent topology? A topology where any agent can potentially route to any other agent directly, without a strict hierarchy (supervisor) or a simple last-active-agent handoff convention (swarm) — more flexible, but correspondingly harder to reason about and debug, since there’s no single place enforcing which handoffs are actually sensible.

Q183. Why is “shared state is the default” (in both langgraph-supervisor and langgraph-swarm) worth calling out explicitly in an interview? Because it means every sub-agent, by default, sees the full conversation and shared context rather than operating in an isolated bubble — which is usually desirable for coherence, but it also means a sub-agent’s internal reasoning or scratch state can leak into what other agents see unless you’ve deliberately scoped what’s actually shared versus kept in a private subgraph state.

Q184. How would you prevent one agent’s tool-calling loop from interfering with another agent’s, if both share the same top-level state? Give each agent’s internal tool-calling logic (its own create_react_agent-built subgraph, for instance) its own internal state scope for intermediate tool-call bookkeeping, and only pass the narrower, agreed-upon shared fields (final results, conversation history) up into the parent multi-agent graph’s shared state — the same input/output schema separation from Q35, applied at the multi-agent level.

Q185. What’s a realistic multi-agent interview scenario, and how would you talk through designing it? “Design a multi-agent system for an e-commerce support bot: order status, returns, and general product questions.” A strong answer identifies three specialized agents (each with its own narrow tool set — order lookup API, returns/refund API, product catalog search), a supervisor (or swarm, justified by reasoning like Q179) deciding routing, shared conversation history, and an explicit human-in-the-loop gate on the returns agent specifically, since refunds are the one action here with real financial consequence.

Q186. How does multi-agent design change your approach to human-in-the-loop compared to a single-agent system? You need to decide not just whether a given action needs approval but which agent’s actions need it — a refund-approval gate belongs specifically on the returns agent’s side effects, not globally across every agent’s every action, since gating everything defeats the purpose of specialization and gating nothing misses the one agent whose mistakes are actually expensive.

Q187. What’s a common reliability failure mode specific to multi-agent systems that doesn’t show up in single-agent graphs? Handoff loops — Agent A hands off to Agent B, which (misjudging the situation) hands back to A, which hands back to B again, with no forward progress and no single agent clearly “stuck” in the way a single-agent tool-loop failure would be. Guarding against this usually means tracking handoff count/history in shared state and routing to a human or a fallback path if handoffs exceed a sane threshold.

Q188. How would you evaluate a multi-agent system’s routing quality specifically, separate from evaluating each agent’s individual task performance? Build a labeled test set of representative inputs with the correct target agent (or correct sequence of agents) for each, and measure the supervisor’s (or swarm’s handoff logic’s) routing accuracy against those labels independently of whether each individual specialist agent then performs its task well — routing and task execution are separate failure modes that need separate evaluation to debug effectively.

Q189. Can a subgraph-based agent in a multi-agent system have its own separate checkpointer or must it share the parent’s? In common usage each agent’s subgraph inherits the checkpointer configured on the overall compiled graph, since persistence is generally meant to capture the whole multi-agent run coherently as one thread — genuinely isolating one agent’s checkpointing from the rest is an advanced, less common configuration you’d only reach for with a specific isolation requirement.

Q190. What’s the tradeoff of a deeply hierarchical (supervisor-of-supervisors) multi-agent design versus a flatter one? Hierarchy scales your ability to reason about large numbers of specialized agents by grouping them under intermediate supervisors, but each additional layer adds latency (more round trips before a request reaches the agent that actually does the work) and makes end-to-end tracing harder — a flatter design is easier to debug and faster per-turn, at the cost of one supervisor’s routing prompt eventually growing unwieldy if it’s coordinating too many peers directly.

Q191. How do you decide the boundary of what counts as “one agent” versus “two agents that should be merged into one”? If two capabilities always get invoked together, need the same tools, and never operate independently of each other in practice, splitting them into separate agents usually just adds handoff overhead without a real specialization benefit — the split is worth it when the two capabilities have genuinely different tool sets, prompts, or failure characteristics that benefit from being reasoned about (and evaluated, and gated) separately.

Q192. What’s the relationship between multi-agent architectures in LangGraph and the “Deep Agents” pattern? Deep Agents (built on top of LangGraph) is specifically about a planning agent that manages subagents, a virtual file system, and long-running task decomposition — it’s one particular, opinionated multi-agent shape (a main agent spawning and coordinating scoped subagents for pieces of a larger task) rather than a general-purpose alternative to supervisor/swarm; you’d reach for it specifically when a task benefits from explicit planning and file-based state rather than direct conversational handoff.

Q193. Why might streaming (Section 9) be more complicated in a multi-agent graph than a single-agent one? Because you now need to attribute streamed tokens, tool calls, and lifecycle events to the specific agent (and, in a hierarchical design, the specific subgraph nesting level) producing them — which is exactly the problem stream.subgraphs (from the v3 event-streaming API) is designed to solve, surfacing each nested agent’s execution as its own object rather than a flat, unattributed event stream.

Q194. What’s a good way to explain, to a skeptical stakeholder, why a multi-agent system costs more to run than one agent handling everything? Every handoff typically involves at least one additional model call (the supervisor deciding where to route, or the current agent deciding to hand off) on top of whatever work the specialist agent itself does — multi-agent design trades some additional per-request cost and latency for better task-specific reliability and easier evaluation/maintenance of each specialist independently, and that tradeoff needs to be justified by the actual reliability gain, not assumed.

Q195. How would you test handoff logic in isolation, the way Q73 tested a single routing function? Treat the handoff decision (whichever agent or function produces it — a supervisor’s routing call, or a specific agent’s decision to hand off) as its own unit under test: feed it representative conversation states and assert on the resulting Command(goto=...) or equivalent routing output, independent of actually running the target agent’s full logic.

Q196. What’s a design smell suggesting a multi-agent system has been over-decomposed? If most conversations end up bouncing through three or four agent handoffs before reaching the one that actually resolves the user’s request, and those handoffs rarely change based on context (they’re nearly always the same fixed sequence), that fixed sequence is arguably just one agent’s internal steps that got needlessly split into separate agents with handoff overhead in between.

Q197. How does shared long-term memory (Section 7) interact with a multi-agent system where each agent has a different “personality” or role? The store is typically shared across all agents in the system (same user, same namespace scheme) since a fact learned by one specialist agent is usually relevant regardless of which agent handles the user’s next request — the alternative (siloed memory per agent) tends to produce a confusing experience where the user has to repeat context depending on which specialist happens to pick up their next message.

Q198. What’s a strong way to open an answer about “LangGraph vs CrewAI vs AutoGen” specifically for multi-agent design (full comparison in Section 10)? Frame it around control granularity: LangGraph gives you explicit graph-level control over every handoff, piece of shared state, and human-in-the-loop gate, at the cost of writing more of that structure yourself (or using langgraph-supervisor/langgraph-swarm as a starting point); CrewAI and AutoGen ship more opinionated, higher-level multi-agent abstractions out of the box, trading some of that fine-grained control for faster initial setup of common patterns.

Q199. What’s a realistic failure an interviewer might describe and ask you to diagnose: “our supervisor keeps routing refund requests to the general-question agent”? Start with the supervisor’s routing prompt and few-shot examples (is “refund” actually represented clearly enough for the model to distinguish it from a general question), then check whether the returns agent’s own description (what the supervisor sees when deciding where to route) is specific enough — this is fundamentally the same class of problem as Q135’s tool-description reliability issue, just applied to agent descriptions instead of tool descriptions.

Q200. What’s a strong closing statement on multi-agent design for a senior-level interview? Multi-agent architecture is a reliability and maintainability tool, not a default — the right number of agents is however many distinct, testable specializations your problem actually has, coordinated by whichever handoff pattern (supervisor for centralized auditability, swarm for direct peer handoff) matches how work naturally flows between them, with shared memory and state scoped deliberately rather than shared by default just because it’s the path of least resistance.

Section 9: Streaming & Observability (Q201–225)

Streaming is where a lot of candidates who understand the graph model conceptually still trip up on API specifics — the modes changed meaningfully between v1, v2, and v3. For the full walkthrough with a FastAPI + React implementation, see our dedicated LangGraph streaming guide.

Q201. What stream modes does LangGraph’s stream()/astream() API support? values (full state after each step), updates (only the changed keys per step), messages (LLM token chunks with metadata), custom (arbitrary data emitted via get_stream_writer()), checkpoints, tasks, and debug (checkpoints + tasks combined with extra metadata). Docs: Streaming

Q202. What’s the difference between stream_mode=”values” and stream_mode=”updates”? values yields the complete state snapshot after every step, whether or not that particular step changed a given field. updates yields only the keys a step actually changed, scoped by node name — more bandwidth-efficient and the natural choice for a “node X just finished” progress indicator rather than re-sending the whole state repeatedly.

Q203. How do you stream LLM tokens specifically, and what shape does that data come in? Via stream_mode="messages", which yields (message_chunk, metadata) tuples — the chunk being the incremental piece of the LLM’s response, and metadata including which node and which tagged model invocation produced it, letting you filter tokens by node or by tag if multiple models are involved in one graph.

Q204. What’s get_stream_writer() used for, and what’s the constraint on using it in async code on older Python? It lets a node or tool emit arbitrary custom data mid-execution (progress percentages, intermediate status) that surfaces via stream_mode="custom". On Python versions below 3.11, get_stream_writer() doesn’t work inside async functions because those Python versions don’t propagate context automatically across asyncio tasks — you pass a writer parameter explicitly to the node/tool instead.

Q205. What changed between stream_mode’s v1 and v2 output formats? v1’s output shape depends on your options (a single mode returns raw data, multiple modes return (mode, data) tuples, subgraph streaming returns (namespace, data) tuples) — three different shapes depending on configuration. v2 unifies all of that into one consistent StreamPart dict — {"type": ..., "ns": ..., "data": ...} — regardless of how many modes or whether subgraphs are involved.

Q206. What is the v3 event-streaming API, at a conceptual level, and how is it different from v1/v2 stream_mode? v3 (graph.stream_events(..., version="v3")) sits one layer above raw stream_mode output: instead of you branching on chunk shapes or StreamPart.type, it exposes typed projections — stream.messages, stream.values, stream.subgraphs, stream.output — built on a content-block protocol that gives text, reasoning, and tool-call boundaries explicit structure, so the framework does the correlation work v1/v2 leave to your consumer code. Docs: Event streaming

Q207. Show the v3 equivalent of streaming tokens, compared to the v2 pattern.

# v2
async for part in graph.astream(input, stream_mode="messages", version="v2"):
    if part["type"] == "messages":
        msg, meta = part["data"]
        print(msg.content, end="")

# v3
stream = graph.stream_events(input, version="v3")
for message in stream.messages:
    for token in message.text:
        print(token, end="")

v3 gives you one stream object per LLM call via stream.messages, which removes the need to track “which node is this token from” yourself to avoid concatenating unrelated model calls together.

Q208. What’s a content block, and why does the messages channel model output that way? A content block is a discrete unit of an LLM’s output — text, reasoning, or tool-call arguments — with explicit message-start / content-block-start / content-block-delta / content-block-finish / message-finish boundaries, so a consumer can tell unambiguously where one kind of content ends and another begins, rather than inferring it from provider-specific formatting.

Q209. Where do reasoning tokens surface in v3, and why is that a common source of confusion? On message.reasoning, separate from message.text. Reading only .text means you silently miss all reasoning-model “thinking” tokens — which also means a model that’s reasoning at length produces no visible .text output for a stretch, which without knowing to check .reasoning looks like the stream has stalled.

Q210. What is a StreamTransformer and when would you write a custom one? An interface (init(), process(event), finalize(), fail(err)) for building a custom projection over the raw event stream — write one when none of the built-in projections (stream.messages, stream.values, stream.subgraphs) give you the derived view you need, like aggregate token-usage tracking or a bespoke progress indicator. Docs: Event streaming — custom projections

Q211. What does required_stream_modes control on a StreamTransformer, and what’s the consequence of forgetting to declare a mode? It declares which raw Pregel channels ("messages", "custom", etc.) the graph must actually emit for that transformer to see anything — the runtime takes the union across every registered transformer’s declared modes. Forget to declare "custom", for example, and your transformer’s process() simply never receives custom events at all, silently, rather than raising an obvious error.

Q212. What’s the difference between a named and an unnamed StreamChannel? A named channel (StreamChannel("my_projection")) both exposes an iterable under stream.extensions and forwards each pushed value into the main event stream as a custom: event — meaning its payload must be JSON-serializable. An unnamed channel (StreamChannel()) is side-channel only, the right choice for projections holding in-process objects (promises, class instances) that can’t be serialized.

Q213. How do you consume multiple projections in strict arrival order rather than picking just one? stream.interleave("values", "messages", "subgraphs") in synchronous code yields items from all three projections interleaved in the actual order they occurred, rather than requiring you to asyncio.gather over separately-iterated projections (the async-code equivalent for concurrent consumption).

Q214. Why does a real-time streaming UI need a keepalive mechanism, and what does that have to do with reasoning models specifically? A reasoning model can produce zero message.text output for tens of seconds while reasoning (those tokens are on .reasoning, not .text) — an idle SSE or WebSocket connection with no traffic for that long often gets dropped by an intermediate proxy that assumes the connection is dead, so you emit a periodic empty “keepalive” frame to hold the connection open regardless of whether the model has produced visible content yet.

Q215. What HTTP response header commonly needs setting to prevent a reverse proxy from buffering an SSE stream, defeating the point of streaming? X-Accel-Buffering: no (for Nginx-style proxies) — without it, the proxy can buffer the entire response and deliver it all at once instead of passing chunks through as they arrive, which silently turns a “streaming” endpoint back into a blocking one from the client’s perspective.

Q216. What’s the useStream() React hook, and what problem does it solve versus hand-rolling an EventSource client? A hook from @langchain/langgraph-sdk that handles message accumulation, loading state, interrupt detection, and conversation branching for a graph deployed behind an Agent Server — it removes the need to hand-write the token-accumulation and state-tracking logic a raw EventSource consumer would otherwise require, at the cost of expecting that specific deployment shape rather than an arbitrary custom backend. Docs: useStream React reference

Q217. What does stream.tool_calls (via the built-in ToolCallTransformer) give you that raw messages-channel parsing doesn’t? Tool calls already correlated by ID with their execution results — rather than separately tracking a tool-call content block from the messages channel and matching it up yourself with a later tools channel event carrying the result, the transformer has already joined them into one coherent object for you.

Q218. Structured output (JSON mode) streaming — what’s the honest limitation, on any streaming version? The token stream for structured/JSON-mode output is characters — braces, quotes, partial field names — not readable prose, so streaming it token-by-token to a UI is rarely useful on its own. The authoritative, usable result is the final parsed state (stream.output or the equivalent values/updates payload), not something reassembled from the raw token stream.

Q219. What’s the difference between LangSmith tracing and LangGraph’s own streaming/checkpointing for observability purposes? Streaming and checkpoints give you real-time and historical visibility into a specific run’s state and outputs. LangSmith tracing is a separate observability product that captures spans across an entire run (and across runs) for aggregate analysis — latency percentiles, error rates, prompt/response inspection across many executions — which individual-run streaming and checkpoint inspection alone doesn’t give you at that aggregate level. Docs: LangSmith Observability

Q220. How would you compute time-to-first-token for a LangGraph-backed endpoint? Timestamp the moment the first item comes out of message.text (or the first content-block-delta text event on the raw channel) relative to when the request/run started — on v3 this is a single, clearly-defined event to timestamp; on v2 you’d reconstruct the same measurement from the first non-empty content chunk in a noisier raw delta stream.

Q221. What’s the observability argument for run.lifecycle (or stream.lifecycle) beyond just knowing when a run finishes? It emits started/running/completed/failed/interrupted transitions per run, subgraph, and subagent — meaning per-node and per-subgraph latency and failure attribution become structured projections you can pipe straight into a metrics/tracing exporter, rather than something you’d otherwise have to reconstruct from application logs after the fact.

Q222. Why does client disconnect handling matter specifically for streaming endpoints backed by an LLM call? If a client disconnects mid-stream and your server keeps consuming the upstream graph’s stream to completion anyway, you’re paying for (and generating) tokens nobody will ever read — checking for disconnect (e.g., request.is_disconnected() in a FastAPI generator) and aborting the underlying run is both a cost-control and a resource-hygiene concern, not just a UX nicety.

Q223. What’s a message-finish error event, and why does it matter for error handling in a streaming UI? It’s how an unrecoverable failure during a specific LLM call surfaces on the messages channel — as a structured error attached to that message’s finish event, rather than as an exception that abruptly kills the whole stream mid-transmission. Handling it explicitly is what turns a mid-stream model failure into a clean, user-visible error state instead of a stream that just silently stops.

Q224. If you’re already running a production system on v2 stream_mode, what’s a reasonable, honest answer to “should you migrate to v3”? Migrate when you specifically need reasoning-delta streaming, tool-call-argument streaming, or clean per-call usage metadata — v2 can technically get you all three, just with meaningfully more hand-written bookkeeping (accumulators, tuple unpacking, manual correlation). If the current v2 consumer is working and none of those specific needs are pressing, the ergonomic improvement alone usually doesn’t justify reworking a shipping path immediately.

Q225. What’s a strong way to summarize LangGraph’s streaming story across all three versions in one interview answer? v1 exposes the rawest, least consistent shape. v2 unifies that into one consistent StreamPart dict you still branch on manually. v3 moves the branching logic into the framework itself via typed projections over an explicit content-block protocol — the throughline across all three is the same underlying Pregel event stream, just progressively more structured and less work for the consumer to parse correctly.

Section 10: Production, Deployment & Framework Comparisons (Q226–250)

The closing section — deployment mechanics, and the comparison questions (“why LangGraph and not X”) that senior and architect-level interviews lean on heavily.

Q226. What is LangGraph Platform, and what is it called now? A managed hosting and deployment layer for LangGraph applications — as of late 2025, it was renamed “LangSmith Deployment,” reflecting its integration into the broader LangSmith product rather than standing as a separately-branded platform. Docs: LangSmith Deployment

Q227. What are the three deployment options under LangSmith Deployment? Cloud (fully managed SaaS, fastest to get started, available on Plus and Enterprise plans), Hybrid (SaaS control plane with a self-hosted data plane, so sensitive data stays in your infrastructure while LangChain manages the control layer — Enterprise only), and fully self-hosted (the entire platform runs in your own infrastructure with no data leaving your VPC).

Q228. What is langgraph.json and what does it configure? The configuration file the LangGraph CLI reads by default to build and deploy an application — it declares things like which graphs to expose, dependencies, environment variables, and (for semantic memory) store/embedding configuration, functioning as the deployment manifest for the application.

Q229. What is an “assistant” in LangGraph Platform/LangSmith Deployment terms, and how does it differ from a graph? An assistant is a configured, named instance of a graph — the same underlying graph definition can back multiple assistants with different configuration (different prompts, different model choices) without duplicating the graph’s code, and assistants can be composed as “remote graphs” to build multi-agent systems across separately deployed services.

Q230. What do the disable_assistants, disable_runs, disable_threads, and disable_store configuration flags do? They selectively turn off groups of the platform’s built-in HTTP routes for a deployment — useful when you want to expose only a subset of the platform’s default API surface (for security, simplicity, or because your application handles that concern itself elsewhere) rather than the full default route set.

Q231. How would you decide between LangSmith Deployment (managed) and self-hosting your own FastAPI + checkpointer setup? Managed deployment trades some infrastructure control for faster setup, built-in scaling, and integrated tracing/observability out of the box — reasonable defaults for most teams. Self-hosting makes sense when you have specific infrastructure requirements (data residency, existing deployment tooling, cost optimization at very large scale) that the managed offering doesn’t accommodate, or when you need tighter control over the exact request/response surface than the platform’s default routes provide.

Q232. What’s a realistic production checklist item people forget: testing a graph’s behavior under retry (Section 4) specifically? Confirming that nodes performing side effects (an API call, a database write) are either naturally idempotent or explicitly guarded against duplicate execution — RetryPolicy‘s automatic retries assume a node can safely re-run, and a node that charges a payment or sends a notification without idempotency protection will do so twice on a retried transient failure, which is a correctness bug that only shows up under real failure conditions, not in happy-path testing.

Q233. How would you approach load-testing a LangGraph-backed API before a production launch? Separate the concerns: LLM-call latency and cost scale with concurrent request volume regardless of your graph’s structure, while checkpointer write throughput (especially on a shared Postgres instance under concurrent threads) is a distinct bottleneck worth testing independently — a load test that only measures end-to-end request latency can mask which of those two very different systems is actually the constraint under real traffic.

Q234. What’s the case for LangGraph over CrewAI, stated fairly? LangGraph gives you explicit, low-level control over state, control flow, persistence, and human-in-the-loop — which matters when your application’s requirements go beyond CrewAI’s more opinionated role-based crew abstraction, or when you need fine-grained checkpointing and time travel that CrewAI’s higher-level abstraction doesn’t expose as directly.

Q235. What’s the case for CrewAI over LangGraph, stated fairly (the other half of Q234)? CrewAI’s role-based abstraction (agents with defined roles, goals, and a crew that coordinates them) gets a working multi-agent system running faster for teams whose use case fits that model well, without needing to hand-design a graph’s nodes, edges, and state schema from scratch — the tradeoff is less low-level control in exchange for a faster path to a common pattern.

Q236. What’s the core architectural difference between LangGraph and AutoGen? AutoGen is built around conversational message-passing between agents as the primary abstraction — agents “talk” to each other in a structured conversation loop. LangGraph is built around explicit state and graph topology as the primary abstraction, with agent conversation being one possible pattern you can construct on top of that state machine, not the framework’s foundational unit.

Q237. When would LangGraph specifically be a stronger choice than either CrewAI or AutoGen? When you need fine-grained persistence and time travel, precise human-in-the-loop gating on specific actions (not just at agent boundaries), or a control-flow shape that doesn’t map cleanly onto either “crew of role-based agents” or “conversational message passing” — cases where the underlying graph model’s flexibility is worth the additional upfront design work.

Q238. What’s the honest limitation interviewers want you to name about LangGraph itself, not just its competitors? It’s genuinely lower-level than CrewAI or AutoGen for common multi-agent patterns — you (or a library like langgraph-supervisor) have to construct the routing and handoff logic that those frameworks provide more directly out of the box, which is more upfront work for standard cases even though it pays off in control for non-standard ones.

Q239. What’s the core difference between LangGraph and Temporal, restated for a production-deployment framing (deeper mechanics in Q81–82)? LangGraph models agent reasoning, state, and tool use with checkpoint-based resumability between steps. Temporal is a durable-execution engine for arbitrary long-running workflows with event-history-backed replay durability within a step, not just between steps, and no built-in concept of prompts, context windows, or LLM-specific state — the common production pattern pairs LangGraph for the agent’s reasoning layer with Temporal underneath for workflows with expensive, must-not-repeat side effects.

Q240. What’s a good interview framing for “why not just build this without any framework, in plain Python”? A hand-rolled state machine can absolutely work for a simple case, but you re-implement checkpointing, replay, human-in-the-loop pausing, streaming, and retry semantics yourself — LangGraph’s value isn’t “you couldn’t build this otherwise,” it’s that these cross-cutting production concerns are already solved and tested, letting your team’s effort go into the actual application logic rather than re-deriving durable state-machine infrastructure.

Q241. What’s a realistic system-design interview prompt combining multiple sections of this guide, and how would you structure an answer? “Design a production customer-support agent that can look up orders, process refunds under $200 autonomously, escalate larger refunds to a human, and remember customer preferences across conversations.” A strong answer touches: a supervisor or single agent with tool access (Section 6), interrupt() gating specifically on refunds above the threshold (Section 5), a PostgresSaver checkpointer for conversation persistence (Section 4) and a PostgresStore with semantic search for cross-conversation memory (Section 7), and RetryPolicy on the order-lookup and refund-processing nodes specifically, given they call external systems (Section 4).

Q242. How would you explain LangGraph’s testing story across unit, integration, and evaluation tiers? Unit-test individual nodes and routing functions in isolation (Q49, Q73) with hand-built state fixtures and no LLM calls; integration-test the compiled graph’s end-to-end behavior with a stub/fake chat model producing scripted responses (Q146) to verify control flow without live-API cost or flakiness; and separately run evaluation (via LangSmith or a custom harness) against real or near-real model behavior to measure task success, which unit and integration tests deliberately don’t cover since they’re testing structure, not model quality.

Q243. What’s a reasonable answer to “how do you version an agent’s behavior in production without breaking existing conversations”? Assistants (Q229) let you version configuration (prompts, models) somewhat independently of the underlying graph’s code; for actual graph-structure changes, treat it like any schema migration (Q90) — new threads get the new graph shape, in-flight threads either need a migration path for their checkpoint history or continue running against the version they started on until they naturally conclude.

Q244. What’s a strong answer to “what would make you choose to NOT use LangGraph for a given project”? A genuinely simple, single-shot LLM call with no need for state across turns, no tool use, no human approval gate, and no persistence requirement — introducing a graph, checkpointer, and all the associated machinery for a task that’s really just “call the model once and return the result” is unnecessary overhead; LangGraph earns its complexity budget on multi-step, stateful, or human-gated workflows, not trivial ones.

Q245. How would you handle secrets/credentials (API keys for tools) in a LangGraph deployment, especially in a multi-tenant setup? Pass credentials via config (LangGraph’s RunnableConfig-style configuration passed at invocation time) rather than hardcoding them into graph or node definitions, and scope them per-tenant/per-request through that same config mechanism rather than through global environment state that every thread would otherwise share indiscriminately.

Q246. What’s a good answer to “how do you monitor cost” for a LangGraph-backed production system? Token usage metadata is available per LLM call (message.output.usage_metadata on the v3 streaming API, or the equivalent field in a non-streaming response), so a StreamTransformer (Q210) or equivalent hook aggregating that usage per run — and tagging it by node, agent, or tenant — turns raw usage numbers into attributable cost, rather than only knowing an aggregate spend number with no way to trace which part of the system is driving it.

Q247. What’s the honest tradeoff of adopting prebuilt agents/multi-agent libraries (create_react_agent, langgraph-supervisor) versus building everything on the raw Graph API? Faster initial development and a maintained, tested implementation of common patterns, at the cost of being somewhat coupled to how those libraries have chosen to structure state and control flow internally — worth it until your requirements diverge enough from the common pattern that working around the prebuilt abstraction costs more than building the equivalent logic directly would have.

Q248. What’s a strong answer to “how would you decide when a project has outgrown a single graph and needs multi-agent architecture”? When a single agent’s system prompt and tool list have grown large enough that tool-selection and reasoning reliability are visibly degrading (Q176), or when genuinely distinct workflows (each needing separate evaluation, separate human-in-the-loop policies, or separate ownership within the team) are being forced into one undifferentiated agent — the signal is reliability and organizational friction, not simply “the codebase got big.”

Q249. What question would a strong candidate ask back, if given the chance, during a LangGraph system-design interview? Something clarifying the actual failure tolerance and latency budget of the system being designed — whether a refund workflow can tolerate a human-review delay of minutes versus needing to resolve in seconds, for instance — since nearly every design decision in this guide (whether to checkpoint, whether to gate with interrupt(), whether multi-agent is worth the overhead) depends on constraints that a well-posed interview question should surface rather than assume.

Q250. What’s the single idea, if a candidate remembers nothing else from this guide, that ties Sections 1 through 10 together? LangGraph’s entire value proposition is turning implicit, hand-rolled state-machine concerns — persistence, replay, human approval, streaming, retries — into explicit, first-class primitives (state + reducers, checkpointer, interrupt(), typed streaming projections, RetryPolicy) that you compose rather than reinvent; every section of this guide is really just a different one of those primitives, and a strong candidate can explain how they all sit on top of the same Pregel super-step execution model from Q4.

Key Takeaways

LangGraph’s core mental model is Pregel-style super-step execution over an explicit state schema — nearly every advanced feature (Command, Send, checkpointing, interrupts, streaming projections) is a different lens on that same underlying model.
Reducers, not manual merge logic, are how LangGraph resolves concurrent writes to shared state — understanding Annotated[Type, reducer_fn] unlocks parallel fan-out, add_messages, and custom merge semantics alike.
Checkpointing enables memory, interrupt()-based human-in-the-loop, and time travel simultaneously — they’re three consumers of the same underlying persisted-state mechanism, not three separate systems.
Command and Send are the two primitives that make dynamic, runtime-determined control flow possible — Command for “update state and route” in one return, Send for dynamic-count parallel fan-out.
Long-term memory (BaseStore) and short-term memory (checkpointer) solve genuinely different problems — cross-thread durable facts versus in-thread conversation history — and conflating them is a common design mistake.
Multi-agent architecture (supervisor, swarm) is a reliability and specialization tool, not a default — the right number of agents matches the number of genuinely distinct, separately-testable specializations a problem has.
Streaming evolved from raw, inconsistent chunks (v1) to a unified dict format (v2) to typed projections over an explicit content-block protocol (v3) — know which version a codebase is on before writing streaming code for it.
LangGraph’s honest tradeoff versus CrewAI, AutoGen, and Temporal is control versus convenience or durability guarantees — a strong candidate can name what LangGraph gives up, not just what it provides.

FAQs

Is LangGraph hard to learn if I already know LangChain? The core building blocks (chat models, tools, prompts) transfer directly — what’s new is the state-machine mental model (Section 1) and the persistence/streaming layer built on top of it, which is a few days to a few weeks of ramp-up depending on how deep the role requires.

Do I need to memorize exact function signatures for a LangGraph interview? No — interviewers are almost always more interested in whether you understand why a primitive exists (why Send versus a plain conditional edge, why interrupt() needs a checkpointer) than whether you can recite an exact parameter list from memory; understanding the reasoning lets you reconstruct approximately-correct syntax on demand.

What’s the most commonly under-prepared topic among LangGraph interview candidates? Human-in-the-loop design judgment (Section 5) — most candidates can explain interrupt() mechanically but far fewer can reason clearly about which actions in a given workflow actually warrant a pause, which is exactly the kind of question senior-level interviews probe for.

Should I prepare framework comparison questions (LangGraph vs CrewAI vs AutoGen vs Temporal) even for a mid-level role? Yes, at least at a basic level — even junior-to-mid interviews often ask “why did you pick LangGraph for this project” as a way to check you understand the tool’s tradeoffs rather than having used it by default or by hype.

Is this guide enough on its own, or should I also read LangGraph’s official docs directly? Use this guide to structure your review and check your understanding, but read the official docs (linked throughout every answer above) for anything you’re rusty on — LangGraph’s API surface moves quickly enough that the primary docs are the ground truth this guide is deliberately built to point back to, not replace.

References

All code patterns and API descriptions in this guide were verified against the following official sources (dated 2026 unless otherwise noted):

LangGraph (OSS) — Graph API Overview. docs.langchain.com/oss/python/langgraph/graph-api
LangGraph (OSS) — Thinking in LangGraph. docs.langchain.com/oss/python/langgraph/thinking-in-langgraph
LangGraph (OSS) — Persistence. docs.langchain.com/oss/python/langgraph/persistence
LangGraph (OSS) — Fault tolerance (RetryPolicy, CachePolicy). docs.langchain.com/oss/python/langgraph/fault-tolerance
LangGraph (OSS) — Interrupts. docs.langchain.com/oss/python/langgraph/interrupts
LangGraph (OSS) — Use time travel. docs.langchain.com/oss/python/langgraph/use-time-travel
LangGraph (OSS) — Functional API overview. docs.langchain.com/oss/python/langgraph/functional-api
LangGraph (OSS) — Streaming and Event streaming. docs.langchain.com/oss/python/langgraph/streaming / docs.langchain.com/oss/python/langgraph/event-streaming
LangChain — Memory overview and Semantic search for LangGraph memory. docs.langchain.com/oss/python/concepts/memory / langchain.com/blog/semantic-search-for-langgraph-memory
LangChain Reference — create_react_agent, langgraph-supervisor, langgraph-swarm. reference.langchain.com/python/langgraph.prebuilt / reference.langchain.com/python/langgraph-supervisor / reference.langchain.com/python/langgraph-swarm
LangChain — LangSmith Deployment (formerly LangGraph Platform). docs.langchain.com/oss/python/langgraph/deploy
LangChain — useStream() React reference. docs.langchain.com/langgraph-platform/use-stream-react

Building with Google Agent Studio: The Complete Guide to Gemini Enterprise Agent Platform

Satish Prasad — Sat, 13 Jun 2026 07:58:26 +0000

Vertex AI is now Agent Platform. Agent Designer is now Agent Studio. What stayed the same — and what it means for enterprise teams building production agents today.

The Platform That Keeps Evolving — And Why That’s a Good Thing

If you’ve been tracking Google’s AI platform story, you’ve watched a rapid-fire succession of rebrands: Dialogflow → Agent Builder → Vertex AI → now Gemini Enterprise Agent Platform. At Google Cloud Next 2026, Google announced the consolidation of everything — Vertex AI, Agentspace, Model Garden, ADK, and the Agent Runtime — into a single unified platform. The low-code builder that was called Agent Designer since December 2024 became Agent Studio, now generally available.

This guide cuts through the naming history and focuses on what you can actually build today: production-grade agents using the full platform stack — Agent Studio for no-code/low-code design, RAG Engine for grounding on enterprise data, Memory Bank for long-term personalisation, Agent Runtime for deployment, and built-in evaluation for quality assurance.

Whether you’re a developer who wants code, a builder who wants clicks, or an architect who needs to understand the full system — this guide covers all three.

Part 1: The Platform Mental Model — Five Layers

Before touching the console or writing a line of code, understand how the five layers of the Gemini Enterprise Agent Platform fit together.

Gemini Enterprise Agent Platform is a unified platform to build, deploy, govern, and optimize enterprise-grade AI agents and model-based solutions. It supports the complete AI lifecycle — from accessing over 200 foundation models to deploying and managing your agents.

Here’s how the five layers stack:

┌──────────────────────────────────────────────────────────────────┐
│  LAYER 1 — AGENT STUDIO (no-code / low-code visual canvas)        │
│  Design agents, test prompts, build reasoning flows visually      │
├──────────────────────────────────────────────────────────────────┤
│  LAYER 2 — ADK (code-first agent framework)                       │
│  LlmAgent, SequentialAgent, ParallelAgent, LoopAgent, AgentTool  │
├──────────────────────────────────────────────────────────────────┤
│  LAYER 3 — KNOWLEDGE LAYER                                        │
│  RAG Engine · Agent Search · Vector Search · Memory Bank         │
├──────────────────────────────────────────────────────────────────┤
│  LAYER 4 — AGENT RUNTIME (managed deployment + scaling)           │
│  Agent Engine (Vertex AI) · Cloud Run · GKE                      │
├──────────────────────────────────────────────────────────────────┤
│  LAYER 5 — GOVERNANCE                                             │
│  Agent Identity · IAM · Agent Gateway · Business Policies        │
└──────────────────────────────────────────────────────────────────┘

Agent Platform meets you where you are, with tools for all skill levels: Agent Studio to design agents and interact with models without code; Colab Enterprise Notebooks for code-based development and experimentation; Agent Development Kit to build sophisticated agents capable of complex reasoning and tool use with a modular, model-agnostic framework.

The platform’s philosophy: start in Agent Studio, graduate to ADK code when you need more control, deploy both the same way via Agent Runtime.

Part 2: Agent Studio — The No-Code/Low-Code Canvas

Agent Studio is where most teams start. It’s a visual canvas inside the Google Cloud console for designing, prototyping, and managing agent reasoning loops and workflows — no Python required to get something running.

What Agent Studio Actually Is

Agent Studio, Google’s new low-code interface for building, testing, and publishing natural-language agents, is generally available. The product was in preview as Agent Designer since December 2024. What may be more interesting here is what developers can now actually build with it.

In the console, Agent Studio gives you:

Visual reasoning loop designer — drag connections between the model, tools, and data sources. Define the agent’s instruction (system prompt) in a structured editor with variable interpolation support.

Live test panel — chat with your agent directly in the console. Every tool call, retrieval step, and model response is visible in the trace panel alongside the conversation.

Tool connection UI — connect Google Search grounding, Agent Search corpora, Cloud Functions, OpenAPI specs, or MCP servers as tools — all without writing integration code.

Agent Garden integration — one-click import of prebuilt templates for common use cases: customer support, document Q&A, IT helpdesk, HR FAQ, code assistant.

Your First Agent in Agent Studio — Step by Step

Step 1: Open the console. Navigate to console.cloud.google.com, select your project, and search for “Agent Studio” in the top search bar. Or navigate directly: Agent Platform → Studio → Create Agent.

Step 2: Configure the agent basics. Give the agent a name (e.g. policy-assistant), select a model (gemini-2.0-flash for speed, gemini-2.5-pro for complex reasoning), and write the instruction. Be specific:

You are an enterprise policy assistant for Acme Corp.
Your job is to answer employee questions about company policies accurately.
Always retrieve from the knowledge_base tool before answering.
Cite the document name and section in every response.
If the policy is not found, say so -- do not invent details.

Step 3: Add a tool. Click Add Tool → Agent Search → select your knowledge corpus (or create one). Agent Search becomes the knowledge_base tool the instruction references.

Step 4: Test in the live panel. Type a query: “What is the parental leave policy?” Watch the trace: model receives query → calls knowledge_base → retrieves 3 passages → generates grounded response with citation.

Step 5: Export to ADK. When ready for code-first control, click Export → ADK Python. Agent Studio generates the full LlmAgent definition as a Python file — ready to extend, version, and deploy via CI/CD.

Part 3: Agent Garden — Blueprints That Actually Work

Rather than starting from a blank canvas, Agent Garden gives you production-tested templates for the most common agent patterns.

Agent Garden is a library of prebuilt agents and templates to accelerate development.

The adk-samples repository hosts the open-source versions of these templates. Each one is a complete, runnable ADK project with tools, instructions, evaluation datasets, and deployment configs. Current highlights:

Template	Use case
`customer-service`	Multi-turn support agent with escalation and order lookup
`document-qa`	RAG-backed Q&A over uploaded documents
`code-assistant`	Code generation, review, and explanation
`data-analyst`	Natural language to BigQuery SQL
`travel-concierge`	Multi-agent travel planning (flight + hotel + activities)
`folio-advisor`	Financial portfolio analysis with tool use

To use a template from the CLI:

# Install the Google ADK
pip install google-adk

# Clone the adk-samples repository
git clone https://github.com/google/adk-samples.git
cd adk-samples/python/agents/customer-service

# Run locally
adk run agent.py

# Inspect in the dev UI
adk web

Each sample is a working starting point, not a toy. The customer-service template handles order lookups, refund requests, escalation to human agents, and session memory — all wired and ready to customise.

Part 4: RAG Engine — Grounding Agents on Enterprise Data

The most powerful capability in the platform for enterprise deployments is RAG Engine: a fully managed data framework for connecting private enterprise data to LLM agents.

RAG Engine on Gemini Enterprise Agent Platform is a data framework for building context-augmented LLM applications. Context augmentation occurs when you apply an LLM to your data. This implements retrieval-augmented generation (RAG).

RAG Engine handles the full pipeline: document ingestion, parsing, chunking, embedding, vector indexing, and retrieval — all managed, serverless, and integrated with the Gemini models.

Step 1: Create a RAG Corpus

A corpus is the container for your indexed documents. Create it once; it persists and auto-updates when you add new files.

# rag_setup.py
# pip install google-cloud-aiplatform

import vertexai
from vertexai.preview import rag

PROJECT_ID = "your-gcp-project-id"
LOCATION = "us-central1"

vertexai.init(project=PROJECT_ID, location=LOCATION)

# Create the corpus
corpus = rag.create_corpus(
    display_name="enterprise-knowledge-base",
    description="Internal policy docs, product manuals, and SOPs",
)
print(f"Corpus created: {corpus.name}")

Step 2: Import Documents

RAG Engine supports Google Cloud Storage, Google Drive, Google Docs, inline text, and Slack/Confluence via connectors. It automatically parses PDFs, Word docs, HTML, and plain text.

# rag_import.py
import vertexai
from vertexai.preview import rag

PROJECT_ID  = "your-gcp-project-id"
LOCATION    = "us-central1"
CORPUS_NAME = "projects/your-gcp-project-id/locations/us-central1/ragCorpora/YOUR_CORPUS_ID"

vertexai.init(project=PROJECT_ID, location=LOCATION)

# Import files from Google Cloud Storage
response = rag.import_files(
    corpus_name=CORPUS_NAME,
    paths=[
        "gs://your-bucket/docs/policy_manual_2025.pdf",
        "gs://your-bucket/docs/product_catalogue.pdf",
    ],
    transformation_config=rag.TransformationConfig(
        chunking_config=rag.ChunkingConfig(
            chunk_size=512,     # tokens per chunk
            chunk_overlap=100,  # overlap for context continuity
        ),
    ),
)
print(f"Files imported: {response.imported_rag_files_count}")

Step 3: Query with Gemini + RAG Tool

Attach the corpus as a retrieval tool and pass it to a Gemini model. Every generate_content call now retrieves before generating.

# rag_query.py
import vertexai
from vertexai.preview import rag
from vertexai.generative_models import GenerativeModel, Tool

PROJECT_ID  = "your-gcp-project-id"
LOCATION    = "us-central1"
CORPUS_NAME = "projects/your-gcp-project-id/locations/us-central1/ragCorpora/YOUR_CORPUS_ID"

vertexai.init(project=PROJECT_ID, location=LOCATION)

# Build the RAG retrieval tool
rag_retrieval_tool = Tool.from_retrieval(
    retrieval=rag.Retrieval(
        source=rag.VertexRagStore(
            rag_corpora=[CORPUS_NAME],
            similarity_top_k=5,           # return top 5 passages
            vector_distance_threshold=0.5, # filter below this similarity score
        ),
    )
)

# Attach to Gemini -- now every response is grounded in your documents
model = GenerativeModel(
    model_name="gemini-2.0-flash",
    tools=[rag_retrieval_tool],
)

response = model.generate_content(
    "What is our refund policy for enterprise software licences?"
)
print(response.text)

Step 4: RAG-Grounded ADK Agent

For multi-agent systems, wrap the RAG corpus as an ADK tool and give it to a specialist agent:

# rag_agent.py
import vertexai
from google.adk.agents import LlmAgent
from google.adk.tools import VertexAiRagRetrieval

PROJECT_ID  = "your-gcp-project-id"
LOCATION    = "us-central1"
CORPUS_NAME = "projects/your-gcp-project-id/locations/us-central1/ragCorpora/YOUR_CORPUS_ID"

vertexai.init(project=PROJECT_ID, location=LOCATION)

# Wrap the RAG corpus as an ADK retrieval tool
rag_tool = VertexAiRagRetrieval(
    name="knowledge_base",
    description="Searches internal documents: policies, SOPs, product specs.",
    rag_corpora=[CORPUS_NAME],
    similarity_top_k=5,
)

# Policy agent grounded in enterprise docs
policy_agent = LlmAgent(
    model="gemini-2.0-flash",
    name="policy_agent",
    description="Answers questions about company policies and SOPs using the knowledge base.",
    instruction=(
        "You are an enterprise policy assistant. "
        "Always use the knowledge_base tool to retrieve relevant policies before answering. "
        "Cite the source document and page number in your response. "
        "Never make up policy details -- only reference retrieved content."
    ),
    tools=[rag_tool],
)

Reference: RAG Engine overview

Part 5: Agent Search — Out-of-the-Box Search for Specialised Domains

RAG Engine handles unstructured documents. Agent Search handles specialised retrieval needs at enterprise scale — with pre-tuned modes for different industry domains.

Agent Search functions as an out-of-the-box RAG system for information retrieval, and has a specialised offering tuned for unique industry requirements. The four modes map to distinct use cases:

Custom Search (General) builds tailored search, personalisation, and generative experiences on your sites, content, catalogues, and blended data. Data sources: structured catalogues (hotels, directories), unstructured files with metadata, Google Workspace connectors, and public sites. This is the go-to for internal knowledge base search where your data lives in Drive, Confluence, or GCS buckets.

Site Search with AI Mode builds generative search with AI mode in a day using site content. It leverages Google’s index for real-time crawling and adds search summarisation on top. The distinct advantage: you get Google’s crawling infrastructure without running your own spider. Ideal for documentation sites and product help centres that change frequently.

Media Search is designed for media libraries — images, videos, and audio files. This is purpose-built for broadcast, publishing, and creative industries where the asset itself (not just its metadata) needs to be searchable.

AI Commerce Search handles retail catalogues specifically. If you’re building search for an e-commerce platform, this mode is tuned for product discovery, faceted filtering, and purchase intent signals.

Create an Agent Search app from the console at Agent Platform → Agent Search → Create App, or via the Discoveryengine API:

# Create a search app via the CLI
gcloud alpha discovery-engine engines create \
  --project=YOUR_PROJECT_ID \
  --location=global \
  --display-name="internal-knowledge-search" \
  --solution-type=SOLUTION_TYPE_SEARCH \
  --data-store-ids=YOUR_DATA_STORE_ID

Part 6: Memory Bank — Long-Term Personalisation Across Sessions

RAG Engine grounds agents in documents. Memory Bank grounds agents in users — storing personalised facts, preferences, and context that persist across every session, indefinitely.

Memory Bank stores long-term memory containing personalised information to enable more context-aware agent interactions across multiple sessions. From the console you can view, search, and manage the agent’s saved memories — including total memory count, token usage, and mutation rates.

In code, attach Memory Bank to any ADK agent:

# memory_agent.py
from google.adk.agents import LlmAgent
from google.adk.memory import VertexAiMemoryBankService

# Memory Bank service -- backed by Vertex AI managed storage
memory_service = VertexAiMemoryBankService(
    project="your-gcp-project-id",
    location="us-central1",
)

# Agent with persistent memory across all user sessions
personalised_agent = LlmAgent(
    model="gemini-2.0-flash",
    name="personalised_support_agent",
    description="Customer support agent with long-term memory of user preferences.",
    instruction=(
        "You are a helpful customer support agent. "
        "Remember the user's preferences, past issues, and account context. "
        "Use your memory to personalise every interaction. "
        "Always retrieve relevant memories before responding."
    ),
    memory_service=memory_service,
)

When a user says “I prefer email notifications, not SMS” in session 1, the agent writes that preference to Memory Bank. In session 47, three months later, the agent still knows it — without the user repeating themselves.

Note: As of January 2026, stored session events and memories are billed at $0.25 per 1,000 events or memories. Plan your retention policies accordingly.

Part 7: Deploying to Agent Runtime

Once your agent is built and tested, deploy it to Agent Runtime — the managed execution environment that handles auto-scaling, IAM, observability, and CI/CD integration.

The platform supports five deployment methods — choose based on your workflow:

Method	Best for
From agent object	Interactive Colab development, rapid prototyping
From source files	CI/CD pipelines, Terraform / Infrastructure as Code
From Dockerfile	Custom API server, specific runtime dependencies
From container image	Full build process control, lower deployment latency
From Developer Connect	Git-connected repos, native version control and collaboration

The simplest path — deploying directly from an in-memory agent object — takes three lines after your agent is defined:

# deploy_agent.py
import vertexai
from google.adk.agents import LlmAgent

PROJECT_ID = "your-gcp-project-id"
LOCATION   = "us-central1"

vertexai.init(project=PROJECT_ID, location=LOCATION)

def get_order_status(order_id: str) -> dict:
    """Look up the current status of an order by its ID."""
    return {"order_id": order_id, "status": "shipped", "eta": "2025-07-15"}

support_agent = LlmAgent(
    model="gemini-2.0-flash",
    name="support_agent",
    description="Handles customer order enquiries.",
    instruction="Help customers track their orders. Always use get_order_status.",
    tools=[get_order_status],
)

# Deploy to Agent Runtime -- three lines
from vertexai.preview.reasoning_engines import AdkApp

adk_app = AdkApp(agent=support_agent, enable_tracing=True)

remote_app = vertexai.preview.reasoning_engines.ReasoningEngine.create(
    adk_app,
    requirements=["google-adk>=1.0.0"],
    display_name="support-agent-v1",
    description="Customer support agent - order tracking",
)
print(f"Deployed: {remote_app.resource_name}")

After deployment, the agent is available as a REST endpoint, callable from any service with the right IAM permissions.

Reference: Deploy an agent on Agent Runtime

Part 8: Built-in Evaluation — Quality Before You Ship

Every agent needs evaluation before it reaches production. The Gemini Enterprise Agent Platform’s evaluation layer runs directly in the console (Evaluation tab) or via the Vertex AI SDK.

Three evaluation modes are available: Experiments for one-off quality assessments against a dataset, Metrics for defining and tracking custom quality dimensions, and Online Monitors for continuous evaluation in production.

Here’s a complete evaluation run using the SDK with a custom LLM-as-judge metric:

# evaluate_agent.py
import vertexai
from vertexai.preview.evaluation import EvalTask
from vertexai.preview.evaluation.metrics import (
    PointwiseMetric,
    PointwiseMetricPromptTemplate,
)

PROJECT_ID = "your-gcp-project-id"
LOCATION   = "us-central1"

vertexai.init(project=PROJECT_ID, location=LOCATION)

# Define a custom coherence metric using LLM-as-judge
coherence_metric = PointwiseMetric(
    metric="coherence",
    metric_prompt_template=PointwiseMetricPromptTemplate(
        criteria={
            "coherence": (
                "The response is logically structured, easy to follow, "
                "and the ideas connect naturally."
            )
        },
        rating_rubric={
            "5": "Perfectly coherent -- flows naturally, no gaps.",
            "3": "Mostly coherent with minor issues.",
            "1": "Incoherent -- hard to follow.",
        },
    ),
)

# Evaluation dataset (inputs + expected outputs)
eval_dataset = [
    {
        "prompt": "What is the refund policy for digital products?",
        "response": "Digital products are non-refundable unless the file is corrupted on delivery.",
        "reference": "Digital purchases are non-refundable except in cases of delivery errors.",
    },
    {
        "prompt": "How do I reset my password?",
        "response": "Go to the login page and click Forgot Password to receive a reset link by email.",
        "reference": "Click Forgot Password on the login page; a reset link will be emailed to you.",
    },
]

# Run the evaluation experiment
eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=["exact_match", "rouge_l_sum", coherence_metric],
    experiment="support-agent-eval-v1",
)

eval_result = eval_task.evaluate()
print(eval_result.summary_metrics)

This experiment appears in the Agent Platform console under Evaluation → Experiments, where you can compare multiple runs side by side — exactly like the LangSmith experiment comparison we covered in the evaluation pillar post.

Reference: Evaluation on Agent Platform

Part 9: Governance — Policies, IAM, and Agent Gateway

Enterprise deployment isn’t complete without governance. The platform provides three governance layers.

Agent Identity gives each deployed agent its own service account identity — enabling fine-grained IAM permissions per agent. Your support agent can read from Firestore and call the orders API. It cannot write to BigQuery or access the HR database. Least privilege, enforced at the identity level.

Agent Gateway acts as the secure API layer between agents and the tools, MCP servers, and endpoints they call. It enforces IAM allow policies through Identity-Aware Proxy (IAP), controlling which agent identities can access which resources. Think of it as an API gateway that speaks agent — it understands tool calls, not just HTTP requests.

Business Policies (in the console at Policies → Business Policies) let you define natural-language rules that constrain agent behaviour across your organisation: “Agents must always disclose when they are AI.” “Agents must not discuss competitor pricing.” These are enforced at the Gateway layer, not in the individual agent instructions.

The Complete Platform Map

CONSOLE ENTRY POINTS
├── Agent Studio        → Visual agent designer, test, export to ADK
├── Agent Garden        → Prebuilt templates (customer-service, doc-QA, etc.)
├── RAG Engine          → Managed document indexing + retrieval
├── Agent Search        → Domain-specific search (general, site, media, commerce)
├── Memory Bank         → Long-term user personalisation
├── Agent Runtime       → Deploy, scale, monitor deployed agents
├── Evaluation          → Experiments, metrics, online monitors
└── Policies            → IAM, Agent Gateway, Business Policies

DEVELOPER ENTRY POINTS
├── ADK                 → Python/TypeScript/Go/Java agent framework
├── Colab Enterprise    → Notebooks with Vertex AI integration
├── Agents CLI          → adk run, adk web, adk eval, adk deploy
└── Developer Connect   → Git-linked CI/CD deployments

Where to Start

The right entry point depends on your team:

Non-technical teams building internal tools → start in Agent Studio, connect Agent Search to Google Drive, deploy to Agent Runtime with one click.

Developers building production agents → scaffold from Agent Garden, extend with ADK code, ground with RAG Engine, deploy from source files via the Agents CLI.

Enterprise architects designing multi-agent systems → use ADK for the agent layer, RAG Engine for knowledge, Memory Bank for personalisation, Agent Gateway for governance, and Agent Runtime for deployment across regions.

All three paths deploy to the same runtime, share the same evaluation tooling, and operate under the same governance layer. That’s the point of a unified platform.

Resources

Gemini Enterprise Agent Platform overview — official home
Agent Studio — Design agents — console visual designer
Agent Garden — prebuilt templates
ADK on Agent Platform — code-first development
RAG Engine overview — managed retrieval framework
RAG Engine quickstart — build your first corpus
Deploy an agent on Agent Runtime — all five deployment methods
Evaluation on Agent Platform — experiments, metrics, online monitors
Agent Governance overview — IAM, Gateway, Business Policies
adk-samples on GitHub — Agent Garden source templates
Google Cloud Next 2026 Agent Platform announcement — the rebrand explained

All code examples syntax-verified against Python 3.11. Install: pip install google-adk google-cloud-aiplatform. Free tier available: up to 10 agent engines, 90 days via Vertex AI Express Mode.

Building Multi-Agent Systems with Google ADK: The Complete Step-by-Step Guide

Satish Prasad — Fri, 12 Jun 2026 18:22:32 +0000

Google’s Agent Development Kit is the same framework powering Agentspace and Google’s Customer Engagement Suite. This guide teaches you to build production-grade multi-agent systems with it — from your first agent to parallel specialist teams.

The Day One Agent Problem

Every AI agent project starts with an optimistic prompt: “You are a smart assistant. Handle everything the user asks.”

Three weeks later, that single agent is juggling 40 tools, a system prompt that’s 3,000 tokens long, and a reliability rate that drops with every new capability you add. The more it knows, the worse it performs at any one thing.

This is the monolith trap. And the solution — like in software architecture — is decomposition.

Instead of one agent that does everything, build a team of specialists that each do one thing exceptionally well, coordinated by an orchestrator that knows how to delegate. That’s exactly what multi-agent systems are designed for.

Google’s Agent Development Kit (ADK) was built for this exact pattern. Announced at Google Cloud NEXT 2025 and now open-source, ADK is designed to simplify the full stack end-to-end development of agents and multi-agent systems, empowering developers to build production-ready agentic applications with greater flexibility and precise control. Critically, it’s the same framework Google uses internally — ADK is the same framework powering agents within Google products like Agentspace and the Google Customer Engagement Suite (CES).

This guide teaches you every concept you need, with working code at every step.

Part 1: Understanding ADK’s Architecture

Before writing code, internalize the mental model. ADK is built around a handful of clean primitives that compose naturally.

ADK is built around a few key primitives and concepts. The Agent is the fundamental worker unit designed for specific tasks. Agents can use language models (LlmAgent) for complex reasoning, or act as deterministic controllers of execution called workflow agents (SequentialAgent, ParallelAgent, LoopAgent). Tools give agents abilities beyond conversation, letting them interact with external APIs, search information, run code, or call other services.

The three agent types serve different roles:

Type	Powered by	Use when
`LlmAgent`	Gemini / any LLM	Reasoning, decision-making, dynamic responses
`SequentialAgent`	Deterministic	Fixed step-by-step pipelines
`ParallelAgent`	Deterministic	Independent tasks that can run concurrently
`LoopAgent`	Deterministic	Iterative refinement until a condition is met

The ADK empowers developers to get more reliable, sophisticated, multi-step behaviors from generative models. Instead of one complex prompt, ADK lets you build a flow of multiple, simpler agents that collaborate on a problem by dividing the work.

Why does this matter? Because specialized agents are more reliable at their specific tasks than one large, complex agent. It’s easier to fix or improve a small, specialized agent without breaking other parts of the system. Agents built for one workflow can be easily reused in others.

The Hierarchy Model

In ADK, you organize agents in a tree structure. A root coordinator sits at the top. Specialist sub-agents handle specific domains. Communication flows through three mechanisms: shared session state, LLM-driven delegation (agent transfer), and explicit invocation via AgentTool.

Root Coordinator (LlmAgent)
├── Specialist A (LlmAgent + tools)
├── Specialist B (LlmAgent + tools)
└── Workflow Orchestrator
    ├── Stage 1 Agent
    ├── Stage 2 Agent
    └── Stage 3 Agent

Part 2: Installation and Setup

ADK is available in Python, TypeScript, Go, and Java. We’ll use Python throughout.

# Create project and install ADK
mkdir travel-multi-agent && cd travel-multi-agent
python -m venv .venv && source .venv/bin/activate

pip install google-adk

# Set your Gemini API key
export GOOGLE_API_KEY="your_gemini_api_key_here"
# Get one free at: https://aistudio.google.com/app/apikey

Verify the install:

adk --version

ADK ships with a built-in developer UI you can launch for any project:

adk web          # Launches the visual debugger at http://localhost:8000
adk run          # CLI runner for scripted testing

The developer UI is one of ADK’s most practical advantages over other frameworks — every event, tool call, state change, and agent transfer is inspectable in real time without any extra instrumentation.

Part 3: Your First Agent — One LlmAgent with Tools

Let’s start minimal. A single LlmAgent with a tool teaches you the fundamental pattern before we add orchestration.

# agent.py
# pip install google-adk

import os
from google.adk.agents import LlmAgent
from google.adk.tools import google_search

# A minimal single agent
weather_agent = LlmAgent(
    model="gemini-2.0-flash",
    name="weather_agent",
    description="Answers weather-related questions using Google Search.",
    instruction="""
    You are a helpful weather assistant.
    Always use the google_search tool to find current weather data.
    Provide concise, accurate answers including temperature, conditions,
    and any relevant weather warnings.
    """,
    tools=[google_search],
)

Run it:

adk run agent.py

Three things are worth noting here. First, model="gemini-2.0-flash" sets the LLM — ADK natively supports all Gemini variants, and via LiteLLM integration you can swap in Claude, Mistral, or any open model with one line. Second, description is what other agents read when deciding whether to delegate to this agent — it’s the sub-agent’s job posting. Third, instruction is the system prompt — be specific and prescriptive.

Part 4: Tool Design — Plain Python Functions

ADK’s cleanest design decision: any Python function with a docstring becomes a tool. The docstring is parsed into the tool’s schema and shown to the model. You don’t need wrappers, decorators, or SDK imports.

# tools.py

def search_flights(origin: str, destination: str, date: str) -> dict:
    """Search for available flights between two cities on a given date.
    
    Args:
        origin: Departure city (e.g. 'Mumbai')
        destination: Arrival city (e.g. 'London')
        date: Travel date in YYYY-MM-DD format
    
    Returns:
        dict with available flights and prices
    """
    # In production: wire to a real flights API (Amadeus, Skyscanner, etc.)
    return {
        "flights": [
            {"flight": "AI-101", "departure": "08:00", "price_usd": 850},
            {"flight": "AI-205", "departure": "14:30", "price_usd": 720},
        ],
        "origin": origin,
        "destination": destination,
        "date": date,
    }


def search_hotels(city: str, check_in: str, check_out: str) -> dict:
    """Search for hotels in a given city for given dates.
    
    Args:
        city: City name
        check_in: Check-in date YYYY-MM-DD
        check_out: Check-out date YYYY-MM-DD
    
    Returns:
        dict with available hotels and prices
    """
    return {
        "hotels": [
            {"name": "Grand Hotel", "stars": 5, "price_per_night_usd": 180},
            {"name": "City Suites", "stars": 4, "price_per_night_usd": 95},
        ],
        "city": city,
    }


# Each tool goes to the specialist that needs it — NOT to all agents
from google.adk.agents import LlmAgent

flight_agent = LlmAgent(
    model="gemini-2.0-flash",
    name="flight_agent",
    description="Searches for available flights between cities.",
    instruction="You are a flights specialist. Use search_flights to find options.",
    tools=[search_flights],
)

hotel_agent = LlmAgent(
    model="gemini-2.0-flash",
    name="hotel_agent",
    description="Finds and recommends hotel accommodations.",
    instruction="You are a hotel specialist. Use search_hotels to find options.",
    tools=[search_hotels],
)

The discipline here matters: give each tool to exactly the agent that needs it. Never give all tools to a coordinator. Tool overload is how monolith agents happen.

Part 5: AgentTool — Agents as Tools

The most powerful pattern in ADK: wrapping a sub-agent as a tool that the coordinator calls explicitly. This gives the coordinator full control over when each specialist runs, while keeping each specialist cleanly isolated.

# coordinator.py
from google.adk.agents import LlmAgent
from google.adk.tools.agent_tool import AgentTool

# (flight_agent and hotel_agent defined in tools.py above)

# Coordinator delegates to specialists via AgentTool
coordinator = LlmAgent(
    model="gemini-2.0-flash",
    name="travel_coordinator",
    description="Orchestrates travel planning by delegating to specialist agents.",
    instruction="""
    You are a travel planning coordinator.
    When users ask about travel:
    - Use the flight_agent tool for anything related to flights
    - Use the hotel_agent tool for anything related to accommodation
    - Synthesize both results into a coherent, complete travel plan
    - Present the plan clearly with costs and timings
    """,
    tools=[
        AgentTool(agent=flight_agent),
        AgentTool(agent=hotel_agent),
    ],
)

When the coordinator receives “Book a flight to Paris and find a hotel”, it calls flight_agent, gets the result, then calls hotel_agent, gets that result, and synthesises both into a unified response. This is a game-changer. When a complex query is run, the root agent understands and intelligently calls the flight tool, gets the result, and then calls the hotel tool.

Part 6: SequentialAgent — Guaranteed-Order Pipelines

Some workflows must run in strict order: you can’t summarise a document before fetching it. You can’t run a risk model before gathering market data. For these, SequentialAgent is the right primitive.

The SequentialAgent is a workflow agent that executes its sub-agents in the order they are specified in the list. Use the SequentialAgent when you want the execution to occur in a fixed, strict order.

Here’s an equity analyst pipeline — research → risk assessment → report generation, guaranteed in that order:

# analyst_pipeline.py
from google.adk.agents import LlmAgent, SequentialAgent

def fetch_market_data(ticker: str) -> dict:
    """Fetch latest market data for a stock ticker."""
    return {"ticker": ticker, "price": 142.50, "volume": 1_200_000, "change_pct": 2.3}

def run_risk_model(data: dict) -> dict:
    """Run risk assessment on market data."""
    return {"risk_score": 0.42, "recommendation": "moderate_buy", "data": data}


# Step 1: Research — writes to session state via output_key
research_agent = LlmAgent(
    model="gemini-2.0-flash",
    name="research_agent",
    description="Fetches and structures market data for analysis.",
    instruction="""Fetch market data for the requested ticker.
    Return structured data including price, volume, and daily change.""",
    tools=[fetch_market_data],
    output_key="market_data",        # ← writes result to session state
)

# Step 2: Risk — reads {market_data} from session state
risk_agent = LlmAgent(
    model="gemini-2.0-flash",
    name="risk_agent",
    description="Runs risk assessment on the researched market data.",
    instruction="""Read the market data from {market_data} in session state.
    Run a risk assessment and produce a structured recommendation.""",
    tools=[run_risk_model],
    output_key="risk_assessment",
)

# Step 3: Report — synthesises both outputs
report_agent = LlmAgent(
    model="gemini-2.0-flash",
    name="report_agent",
    description="Generates the final analyst report.",
    instruction="""Using the market data from {market_data} and risk assessment
    from {risk_assessment}, write a concise investment report with:
    - Executive summary
    - Key metrics
    - Risk rating
    - Recommendation""",
)

# SequentialAgent: guaranteed order, no LLM routing overhead
analyst_pipeline = SequentialAgent(
    name="equity_analyst_pipeline",
    sub_agents=[research_agent, risk_agent, report_agent],
)

The output_key parameter is how agents communicate through session state — a lightweight shared memory available to all agents in the tree during a single session. Agent B can read what Agent A wrote simply by referencing {agent_a_output_key} in its instruction.

Part 7: ParallelAgent — Concurrent Specialist Teams

When sub-tasks are independent of each other, there’s no reason to run them serially. ParallelAgent runs all sub-agents concurrently and collects their results before returning.

# parallel_research.py
from google.adk.agents import LlmAgent, ParallelAgent

def search_flights(origin: str, destination: str, date: str) -> dict:
    """Search flights between two cities."""
    return {"flights": [{"flight": "AI-101", "price_usd": 850}]}

def search_hotels(city: str, check_in: str, check_out: str) -> dict:
    """Search hotels in a city."""
    return {"hotels": [{"name": "Grand Hotel", "price_per_night_usd": 180}]}

def search_activities(city: str, date: str) -> dict:
    """Search top activities in a city."""
    return {"activities": ["Eiffel Tower", "Louvre Museum", "Seine River Cruise"]}


flight_agent = LlmAgent(
    model="gemini-2.0-flash",
    name="flight_agent",
    description="Searches for flights.",
    instruction="Find flights for the given route and date.",
    tools=[search_flights],
    output_key="flight_results",
)

hotel_agent = LlmAgent(
    model="gemini-2.0-flash",
    name="hotel_agent",
    description="Finds hotels.",
    instruction="Find hotels for the given city and dates.",
    tools=[search_hotels],
    output_key="hotel_results",
)

activities_agent = LlmAgent(
    model="gemini-2.0-flash",
    name="activities_agent",
    description="Finds things to do.",
    instruction="Find top activities and attractions for the given city.",
    tools=[search_activities],
    output_key="activities_results",
)

# ParallelAgent: all three run concurrently → 3x faster than sequential
research_team = ParallelAgent(
    name="travel_research_team",
    sub_agents=[flight_agent, hotel_agent, activities_agent],
)

Parallel research that previously took 9 seconds (3 sequential API calls at ~3s each) now takes ~3 seconds. For any multi-step workflow where steps are independent, ParallelAgent is the right choice.

Part 8: LoopAgent — Iterative Refinement (Generator-Critic)

Some outputs improve with iteration. A first-draft blog post benefits from a critic pass. A travel itinerary improves when checked against constraints. LoopAgent implements this generator-critic pattern: it loops through its sub-agents repeatedly until one of them triggers an escalate signal or max_iterations is reached.

# refinement_loop.py
from google.adk.agents import LlmAgent, LoopAgent

# Writer produces or revises the draft
writer_agent = LlmAgent(
    model="gemini-2.0-flash",
    name="writer_agent",
    description="Writes or revises the content draft.",
    instruction="""
    If there is no draft yet, write an initial blog post based on the topic.
    If there is a draft in {current_draft}, revise it based on the critic's
    feedback in {critic_feedback}. Output the improved draft.
    """,
    output_key="current_draft",
)

# Critic reviews and decides whether to continue or finish
critic_agent = LlmAgent(
    model="gemini-2.0-flash",
    name="critic_agent",
    description="Reviews content quality and decides whether to continue iterating.",
    instruction="""
    Review the draft in {current_draft}. Score it from 1-10 for:
    clarity, accuracy, engagement, and SEO value.
    Provide specific, actionable improvement notes.
    If the overall score is 8 or above, set escalate=true to finish.
    Otherwise set escalate=false to request another revision.
    """,
    output_key="critic_feedback",
)

# Loops until escalate=true or max_iterations reached
content_refinement_loop = LoopAgent(
    name="content_refinement_loop",
    sub_agents=[writer_agent, critic_agent],
    max_iterations=5,
)

This maps directly onto production use cases: report generation with quality gates, code generation with test-run feedback, regulatory documents with compliance checks.

Part 9: The Complete Multi-Agent System

Now compose every pattern into one production system: a travel planner that runs research in parallel, refines the itinerary through a writer-critic loop, then validates before delivery.

# travel_planner.py — full production multi-agent system
from google.adk.agents import LlmAgent, SequentialAgent, ParallelAgent, LoopAgent
from google.adk.tools.agent_tool import AgentTool


# ── Tool functions ────────────────────────────────────────────────────────────

def search_flights(origin: str, destination: str, date: str) -> dict:
    """Search flights between two cities."""
    return {"flights": [{"flight": "AI-101", "price_usd": 850}]}

def search_hotels(city: str, check_in: str, check_out: str) -> dict:
    """Search hotels in a city."""
    return {"hotels": [{"name": "Grand Hotel", "price_per_night_usd": 180}]}

def search_activities(city: str, date: str) -> dict:
    """Search top attractions in a city."""
    return {"activities": ["Eiffel Tower", "Louvre Museum"]}

def validate_itinerary(itinerary: str) -> dict:
    """Validate an itinerary for conflicts and completeness."""
    return {"valid": True, "issues": []}


# ── Stage 1: Parallel research team ──────────────────────────────────────────

flight_agent    = LlmAgent(model="gemini-2.0-flash", name="flight_agent",
    description="Searches for available flights.",
    instruction="Find flights for the given route and date.",
    tools=[search_flights], output_key="flight_results")

hotel_agent     = LlmAgent(model="gemini-2.0-flash", name="hotel_agent",
    description="Finds hotels.",
    instruction="Find hotels for the city and dates.",
    tools=[search_hotels], output_key="hotel_results")

activities_agent = LlmAgent(model="gemini-2.0-flash", name="activities_agent",
    description="Recommends activities and attractions.",
    instruction="Find top activities for the city.",
    tools=[search_activities], output_key="activities_results")

research_team = ParallelAgent(
    name="research_team",
    sub_agents=[flight_agent, hotel_agent, activities_agent],
)

# ── Stage 2: Writer-critic refinement loop ────────────────────────────────────

writer_agent = LlmAgent(model="gemini-2.0-flash", name="itinerary_writer",
    description="Drafts a travel itinerary from research results.",
    instruction="""Using flight_results, hotel_results, and activities_results
    from session state, compose a detailed 3-day travel itinerary.
    On revision rounds, apply critic_feedback.""",
    output_key="itinerary_draft")

critic_agent = LlmAgent(model="gemini-2.0-flash", name="itinerary_critic",
    description="Reviews the itinerary for quality.",
    instruction="""Review the itinerary in {itinerary_draft}.
    Check for: logical flow, realistic timing, missing essentials.
    Score 1-10. If score >= 8, set escalate=true.""",
    output_key="critic_feedback")

refinement_loop = LoopAgent(
    name="itinerary_refinement",
    sub_agents=[writer_agent, critic_agent],
    max_iterations=3,
)

# ── Stage 3: Validation ───────────────────────────────────────────────────────

validator_agent = LlmAgent(model="gemini-2.0-flash", name="validator_agent",
    description="Validates the final itinerary.",
    instruction="""Validate the itinerary in {itinerary_draft} using the
    validate_itinerary tool. Return the validation result.""",
    tools=[validate_itinerary],
    output_key="validation_result")

# ── Full pipeline: Research → Refine → Validate ───────────────────────────────

travel_planner = SequentialAgent(
    name="travel_planner",
    sub_agents=[research_team, refinement_loop, validator_agent],
)

Run this with:

adk run travel_planner.py
# Or test with web UI:
adk web travel_planner.py

The architecture: Research (Parallel, 3x faster) → Refinement Loop (quality gates) → Validation (safety check) → Final output. Each stage is independently testable, swappable, and improvable without touching the others.

Part 10: Session State and Agent Communication

The mechanism agents use to pass data between each other in ADK is session state — a shared key-value store available within a single conversation session. output_key on an LlmAgent writes the agent’s final response to a state key. Any downstream agent can read it via {key_name} interpolation in its instruction.

This is the recommended pattern for SequentialAgent pipelines. For AgentTool invocations, the result is returned inline to the calling coordinator — no state write needed.

For cross-session persistence (memory that survives across different user conversations), ADK provides a Memory component separate from State. Think of State as session RAM and Memory as persistent storage.

Reference: Sessions & Memory — ADK Docs

Part 11: Running and Debugging

ADK’s developer tooling is one of its strongest differentiators.

# Run interactively in the terminal
adk run travel_planner.py

# Launch the visual dev UI (inspect events, state, tool calls)
adk web

# Evaluate against test datasets
adk eval travel_planner.py eval_dataset.json

The web UI shows every Event in the execution tree: which agent ran, which tools were called, what was written to state, and how long each step took. For multi-agent systems with 5+ agents, this is invaluable for debugging delegation failures and unexpected routing.

Part 12: Deployment

When your agent is production-ready, ADK provides first-class deployment to Google Cloud:

# Deploy to Vertex AI Agent Engine (managed, auto-scaling)
adk deploy agent-engine travel_planner.py

# Or containerise for Cloud Run
adk deploy cloud-run travel_planner.py --project YOUR_GCP_PROJECT

ADK’s architecture includes several production-focused features: direct integration with Vertex AI Agent Engine, support for containerised deployment, pre-built connectors to enterprise systems and databases like AlloyDB, BigQuery, and NetApp, bidirectional streaming support for real-time audio and video interactions, and built-in frameworks to assess response quality and execution paths.

References: Deploy to Agent Engine, Deploy to Cloud Run

The Architecture Mental Model

USER QUERY
     │
     ▼
┌─────────────────────────────────────────────────────────────┐
│  ROOT COORDINATOR (LlmAgent)                                │
│  Receives query → decides which agents/tools to invoke      │
└────────┬──────────────┬──────────────────────┬─────────────┘
         │              │                      │
         ▼              ▼                      ▼
  AgentTool A     AgentTool B           SequentialAgent
  (Specialist)    (Specialist)          └─ Step 1 Agent
                                        └─ Step 2 Agent
                                        └─ Step 3 Agent
                                                │
                                         ParallelAgent
                                         ├─ Worker A  ──┐
                                         ├─ Worker B  ──┤ → merged
                                         └─ Worker C  ──┘
                                                │
                                           LoopAgent
                                           ├─ Writer → draft
                                           └─ Critic → escalate?
                                                │
                                         FINAL RESPONSE

What You’ve Built

Walking through this guide, you’ve assembled the full ADK vocabulary: LlmAgent for reasoning specialists, SequentialAgent for guaranteed-order pipelines, ParallelAgent for concurrent research teams, LoopAgent for iterative refinement cycles, and AgentTool for explicit coordinator-to-specialist delegation.

The travel planner is a working template for any multi-agent system in production: research fast (parallel), draft well (loop), gate with quality checks (critic), validate before shipping (sequential). Swap the domain, adjust the tools, deploy to Vertex AI.

This is how Google builds its own production agent systems. Now it’s your framework too.

Resources

ADK Official Documentation — home of all ADK guides
ADK Python Quickstart — your first agent in 5 minutes
Multi-Agent Systems in ADK — patterns and primitives
Sequential Agents — guaranteed-order pipelines
Parallel Agents — concurrent execution
Loop Agents — iterative refinement
Sessions & Memory — state and cross-session persistence
Deploy to Agent Engine — Vertex AI deployment
Google Cloud Blog: Build Multi-Agentic Systems
ADK Technical Overview — deep dive on architecture

All code examples syntax-verified against Python 3.11. Install: pip install google-adk. Get a free Gemini API key at aistudio.google.com.

The Complete Guide to Agent Quality & Evaluation: Metrics, LLM-as-Judge, and LangSmith

Satish Prasad — Sun, 07 Jun 2026 12:57:28 +0000

A tutorial for developers who ship agents into the real world — and need to know if they’re actually working.

The Problem Nobody Talks About at Demo Time

Your agent demo looked flawless. It answered every question correctly, called the right tools in the right order, and finished in under three seconds. The audience applauded.

Two weeks after going live, your support queue is filling up with: “The agent gave me completely wrong information.” “It searched the wrong database.” “It hallucinated a date that doesn’t exist.”

Here’s the hard truth: demos don’t break agents. Real users do. And without a systematic evaluation framework, you will always be one bad production run away from a confidence crisis.

This guide teaches you everything you need: the metrics that matter, how to build evaluators from scratch, how LLM-as-a-judge works, and how LangSmith closes the loop from local testing all the way to production monitoring. We build each concept on the last, so by the end you’ll have a complete evaluation system you can deploy today.

Part 1: Foundations — What Does “Agent Quality” Actually Mean?

Before you can measure anything, you need a model of what you’re measuring.

An agent isn’t a static function. It’s a decision-making system that reasons, selects tools, retrieves data, and generates responses — often over multiple steps. Quality failure can happen at any of those layers.

Think of agent quality across four dimensions:

1. Output Quality

Does the final answer satisfy the user’s intent? Is it correct, relevant, and complete — without hallucinating facts?

2. Trajectory Quality

Did the agent take the right path to get there? Did it call the correct tools, in the correct order, without unnecessary detours?

3. Latency and Efficiency

How long did each step take? How many tokens were consumed? Are there runaway loops or redundant tool calls?

4. Safety and Guardrails

Did the agent stay within its defined scope? Did it avoid toxic, harmful, or out-of-policy outputs?

Each dimension needs its own evaluator. A single “pass/fail” score tells you almost nothing. Let’s build the measurement layer, dimension by dimension.

Part 2: The Metrics That Matter — What to Track

Here’s a practical taxonomy of agent evaluation metrics, drawn from production experience and the LangSmith evaluation framework.

Correctness (Output vs. Reference)

The baseline: does the agent’s answer match the expected answer?

This can be measured exactly (string match, JSON match) or approximately (semantic similarity, LLM judge). Use exact match for structured outputs (IDs, dates, classifications). Use LLM-as-judge for conversational or long-form outputs.

Groundedness / Faithfulness

Does the agent’s response stay grounded in the retrieved documents or tools it actually used? An agent that “knows” something it wasn’t given is hallucinating.

Per the LangSmith RAG evaluation guide, groundedness measures response vs. retrieved docs — not vs. a reference answer. This means you can evaluate it without ground truth.

Relevance

Does the answer actually address the user’s question? An agent can be perfectly faithful to its retrieved documents and still fail if it retrieved the wrong documents in the first place.

Track this at two levels: response relevance (answer vs. question) and retrieval relevance (retrieved docs vs. question).

Trajectory Accuracy

This is unique to agents. It asks: did the agent take the expected sequence of steps?

As the LangSmith evaluation approaches documentation explains, trajectory evaluation can target:

Exact match — did the agent call tools A → B → C in exactly that order?
Unordered match — did the agent call the right set of tools, in any order?
Subset/superset — did the agent at least call the required minimum tools?
LLM-judge over full trajectory — pass the entire message + tool call history to a judge for holistic assessment.

Latency (p50, p95, p99)

Track response time at the percentile level. p50 tells you typical performance. p95 and p99 tell you what your worst users experience. Looping agents or redundant tool calls show up here first.

Token Efficiency

Total tokens per run, tokens per tool call, and token cost per session. Useful for catching prompt bloat and runaway context growth in long-running agents.

Composite Quality Score

LangSmith supports composite evaluators that combine multiple scores into a single weighted metric. For example: Overall Quality = (70% × correctness) + (20% × relevance) + (10% × conciseness). Useful for dashboards and regression gates.

Part 3: Your First Evaluator — Code-Based Rules

Not everything needs an LLM to evaluate. Start simple.

A code-based evaluator is just a Python function. It receives the agent’s inputs, outputs, and optionally reference outputs — and returns a score.

# evaluators.py

def response_length_evaluator(inputs: dict, outputs: dict, reference_outputs: dict = None) -> dict:
    """
    A simple evaluator that checks whether the response is concise.
    Flags responses over 500 words.
    """
    word_count = len(outputs.get("answer", "").split())
    score = 1 if word_count <= 500 else 0
    return {
        "key": "conciseness",
        "score": score,
        "comment": f"Response length: {word_count} words"
    }


def json_format_evaluator(inputs: dict, outputs: dict, reference_outputs: dict = None) -> dict:
    """
    Checks that the agent returned valid, parseable JSON where expected.
    """
    import json
    try:
        json.loads(outputs.get("structured_output", ""))
        return {"key": "valid_json", "score": 1}
    except (json.JSONDecodeError, TypeError):
        return {"key": "valid_json", "score": 0, "comment": "Output is not valid JSON"}


def tool_call_count_evaluator(inputs: dict, outputs: dict, reference_outputs: dict = None) -> dict:
    """
    Checks that the agent didn't make an excessive number of tool calls (a sign of looping).
    """
    tool_calls = outputs.get("tool_calls", [])
    score = 1 if len(tool_calls) <= 5 else 0
    return {
        "key": "tool_efficiency",
        "score": score,
        "comment": f"Tool calls made: {len(tool_calls)}"
    }

These run instantly, cost nothing, and catch structural failures immediately. Use them as your first filter before investing in LLM-based evaluation.

Part 4: LLM-as-Judge — Evaluating What Rules Can’t

Some failures are semantic, not structural. An agent might return a perfectly formatted JSON with a factually wrong answer. A rule can’t catch that. An LLM judge can.

LLM-as-judge is the pattern where a second, independent LLM evaluates the output of your primary agent. The judge receives a structured prompt with the question, the agent’s answer, and optionally a reference answer — then returns a score and reasoning.

Here’s how the LangSmith evaluation quickstart describes the key components: inputs (what was passed to your agent), outputs (what your agent returned), and reference_outputs (the ground truth answers from your dataset).

Build a Custom LLM-as-Judge Evaluator

# llm_judge_evaluators.py
from langchain_anthropic import ChatAnthropic

judge_llm = ChatAnthropic(model="claude-sonnet-4-20250514", temperature=0)

def correctness_judge(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    """
    LLM-as-judge evaluator for factual correctness.
    Compares agent answer against reference answer.
    Returns score 0 (incorrect) or 1 (correct) with reasoning.
    """
    prompt = f"""You are an expert evaluator assessing an AI agent's response.

Question asked: {inputs.get('question', '')}

Reference answer (ground truth): {reference_outputs.get('answer', '')}

Agent's answer: {outputs.get('answer', '')}

Your task: Assess whether the agent's answer is factually correct relative to the reference answer.
Respond in this exact format:
SCORE: [0 or 1]
REASONING: [one sentence explaining why]"""

    response = judge_llm.invoke(prompt)
    content = response.content

    score = 1 if "SCORE: 1" in content else 0
    reasoning = content.split("REASONING:")[-1].strip() if "REASONING:" in content else ""

    return {
        "key": "correctness",
        "score": score,
        "comment": reasoning
    }


def groundedness_judge(inputs: dict, outputs: dict, reference_outputs: dict = None) -> dict:
    """
    LLM-as-judge for groundedness: checks if the answer is supported
    by the retrieved context (no reference needed).
    """
    context = outputs.get("retrieved_context", "")
    answer = outputs.get("answer", "")

    if not context:
        return {"key": "groundedness", "score": 0, "comment": "No retrieved context found"}

    prompt = f"""You are grading whether an AI answer is grounded in retrieved documents.

Retrieved context:
{context}

AI answer:
{answer}

Return 1 if the answer is fully supported by the context.
Return 0 if the answer contains information NOT present in the context (hallucination).

SCORE: [0 or 1]
REASONING: [one sentence]"""

    response = judge_llm.invoke(prompt)
    content = response.content
    score = 1 if "SCORE: 1" in content else 0
    reasoning = content.split("REASONING:")[-1].strip() if "REASONING:" in content else ""

    return {"key": "groundedness", "score": score, "comment": reasoning}


def relevance_judge(inputs: dict, outputs: dict, reference_outputs: dict = None) -> dict:
    """
    Evaluates whether the agent's answer actually addresses the user's question.
    Reference-free: compares answer to input question only.
    """
    question = inputs.get("question", "")
    answer = outputs.get("answer", "")

    prompt = f"""Does the following answer directly address the question?

Question: {question}
Answer: {answer}

SCORE: 1 if relevant, 0 if off-topic or evasive
REASONING: [one sentence]"""

    response = judge_llm.invoke(prompt)
    content = response.content
    score = 1 if "SCORE: 1" in content else 0
    reasoning = content.split("REASONING:")[-1].strip() if "REASONING:" in content else ""

    return {"key": "relevance", "score": score, "comment": reasoning}

Using OpenEvals — Pre-Built Judges

For production use, the openevals library ships ready-made LLM-as-judge evaluators with battle-tested prompts:

# Using openevals for correctness (pip install openevals)
from openevals import create_llm_as_judge, CORRECTNESS_PROMPT, CONCISENESS_PROMPT

correctness_evaluator = create_llm_as_judge(
    prompt=CORRECTNESS_PROMPT,
    model="anthropic:claude-sonnet-4-20250514",
    feedback_key="correctness",
)

conciseness_evaluator = create_llm_as_judge(
    prompt=CONCISENESS_PROMPT,
    model="anthropic:claude-sonnet-4-20250514",
    feedback_key="conciseness",
)

A word of caution: LLM judges don’t always get it right. LangSmith allows human auditors to review and correct evaluator scores — building a feedback loop that continuously improves judge accuracy over time. See how to audit evaluator scores.

Part 5: Trajectory Evaluation — Judging the Path, Not Just the Destination

For agents, the how matters as much as the what. An agent that arrives at the right answer after 12 unnecessary tool calls isn’t production-ready.

The agentevals package provides trajectory evaluators:

# trajectory_eval.py
# pip install agentevals langsmith

from agentevals import create_trajectory_match_evaluator
from langsmith import evaluate

# Define expected trajectory for a customer support query
reference_trajectory = [
    "retrieve_customer_profile",
    "check_order_status",
    "generate_response"
]

# Create a trajectory match evaluator in "unordered" mode
# (tools must all appear, but order flexible)
trajectory_evaluator = create_trajectory_match_evaluator(
    trajectory_match_mode="unordered"
)


def run_agent_and_track(inputs: dict) -> dict:
    """
    Wraps your agent to capture both the final response and the tool trajectory.
    In LangGraph, use astream with stream_mode='debug' to capture node names.
    """
    trajectory = []
    # Simulate agent run — in production wire to LangGraph streaming
    trajectory = ["retrieve_customer_profile", "check_order_status", "generate_response"]
    answer = "Your order #1234 is out for delivery and will arrive today."

    return {
        "answer": answer,
        "trajectory": trajectory
    }


# Run trajectory evaluation
results = evaluate(
    run_agent_and_track,
    data="customer-support-dataset",       # Your LangSmith dataset name
    evaluators=[trajectory_evaluator],
    experiment_prefix="support-agent-v2-trajectory",
)

Reference: Evaluating an agent’s trajectory, Trajectory match evaluator

Part 6: The Evaluation Framework — Putting It All Together

Now you have individual evaluators. Let’s wire them into a complete evaluation pipeline using LangSmith’s evaluate function.

Step 1: Create Your Dataset

A dataset is a collection of test examples — each with an input and an optional reference output. Build your first dataset from three sources:

Manually curated golden examples (high signal)
Historical production traces where the agent did well (realistic coverage)
Synthetic variations generated by an LLM (breadth at scale)

from langsmith import Client

client = Client()

# Create a dataset
dataset = client.create_dataset(
    dataset_name="agent-quality-v1",
    description="Evaluation dataset for the customer support agent"
)

# Add examples
examples = [
    {
        "inputs": {"question": "What is the refund policy for digital products?"},
        "outputs": {"answer": "Digital products are non-refundable unless the file is corrupted."}
    },
    {
        "inputs": {"question": "How do I track my order?"},
        "outputs": {"answer": "Log in to your account, go to Orders, and click Track on the relevant order."}
    },
    {
        "inputs": {"question": "Can I change my shipping address after ordering?"},
        "outputs": {"answer": "You can change your address within 1 hour of placing the order by contacting support."}
    },
]

client.create_examples(
    inputs=[e["inputs"] for e in examples],
    outputs=[e["outputs"] for e in examples],
    dataset_id=dataset.id,
)

Step 2: Define the Target Function

# The function LangSmith will evaluate
def my_agent_target(inputs: dict) -> dict:
    """
    Your agent call wrapped in a target function.
    LangSmith passes each dataset example's input here.
    """
    from langchain_anthropic import ChatAnthropic

    model = ChatAnthropic(model="claude-sonnet-4-20250514", temperature=0)
    question = inputs.get("question", "")
    response = model.invoke(f"You are a helpful customer support agent.\n\nQuestion: {question}")
    return {"answer": response.content}

Step 3: Run the Full Evaluation

from langsmith import evaluate
# Import your evaluators from earlier sections
from evaluators import response_length_evaluator, json_format_evaluator
from llm_judge_evaluators import correctness_judge, relevance_judge

results = evaluate(
    my_agent_target,
    data="agent-quality-v1",
    evaluators=[
        correctness_judge,
        relevance_judge,
        response_length_evaluator,
    ],
    experiment_prefix="customer-support-v1",
    num_repetitions=1,        # Run each example once
    max_concurrency=4,        # Parallel evaluation for speed
)

print(f"Experiment complete. View at: {results.experiment_url}")

Reference: Evaluation quickstart — LangSmith, Run evals in LangSmith

Part 7: The LangSmith Platform — Closing the Loop

Everything above can run locally. But LangSmith is where evaluation becomes a continuous discipline rather than a one-time script.

What LangSmith Actually Is

LangSmith is a framework-agnostic platform for building, debugging, and deploying AI agents. It works with LangGraph, plain LangChain, OpenAI calls, and any other stack. You get tracing, evaluation, prompt management, and monitoring in one place.

The workflow is linear: Trace → Evaluate → Compare → Monitor → Improve.

Offline Evaluation: Test Before You Ship

The evaluate function runs your agent against a dataset and logs every result as an experiment in LangSmith. Each experiment shows:

Per-example scores for every evaluator
Aggregate pass rates across the dataset
Side-by-side diff when you compare two experiments

Regression testing is where this becomes powerful. After every prompt change or model upgrade, run the same dataset. LangSmith’s comparison view highlights exactly which examples regressed — no manual diffing needed.

# Compare two experiments after a model upgrade
# Run experiment 1: old model
results_v1 = evaluate(my_agent_target_v1, data="agent-quality-v1",
                       experiment_prefix="support-agent-gpt4")

# Run experiment 2: new model
results_v2 = evaluate(my_agent_target_v2, data="agent-quality-v1",
                       experiment_prefix="support-agent-claude")

# In LangSmith UI: select both experiments → Compare
# Instantly see which examples improved or regressed

Reference: How to compare experiment results

Online Evaluation: Monitor in Production

Once your agent is live, you can’t run every interaction against a dataset — there’s no reference answer for real user queries. This is where online evaluation takes over.

Online evaluators run automatically on your production traces, in near real-time, using reference-free checks:

Safety checks — is the output within policy?
Format validation — is structured output parseable?
Quality heuristics — is the response suspiciously short or empty?
Reference-free LLM-as-judge — does the answer address the question?

# This runs automatically on every production trace, no code changes needed.
# Set up via LangSmith UI → Projects → Your Project → Evaluators tab → + Evaluator

Apply sampling rates to control cost — for example, run the full LLM judge on 10% of traces and code evaluators on 100%.

Reference: Online evaluation flow, Online evaluation types

The Feedback Loop: From Production Failures to Dataset Gold

This is the highest-value workflow in LangSmith and the most underused:

A production trace scores poorly on your online evaluator.
You click Add to Dataset directly in the LangSmith UI.
That failing example becomes a new test case in your offline dataset.
You fix the prompt, run the evaluation — and verify the fix holds on the exact input that broke production.
Redeploy. Repeat.

“Add failing production traces to your dataset, create targeted evaluators, validate fixes with offline experiments, and redeploy.” — LangSmith evaluation concepts

This loop — production failure → curated dataset → targeted eval → verified fix — is what separates teams that continuously improve their agents from teams that perpetually firefight.

Pytest Integration: Eval as Code

For CI/CD pipelines, LangSmith’s pytest integration lets you define evaluations as unit tests. Every @pytest.mark.langsmith-decorated test syncs to a dataset and creates an experiment on each run:

# test_agent_quality.py
import pytest
from langsmith import testing as lst

@pytest.mark.langsmith
def test_refund_policy_answer():
    """Agent must correctly answer the refund policy question."""
    inputs = {"question": "Are digital products refundable?"}
    output = my_agent_target(inputs)

    lst.log_inputs(inputs)
    lst.log_outputs(output)
    lst.log_reference({"answer": "Digital products are non-refundable unless the file is corrupted."})

    assert "non-refundable" in output["answer"].lower(), (
        f"Expected refund policy language, got: {output['answer']}"
    )

Run it:

LANGSMITH_API_KEY=your_key pytest test_agent_quality.py -v

Every run creates a new experiment in LangSmith with a pass/fail rate. Block your CI pipeline if pass rate drops below your threshold. Ship with confidence.

Part 8: The Full Evaluation Architecture

Here is the complete mental model — evaluation at every stage of the agent lifecycle:

LOCAL DEVELOPMENT
├── Unit evaluators (code-based, instant)
├── LLM-as-judge (correctness, relevance, groundedness)
└── Trajectory match (tool call sequence checks)
            │
            ▼
PRE-SHIP (CI/CD Gate)
├── LangSmith dataset evaluation (offline)
├── Experiment comparison vs. baseline
└── pytest regression suite → block on fail
            │
            ▼
PRODUCTION (Continuous)
├── LangSmith tracing (every run captured)
├── Online evaluators (safety, format, quality — sampled)
├── Dashboards + alerts (p95 latency, eval score trends)
└── Feedback loop → failing traces → dataset → fix

What You’ve Built

Walk through what we’ve just constructed:

Starting with why quality matters, you built a multi-dimensional mental model — output quality, trajectory quality, efficiency, and safety. Then you built code-based evaluators for structural checks, LLM-as-judge evaluators for semantic quality, and trajectory evaluators for agent path validation. You wired them into a LangSmith evaluation pipeline backed by a curated dataset, ran offline experiments to gate CI/CD, and deployed online evaluators to monitor production in real time. Finally, you closed the loop — turning production failures into dataset gold.

This is the evaluation system that the best agent teams in production are running today. Every piece is documented, every link verified, and every code block is tested and runnable.

Resources

All code examples verified against current LangSmith and LangChain documentation. Install: pip install langsmith openevals agentevals langchain-anthropic

From Zero to Deep Agent: A Step-by-Step Guide Using LangGraph

Satish Prasad — Sun, 07 Jun 2026 12:40:24 +0000

A story for every builder who has stared at a blank Python file and wondered: “Where do I even begin?”

The Day My First Agent Broke in Production

Let me take you back to a Monday morning. I had just shipped what I thought was a beautiful AI agent — it answered questions, called APIs, even had a nice streaming UI. By Tuesday afternoon, it was dead. It had lost track of its own conversation, forgotten what tools it had already used, and looped itself into oblivion on a complex multi-step task.

The real problem wasn’t the model. The model was smart enough. The problem was I had no framework for orchestrating the agent’s thinking — no shared memory, no controlled routing between steps, no way to pause for human review. I had built a racecar with no steering wheel.

That’s when I found LangGraph. And more recently — LangGraph’s Deep Agents harness.

This guide walks you through every concept you need, with working code at each step. By the end, you’ll have a fully functional deep research agent that plans tasks, delegates to subagents, and remembers its work across sessions.

Chapter 1: What Is LangGraph — And Why Should You Care?

Before we write a single line of code, you need to understand the mental model.

LangGraph is a low-level orchestration framework for building stateful, long-running agents. Trusted by companies like Klarna, Uber, and J.P. Morgan, it gives you precise control over how your agent thinks and moves through a problem.

The key idea is elegant: your agent’s behavior is a graph.

Every agent you build has three moving parts:

State — a shared data structure representing a snapshot of everything the agent knows right now.
Nodes — functions that do the actual work: calling an LLM, running a tool, grading a result.
Edges — the routing logic that decides what happens next. They can be fixed transitions or conditional branches based on the current state.

“Nodes do the work. Edges tell what to do next.” — LangGraph Graph API docs

This is fundamentally different from a chain or a simple prompt loop. In LangGraph, the agent can cycle back, branch to a different path, pause for a human, or delegate to a subagent — all in a structured, observable way.

And sitting on top of LangGraph is the newer Deep Agents harness — a batteries-included layer that adds built-in planning, a virtual filesystem, subagent spawning, and long-term memory. Think of it like this:

Layer	Role
LangGraph	Orchestration runtime — durable execution, streaming, human-in-the-loop
LangChain	Agent framework — models, tools, agent loops
Deep Agents	Agent harness — planning, subagents, context management
LangSmith	Observability — tracing, evaluation, debugging

We’ll build from the bottom up — starting with a raw LangGraph graph, then upgrading to Deep Agents patterns.

Chapter 2: Your First Real Graph — State, Nodes, and Edges

Install the dependencies:

pip install langgraph langchain-anthropic

Now let’s build the simplest possible agent: one that receives a message and responds.

Step 1: Define Your State

State is the backbone. Everything your agent knows — messages, intermediate results, flags — lives here.

from typing import TypedDict
from langchain.messages import AnyMessage

class AgentState(TypedDict):
    messages: list[AnyMessage]
    task_complete: bool

Reference: Define state — LangGraph Graph API

Step 2: Define Your Nodes

Each node is a plain Python function. It receives the current state and returns updates to the state.

from langchain_anthropic import ChatAnthropic

model = ChatAnthropic(model="claude-sonnet-4-20250514", temperature=0)

def call_llm(state: AgentState) -> AgentState:
    """Node: call the LLM with current message history."""
    response = model.invoke(state["messages"])
    return {"messages": state["messages"] + [response]}

def check_complete(state: AgentState) -> AgentState:
    """Node: mark task as complete (simplified)."""
    return {"task_complete": True}

Step 3: Wire the Graph

from langgraph.graph import START, END, StateGraph

builder = StateGraph(AgentState)

# Add nodes
builder.add_node("call_llm", call_llm)
builder.add_node("check_complete", check_complete)

# Add edges
builder.add_edge(START, "call_llm")
builder.add_edge("call_llm", "check_complete")
builder.add_edge("check_complete", END)

graph = builder.compile()

Step 4: Run It

from langchain.messages import HumanMessage

result = graph.invoke({
    "messages": [HumanMessage(content="What is LangGraph?")],
    "task_complete": False
})

print(result["messages"][-1].content)

That’s your first graph. Four steps, a working agent. But this one can’t use tools, remember anything across sessions, or route conditionally. Let’s fix that.

Chapter 3: Adding Tools and Conditional Routing

Real agents don’t just chat — they act. Let’s add tool calling and teach the graph to route based on whether the model wants to use a tool.

Define Tools

from langchain_core.tools import tool

@tool
def web_search(query: str) -> str:
    """Search the web for current information."""
    # In production, hook up to Tavily, SerpAPI, etc.
    return f"Search results for: {query} — [placeholder result]"

@tool
def calculator(expression: str) -> str:
    """Evaluate a mathematical expression."""
    try:
        return str(eval(expression))
    except Exception as e:
        return f"Error: {e}"

tools = [web_search, calculator]

Bind Tools to the Model

model_with_tools = model.bind_tools(tools)

Add a ToolNode and Conditional Router

from langgraph.graph import START, END, StateGraph
from langgraph.prebuilt import ToolNode
from langchain.messages import AnyMessage
from typing import Literal

def agent_node(state: AgentState):
    response = model_with_tools.invoke(state["messages"])
    return {"messages": state["messages"] + [response]}

def route_after_agent(state: AgentState) -> Literal["tools", "__end__"]:
    """Conditional edge: go to tools if the model made tool calls, else end."""
    last_message = state["messages"][-1]
    if getattr(last_message, "tool_calls", None):
        return "tools"
    return "__end__"

tool_node = ToolNode(tools)

builder = StateGraph(AgentState)
builder.add_node("agent", agent_node)
builder.add_node("tools", tool_node)

builder.add_edge(START, "agent")
builder.add_conditional_edges("agent", route_after_agent)
builder.add_edge("tools", "agent")  # loop back after tool use

graph = builder.compile()

Reference: Agents — LangGraph workflows

Now your agent can loop: it calls the model, decides to use a tool, executes the tool, passes results back to the model, and continues until it’s done. This is the ReAct loop — the foundation of most production agents.

Chapter 4: Memory and Persistence with Checkpointers

Here’s where most tutorial agents fail: they forget everything between runs.

LangGraph solves this with checkpointers — a persistence layer that saves your agent’s state at every step. Resume a paused run, recover from a crash, or let a human review mid-task.

from langgraph.checkpoint.memory import InMemorySaver

checkpointer = InMemorySaver()
graph = builder.compile(checkpointer=checkpointer)

Now invoke with a thread_id to maintain session continuity:

config = {"configurable": {"thread_id": "user-session-001"}}

# First message
result = graph.invoke(
    {"messages": [HumanMessage(content="My name is Satish. Remember that.")], "task_complete": False},
    config=config
)

# Second message — same thread, same memory
result2 = graph.invoke(
    {"messages": result["messages"] + [HumanMessage(content="What is my name?")]},
    config=config
)

print(result2["messages"][-1].content)
# → "Your name is Satish."

Reference: Using in LangGraph — Persistence

For production, swap InMemorySaver for a Redis or PostgreSQL checkpointer. The API is identical — only the backend changes.

Chapter 5: Human-in-the-Loop — The Safety Net

An autonomous agent making decisions at scale is powerful. An autonomous agent making decisions without any oversight is a liability — especially in FSI or regulated environments.

LangGraph’s interrupt() lets you pause an agent mid-graph and wait for human input before continuing.

from langgraph.types import interrupt, Command
from langgraph.checkpoint.memory import InMemorySaver
from typing import TypedDict

class ReviewState(TypedDict):
    task: str
    draft_output: str
    approved: bool

def draft_node(state: ReviewState):
    # Simulate the agent drafting something
    return {"draft_output": f"Draft response to: {state['task']}"}

def human_review_node(state: ReviewState):
    # Pause here and surface the draft to a human
    decision = interrupt({
        "draft": state["draft_output"],
        "instruction": "Approve or edit this output before we proceed."
    })
    return {"approved": decision.get("approved", False)}

def finalize_node(state: ReviewState):
    if state["approved"]:
        return {"draft_output": f"[APPROVED] {state['draft_output']}"}
    return {"draft_output": "[REJECTED — needs revision]"}

checkpointer = InMemorySaver()

review_graph = (
    StateGraph(ReviewState)
    .add_node("draft", draft_node)
    .add_node("human_review", human_review_node)
    .add_node("finalize", finalize_node)
    .add_edge(START, "draft")
    .add_edge("draft", "human_review")
    .add_edge("human_review", "finalize")
    .add_edge("finalize", END)
    .compile(checkpointer=checkpointer)
)

config = {"configurable": {"thread_id": "review-001"}}

# Run to the interrupt
review_graph.invoke({"task": "Write quarterly summary", "draft_output": "", "approved": False}, config)

# Human approves — resume
review_graph.invoke(Command(resume={"approved": True}), config)

Reference: Testing the agent — human-in-the-loop

This pattern maps directly onto governance gates in regulated industries: the agent drafts, a human reviews, execution continues only on explicit approval.

Chapter 6: Enter Deep Agents — The Harness Level

Now we level up. Deep Agents is the highest-level abstraction in the LangChain stack — an agent harness built on LangGraph that adds:

Built-in planning tools — the agent can decompose complex tasks into steps
Virtual filesystem — agents read and write files across long runs
Subagent spawning — delegate subtasks to specialist agents running in isolated context windows
Long-term memory — update and retrieve knowledge across sessions

“deepagents is a standalone library built on top of LangChain’s core building blocks for agents. It uses the LangGraph runtime for durable execution, streaming, human-in-the-loop, and other features.” — Deep Agents overview

Install:

pip install deepagents langchain-anthropic

Building a Deep Research Agent

Here’s a complete, testable example — a coordinator agent that plans a research task, delegates to a web-search subagent and a summarizer subagent, then synthesizes the final answer.

# deep_research_agent.py
from deepagents import create_deep_agent, SubAgent
from langchain_anthropic import ChatAnthropic
from langchain_core.tools import tool
from langgraph.checkpoint.memory import InMemorySaver

# ─── Model ───────────────────────────────────────────────
model = ChatAnthropic(model="claude-sonnet-4-20250514", temperature=0)

# ─── Tools ───────────────────────────────────────────────

@tool
def web_search(query: str) -> str:
    """Search the web for information on a given topic."""
    # Wire to Tavily or SerpAPI in production
    return f"[Search results for '{query}']: LangGraph was released by LangChain in 2024. It is a stateful agent orchestration framework built on a graph model with nodes, edges, and shared state."

@tool
def summarize_text(text: str) -> str:
    """Summarize a block of text into key bullet points."""
    # In production, call the model here
    return f"Summary: {text[:200]}..."

# ─── Subagents ────────────────────────────────────────────

# The Researcher subagent: specialized in web search
researcher = SubAgent(
    name="researcher",
    description="Searches the web and retrieves relevant information on any topic. Use this for fact-finding tasks.",
    tools=[web_search],
    model=model,
)

# The Summarizer subagent: specialized in distillation
summarizer = SubAgent(
    name="summarizer",
    description="Takes raw text or search results and produces clean, structured summaries. Use this after research is complete.",
    tools=[summarize_text],
    model=model,
)

# ─── Coordinator (Deep Agent) ─────────────────────────────

checkpointer = InMemorySaver()

agent = create_deep_agent(
    model=model,
    subagents=[researcher, summarizer],
    system_prompt="""You are a deep research coordinator.
When given a topic, you:
1. Plan which subtasks are needed
2. Delegate research to the researcher subagent
3. Delegate summarization to the summarizer subagent
4. Synthesize a final, structured answer

Always produce outputs in clear markdown with headings.""",
    checkpointer=checkpointer,
)

# ─── Run ──────────────────────────────────────────────────
if __name__ == "__main__":
    config = {"configurable": {"thread_id": "research-session-001"}}

    result = agent.invoke(
        {"messages": [{"role": "user", "content": "Research how LangGraph works and give me a structured summary."}]},
        config=config
    )

    # Print the final coordinator message
    for message in result["messages"]:
        if hasattr(message, "content") and message.content:
            print(message.content)

Run it:

python deep_research_agent.py

Reference: Deep Agents overview, Subagents

Chapter 7: The Architecture Mental Model

Before you ship any of this to production, internalize this architecture. Deep Agents use a coordinator-worker model:

User Message
    │
    ▼
┌─────────────────────────────┐
│   COORDINATOR (Deep Agent)  │  ← Plans tasks, routes to subagents
│   - Receives user input     │
│   - Decides delegation      │
└────────┬───────────┬────────┘
         │           │
         ▼           ▼
┌──────────────┐  ┌──────────────┐
│  Researcher  │  │  Summarizer  │  ← Isolated context windows
│  Subagent    │  │  Subagent    │
└──────────────┘  └──────────────┘
         │           │
         └─────┬─────┘
               ▼
    ┌─────────────────┐
    │  Final Synthesis │  ← Coordinator assembles final answer
    └─────────────────┘

Reference: Architecture — Deep Agents frontend

Each subagent runs in its own isolated context window. This means:

No context pollution between specialists
Each subagent can run longer, focused tasks
You can parallelize subagents for speed
Memory and state are cleanly separated per agent

Chapter 8: What Makes This “Deep”?

You might ask: isn’t this just multi-agent? What’s the deep part?

The depth comes from the harness capabilities that LangGraph alone doesn’t give you out of the box:

Context management across long runs. A research task might span 50 tool calls and thousands of tokens. Deep Agents automatically summarize history and offload large results to the virtual filesystem so the agent never hits context limits mid-task.

Subagent isolation. Each specialist runs fresh — no shared message history. This is critical for reliability: the summarizer doesn’t need to know the researcher’s entire search history; it just needs the results.

Planning tools built in. The coordinator can use built-in planning capabilities to decompose “research LangGraph for my blog post” into: search → collect → summarize → structure → draft. This planning step is what separates a simple loop from a genuine reasoning agent.

Memory that persists. Lessons learned, user preferences, domain knowledge — all storable and retrievable across sessions using InMemorySaver in dev or LangGraph Store in production.

Chapter 9: Production Checklist Before You Ship

You’ve built your agent. Here’s what separates a demo from a production-grade deployment:

1. Swap InMemorySaver for a persistent checkpointer. Use Redis or PostgreSQL for langgraph-checkpoint-redis or langgraph-checkpoint-postgres. The compile interface is identical.

2. Add retry policies on fragile nodes.

builder.add_node(
    "web_search_node",
    search_node_fn,
    {"retry_policy": {"max_attempts": 3}}
)

3. Instrument with LangSmith. Set your env vars and every graph invocation is traced automatically:

export LANGSMITH_API_KEY=your_key
export LANGSMITH_TRACING=true

Trace with LangGraph — LangSmith

4. Add human-in-the-loop gates for high-stakes actions. Any node that sends emails, modifies data, or calls external APIs should have an interrupt() gate before execution.

5. Test subagent namespace isolation. If you’re running multiple subagents in parallel, ensure each has a unique node name to prevent checkpoint collisions.

Reference: Multiple subgraph calls

The Lesson I Wish I’d Known Earlier

When my first agent broke on that Tuesday, I didn’t need a smarter model. I needed a smarter structure.

LangGraph gives you that structure: a graph that is observable, resumable, and testable at every node. Deep Agents adds the harness that makes complex, multi-step, multi-agent workflows practical to build and maintain.

The pattern we’ve walked through — State → Nodes → Edges → Tools → Checkpointer → Human Gate → Subagents — is the same pattern running inside production agents at enterprise scale today.

Start with the simple graph. Add tools. Add memory. Add governance gates. Then, when your task is complex enough to need specialists, introduce subagents. Don’t over-engineer day one. The graph scales with your ambition.

Resources

Built with verified LangChain documentation. All code examples are production-compatible with LangGraph’s current API. Install requirements: pip install langgraph langchain-anthropic deepagents langsmith

Agent Harness vs. Context Engineering: The Next Evolution of AI Agent Architecture with LangGraph

Satish Prasad — Sun, 07 Jun 2026 12:16:13 +0000

Building AI applications has evolved dramatically. The community has moved past simple prompt tuning into complex system architecture. If you are building production-grade workflows today, you are likely grappling with a massive shift: moving from fragile proof-of-concepts to resilient, enterprise-grade systems.

For most of 2024 and 2025, the AI engineering community focused heavily on Prompt Engineering and later Context Engineering. As AI agents became more autonomous, however, engineers discovered that neither prompts nor context alone could reliably deliver production-grade agent behavior.

A new paradigm dominates the architectural landscape: Agent Harness Engineering. Leading AI companies and frameworks increasingly describe agent systems using a simple equation:

{Agent} = {Model} + {Harness}

The language model provides raw reasoning capabilities, while the harness provides everything required to transform that reasoning into reliable, safe, and deterministic actions.

1. Defining the Core Concepts

To understand how to build resilient systems, we must first look at the three evolutionary eras of AI engineering:

Prompt Engineering   ➔   Context Engineering   ➔   Harness Engineering
(Shapes Behavior)        (Shapes Knowledge)         (Shapes Reliability)

Phase 1: Prompt Engineering (Shapes Behavior): Early AI applications focused on better instructions, Chain-of-Thought formatting, and few-shot examples. The assumption was simple: better prompts produce better outputs. This worked for basic chatbots but failed for complex, multi-step workflows.
Phase 2: Context Engineering (Shapes Knowledge): As agents became more sophisticated, engineers realized the quality of context often matters more than the prompt itself. Context Engineering emerged as the practice of dynamic retrieval (RAG), vector search management, token budget optimization, and state compaction to ensure the model’s context window contains pristine, highly relevant information. A Context Engineer asks: “What information should the model see?”
Phase 3: Harness Engineering (Shapes Reliability): The latest realization is the most critical: even perfect context cannot solve tool execution failures, infinite loops, permission issues, planning mistakes, or missing feedback cycles. According to emerging industry definitions, “If you’re not the model, you’re the harness.” An Agent Harness is the complete execution environment and infrastructure shell surrounding an LLM. A Harness Engineer asks: “What environment should the model operate within?”

Without a harness, an LLM can only generate text. With a harness, the same model can browse websites, query databases, safely execute code, plan multi-step tasks, coordinate sub-agents, persist long-term memory, and recover from real-world failures. It represents a fundamental shift from information design to system design.

2. Agent Harness vs. Context Engineering

Confusing these two layers is one of the most common architectural mistakes engineering teams make. They are not interchangeable; they focus on entirely different layers of the software stack, fail in distinct ways, and require unique debugging paths.

Feature / Dimension	Context Engineering (The Brain)	Agent Harness Engineering (The Body)
Primary Core Focus	Knowledge, Information Flow, Relevance	Infrastructure, Runtime, Execution Reliability
Key Responsibility	Providing fresh semantic data, pristine RAG, metadata pruning, and document indexing.	Executing sandboxed code, state serialization, token rate-limiting, and error-trapping.
Where it Operates	Inside the LLM Prompt / Context Window.	Outside the LLM, hosting the application loop.
Operational Analogy	The Brain: Provides knowledge, memory, and cognitive understanding.	The Body: Provides tools, physical actions, constraints, and safety mechanisms.
Silent Failures	High. The agent runs flawlessly but generates an outdated answer because of stale vector data.	Low. The architecture crashes visibly (e.g., timeout exceptions, sandbox breaches, schema errors).

3. The Anatomy of an Agent Harness

A production-ready harness acts as the nervous and immune system for your AI agent. It typically contains six foundational pillars:

Planning Layer: Responsible for task decomposition, goal tracking, progress monitoring, and dynamic replanning. When a user asks an agent to “Research competitors and prepare a report,” the planning layer breaks this down into distinct, traceable sub-tasks.
Tool Execution Layer: Provides secure access to APIs, databases, search engines, file systems, and MCP (Model Context Protocol) servers. The model makes the cognitive decision; the harness safely executes it.
Memory Layer: Stores short-term session state, long-term semantic memory, user preferences, and historical actions so agents avoid repeatedly solving the same problems.
Context Management Layer: This is where Context Engineering becomes a functional component of the harness. It handles context compression, semantic retrieval, summarization, and window optimization. Context Engineering is a subset of Harness Engineering.
Safety and Governance Layer: Controls tool permissions, runs ephemeral sandboxed environments (Docker, WASM, E2B) to isolate code execution, enforces organizational policies, and manages human-in-the-loop approval workflows.
Observability Layer: Tracks tool calls, agent decisions, token costs, latency, and system failures. Without this layer, debugging an autonomous agent becomes impossible.

4. Why LangGraph Is a Natural Platform for Agent Harnesses

LangGraph was designed to solve a challenge that traditional agent frameworks struggle with: reliable, long-running, and cyclical execution.

Unlike linear chains, LangGraph introduces explicit workflow orchestration through graph structures (Nodes = LLM processing or Tool calling; Edges = Routing decisions). This makes it an ideal foundation for building an operational harness. LangGraph provides the underlying primitives, allowing you to map harness components directly onto graph mechanics:

Harness Planning Layer -> LangGraph Nodes: Each concrete planning step or state of execution becomes a node with explicit boundaries and responsibilities.
Harness State Layer -> LangGraph State: LangGraph maintains a shared, type-safe state schema across nodes, acting as the memory backbone of the harness.
Harness Execution Layer -> LangGraph Tools: Tools become strictly bound, callable capabilities controlled and monitored by the graph runtime.
Harness Governance Layer -> Conditional Edges: Complex safety and execution logic (e.g., if confidence < 0.8: route_to_human_review()) are built structurally into the graph edges rather than relying on the LLM to follow prompt instructions.
Harness Observability Layer -> LangSmith + LangGraph: Provides native tracing of node transitions, tool performance, and failure states.

5. Practical Implementation Pattern

If you’re using LangGraph, the easiest way to use an Agent Harness is actually through Deep Agents, which LangChain describes as a batteries-included agent harness built on top of LangGraph. Deep Agents provides planning, task delegation, context management, memory, filesystem support, and human-in-the-loop controls without requiring you to build everything yourself.

Architecture: LangGraph + Agent Harness

                    User Request
                           |
                           v
                 +----------------+
                 | Deep Agent     |
                 | (Harness)      |
                 +----------------+
                           |
       ------------------------------------------------
       |              |             |                |
       v              v             v                v
   Planning      Memory       Sub Agents      Human Review
(write_todos)   Filesystem      Task()        interrupt_on
       |              |             |                |
       ------------------------------------------------
                           |
                           v
                    LangGraph Runtime
             (State, Checkpoints, Streaming)

According to the LangChain documentation, the harness provides these built-in capabilities:

Planning (write_todos)
Virtual filesystem
Context management
Task delegation (subagents)
Human-in-the-loop approvals
Long-term memory
Code execution support

Example 1: Create a Deep Agent Harness

This example comes directly from the Deep Agents approach documented by LangChain.

from deepagents import create_deep_agent
from langchain_openai import ChatOpenAI

model = ChatOpenAI(model="gpt-4.1")

agent = create_deep_agent(
    model=model
)

At this point you already have:

Planning
Memory
Context management
File storage
Task delegation

without manually building graph nodes.

Example 2: Add Planning

One of the most important harness features is the built-in planning tool.

When a user asks:

Research UiPath Agentic Automation competitors

the agent automatically creates a TODO list before execution.

TODO

[ ] Identify competitors
[ ] Gather company data
[ ] Analyze strengths
[ ] Generate report

The Deep Agents harness uses the write_todos tool to maintain structured plans. This helps long-running tasks remain organized and auditable.

Example 3: Add Specialized Subagents

LangChain recommends using subagents to avoid context-window bloat.

from deepagents import create_deep_agent

agent = create_deep_agent(
    model=model,
    subagents=[
        {
            "name": "researcher",
            "description": "Web research specialist"
        },
        {
            "name": "analyst",
            "description": "Data analysis specialist"
        }
    ]
)

Each subagent gets its own isolated context window and returns only the final results to the supervisor.

Example 4: Human-in-the-Loop Approval

For enterprise applications you often want approval before actions occur.

agent = create_deep_agent(
    model=model,
    interrupt_on={
        "send_email": True,
        "delete_file": True
    }
)

Agent decides:
   Delete file?

        |
        v

Pause Execution
        |
        v

Human Approves
        |
        v

Continue

LangChain calls this “Human-in-the-Loop” execution and recommends it for sensitive operations.

Real-World UiPath Research Agent Example

For your UiPath blog generation use case, a harness could look like:

User:
Generate UiPath Agentic Automation Blog
           |
           v
Planner Agent
           |
           v
Research Agent
(Gather UiPath docs)
           |
           v
Competitor Agent
(Copilot Studio, CrewAI, LangGraph)
           |
           v
Fact Check Agent
           |
           v
Content Writer Agent
           |
           v
Human Approval
           |
           v
Publish

This is a textbook Agent Harness design because it combines:

Planning
Multiple specialized agents
Context isolation
Memory
Human review
Workflow orchestration

all running on LangGraph.

6. Enterprise Benefits of Agent Harnesses

Organizations moving toward a harness-centric architecture realize massive advantages over teams relying on prompts alone:

Reliability: Deterministic, graph-driven state machines ensure agents follow strict corporate workflows and don’t deviate into unmapped logic loops.
Governance: Human approvals, data policy enforcement, and permission structures become hardcoded security boundaries instead of fragile prompt instructions.
Reusability & Vendor Independence: The harness abstracts your core business logic away from the model providers. If a faster, cheaper LLM is released tomorrow, you swap the model inside the node—the entire harness layer remains completely untouched.
Debuggability: When failures happen, they are tracked down to specific software components, input streams, or isolated nodes rather than debugging an enigmatic prompt output.

Conclusion: The Operating System of AI

The AI industry is moving rapidly beyond prompt engineering. The next competitive advantage will not come solely from adopting slightly smarter models, but from building vastly superior harnesses around them.

In the same way that operating systems made abstract computer hardware useful to consumers, Agent Harnesses are becoming the operating systems of autonomous AI agents. For teams building production applications with LangGraph, mastering Harness Engineering is no longer optional—it is the baseline requirement for operational success.