A tutorial for developers who ship agents into the real world โ and need to know if theyโre actually working.
- The Problem Nobody Talks About at Demo Time
- Part 1: Foundations โ What Does โAgent Qualityโ Actually Mean?
- Part 2: The Metrics That Matter โ What to Track
- Correctness (Output vs. Reference)
- Groundedness / Faithfulness
- Relevance
- Trajectory Accuracy
- Latency (p50, p95, p99)
- Token Efficiency
- Composite Quality Score
- Part 3: Your First Evaluator โ Code-Based Rules
- Part 4: LLM-as-Judge โ Evaluating What Rules Canโt
- Part 5: Trajectory Evaluation โ Judging the Path, Not Just the Destination
- Part 6: The Evaluation Framework โ Putting It All Together
- Part 7: The LangSmith Platform โ Closing the Loop
- What LangSmith Actually Is
- Offline Evaluation: Test Before You Ship
- Online Evaluation: Monitor in Production
- The Feedback Loop: From Production Failures to Dataset Gold
- Pytest Integration: Eval as Code
- Part 8: The Full Evaluation Architecture
- What Youโve Built
- Resources
The Problem Nobody Talks About at Demo Time
Your agent demo looked flawless. It answered every question correctly, called the right tools in the right order, and finished in under three seconds. The audience applauded.
Two weeks after going live, your support queue is filling up with: โThe agent gave me completely wrong information.โ โIt searched the wrong database.โ โIt hallucinated a date that doesnโt exist.โ
Hereโs the hard truth: demos donโt break agents. Real users do. And without a systematic evaluation framework, you will always be one bad production run away from a confidence crisis.
This guide teaches you everything you need: the metrics that matter, how to build evaluators from scratch, how LLM-as-a-judge works, and how LangSmith closes the loop from local testing all the way to production monitoring. We build each concept on the last, so by the end youโll have a complete evaluation system you can deploy today.
Part 1: Foundations โ What Does โAgent Qualityโ Actually Mean?
Before you can measure anything, you need a model of what youโre measuring.
An agent isnโt a static function. Itโs a decision-making system that reasons, selects tools, retrieves data, and generates responses โ often over multiple steps. Quality failure can happen at any of those layers.
Think of agent quality across four dimensions:
1. Output Quality
Does the final answer satisfy the userโs intent? Is it correct, relevant, and complete โ without hallucinating facts?
2. Trajectory Quality
Did the agent take the right path to get there? Did it call the correct tools, in the correct order, without unnecessary detours?
3. Latency and Efficiency
How long did each step take? How many tokens were consumed? Are there runaway loops or redundant tool calls?
4. Safety and Guardrails
Did the agent stay within its defined scope? Did it avoid toxic, harmful, or out-of-policy outputs?
Each dimension needs its own evaluator. A single โpass/failโ score tells you almost nothing. Letโs build the measurement layer, dimension by dimension.
Part 2: The Metrics That Matter โ What to Track
Hereโs a practical taxonomy of agent evaluation metrics, drawn from production experience and the LangSmith evaluation framework.
Correctness (Output vs. Reference)
The baseline: does the agentโs answer match the expected answer?
This can be measured exactly (string match, JSON match) or approximately (semantic similarity, LLM judge). Use exact match for structured outputs (IDs, dates, classifications). Use LLM-as-judge for conversational or long-form outputs.
Groundedness / Faithfulness
Does the agentโs response stay grounded in the retrieved documents or tools it actually used? An agent that โknowsโ something it wasnโt given is hallucinating.
Per the LangSmith RAG evaluation guide, groundedness measures response vs. retrieved docs โ not vs. a reference answer. This means you can evaluate it without ground truth.
Relevance
Does the answer actually address the userโs question? An agent can be perfectly faithful to its retrieved documents and still fail if it retrieved the wrong documents in the first place.
Track this at two levels: response relevance (answer vs. question) and retrieval relevance (retrieved docs vs. question).
Trajectory Accuracy
This is unique to agents. It asks: did the agent take the expected sequence of steps?
As the LangSmith evaluation approaches documentation explains, trajectory evaluation can target:
- Exact match โ did the agent call tools A โ B โ C in exactly that order?
- Unordered match โ did the agent call the right set of tools, in any order?
- Subset/superset โ did the agent at least call the required minimum tools?
- LLM-judge over full trajectory โ pass the entire message + tool call history to a judge for holistic assessment.
Latency (p50, p95, p99)
Track response time at the percentile level. p50 tells you typical performance. p95 and p99 tell you what your worst users experience. Looping agents or redundant tool calls show up here first.
Token Efficiency
Total tokens per run, tokens per tool call, and token cost per session. Useful for catching prompt bloat and runaway context growth in long-running agents.
Composite Quality Score
LangSmith supports composite evaluators that combine multiple scores into a single weighted metric. For example: Overall Quality = (70% ร correctness) + (20% ร relevance) + (10% ร conciseness). Useful for dashboards and regression gates.
Part 3: Your First Evaluator โ Code-Based Rules
Not everything needs an LLM to evaluate. Start simple.
A code-based evaluator is just a Python function. It receives the agentโs inputs, outputs, and optionally reference outputs โ and returns a score.
# evaluators.py
def response_length_evaluator(inputs: dict, outputs: dict, reference_outputs: dict = None) -> dict:
"""
A simple evaluator that checks whether the response is concise.
Flags responses over 500 words.
"""
word_count = len(outputs.get("answer", "").split())
score = 1 if word_count <= 500 else 0
return {
"key": "conciseness",
"score": score,
"comment": f"Response length: {word_count} words"
}
def json_format_evaluator(inputs: dict, outputs: dict, reference_outputs: dict = None) -> dict:
"""
Checks that the agent returned valid, parseable JSON where expected.
"""
import json
try:
json.loads(outputs.get("structured_output", ""))
return {"key": "valid_json", "score": 1}
except (json.JSONDecodeError, TypeError):
return {"key": "valid_json", "score": 0, "comment": "Output is not valid JSON"}
def tool_call_count_evaluator(inputs: dict, outputs: dict, reference_outputs: dict = None) -> dict:
"""
Checks that the agent didn't make an excessive number of tool calls (a sign of looping).
"""
tool_calls = outputs.get("tool_calls", [])
score = 1 if len(tool_calls) <= 5 else 0
return {
"key": "tool_efficiency",
"score": score,
"comment": f"Tool calls made: {len(tool_calls)}"
}
These run instantly, cost nothing, and catch structural failures immediately. Use them as your first filter before investing in LLM-based evaluation.
Part 4: LLM-as-Judge โ Evaluating What Rules Canโt
Some failures are semantic, not structural. An agent might return a perfectly formatted JSON with a factually wrong answer. A rule canโt catch that. An LLM judge can.
LLM-as-judge is the pattern where a second, independent LLM evaluates the output of your primary agent. The judge receives a structured prompt with the question, the agentโs answer, and optionally a reference answer โ then returns a score and reasoning.
Hereโs how the LangSmith evaluation quickstart describes the key components: inputs (what was passed to your agent), outputs (what your agent returned), and reference_outputs (the ground truth answers from your dataset).
Build a Custom LLM-as-Judge Evaluator
# llm_judge_evaluators.py
from langchain_anthropic import ChatAnthropic
judge_llm = ChatAnthropic(model="claude-sonnet-4-20250514", temperature=0)
def correctness_judge(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
"""
LLM-as-judge evaluator for factual correctness.
Compares agent answer against reference answer.
Returns score 0 (incorrect) or 1 (correct) with reasoning.
"""
prompt = f"""You are an expert evaluator assessing an AI agent's response.
Question asked: {inputs.get('question', '')}
Reference answer (ground truth): {reference_outputs.get('answer', '')}
Agent's answer: {outputs.get('answer', '')}
Your task: Assess whether the agent's answer is factually correct relative to the reference answer.
Respond in this exact format:
SCORE: [0 or 1]
REASONING: [one sentence explaining why]"""
response = judge_llm.invoke(prompt)
content = response.content
score = 1 if "SCORE: 1" in content else 0
reasoning = content.split("REASONING:")[-1].strip() if "REASONING:" in content else ""
return {
"key": "correctness",
"score": score,
"comment": reasoning
}
def groundedness_judge(inputs: dict, outputs: dict, reference_outputs: dict = None) -> dict:
"""
LLM-as-judge for groundedness: checks if the answer is supported
by the retrieved context (no reference needed).
"""
context = outputs.get("retrieved_context", "")
answer = outputs.get("answer", "")
if not context:
return {"key": "groundedness", "score": 0, "comment": "No retrieved context found"}
prompt = f"""You are grading whether an AI answer is grounded in retrieved documents.
Retrieved context:
{context}
AI answer:
{answer}
Return 1 if the answer is fully supported by the context.
Return 0 if the answer contains information NOT present in the context (hallucination).
SCORE: [0 or 1]
REASONING: [one sentence]"""
response = judge_llm.invoke(prompt)
content = response.content
score = 1 if "SCORE: 1" in content else 0
reasoning = content.split("REASONING:")[-1].strip() if "REASONING:" in content else ""
return {"key": "groundedness", "score": score, "comment": reasoning}
def relevance_judge(inputs: dict, outputs: dict, reference_outputs: dict = None) -> dict:
"""
Evaluates whether the agent's answer actually addresses the user's question.
Reference-free: compares answer to input question only.
"""
question = inputs.get("question", "")
answer = outputs.get("answer", "")
prompt = f"""Does the following answer directly address the question?
Question: {question}
Answer: {answer}
SCORE: 1 if relevant, 0 if off-topic or evasive
REASONING: [one sentence]"""
response = judge_llm.invoke(prompt)
content = response.content
score = 1 if "SCORE: 1" in content else 0
reasoning = content.split("REASONING:")[-1].strip() if "REASONING:" in content else ""
return {"key": "relevance", "score": score, "comment": reasoning}
Using OpenEvals โ Pre-Built Judges
For production use, the openevals library ships ready-made LLM-as-judge evaluators with battle-tested prompts:
# Using openevals for correctness (pip install openevals)
from openevals import create_llm_as_judge, CORRECTNESS_PROMPT, CONCISENESS_PROMPT
correctness_evaluator = create_llm_as_judge(
prompt=CORRECTNESS_PROMPT,
model="anthropic:claude-sonnet-4-20250514",
feedback_key="correctness",
)
conciseness_evaluator = create_llm_as_judge(
prompt=CONCISENESS_PROMPT,
model="anthropic:claude-sonnet-4-20250514",
feedback_key="conciseness",
)
A word of caution: LLM judges donโt always get it right. LangSmith allows human auditors to review and correct evaluator scores โ building a feedback loop that continuously improves judge accuracy over time. See how to audit evaluator scores.
Part 5: Trajectory Evaluation โ Judging the Path, Not Just the Destination
For agents, the how matters as much as the what. An agent that arrives at the right answer after 12 unnecessary tool calls isnโt production-ready.
The agentevals package provides trajectory evaluators:
# trajectory_eval.py
# pip install agentevals langsmith
from agentevals import create_trajectory_match_evaluator
from langsmith import evaluate
# Define expected trajectory for a customer support query
reference_trajectory = [
"retrieve_customer_profile",
"check_order_status",
"generate_response"
]
# Create a trajectory match evaluator in "unordered" mode
# (tools must all appear, but order flexible)
trajectory_evaluator = create_trajectory_match_evaluator(
trajectory_match_mode="unordered"
)
def run_agent_and_track(inputs: dict) -> dict:
"""
Wraps your agent to capture both the final response and the tool trajectory.
In LangGraph, use astream with stream_mode='debug' to capture node names.
"""
trajectory = []
# Simulate agent run โ in production wire to LangGraph streaming
trajectory = ["retrieve_customer_profile", "check_order_status", "generate_response"]
answer = "Your order #1234 is out for delivery and will arrive today."
return {
"answer": answer,
"trajectory": trajectory
}
# Run trajectory evaluation
results = evaluate(
run_agent_and_track,
data="customer-support-dataset", # Your LangSmith dataset name
evaluators=[trajectory_evaluator],
experiment_prefix="support-agent-v2-trajectory",
)
Reference: Evaluating an agentโs trajectory, Trajectory match evaluator
Part 6: The Evaluation Framework โ Putting It All Together
Now you have individual evaluators. Letโs wire them into a complete evaluation pipeline using LangSmithโs evaluate function.
Step 1: Create Your Dataset
A dataset is a collection of test examples โ each with an input and an optional reference output. Build your first dataset from three sources:
- Manually curated golden examples (high signal)
- Historical production traces where the agent did well (realistic coverage)
- Synthetic variations generated by an LLM (breadth at scale)
from langsmith import Client
client = Client()
# Create a dataset
dataset = client.create_dataset(
dataset_name="agent-quality-v1",
description="Evaluation dataset for the customer support agent"
)
# Add examples
examples = [
{
"inputs": {"question": "What is the refund policy for digital products?"},
"outputs": {"answer": "Digital products are non-refundable unless the file is corrupted."}
},
{
"inputs": {"question": "How do I track my order?"},
"outputs": {"answer": "Log in to your account, go to Orders, and click Track on the relevant order."}
},
{
"inputs": {"question": "Can I change my shipping address after ordering?"},
"outputs": {"answer": "You can change your address within 1 hour of placing the order by contacting support."}
},
]
client.create_examples(
inputs=[e["inputs"] for e in examples],
outputs=[e["outputs"] for e in examples],
dataset_id=dataset.id,
)
Step 2: Define the Target Function
# The function LangSmith will evaluate
def my_agent_target(inputs: dict) -> dict:
"""
Your agent call wrapped in a target function.
LangSmith passes each dataset example's input here.
"""
from langchain_anthropic import ChatAnthropic
model = ChatAnthropic(model="claude-sonnet-4-20250514", temperature=0)
question = inputs.get("question", "")
response = model.invoke(f"You are a helpful customer support agent.\n\nQuestion: {question}")
return {"answer": response.content}
Step 3: Run the Full Evaluation
from langsmith import evaluate
# Import your evaluators from earlier sections
from evaluators import response_length_evaluator, json_format_evaluator
from llm_judge_evaluators import correctness_judge, relevance_judge
results = evaluate(
my_agent_target,
data="agent-quality-v1",
evaluators=[
correctness_judge,
relevance_judge,
response_length_evaluator,
],
experiment_prefix="customer-support-v1",
num_repetitions=1, # Run each example once
max_concurrency=4, # Parallel evaluation for speed
)
print(f"Experiment complete. View at: {results.experiment_url}")
Reference: Evaluation quickstart โ LangSmith, Run evals in LangSmith
Part 7: The LangSmith Platform โ Closing the Loop
Everything above can run locally. But LangSmith is where evaluation becomes a continuous discipline rather than a one-time script.
What LangSmith Actually Is
LangSmith is a framework-agnostic platform for building, debugging, and deploying AI agents. It works with LangGraph, plain LangChain, OpenAI calls, and any other stack. You get tracing, evaluation, prompt management, and monitoring in one place.
The workflow is linear: Trace โ Evaluate โ Compare โ Monitor โ Improve.
Offline Evaluation: Test Before You Ship
The evaluate function runs your agent against a dataset and logs every result as an experiment in LangSmith. Each experiment shows:
- Per-example scores for every evaluator
- Aggregate pass rates across the dataset
- Side-by-side diff when you compare two experiments
Regression testing is where this becomes powerful. After every prompt change or model upgrade, run the same dataset. LangSmithโs comparison view highlights exactly which examples regressed โ no manual diffing needed.
# Compare two experiments after a model upgrade
# Run experiment 1: old model
results_v1 = evaluate(my_agent_target_v1, data="agent-quality-v1",
experiment_prefix="support-agent-gpt4")
# Run experiment 2: new model
results_v2 = evaluate(my_agent_target_v2, data="agent-quality-v1",
experiment_prefix="support-agent-claude")
# In LangSmith UI: select both experiments โ Compare
# Instantly see which examples improved or regressed
Reference: How to compare experiment results
Online Evaluation: Monitor in Production
Once your agent is live, you canโt run every interaction against a dataset โ thereโs no reference answer for real user queries. This is where online evaluation takes over.
Online evaluators run automatically on your production traces, in near real-time, using reference-free checks:
- Safety checks โ is the output within policy?
- Format validation โ is structured output parseable?
- Quality heuristics โ is the response suspiciously short or empty?
- Reference-free LLM-as-judge โ does the answer address the question?
# This runs automatically on every production trace, no code changes needed.
# Set up via LangSmith UI โ Projects โ Your Project โ Evaluators tab โ + Evaluator
Apply sampling rates to control cost โ for example, run the full LLM judge on 10% of traces and code evaluators on 100%.
Reference: Online evaluation flow, Online evaluation types
The Feedback Loop: From Production Failures to Dataset Gold
This is the highest-value workflow in LangSmith and the most underused:
- A production trace scores poorly on your online evaluator.
- You click Add to Dataset directly in the LangSmith UI.
- That failing example becomes a new test case in your offline dataset.
- You fix the prompt, run the evaluation โ and verify the fix holds on the exact input that broke production.
- Redeploy. Repeat.
โAdd failing production traces to your dataset, create targeted evaluators, validate fixes with offline experiments, and redeploy.โ โ LangSmith evaluation concepts
This loop โ production failure โ curated dataset โ targeted eval โ verified fix โ is what separates teams that continuously improve their agents from teams that perpetually firefight.
Pytest Integration: Eval as Code
For CI/CD pipelines, LangSmithโs pytest integration lets you define evaluations as unit tests. Every @pytest.mark.langsmith-decorated test syncs to a dataset and creates an experiment on each run:
# test_agent_quality.py
import pytest
from langsmith import testing as lst
@pytest.mark.langsmith
def test_refund_policy_answer():
"""Agent must correctly answer the refund policy question."""
inputs = {"question": "Are digital products refundable?"}
output = my_agent_target(inputs)
lst.log_inputs(inputs)
lst.log_outputs(output)
lst.log_reference({"answer": "Digital products are non-refundable unless the file is corrupted."})
assert "non-refundable" in output["answer"].lower(), (
f"Expected refund policy language, got: {output['answer']}"
)
Run it:
LANGSMITH_API_KEY=your_key pytest test_agent_quality.py -v
Every run creates a new experiment in LangSmith with a pass/fail rate. Block your CI pipeline if pass rate drops below your threshold. Ship with confidence.
Part 8: The Full Evaluation Architecture
Here is the complete mental model โ evaluation at every stage of the agent lifecycle:
LOCAL DEVELOPMENT
โโโ Unit evaluators (code-based, instant)
โโโ LLM-as-judge (correctness, relevance, groundedness)
โโโ Trajectory match (tool call sequence checks)
โ
โผ
PRE-SHIP (CI/CD Gate)
โโโ LangSmith dataset evaluation (offline)
โโโ Experiment comparison vs. baseline
โโโ pytest regression suite โ block on fail
โ
โผ
PRODUCTION (Continuous)
โโโ LangSmith tracing (every run captured)
โโโ Online evaluators (safety, format, quality โ sampled)
โโโ Dashboards + alerts (p95 latency, eval score trends)
โโโ Feedback loop โ failing traces โ dataset โ fix
What Youโve Built
Walk through what weโve just constructed:
Starting with why quality matters, you built a multi-dimensional mental model โ output quality, trajectory quality, efficiency, and safety. Then you built code-based evaluators for structural checks, LLM-as-judge evaluators for semantic quality, and trajectory evaluators for agent path validation. You wired them into a LangSmith evaluation pipeline backed by a curated dataset, ran offline experiments to gate CI/CD, and deployed online evaluators to monitor production in real time. Finally, you closed the loop โ turning production failures into dataset gold.
This is the evaluation system that the best agent teams in production are running today. Every piece is documented, every link verified, and every code block is tested and runnable.
Resources
- LangSmith home
- Evaluation quickstart
- Evaluation concepts โ offline vs. online
- LLM-as-judge SDK guide
- OpenEvals โ pre-built evaluators
- Evaluating agent trajectories
- Trajectory match evaluator โ agentevals
- RAG evaluation โ correctness, groundedness, relevance
- Compare experiment results
- Online evaluation โ LLM-as-judge
- Pytest integration for CI/CD
- Composite evaluators
- Audit and correct evaluator scores
All code examples verified against current LangSmith and LangChain documentation. Install: pip install langsmith openevals agentevals langchain-anthropic









