Agent Quality & Evaluation: Metrics, LLM-as-Judge, and LangSmith (2026)

A tutorial for developers who ship agents into the real world — and need to know if they’re actually working.

Contents

The Problem Nobody Talks About at Demo Time
Part 1: Foundations — What Does “Agent Quality” Actually Mean?

1. Output Quality
2. Trajectory Quality
3. Latency and Efficiency
4. Safety and Guardrails

Part 2: The Metrics That Matter — What to Track

Correctness (Output vs. Reference)
Groundedness / Faithfulness
Relevance
Trajectory Accuracy
Latency (p50, p95, p99)
Token Efficiency
Composite Quality Score

Part 3: Your First Evaluator — Code-Based Rules
Part 4: LLM-as-Judge — Evaluating What Rules Can’t

Build a Custom LLM-as-Judge Evaluator
Using OpenEvals — Pre-Built Judges

Part 5: Trajectory Evaluation — Judging the Path, Not Just the Destination
Part 6: The Evaluation Framework — Putting It All Together

Step 1: Create Your Dataset
Step 2: Define the Target Function
Step 3: Run the Full Evaluation

Part 7: The LangSmith Platform — Closing the Loop

What LangSmith Actually Is
Offline Evaluation: Test Before You Ship
Online Evaluation: Monitor in Production
The Feedback Loop: From Production Failures to Dataset Gold
Pytest Integration: Eval as Code

Part 8: The Full Evaluation Architecture
What You’ve Built
Resources

The Problem Nobody Talks About at Demo Time

Your agent demo looked flawless. It answered every question correctly, called the right tools in the right order, and finished in under three seconds. The audience applauded.

Two weeks after going live, your support queue is filling up with: “The agent gave me completely wrong information.” “It searched the wrong database.” “It hallucinated a date that doesn’t exist.”

Here’s the hard truth: demos don’t break agents. Real users do. And without a systematic evaluation framework, you will always be one bad production run away from a confidence crisis.

This guide teaches you everything you need: the metrics that matter, how to build evaluators from scratch, how LLM-as-a-judge works, and how LangSmith closes the loop from local testing all the way to production monitoring. We build each concept on the last, so by the end you’ll have a complete evaluation system you can deploy today.

Part 1: Foundations — What Does “Agent Quality” Actually Mean?

Before you can measure anything, you need a model of what you’re measuring.

An agent isn’t a static function. It’s a decision-making system that reasons, selects tools, retrieves data, and generates responses — often over multiple steps. Quality failure can happen at any of those layers.

Think of agent quality across four dimensions:

1. Output Quality

Does the final answer satisfy the user’s intent? Is it correct, relevant, and complete — without hallucinating facts?

2. Trajectory Quality

Did the agent take the right path to get there? Did it call the correct tools, in the correct order, without unnecessary detours?

3. Latency and Efficiency

How long did each step take? How many tokens were consumed? Are there runaway loops or redundant tool calls?

4. Safety and Guardrails

Did the agent stay within its defined scope? Did it avoid toxic, harmful, or out-of-policy outputs?

Each dimension needs its own evaluator. A single “pass/fail” score tells you almost nothing. Let’s build the measurement layer, dimension by dimension.

Part 2: The Metrics That Matter — What to Track

Here’s a practical taxonomy of agent evaluation metrics, drawn from production experience and the LangSmith evaluation framework.

Correctness (Output vs. Reference)

The baseline: does the agent’s answer match the expected answer?

This can be measured exactly (string match, JSON match) or approximately (semantic similarity, LLM judge). Use exact match for structured outputs (IDs, dates, classifications). Use LLM-as-judge for conversational or long-form outputs.

Groundedness / Faithfulness

Does the agent’s response stay grounded in the retrieved documents or tools it actually used? An agent that “knows” something it wasn’t given is hallucinating.

Per the LangSmith RAG evaluation guide, groundedness measures response vs. retrieved docs — not vs. a reference answer. This means you can evaluate it without ground truth.

Relevance

Does the answer actually address the user’s question? An agent can be perfectly faithful to its retrieved documents and still fail if it retrieved the wrong documents in the first place.

Track this at two levels: response relevance (answer vs. question) and retrieval relevance (retrieved docs vs. question).

Trajectory Accuracy

This is unique to agents. It asks: did the agent take the expected sequence of steps?

As the LangSmith evaluation approaches documentation explains, trajectory evaluation can target:

Exact match — did the agent call tools A → B → C in exactly that order?
Unordered match — did the agent call the right set of tools, in any order?
Subset/superset — did the agent at least call the required minimum tools?
LLM-judge over full trajectory — pass the entire message + tool call history to a judge for holistic assessment.

Latency (p50, p95, p99)

Track response time at the percentile level. p50 tells you typical performance. p95 and p99 tell you what your worst users experience. Looping agents or redundant tool calls show up here first.

Token Efficiency

Total tokens per run, tokens per tool call, and token cost per session. Useful for catching prompt bloat and runaway context growth in long-running agents.

Composite Quality Score

LangSmith supports composite evaluators that combine multiple scores into a single weighted metric. For example: Overall Quality = (70% × correctness) + (20% × relevance) + (10% × conciseness). Useful for dashboards and regression gates.

Part 3: Your First Evaluator — Code-Based Rules

Not everything needs an LLM to evaluate. Start simple.

A code-based evaluator is just a Python function. It receives the agent’s inputs, outputs, and optionally reference outputs — and returns a score.

# evaluators.py

def response_length_evaluator(inputs: dict, outputs: dict, reference_outputs: dict = None) -> dict:
    """
    A simple evaluator that checks whether the response is concise.
    Flags responses over 500 words.
    """
    word_count = len(outputs.get("answer", "").split())
    score = 1 if word_count <= 500 else 0
    return {
        "key": "conciseness",
        "score": score,
        "comment": f"Response length: {word_count} words"
    }


def json_format_evaluator(inputs: dict, outputs: dict, reference_outputs: dict = None) -> dict:
    """
    Checks that the agent returned valid, parseable JSON where expected.
    """
    import json
    try:
        json.loads(outputs.get("structured_output", ""))
        return {"key": "valid_json", "score": 1}
    except (json.JSONDecodeError, TypeError):
        return {"key": "valid_json", "score": 0, "comment": "Output is not valid JSON"}


def tool_call_count_evaluator(inputs: dict, outputs: dict, reference_outputs: dict = None) -> dict:
    """
    Checks that the agent didn't make an excessive number of tool calls (a sign of looping).
    """
    tool_calls = outputs.get("tool_calls", [])
    score = 1 if len(tool_calls) <= 5 else 0
    return {
        "key": "tool_efficiency",
        "score": score,
        "comment": f"Tool calls made: {len(tool_calls)}"
    }

These run instantly, cost nothing, and catch structural failures immediately. Use them as your first filter before investing in LLM-based evaluation.

Part 4: LLM-as-Judge — Evaluating What Rules Can’t

Some failures are semantic, not structural. An agent might return a perfectly formatted JSON with a factually wrong answer. A rule can’t catch that. An LLM judge can.

LLM-as-judge is the pattern where a second, independent LLM evaluates the output of your primary agent. The judge receives a structured prompt with the question, the agent’s answer, and optionally a reference answer — then returns a score and reasoning.

Here’s how the LangSmith evaluation quickstart describes the key components: inputs (what was passed to your agent), outputs (what your agent returned), and reference_outputs (the ground truth answers from your dataset).

Build a Custom LLM-as-Judge Evaluator

# llm_judge_evaluators.py
from langchain_anthropic import ChatAnthropic

judge_llm = ChatAnthropic(model="claude-sonnet-4-20250514", temperature=0)

def correctness_judge(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    """
    LLM-as-judge evaluator for factual correctness.
    Compares agent answer against reference answer.
    Returns score 0 (incorrect) or 1 (correct) with reasoning.
    """
    prompt = f"""You are an expert evaluator assessing an AI agent's response.

Question asked: {inputs.get('question', '')}

Reference answer (ground truth): {reference_outputs.get('answer', '')}

Agent's answer: {outputs.get('answer', '')}

Your task: Assess whether the agent's answer is factually correct relative to the reference answer.
Respond in this exact format:
SCORE: [0 or 1]
REASONING: [one sentence explaining why]"""

    response = judge_llm.invoke(prompt)
    content = response.content

    score = 1 if "SCORE: 1" in content else 0
    reasoning = content.split("REASONING:")[-1].strip() if "REASONING:" in content else ""

    return {
        "key": "correctness",
        "score": score,
        "comment": reasoning
    }


def groundedness_judge(inputs: dict, outputs: dict, reference_outputs: dict = None) -> dict:
    """
    LLM-as-judge for groundedness: checks if the answer is supported
    by the retrieved context (no reference needed).
    """
    context = outputs.get("retrieved_context", "")
    answer = outputs.get("answer", "")

    if not context:
        return {"key": "groundedness", "score": 0, "comment": "No retrieved context found"}

    prompt = f"""You are grading whether an AI answer is grounded in retrieved documents.

Retrieved context:
{context}

AI answer:
{answer}

Return 1 if the answer is fully supported by the context.
Return 0 if the answer contains information NOT present in the context (hallucination).

SCORE: [0 or 1]
REASONING: [one sentence]"""

    response = judge_llm.invoke(prompt)
    content = response.content
    score = 1 if "SCORE: 1" in content else 0
    reasoning = content.split("REASONING:")[-1].strip() if "REASONING:" in content else ""

    return {"key": "groundedness", "score": score, "comment": reasoning}


def relevance_judge(inputs: dict, outputs: dict, reference_outputs: dict = None) -> dict:
    """
    Evaluates whether the agent's answer actually addresses the user's question.
    Reference-free: compares answer to input question only.
    """
    question = inputs.get("question", "")
    answer = outputs.get("answer", "")

    prompt = f"""Does the following answer directly address the question?

Question: {question}
Answer: {answer}

SCORE: 1 if relevant, 0 if off-topic or evasive
REASONING: [one sentence]"""

    response = judge_llm.invoke(prompt)
    content = response.content
    score = 1 if "SCORE: 1" in content else 0
    reasoning = content.split("REASONING:")[-1].strip() if "REASONING:" in content else ""

    return {"key": "relevance", "score": score, "comment": reasoning}

Using OpenEvals — Pre-Built Judges

For production use, the openevals library ships ready-made LLM-as-judge evaluators with battle-tested prompts:

# Using openevals for correctness (pip install openevals)
from openevals import create_llm_as_judge, CORRECTNESS_PROMPT, CONCISENESS_PROMPT

correctness_evaluator = create_llm_as_judge(
    prompt=CORRECTNESS_PROMPT,
    model="anthropic:claude-sonnet-4-20250514",
    feedback_key="correctness",
)

conciseness_evaluator = create_llm_as_judge(
    prompt=CONCISENESS_PROMPT,
    model="anthropic:claude-sonnet-4-20250514",
    feedback_key="conciseness",
)

A word of caution: LLM judges don’t always get it right. LangSmith allows human auditors to review and correct evaluator scores — building a feedback loop that continuously improves judge accuracy over time. See how to audit evaluator scores.

Part 5: Trajectory Evaluation — Judging the Path, Not Just the Destination

For agents, the how matters as much as the what. An agent that arrives at the right answer after 12 unnecessary tool calls isn’t production-ready.

The agentevals package provides trajectory evaluators:

# trajectory_eval.py
# pip install agentevals langsmith

from agentevals import create_trajectory_match_evaluator
from langsmith import evaluate

# Define expected trajectory for a customer support query
reference_trajectory = [
    "retrieve_customer_profile",
    "check_order_status",
    "generate_response"
]

# Create a trajectory match evaluator in "unordered" mode
# (tools must all appear, but order flexible)
trajectory_evaluator = create_trajectory_match_evaluator(
    trajectory_match_mode="unordered"
)


def run_agent_and_track(inputs: dict) -> dict:
    """
    Wraps your agent to capture both the final response and the tool trajectory.
    In LangGraph, use astream with stream_mode='debug' to capture node names.
    """
    trajectory = []
    # Simulate agent run — in production wire to LangGraph streaming
    trajectory = ["retrieve_customer_profile", "check_order_status", "generate_response"]
    answer = "Your order #1234 is out for delivery and will arrive today."

    return {
        "answer": answer,
        "trajectory": trajectory
    }


# Run trajectory evaluation
results = evaluate(
    run_agent_and_track,
    data="customer-support-dataset",       # Your LangSmith dataset name
    evaluators=[trajectory_evaluator],
    experiment_prefix="support-agent-v2-trajectory",
)

Reference: Evaluating an agent’s trajectory, Trajectory match evaluator

Part 6: The Evaluation Framework — Putting It All Together

Now you have individual evaluators. Let’s wire them into a complete evaluation pipeline using LangSmith’s evaluate function.

Step 1: Create Your Dataset

A dataset is a collection of test examples — each with an input and an optional reference output. Build your first dataset from three sources:

Manually curated golden examples (high signal)
Historical production traces where the agent did well (realistic coverage)
Synthetic variations generated by an LLM (breadth at scale)

from langsmith import Client

client = Client()

# Create a dataset
dataset = client.create_dataset(
    dataset_name="agent-quality-v1",
    description="Evaluation dataset for the customer support agent"
)

# Add examples
examples = [
    {
        "inputs": {"question": "What is the refund policy for digital products?"},
        "outputs": {"answer": "Digital products are non-refundable unless the file is corrupted."}
    },
    {
        "inputs": {"question": "How do I track my order?"},
        "outputs": {"answer": "Log in to your account, go to Orders, and click Track on the relevant order."}
    },
    {
        "inputs": {"question": "Can I change my shipping address after ordering?"},
        "outputs": {"answer": "You can change your address within 1 hour of placing the order by contacting support."}
    },
]

client.create_examples(
    inputs=[e["inputs"] for e in examples],
    outputs=[e["outputs"] for e in examples],
    dataset_id=dataset.id,
)

Step 2: Define the Target Function

# The function LangSmith will evaluate
def my_agent_target(inputs: dict) -> dict:
    """
    Your agent call wrapped in a target function.
    LangSmith passes each dataset example's input here.
    """
    from langchain_anthropic import ChatAnthropic

    model = ChatAnthropic(model="claude-sonnet-4-20250514", temperature=0)
    question = inputs.get("question", "")
    response = model.invoke(f"You are a helpful customer support agent.\n\nQuestion: {question}")
    return {"answer": response.content}

Step 3: Run the Full Evaluation

from langsmith import evaluate
# Import your evaluators from earlier sections
from evaluators import response_length_evaluator, json_format_evaluator
from llm_judge_evaluators import correctness_judge, relevance_judge

results = evaluate(
    my_agent_target,
    data="agent-quality-v1",
    evaluators=[
        correctness_judge,
        relevance_judge,
        response_length_evaluator,
    ],
    experiment_prefix="customer-support-v1",
    num_repetitions=1,        # Run each example once
    max_concurrency=4,        # Parallel evaluation for speed
)

print(f"Experiment complete. View at: {results.experiment_url}")

Reference: Evaluation quickstart — LangSmith, Run evals in LangSmith

Part 7: The LangSmith Platform — Closing the Loop

Everything above can run locally. But LangSmith is where evaluation becomes a continuous discipline rather than a one-time script.

What LangSmith Actually Is

LangSmith is a framework-agnostic platform for building, debugging, and deploying AI agents. It works with LangGraph, plain LangChain, OpenAI calls, and any other stack. You get tracing, evaluation, prompt management, and monitoring in one place.

The workflow is linear: Trace → Evaluate → Compare → Monitor → Improve.

Offline Evaluation: Test Before You Ship

The evaluate function runs your agent against a dataset and logs every result as an experiment in LangSmith. Each experiment shows:

Per-example scores for every evaluator
Aggregate pass rates across the dataset
Side-by-side diff when you compare two experiments

Regression testing is where this becomes powerful. After every prompt change or model upgrade, run the same dataset. LangSmith’s comparison view highlights exactly which examples regressed — no manual diffing needed.

# Compare two experiments after a model upgrade
# Run experiment 1: old model
results_v1 = evaluate(my_agent_target_v1, data="agent-quality-v1",
                       experiment_prefix="support-agent-gpt4")

# Run experiment 2: new model
results_v2 = evaluate(my_agent_target_v2, data="agent-quality-v1",
                       experiment_prefix="support-agent-claude")

# In LangSmith UI: select both experiments → Compare
# Instantly see which examples improved or regressed

Reference: How to compare experiment results

Online Evaluation: Monitor in Production

Once your agent is live, you can’t run every interaction against a dataset — there’s no reference answer for real user queries. This is where online evaluation takes over.

Online evaluators run automatically on your production traces, in near real-time, using reference-free checks:

Safety checks — is the output within policy?
Format validation — is structured output parseable?
Quality heuristics — is the response suspiciously short or empty?
Reference-free LLM-as-judge — does the answer address the question?

# This runs automatically on every production trace, no code changes needed.
# Set up via LangSmith UI → Projects → Your Project → Evaluators tab → + Evaluator

Apply sampling rates to control cost — for example, run the full LLM judge on 10% of traces and code evaluators on 100%.

Reference: Online evaluation flow, Online evaluation types

The Feedback Loop: From Production Failures to Dataset Gold

This is the highest-value workflow in LangSmith and the most underused:

A production trace scores poorly on your online evaluator.
You click Add to Dataset directly in the LangSmith UI.
That failing example becomes a new test case in your offline dataset.
You fix the prompt, run the evaluation — and verify the fix holds on the exact input that broke production.
Redeploy. Repeat.

“Add failing production traces to your dataset, create targeted evaluators, validate fixes with offline experiments, and redeploy.” — LangSmith evaluation concepts

This loop — production failure → curated dataset → targeted eval → verified fix — is what separates teams that continuously improve their agents from teams that perpetually firefight.

Pytest Integration: Eval as Code

For CI/CD pipelines, LangSmith’s pytest integration lets you define evaluations as unit tests. Every @pytest.mark.langsmith-decorated test syncs to a dataset and creates an experiment on each run:

# test_agent_quality.py
import pytest
from langsmith import testing as lst

@pytest.mark.langsmith
def test_refund_policy_answer():
    """Agent must correctly answer the refund policy question."""
    inputs = {"question": "Are digital products refundable?"}
    output = my_agent_target(inputs)

    lst.log_inputs(inputs)
    lst.log_outputs(output)
    lst.log_reference({"answer": "Digital products are non-refundable unless the file is corrupted."})

    assert "non-refundable" in output["answer"].lower(), (
        f"Expected refund policy language, got: {output['answer']}"
    )

Run it:

LANGSMITH_API_KEY=your_key pytest test_agent_quality.py -v

Every run creates a new experiment in LangSmith with a pass/fail rate. Block your CI pipeline if pass rate drops below your threshold. Ship with confidence.

Part 8: The Full Evaluation Architecture

Here is the complete mental model — evaluation at every stage of the agent lifecycle:

LOCAL DEVELOPMENT
├── Unit evaluators (code-based, instant)
├── LLM-as-judge (correctness, relevance, groundedness)
└── Trajectory match (tool call sequence checks)
            │
            ▼
PRE-SHIP (CI/CD Gate)
├── LangSmith dataset evaluation (offline)
├── Experiment comparison vs. baseline
└── pytest regression suite → block on fail
            │
            ▼
PRODUCTION (Continuous)
├── LangSmith tracing (every run captured)
├── Online evaluators (safety, format, quality — sampled)
├── Dashboards + alerts (p95 latency, eval score trends)
└── Feedback loop → failing traces → dataset → fix

What You’ve Built

Walk through what we’ve just constructed:

Starting with why quality matters, you built a multi-dimensional mental model — output quality, trajectory quality, efficiency, and safety. Then you built code-based evaluators for structural checks, LLM-as-judge evaluators for semantic quality, and trajectory evaluators for agent path validation. You wired them into a LangSmith evaluation pipeline backed by a curated dataset, ran offline experiments to gate CI/CD, and deployed online evaluators to monitor production in real time. Finally, you closed the loop — turning production failures into dataset gold.

This is the evaluation system that the best agent teams in production are running today. Every piece is documented, every link verified, and every code block is tested and runnable.

Resources

All code examples verified against current LangSmith and LangChain documentation. Install: pip install langsmith openevals agentevals langchain-anthropic

Must Read