In the world of AI-powered automation, building a capable agent is only half the battle. Ensuring it performs reliably and accurately in real-world scenarios is the true test. This is where a robust evaluation strategy comes in. Without it, you’re essentially deploying your automations blind, hoping they work as intended.
UiPath’s Agentic Automation platform provides powerful tools to systematically measure and improve your agent’s performance. The key to leveraging these tools effectively is organization. In this blog post, we’ll break down the best practices for structuring your evaluations, from grouping them into logical sets to choosing the right scoring engines, or “evaluators.”
The Core Philosophy of AI Agent Evaluation
At its heart, evaluating an AI agent is no different from quality assurance in software development or performance review for an employee. The goal is to systematically answer one critical question: “Is this agent reliably performing its intended task to the required standard?”
This moves you from anecdotal testing (“Let me try a few queries”) to empirical, measurable validation (“Based on 200 test cases, the agent achieves 95% accuracy on core tasks”).
The Universal “Why”: Why Evaluate AI Agents?
Think of an AI Agent as a new employee. You wouldn’t deploy them to handle critical business tasks without training and checking their work. Evaluation is that continuous training and quality check process.
- Reliability: Ensures the agent performs consistently, not just correctly on one lucky try.
- Accuracy: Measures if the agent’s outputs are factually correct and meet the task requirements.
- Robustness: Tests how the agent handles edge cases, errors, and unexpected inputs without breaking.
- Improvement: Provides a feedback loop to iteratively improve the agent’s prompts, tools, and reasoning (e.g., using RAG).
- Trust: Builds confidence to deploy the agent into production processes.
🔎 What are Evaluators?
Evaluators are the measurement mechanisms used to check if an agent is doing its job well. They can be:
- Rule-based evaluators → Compare agent output against expected results (ground truth).
- Metric-based evaluators → Use quantitative scores (e.g., accuracy, precision, latency).
- Human evaluators → End-users or SMEs rate usefulness, correctness, clarity.
- LLM-as-a-judge evaluators → Another AI model scores the agent’s output quality (used in LLM/agent frameworks like LangChain, LlamaIndex, DSPy).
🔎 What is Evaluation?
Evaluation is the systematic process of testing agent behavior across dimensions like correctness, robustness, usability, and business value.
It helps answer questions like:
- Does the agent solve the intended problem?
- Is it reliable under different conditions?
- Does it align with business and compliance needs?
🔎 Techniques for Evaluation (Across Frameworks)
| Technique | Description | Examples / Usage | Primary Goal |
|---|---|---|---|
| Ground Truth / Rule-Based Testing | Outputs compared against predefined correct answers (classic software-style testing). | Validate extracted invoice amount = expected DB value. | Ensure correctness against known outcomes. |
| Metric-Based Evaluation | Uses quantitative KPIs to measure accuracy, efficiency, and performance. | Accuracy/Precision/Recall → extraction tasks. Latency/Throughput → performance. Cost per execution. | Measure performance and efficiency. |
| Simulation & Scenario Testing | Agents tested in synthetic but realistic environments, covering edge cases and noise. | Multi-agent setup → simulate multiple customer requests at once. | Test robustness and adaptability. |
| Human-in-the-Loop (HITL) Evaluation | SMEs or users validate correctness, usefulness, or context. | Customer support bots → humans rate empathy/clarity of responses. | Validate quality and contextual relevance. |
| Adversarial Testing | Stress test agents with unexpected, malformed, or malicious inputs. | LLM → jailbreak prompts. RPA → incomplete/malformed data. | Assess resilience and security. |
| LLM-as-a-Judge / Model-based Evaluation | Another AI model evaluates outputs instead of humans/rules. | Ask evaluator model: “Rate correctness (1–10)” or “Does this follow instructions?” | Automate qualitative evaluation at scale. |
| User Experience Testing | Collects qualitative feedback on usability, clarity, and satisfaction. | NPS surveys, feedback ratings, interaction analytics. | Improve usability and user satisfaction. |
| Continuous Evaluation (Monitoring & Logging) | Ongoing monitoring of live agent performance, drift detection, and retraining triggers. | Real-time dashboards, error logging, SLA tracking. | Ensure long-term reliability and improvement. |
Why Group Evaluations into Sets?
Trying to test every possible scenario in one disorganized list is inefficient and unclear. Grouping your evaluations into purposeful sets allows you to:
- Focus your testing on specific areas of your agent’s behavior.
- Interpret results more easily by understanding the context of any failures.
- Manage your test suites efficiently as your agents evolve.
UiPath recommend creating Below primary types of evaluation sets to cover all bases:
| Evaluation Set | Purpose | % of Total Evaluations (Guideline) | Example Content |
|---|---|---|---|
| Full Output / Core Scenarios | Test the agent’s ability to provide complete, accurate, and helpful responses for common requests. | ~50% | “How do I reset my password?”, “Create a Teams channel for project Apollo.” |
| Edge Cases & Error Handling | Test how the agent handles ambiguity, incomplete information, complex requests, and scenarios requiring escalation. | ~25% | “It doesn’t work.”, “I need access to everything.”, A request beyond the agent’s permissions. |
| Misspelling & Typographical Errors | Test the robustness of the model and its ability to understand user intent despite errors. | ~15% | “pasword reset”, “how 2 sharepoint file?”, “Excel is sheeting slowly.” |
| Complex Workflow & Tool Usage | Test multi-step processes, tool calling accuracy, parameter passing, and decision branches. | ~10% (Critical for complex agents) | A request that requires checking a database and sending an email and updating a ticket. |
1. The Full Output Evaluation Set
This is your foundation—the suite of tests that validate normal, expected behavior under typical conditions.
- Purpose: To verify core functionality and logic.
- What it covers:
- Basic Functionality: Does the agent produce the correct output for valid inputs?
- Core Logic: Are calculations, data comparisons, and field validations working correctly?
- Example Tests:
- Does the total on an extracted invoice match the sum of its line items?
- Is the format of dates, currencies, and numbers correctly validated?
- Benefit: This set gives you confidence that the primary functions of your agent are working as expected. It’s your essential first pass.
2. The Edge Case Evaluation Set
This set is designed to probe the boundaries and robustness of your agent, testing how it handles rare, unexpected, or extreme conditions.
- Purpose: To uncover hidden bugs that don’t appear in normal operation.
- What it covers:
- Input Boundaries: Testing with maximum/minimum values (e.g., extremely high quantities or totals).
- Abnormal Inputs: How does it handle empty fields, extremely long text, or unusual data formats?
- Unusual Conditions: Testing with missing or incomplete data.
- Example Tests:
- What happens if an invoice total exceeds the system’s maximum allowed value?
- How does the agent react if a required field like
VendorNameis missing?
- Benefit: This set is crucial for ensuring stability and preventing crashes or errors in non-ideal, real-world scenarios.
3. The Misspelling and Typographical Error Set
Users and upstream systems make mistakes. This set tests your agent’s ability to handle imperfect input gracefully.
- Purpose: To ensure the agent is user-friendly and robust enough to handle common input errors.
- What it covers:
- Misspelled Fields (e.g., “VenderName” instead of “VendorName”).
- Partial Matches & Case Sensitivity (e.g., “ABC Corp.” vs. “ABC Corporation”).
- Unexpected Characters like leading/trailing spaces or special symbols.
- Example Tests:
- If a user enters “Acme Co.” but the system expects “Acme Company,” does it flag an error or use fuzzy matching to understand?
- How does it handle accidental spaces in a
PONumberfield?
- Benefit: This testing ensures your automation is resilient and can process data successfully even when input isn’t perfect, which is vital for real-world deployment.
Mastering UiPath Agent Evaluations
The Engine of Evaluation: Understanding Evaluators
Evaluation sets define what to test, but Evaluators define how to score the results. They are the scoring engines that determine whether an agent’s output meets your quality bar. Without them, evaluations are just snapshots of expected output.
UiPath provides several types of evaluators to match your needs:
| Evaluation Type | Description | Best For |
|---|---|---|
| LLM-as-a-judge: Semantic Similarity | Uses a Large Language Model (LLM) to compare the generated output against an expected output. It evaluates whether the meaning and intent are similar, even if the wording is different. | Testing the quality and correctness of open-ended, conversational responses where exact wording may vary. |
| Create your own LLM-based evaluator | Provides a flexible framework to define custom evaluation logic using an LLM. You can craft specific prompts to ask the LLM to judge outputs based on your own unique criteria (e.g., “Check if the output is polite and professional”). | Highly customized evaluation needs that go beyond simple similarity, such as checking tone, style, or specific domain knowledge. |
| Trajectory | Evaluates the entire path or sequence of steps the agent took to reach its final answer. This includes the tools it used, the questions it asked, and the intermediate results. | Testing complex agents that use multiple tools or require multi-step reasoning. It ensures the agent’s process is logical and efficient, not just the final output. |
| Exact Match | Checks if the agent’s output precisely and character-for-character matches the expected output. Any variation in wording, punctuation, or formatting will cause a failure. | Validating structured outputs like codes, specific commands, URLs, or names where absolute precision is critical. |
| JSON Similarity | Checks if two JSON structures (e.g., the agent’s output and the expected output) are semantically similar. It can ignore inconsequential differences like whitespace or the order of keys. | Testing agents that return structured data via tools, ensuring they extract or generate the correct information and format it properly. |
| Faithfulness (Groundedness) | Scores whether the claims in the agent’s final output are entirely supported by and grounded in the context provided to it (e.g., from knowledge retrieval or tool outputs). It detects “hallucination.” | Ensuring the agent’s responses are accurate and based solely on the information it was given, which is crucial for RAG (Retrieval-Augmented Generation) applications. |
The Lifecycle of an Evaluator
- Create: Build your evaluator in the Evaluators panel. Choose its type and give it a clear, semantic name (e.g., “US-Invoice-Totals-Range”).
- Attach: Assign one or more evaluators to an evaluation set. You can mix and match types (e.g., use an Exact Match for a status code and an LLM-as-a-Judge for a summary field).
- Version: Any change to an evaluator creates a new version. This maintains historical audit trails. For CI/CD pipelines, pin evaluator versions just like you would package dependencies.
- Retire: If business rules change, clone an evaluator and edit the clone. Never edit an existing evaluator in-place if you need to maintain auditability for past runs.
When to Create Your Evaluations
The best time to build your evaluation sets is once your agent’s arguments are stable and complete—meaning your use case, prompts, tools, and Context Grounding indexes are finalized. This minimizes rework. If you modify your agent’s design later, you will need to adjust your evaluations accordingly.
A major advantage of this system is reusability. You can easily export and import evaluation sets between agents in the same organization or even across different organizations, saving you from rebuilding them from scratch.

Example of Creating Evaluation
Agent : Internal IT Support Triage and Resolution Agent
This AI agent is designed to automate and enhance the Level 1 IT support function within an organization. Its core use case is to instantly handle incoming employee queries via a chat interface (e.g., Microsoft Teams, a web portal, or service desk email), reducing resolution time and freeing human agents for more complex tasks. The agent intelligently parses the user’s request, cross-references it against a curated internal knowledge base of IT guides and FAQs, and determines the optimal response path.
For common, resolvable issues—such as configuring Outlook settings, troubleshooting Excel errors, or guiding users through SharePoint sharing permissions—the agent provides immediate, clear, and structured step-by-step instructions directly within the chat. For issues that require administrative access, specialized skills, or physical intervention—such as creating new Teams channels, restoring mailbox access, or handling software crashes—the agent automatically generates a pre-formatted support ticket, accurately categorizes it, and assigns it to the correct resolver group (e.g., O365_Teams_ChannelCreation, O365_SharedMailbox_Access). This ensures seamless escalation, provides users with clear expectations, and integrates directly with the company’s existing IT Service Management (ITSM) workflow.
1. The Full Output Evaluation Set
This set is designed to test the agent’s ability to provide complete, accurate, and helpful step-by-step solutions for common but non-trivial technical issues.
Purpose: To evaluate the agent’s core knowledge, clarity of instruction, and ability to structure a helpful response.
Examples:
Name: Laura
Message: I accidentally deleted an important email in Outlook. Is there a way to recover it?
Expected Output:
Hello Laura, thank you for reaching out. Yes, deleted emails in Outlook are often recoverable. Here’s how you can try to get it back:
- Check the “Deleted Items” Folder: First, look in your Deleted Items folder. If it’s there, you can simply right-click the email and select “Move” > “Inbox” or another folder.
- Recover from “Recoverable Items”: If you’ve emptied the Deleted Items folder, the email may still be in a hidden folder for a limited time.
- In your Outlook folder pane, click on the “Deleted Items” folder.
- On the “Home” tab, click “Recover deleted items” (or in Outlook on the web, click “Recover items deleted from this folder”).
- A dialog box will open. Select the email(s) you want to recover and click “Restore Selected Items”. The emails will be moved back to your Deleted Items folder.
If these steps don’t work, and the email was critically important, you may need to contact your IT admin, as they might be able to restore it from a server-level backup. Let me know if you need further assistance!
Name: David
Message: How do I stop Excel from changing my numbers into dates? It’s very frustrating.
Expected Output:
Hi David, that’s a common and understandably frustrating issue. Excel tries to be “helpful” by auto-formatting data. Here are the best ways to prevent this:
- Pre-format the Cells (Recommended):
- Select the column or cells where you will enter your numbers.
- Right-click and choose “Format Cells”.
- Select “Text” from the Category list and click “OK”. Now, anything you type in these cells will be treated as text and not converted to a date.
- Use an Apostrophe:
- Before typing a number that could be misinterpreted (like 1-2), type an apostrophe (
'). For example, type'1-2. The apostrophe will not be visible in the cell, but it forces Excel to treat the entry as text.
- Before typing a number that could be misinterpreted (like 1-2), type an apostrophe (
- Import Data Using Text Import Wizard:
- If you’re importing data from a .CSV or text file, use Data > From Text/CSV.
- In the preview window, select the problematic column and change its data type to “Text” before loading it.
This should give you full control over how your data is displayed. Hope this helps!
Name: Chloe
Message: My OneDrive sync keeps saying “Processing changes” for hours and won’t finish. What can I do?
Expected Output:
Hello Chloe, a stuck OneDrive sync is usually resolved by resetting the client. Don’t worry, this won’t delete your files. Please follow these steps:
- Close OneDrive Completely: Right-click the OneDrive cloud icon in your taskbar notification area and click “Close OneDrive”.
- Reset OneDrive: Press
Win + R, type the following command, and press Enter:%localappdata%\Microsoft\OneDrive\onedrive.exe /reset
(Note: A command window may flash briefly; this is normal). - Restart OneDrive: Press
Win + Ragain, type this command, and press Enter:%localappdata%\Microsoft\OneDrive\onedrive.exe - Wait for Sync to Restart: It may take a few minutes for the icon to reappear and for syncing to begin again. This process clears the local sync cache and often resolves persistent “Processing changes” issues.
If the problem continues, there might be a problem with a specific file. Let me know, and we can try more advanced troubleshooting like pausing sync and syncing folders one-by-one.
2. The Edge Case Evaluation Set
This set tests the agent’s ability to handle ambiguity, lack of information, complex multi-part requests, and scenarios where a direct solution isn’t possible, requiring escalation or clarifying questions.
Purpose: To evaluate the agent’s logical reasoning, communication skills, and understanding of its own limitations.
Examples:
Name: Mark
Message: It doesn’t work.
Expected Output:
Hi Mark, I’m sorry to hear you’re having trouble. To help you effectively, I’ll need a bit more information.
Could you please describe:
- What application or service “doesn’t work” (e.g., Outlook, Teams, Excel)?
- What were you trying to do when you encountered the issue?
- Are you seeing a specific error message? If so, what does it say?
The more details you can provide, the better I can assist you.
Name: Sophia
Message: I need full access to the Finance SharePoint site, the ability to edit the budget Excel file that’s there, and also my printer isn’t working.
Expected Output:
Hello Sophia, thank you for your message. I understand you have three separate requests:
- Access to Finance SharePoint Site: Access permissions for SharePoint sites are managed by the site owners or your IT administration team. I will create a support ticket for the SharePoint_Access team to process your access request.
- Edit Permissions for Budget File: Editing permissions for a specific file are also managed by the file owner or site admin. I have added this detail to the ticket.
- Printer Issue: Printer troubleshooting is handled by a different specialist team. I will create a second, separate ticket for the Hardware_Printing team to assist you with that.
You will be contacted shortly by both teams. Thank you for your patience.
Name: Ben
Message: My boss needs a report of all emails sent from our department last quarter for an audit. How do I get that?
Expected Output:
Hello Ben, that sounds like an important request for compliance. Generating that type of comprehensive email audit report typically requires administrative privileges and access to the Microsoft Purview compliance portal or Exchange Admin Center, which is not available to standard users.
This request will need to be handled by your organization’s Compliance or IT Administration team. I have created a high-priority support ticket and assigned it to the Compliance_Audit_Reporting team. They have the tools and permissions necessary to generate and provide that report for you. They will contact you shortly to confirm the exact details of the audit requirements.
3. The Misspelling and Typographical Error Set
This set tests the agent’s robustness and ability to use intent recognition to understand user queries despite spelling mistakes, grammatical errors, and auto-correct artifacts.
Purpose: To evaluate the agent’s Natural Language Processing (NLP) capabilities and fault tolerance.
Examples:
Name: Rachel
Message: How can I sheare a foleder in Teams? I need to colaborate on documments.
Expected Output:
Hello Rachel, absolutely! To share a folder in Microsoft Teams for collaboration, the best way is to add the folder to a Teams channel. Here’s how:
- Go to the specific channel within your team.
- Click on the “Files” tab at the top.
- Click “Upload” and select the folder from your computer. This will upload the entire folder and its contents to the channel’s associated SharePoint site.
- Once uploaded, everyone in the team will have access to the folder and can collaborate on the documents inside it in real-time.
This is much more effective than sharing separately via email. Let me know if you need help with the specific steps!
Name: Thomas
Message: outlook keeps askng for my pasword every 5 minits its so anoying
Expected Output:
Hi Thomas, I understand that’s very annoying. The constant password prompts are usually caused by a corrupted Outlook profile or incorrect credential storage. Here are the most effective solutions:
- Check Credential Manager:
- Press
Win + R, typecontrol.exe /name Microsoft.CredentialManager, and press Enter. - Go to “Windows Credentials”.
- Look for any Outlook-related credentials (e.g.,
MicrosoftOffice[email]) and remove them. Restart Outlook and it will prompt you to enter your password fresh, which often fixes the loop.
- Press
- Create a New Outlook Profile (If above fails):
- Go to Control Panel > Mail (Microsoft Outlook) > Show Profiles.
- Click “Add…” to create a new profile, set it up with your email account, and set it as the default.
- You can then remove the old profile.
Try the first step first—it resolves the issue most of the time. If it doesn’t work, let me know and we can escalate to the O365_Outlook_Config team.
Name: Jessica
Message: can u help me instal power bi desktop i cant find the downlod
Expected Output:
Hello Jessica, I can certainly help with that! Here is the direct link to download Power BI Desktop from the official Microsoft website:
Download Power BI Desktop: https://aka.ms/pbidesktopstore
(You can simply click that link or copy and paste it into your browser).
Once the download is complete, run the installer and follow the on-screen instructions. It’s a straightforward process. After installation, you can sign in with your work account to get started. Let me know if you encounter any issues during the installation!
Key Coverage Principles to Follow
- Logical Coverage Over Quantity: Don’t just add more of the same test. Map out all possible input combinations, decision branches, and boundary conditions. Ensure each unique path is tested.
- Manage Redundancy: For each unique logical case (e.g., “password reset”), 3-5 evaluations with slightly different phrasings are sufficient to ensure consistency without cluttering the dataset.
- Quality is Paramount: A well-designed set of 50 evaluations that tests all critical paths is far more valuable than 200 repetitive or low-quality tests. Focus on meaningful scenarios that reflect real-world use and potential failures.
- Iterate: Evaluations are not a one-time task. As you add new features or intents to your agent, you must expand your evaluation sets to cover them.
| Agent Complexity | Description | Recommended Number of Evaluations | Key Focus Areas |
|---|---|---|---|
| Simple | Handles 1-2 intents, simple logic, no tools or few parameters. | ~30 evaluations | Core use cases, basic edge cases, common typos. |
| Moderate | Handles multiple related intents, uses tools, has conditional logic. | 50 – 70 evaluations | Broader input coverage, tool usage patterns, more complex edge cases. |
| Complex | Handles diverse, unrelated intents, complex tool usage, multiple decision branches. | 100+ evaluations | Full logical coverage, extensive edge case testing, complex error handling, persona variety. |
Start Building with Confidence
A structured approach to evaluation is not just a best practice—it’s a necessity for deploying trustworthy and robust AI agents. By grouping your tests into logical sets and leveraging the power of different evaluators, you can gain deep, actionable insights into your agent’s performance, ensuring it delivers value reliably.
Ready to put these practices into action?
Dive deeper and start building your evaluation sets today by visiting the official UiPath Agent Evaluations documentation.