In the world of AI-powered automation, building a capable agent is only half the battle. Ensuring it performs reliably and accurately in real-world scenarios is the true test. This is where a robust evaluation strategy comes in. Without it, youāre essentially deploying your automations blind, hoping they work as intended.
UiPathās Agentic Automation platform provides powerful tools to systematically measure and improve your agentās performance. The key to leveraging these tools effectively isĀ organization. In this blog post, weāll break down the best practices for structuring your evaluations, from grouping them into logical sets to choosing the right scoring engines, or āevaluators.ā
The Core Philosophy of AI Agent Evaluation
At its heart, evaluating an AI agent is no different from quality assurance in software development or performance review for an employee. The goal is to systematically answer one critical question:Ā āIs this agent reliably performing its intended task to the required standard?ā
This moves you from anecdotal testing (āLet me try a few queriesā) to empirical, measurable validation (āBased on 200 test cases, the agent achieves 95% accuracy on core tasksā).
The Universal āWhyā: Why Evaluate AI Agents?
Think of an AI Agent as a new employee. You wouldnāt deploy them to handle critical business tasks without training and checking their work. Evaluation is that continuous training and quality check process.
- Reliability:Ā Ensures the agent performs consistently, not just correctly on one lucky try.
- Accuracy:Ā Measures if the agentās outputs are factually correct and meet the task requirements.
- Robustness:Ā Tests how the agent handles edge cases, errors, and unexpected inputs without breaking.
- Improvement:Ā Provides a feedback loop to iteratively improve the agentās prompts, tools, and reasoning (e.g., using RAG).
- Trust:Ā Builds confidence to deploy the agent into production processes.
š What are Evaluators?
Evaluators are the measurement mechanisms used to check if an agent is doing its job well. They can be:
- Rule-based evaluators ā Compare agent output against expected results (ground truth).
- Metric-based evaluators ā Use quantitative scores (e.g., accuracy, precision, latency).
- Human evaluators ā End-users or SMEs rate usefulness, correctness, clarity.
- LLM-as-a-judge evaluators ā Another AI model scores the agentās output quality (used in LLM/agent frameworks like LangChain, LlamaIndex, DSPy).
š What is Evaluation?
Evaluation is the systematic process of testing agent behavior across dimensions like correctness, robustness, usability, and business value.
It helps answer questions like:
- Does the agent solve the intended problem?
- Is it reliable under different conditions?
- Does it align with business and compliance needs?
š Techniques for Evaluation (Across Frameworks)
| Technique | Description | Examples / Usage | Primary Goal |
|---|---|---|---|
| Ground Truth / Rule-Based Testing | Outputs compared against predefined correct answers (classic software-style testing). | Validate extracted invoice amount = expected DB value. | Ensure correctness against known outcomes. |
| Metric-Based Evaluation | Uses quantitative KPIs to measure accuracy, efficiency, and performance. | Accuracy/Precision/Recall ā extraction tasks. Latency/Throughput ā performance. Cost per execution. | Measure performance and efficiency. |
| Simulation & Scenario Testing | Agents tested in synthetic but realistic environments, covering edge cases and noise. | Multi-agent setup ā simulate multiple customer requests at once. | Test robustness and adaptability. |
| Human-in-the-Loop (HITL) Evaluation | SMEs or users validate correctness, usefulness, or context. | Customer support bots ā humans rate empathy/clarity of responses. | Validate quality and contextual relevance. |
| Adversarial Testing | Stress test agents with unexpected, malformed, or malicious inputs. | LLM ā jailbreak prompts. RPA ā incomplete/malformed data. | Assess resilience and security. |
| LLM-as-a-Judge / Model-based Evaluation | Another AI model evaluates outputs instead of humans/rules. | Ask evaluator model: āRate correctness (1ā10)ā or āDoes this follow instructions?ā | Automate qualitative evaluation at scale. |
| User Experience Testing | Collects qualitative feedback on usability, clarity, and satisfaction. | NPS surveys, feedback ratings, interaction analytics. | Improve usability and user satisfaction. |
| Continuous Evaluation (Monitoring & Logging) | Ongoing monitoring of live agent performance, drift detection, and retraining triggers. | Real-time dashboards, error logging, SLA tracking. | Ensure long-term reliability and improvement. |
Why Group Evaluations into Sets?
Trying to test every possible scenario in one disorganized list is inefficient and unclear. Grouping your evaluations into purposeful sets allows you to:
- Focus your testingĀ on specific areas of your agentās behavior.
- Interpret results more easilyĀ by understanding the context of any failures.
- Manage your test suitesĀ efficiently as your agents evolve.
UiPath recommend creating Below primary types of evaluation sets to cover all bases:
| Evaluation Set | Purpose | % of Total Evaluations (Guideline) | Example Content |
|---|---|---|---|
| Full Output / Core Scenarios | Test the agentās ability to provide complete, accurate, and helpful responses for common requests. | ~50% | āHow do I reset my password?ā, āCreate a Teams channel for project Apollo.ā |
| Edge Cases & Error Handling | Test how the agent handles ambiguity, incomplete information, complex requests, and scenarios requiring escalation. | ~25% | āIt doesnāt work.ā, āI need access to everything.ā, A request beyond the agentās permissions. |
| Misspelling & Typographical Errors | Test the robustness of the model and its ability to understand user intent despite errors. | ~15% | āpasword resetā, āhow 2 sharepoint file?ā, āExcel is sheeting slowly.ā |
| Complex Workflow & Tool Usage | Test multi-step processes, tool calling accuracy, parameter passing, and decision branches. | ~10%Ā (Critical for complex agents) | A request that requires checking a databaseĀ andĀ sending an emailĀ andĀ updating a ticket. |
1. The Full Output Evaluation Set
This is your foundationāthe suite of tests that validate normal, expected behavior under typical conditions.
- Purpose:Ā To verify core functionality and logic.
- What it covers:
- Basic Functionality:Ā Does the agent produce the correct output for valid inputs?
- Core Logic:Ā Are calculations, data comparisons, and field validations working correctly?
- Example Tests:
- Does the total on an extracted invoice match the sum of its line items?
- Is the format of dates, currencies, and numbers correctly validated?
- Benefit:Ā This set gives you confidence that the primary functions of your agent are working as expected. Itās your essential first pass.
2. The Edge Case Evaluation Set
This set is designed to probe the boundaries and robustness of your agent, testing how it handles rare, unexpected, or extreme conditions.
- Purpose:Ā To uncover hidden bugs that donāt appear in normal operation.
- What it covers:
- Input Boundaries:Ā Testing with maximum/minimum values (e.g., extremely high quantities or totals).
- Abnormal Inputs:Ā How does it handle empty fields, extremely long text, or unusual data formats?
- Unusual Conditions:Ā Testing with missing or incomplete data.
- Example Tests:
- What happens if an invoice total exceeds the systemās maximum allowed value?
- How does the agent react if a required field likeĀ
VendorNameĀ is missing?
- Benefit:Ā This set is crucial for ensuring stability and preventing crashes or errors in non-ideal, real-world scenarios.
3. The Misspelling and Typographical Error Set
Users and upstream systems make mistakes. This set tests your agentās ability to handle imperfect input gracefully.
- Purpose:Ā To ensure the agent is user-friendly and robust enough to handle common input errors.
- What it covers:
- Misspelled FieldsĀ (e.g., āVenderNameā instead of āVendorNameā).
- Partial Matches & Case SensitivityĀ (e.g., āABC Corp.ā vs. āABC Corporationā).
- Unexpected CharactersĀ like leading/trailing spaces or special symbols.
- Example Tests:
- If a user enters āAcme Co.ā but the system expects āAcme Company,ā does it flag an error or use fuzzy matching to understand?
- How does it handle accidental spaces in aĀ
PONumberĀ field?
- Benefit:Ā This testing ensures your automation is resilient and can process data successfully even when input isnāt perfect, which is vital for real-world deployment.
Mastering UiPath Agent Evaluations
The Engine of Evaluation: Understanding Evaluators
Evaluation sets defineĀ whatĀ to test, butĀ EvaluatorsĀ defineĀ howĀ to score the results. They are the scoring engines that determine whether an agentās output meets your quality bar. Without them, evaluations are just snapshots of expected output.
UiPath provides several types of evaluators to match your needs:
| Evaluation Type | Description | Best For |
|---|---|---|
| LLM-as-a-judge: Semantic Similarity | Uses a Large Language Model (LLM) to compare the generated output against an expected output. It evaluates whether theĀ meaning and intentĀ are similar, even if the wording is different. | Testing the quality and correctness of open-ended, conversational responses where exact wording may vary. |
| Create your own LLM-based evaluator | Provides a flexible framework to define custom evaluation logic using an LLM. You can craft specific prompts to ask the LLM to judge outputs based on your own unique criteria (e.g., āCheck if the output is polite and professionalā). | Highly customized evaluation needs that go beyond simple similarity, such as checking tone, style, or specific domain knowledge. |
| Trajectory | Evaluates the entireĀ pathĀ or sequence of steps the agent took to reach its final answer. This includes the tools it used, the questions it asked, and the intermediate results. | Testing complex agents that use multiple tools or require multi-step reasoning. It ensures the agentās process is logical and efficient, not just the final output. |
| Exact Match | Checks if the agentās outputĀ precisely and character-for-character matchesĀ the expected output. Any variation in wording, punctuation, or formatting will cause a failure. | Validating structured outputs like codes, specific commands, URLs, or names where absolute precision is critical. |
| JSON Similarity | Checks if two JSON structures (e.g., the agentās output and the expected output) are semantically similar. It can ignore inconsequential differences like whitespace or the order of keys. | Testing agents that return structured data via tools, ensuring they extract or generate the correct information and format it properly. |
| Faithfulness (Groundedness) | Scores whether the claims in the agentās final output are entirely supported by and grounded in the context provided to it (e.g., from knowledge retrieval or tool outputs). It detects āhallucination.ā | Ensuring the agentās responses are accurate and based solely on the information it was given, which is crucial for RAG (Retrieval-Augmented Generation) applications. |
The Lifecycle of an Evaluator
- Create:Ā Build your evaluator in theĀ EvaluatorsĀ panel. Choose its type and give it a clear, semantic name (e.g., āUS-Invoice-Totals-Rangeā).
- Attach:Ā Assign one or more evaluators to an evaluation set. You can mix and match types (e.g., use an Exact Match for a status code and an LLM-as-a-Judge for a summary field).
- Version:Ā Any change to an evaluator creates a new version. This maintains historical audit trails. For CI/CD pipelines, pin evaluator versions just like you would package dependencies.
- Retire:Ā If business rules change,Ā cloneĀ an evaluator and edit the clone. Never edit an existing evaluator in-place if you need to maintain auditability for past runs.
When to Create Your Evaluations
The best time to build your evaluation sets is once your agentās arguments areĀ stable and completeāmeaning your use case, prompts, tools, and Context Grounding indexes are finalized. This minimizes rework. If you modify your agentās design later, you will need to adjust your evaluations accordingly.
A major advantage of this system isĀ reusability. You can easily export and import evaluation sets between agents in the same organization or even across different organizations, saving you from rebuilding them from scratch.

Example of Creating Evaluation
Agent : Internal IT Support Triage and Resolution Agent
This AI agent is designed to automate and enhance the Level 1 IT support function within an organization. Its core use case is to instantly handle incoming employee queries via a chat interface (e.g., Microsoft Teams, a web portal, or service desk email), reducing resolution time and freeing human agents for more complex tasks. The agent intelligently parses the userās request, cross-references it against a curated internal knowledge base of IT guides and FAQs, and determines the optimal response path.
For common, resolvable issuesāsuch as configuring Outlook settings, troubleshooting Excel errors, or guiding users through SharePoint sharing permissionsāthe agent provides immediate, clear, and structured step-by-step instructions directly within the chat. For issues that require administrative access, specialized skills, or physical interventionāsuch as creating new Teams channels, restoring mailbox access, or handling software crashesāthe agent automatically generates a pre-formatted support ticket, accurately categorizes it, and assigns it to the correct resolver group (e.g.,Ā O365_Teams_ChannelCreation,Ā O365_SharedMailbox_Access). This ensures seamless escalation, provides users with clear expectations, and integrates directly with the companyās existing IT Service Management (ITSM) workflow.
1. The Full Output Evaluation Set
This set is designed to test the agentās ability to provide complete, accurate, and helpful step-by-step solutions for common but non-trivial technical issues.
Purpose:Ā To evaluate the agentās core knowledge, clarity of instruction, and ability to structure a helpful response.
Examples:
Name:Ā Laura
Message:Ā I accidentally deleted an important email in Outlook. Is there a way to recover it?
Expected Output:
Hello Laura, thank you for reaching out. Yes, deleted emails in Outlook are often recoverable. Hereās how you can try to get it back:
- Check the āDeleted Itemsā Folder:Ā First, look in your Deleted Items folder. If itās there, you can simply right-click the email and select āMoveā > āInboxā or another folder.
- Recover from āRecoverable Itemsā:Ā If youāve emptied the Deleted Items folder, the email may still be in a hidden folder for a limited time.
- In your Outlook folder pane, click on the āDeleted Itemsā folder.
- On the āHomeā tab, click āRecover deleted itemsā (or in Outlook on the web, click āRecover items deleted from this folderā).
- A dialog box will open. Select the email(s) you want to recover and click āRestore Selected Itemsā. The emails will be moved back to your Deleted Items folder.
If these steps donāt work, and the email was critically important, you may need to contact your IT admin, as they might be able to restore it from a server-level backup. Let me know if you need further assistance!
Name:Ā David
Message:Ā How do I stop Excel from changing my numbers into dates? Itās very frustrating.
Expected Output:
Hi David, thatās a common and understandably frustrating issue. Excel tries to be āhelpfulā by auto-formatting data. Here are the best ways to prevent this:
- Pre-format the Cells (Recommended):
- Select the column or cells where you will enter your numbers.
- Right-click and choose āFormat Cellsā.
- Select āTextā from the Category list and click āOKā. Now, anything you type in these cells will be treated as text and not converted to a date.
- Use an Apostrophe:
- Before typing a number that could be misinterpreted (like 1-2), type an apostrophe (
'). For example, typeĀ'1-2. The apostrophe will not be visible in the cell, but it forces Excel to treat the entry as text.
- Before typing a number that could be misinterpreted (like 1-2), type an apostrophe (
- Import Data Using Text Import Wizard:
- If youāre importing data from a .CSV or text file, use Data > From Text/CSV.
- In the preview window, select the problematic column and change its data type to āTextā before loading it.
This should give you full control over how your data is displayed. Hope this helps!
Name:Ā Chloe
Message:Ā My OneDrive sync keeps saying āProcessing changesā for hours and wonāt finish. What can I do?
Expected Output:
Hello Chloe, a stuck OneDrive sync is usually resolved by resetting the client. Donāt worry, this wonāt delete your files. Please follow these steps:
- Close OneDrive Completely:Ā Right-click the OneDrive cloud icon in your taskbar notification area and click āClose OneDriveā.
- Reset OneDrive:Ā PressĀ
Win + R, type the following command, and press Enter:%localappdata%\Microsoft\OneDrive\onedrive.exe /reset
(Note: A command window may flash briefly; this is normal). - Restart OneDrive:Ā PressĀ
Win + RĀ again, type this command, and press Enter:%localappdata%\Microsoft\OneDrive\onedrive.exe - Wait for Sync to Restart:Ā It may take a few minutes for the icon to reappear and for syncing to begin again. This process clears the local sync cache and often resolves persistent āProcessing changesā issues.
If the problem continues, there might be a problem with a specific file. Let me know, and we can try more advanced troubleshooting like pausing sync and syncing folders one-by-one.
2. The Edge Case Evaluation Set
This set tests the agentās ability to handle ambiguity, lack of information, complex multi-part requests, and scenarios where a direct solution isnāt possible, requiring escalation or clarifying questions.
Purpose:Ā To evaluate the agentās logical reasoning, communication skills, and understanding of its own limitations.
Examples:
Name:Ā Mark
Message:Ā It doesnāt work.
Expected Output:
Hi Mark, Iām sorry to hear youāre having trouble. To help you effectively, Iāll need a bit more information.
Could you please describe:
- What application or service ādoesnāt workā (e.g., Outlook, Teams, Excel)?
- What were you trying to do when you encountered the issue?
- Are you seeing a specific error message? If so, what does it say?
The more details you can provide, the better I can assist you.
Name:Ā Sophia
Message:Ā I need full access to the Finance SharePoint site, the ability to edit the budget Excel file thatās there, and also my printer isnāt working.
Expected Output:
Hello Sophia, thank you for your message. I understand you have three separate requests:
- Access to Finance SharePoint Site:Ā Access permissions for SharePoint sites are managed by the site owners or your IT administration team. I will create a support ticket for theĀ SharePoint_AccessĀ team to process your access request.
- Edit Permissions for Budget File:Ā Editing permissions for a specific file are also managed by the file owner or site admin. I have added this detail to the ticket.
- Printer Issue:Ā Printer troubleshooting is handled by a different specialist team. I will create a second, separate ticket for theĀ Hardware_PrintingĀ team to assist you with that.
You will be contacted shortly by both teams. Thank you for your patience.
Name:Ā Ben
Message:Ā My boss needs a report of all emails sent from our department last quarter for an audit. How do I get that?
Expected Output:
Hello Ben, that sounds like an important request for compliance. Generating that type of comprehensive email audit report typically requires administrative privileges and access to theĀ Microsoft Purview compliance portalĀ or Exchange Admin Center, which is not available to standard users.
This request will need to be handled by your organizationās Compliance or IT Administration team. I have created a high-priority support ticket and assigned it to theĀ Compliance_Audit_ReportingĀ team. They have the tools and permissions necessary to generate and provide that report for you. They will contact you shortly to confirm the exact details of the audit requirements.
3. The Misspelling and Typographical Error Set
This set tests the agentās robustness and ability to use intent recognition to understand user queries despite spelling mistakes, grammatical errors, and auto-correct artifacts.
Purpose:Ā To evaluate the agentās Natural Language Processing (NLP) capabilities and fault tolerance.
Examples:
Name:Ā Rachel
Message:Ā How can I sheare a foleder in Teams? I need to colaborate on documments.
Expected Output:
Hello Rachel, absolutely! To share a folder in Microsoft Teams for collaboration, the best way is toĀ add the folder to a Teams channel. Hereās how:
- Go to the specific channel within your team.
- Click on the āFilesā tab at the top.
- Click āUploadā and select the folder from your computer. This will upload the entire folder and its contents to the channelās associated SharePoint site.
- Once uploaded, everyone in the team will have access to the folder and can collaborate on the documents inside it in real-time.
This is much more effective than sharing separately via email. Let me know if you need help with the specific steps!
Name:Ā Thomas
Message:Ā outlook keeps askng for my pasword every 5 minits its so anoying
Expected Output:
Hi Thomas, I understand thatās very annoying. The constant password prompts are usually caused by a corrupted Outlook profile or incorrect credential storage. Here are the most effective solutions:
- Check Credential Manager:
- PressĀ
Win + R, typeĀcontrol.exe /name Microsoft.CredentialManager, and press Enter. - Go to āWindows Credentialsā.
- Look for any Outlook-related credentials (e.g.,Ā
MicrosoftOffice[email]) and remove them. Restart Outlook and it will prompt you to enter your password fresh, which often fixes the loop.
- PressĀ
- Create a New Outlook Profile (If above fails):
- Go to Control Panel > Mail (Microsoft Outlook) > Show Profiles.
- Click āAddā¦ā to create a new profile, set it up with your email account, and set it as the default.
- You can then remove the old profile.
Try the first step firstāit resolves the issue most of the time. If it doesnāt work, let me know and we can escalate to theĀ O365_Outlook_ConfigĀ team.
Name:Ā Jessica
Message:Ā can u help me instal power bi desktop i cant find the downlod
Expected Output:
Hello Jessica, I can certainly help with that! Here is the direct link to download Power BI Desktop from the official Microsoft website:
Download Power BI Desktop:Ā https://aka.ms/pbidesktopstore
(You can simply click that link or copy and paste it into your browser).
Once the download is complete, run the installer and follow the on-screen instructions. Itās a straightforward process. After installation, you can sign in with your work account to get started. Let me know if you encounter any issues during the installation!

Key Coverage Principles to Follow
- Logical Coverage Over Quantity:Ā Donāt just add more of the same test. Map out all possible input combinations, decision branches, and boundary conditions. Ensure each unique path is tested.
- Manage Redundancy:Ā For each unique logical case (e.g., āpassword resetā), 3-5 evaluations with slightly different phrasings are sufficient to ensure consistency without cluttering the dataset.
- Quality is Paramount:Ā A well-designed set of 50 evaluations that tests all critical paths is far more valuable than 200 repetitive or low-quality tests. Focus on meaningful scenarios that reflect real-world use and potential failures.
- Iterate:Ā Evaluations are not a one-time task. As you add new features or intents to your agent, you must expand your evaluation sets to cover them.
| Agent Complexity | Description | Recommended Number of Evaluations | Key Focus Areas |
|---|---|---|---|
| Simple | Handles 1-2 intents, simple logic, no tools or few parameters. | ~30 evaluations | Core use cases, basic edge cases, common typos. |
| Moderate | Handles multiple related intents, uses tools, has conditional logic. | 50 ā 70 evaluations | Broader input coverage, tool usage patterns, more complex edge cases. |
| Complex | Handles diverse, unrelated intents, complex tool usage, multiple decision branches. | 100+ evaluations | Full logical coverage, extensive edge case testing, complex error handling, persona variety. |
Start Building with Confidence
A structured approach to evaluation is not just a best practiceāitās a necessity for deploying trustworthy and robust AI agents. By grouping your tests into logical sets and leveraging the power of different evaluators, you can gain deep, actionable insights into your agentās performance, ensuring it delivers value reliably.
Ready to put these practices into action?
Dive deeper and start building your evaluation sets today by visiting the officialĀ UiPath Agent Evaluations documentation.