> ## Documentation Index
> Fetch the complete documentation index at: https://docs.langdock.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Agent Evals

> Agent Evals lets you test your agent with structured test sets before publishing a new version, so you can catch issues before they reach your team.

<Frame>
  <img src="https://mintcdn.com/langdock-34/pZ8hEj6Kp-YF87M4/images/agent-evals.jpg?fit=max&auto=format&n=pZ8hEj6Kp-YF87M4&q=85&s=6fa8d66bf3cd3bcdcdc83be74e824cb3" alt="Agent Evals run view showing completed test cases with pass and fail statuses, prompts, and expected answers" style={{borderRadius: '6px'}} width="1920" height="1080" data-path="images/agent-evals.jpg" />
</Frame>

Agent Evals is a testing tool built into the agent editor. It lets you run structured evaluations against your agent to verify it still behaves correctly after making changes. Instead of manually testing prompts one by one, you define test cases, select how they should be graded, and run them all at once.

Open your agent and click the **Agent Evals** tab at the top of the agent editor. The tab contains two sections: **Test sets** and **Runs**.

## Test sets

A test set is a collection of test cases with shared configuration. Each test set defines the conversation shape, which checks to apply, and how tools should behave during the evaluation.

### Creating a test set

1. Click **New test set** in the Agent Evals tab.

<Frame>
  <img src="https://mintcdn.com/langdock-34/pZ8hEj6Kp-YF87M4/images/agent_evals_1.png?fit=max&auto=format&n=pZ8hEj6Kp-YF87M4&q=85&s=082452e24c5bb9e3253415510e63ae97" alt="Agent Evals welcome screen with Create your first test set button and Read the docs link" style={{borderRadius: '6px'}} width="1778" height="756" data-path="images/agent_evals_1.png" />
</Frame>

2. Enter a name for the test set and select a conversation shape. Currently, only **Single turn** is available.

<Frame>
  <img src="https://mintcdn.com/langdock-34/pZ8hEj6Kp-YF87M4/images/agent_evals2.png?fit=max&auto=format&n=pZ8hEj6Kp-YF87M4&q=85&s=17fde3451ba0c7c97fb5f946f3ff820a" alt="Create a test set dialog showing name field and conversation shape selector with Single turn selected" style={{borderRadius: '6px'}} width="1776" height="752" data-path="images/agent_evals2.png" />
</Frame>

3. Select one or more checks to grade your test cases: **AI judge**, **Tool check**, or **Keyword check**.

<Frame>
  <img src="https://mintcdn.com/langdock-34/pZ8hEj6Kp-YF87M4/images/agent_evals3.png?fit=max&auto=format&n=pZ8hEj6Kp-YF87M4&q=85&s=b55ee0b329654bebb3683384bcd8395b" alt="Check selection showing AI judge with model picker, Tool check, and Keyword check options" style={{borderRadius: '6px'}} width="1774" height="702" data-path="images/agent_evals3.png" />
</Frame>

4. If you selected AI judge, choose which model to use as the judge.

<Frame>
  <img src="https://mintcdn.com/langdock-34/pZ8hEj6Kp-YF87M4/images/agent_evals4.png?fit=max&auto=format&n=pZ8hEj6Kp-YF87M4&q=85&s=2d85896165c7bd6603a0af97449dfe64" alt="Model selector dropdown showing available models for the AI judge" style={{borderRadius: '6px'}} width="1848" height="626" data-path="images/agent_evals4.png" />
</Frame>

5. Click **Create test set**.

<Tip>
  Expand the **Advanced** section in the creation dialog to configure tool execution mode. By default, evals run in **dry run** mode where no tools actually execute. See [Tool execution modes](#tool-execution-modes) for details.
</Tip>

## Check types

Checks define how each test case is graded. You select them when creating a test set, and they apply to every case in that set.

### AI judge

AI judge compares your agent's actual response to an expected answer you define for each test case. It uses a language model to evaluate whether the response meets the intent and content of the expected answer, even if the wording differs.

You choose which model serves as the judge when creating the test set.

### Tool check

Tool check verifies whether your agent called the expected tools for a given prompt. Define the tools you expect to be used, and the check confirms whether the agent called them during the evaluation.

This is useful for agents with integration actions where calling the correct tool matters as much as the response itself.

### Keyword check

Keyword check verifies whether the agent's response contains required words or avoids forbidden words. Unlike AI judge, it does not involve a model and produces a deterministic pass or fail result.

Each test case has two fields for this check: **Must mention** and **Must not mention**. Use it for compliance requirements, brand guidelines, or any case where certain terms must appear or must be avoided in the response.

## Test cases

After creating a test set, you add the individual test cases that will be evaluated.

### Adding cases manually

Click **Add case** on the test set page to create a single test case. Each case includes a prompt and the expected outputs that your selected checks will grade against.

<Frame>
  <img src="https://mintcdn.com/langdock-34/pZ8hEj6Kp-YF87M4/images/agent_evals5.png?fit=max&auto=format&n=pZ8hEj6Kp-YF87M4&q=85&s=37ad7d7d2c618a8f9d1e27517d7d174d" alt="Empty test set page with Add case and Import CSV buttons" style={{borderRadius: '6px'}} width="1848" height="722" data-path="images/agent_evals5.png" />
</Frame>

### Importing from CSV

For larger test sets, click **Import CSV** to load multiple cases at once. The CSV can include prompts, expected answers, expected tools, and other fields that map to your selected checks.

<Info>
  Importing from CSV is the fastest way to build comprehensive test sets, especially if you already maintain a spreadsheet of prompts you use for manual testing.
</Info>

## Running evals

Click **Run** in the top right of the test set page to start the evaluation. Results appear live as each case completes, showing the status, prompt, output, grader results, and duration for each case.

<Frame>
  <img src="https://mintcdn.com/langdock-34/pZ8hEj6Kp-YF87M4/images/agent_evals6.png?fit=max&auto=format&n=pZ8hEj6Kp-YF87M4&q=85&s=98e2a677bd6f9edf9dfc97723da29f31" alt="Test set with cases loaded showing prompts, expected answers, and expected tools columns with Export CSV, Edit, and Run buttons" style={{borderRadius: '6px'}} width="1846" height="460" data-path="images/agent_evals6.png" />
</Frame>

A single run can evaluate up to 100 cases at once.

<Info>
  Every eval case consumes usage in the same way as a regular agent conversation. Only one run can be active at a time.
</Info>

### Reviewing results

Once a case finishes, click on it to open the detail view. You can review:

<Frame>
  <img src="https://mintcdn.com/langdock-34/pZ8hEj6Kp-YF87M4/images/agent_evals7.png?fit=max&auto=format&n=pZ8hEj6Kp-YF87M4&q=85&s=ad04061ba52a339f3ee8411f225bf99c" alt="Completed run showing pass and fail statuses, prompts, expected answers, outputs, grader results, and duration for each case" style={{borderRadius: '6px'}} width="1850" height="386" data-path="images/agent_evals7.png" />
</Frame>

* The full conversation between the prompt and the agent
* The agent's response
* Token usage and duration
* The status of each grader (passed or failed)

To analyze results outside Langdock, click **Download CSV** to export the full run as a spreadsheet.

## Tool execution modes

The tool execution mode determines whether your agent's tools actually run during an evaluation. You configure this in the **Advanced** section when creating a test set.

| Mode                  | Behavior                                                                                                                     | When to use                                                                                             |
| --------------------- | ---------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------- |
| **Dry run** (default) | Records action and MCP calls without executing them. No real actions are taken.                                              | Most evaluations. Safe for testing without side effects.                                                |
| **Live mode**         | Executes action and MCP calls that do not require approval. Actions that require approval stop the run instead of executing. | When you need to verify the full execution flow, including tool behavior and external system responses. |

<Warning>
  Live mode can change external systems, such as sending emails, creating tickets, or updating records. Use it only when you intentionally want an end-to-end integration test.
</Warning>
