Agent Evals

Agent Evals is a testing tool built into the agent editor. It lets you run structured evaluations against your agent to verify it still behaves correctly after making changes. Instead of manually testing prompts one by one, you define test cases, select how they should be graded, and run them all at once. Open your agent and click the Evals tab at the top of the agent editor. The tab contains two sections: Test sets and Runs.

Test sets

A test set is a collection of test cases with shared configuration. Each test set defines the conversation shape, which checks to apply, and how tools should behave during the evaluation.

Creating a test set

Click New test set in the Evals tab.

Agent Evals welcome screen with Create your first test set button and Read the docs link

Enter a name for the test set and select a conversation shape. Currently, only Single turn is available.

Create a test set dialog showing name field and conversation shape selector with Single turn selected

Optionally, select one or more checks to grade your test cases: AI judge, Tool check, or Keyword check.

Check selection showing AI judge with model picker, Tool check, and Keyword check options

If you selected AI judge, choose which model to use as the judge.

Model selector dropdown showing available models for the AI judge

Click Create test set.

Expand the Advanced section in the creation dialog to configure tool execution mode. By default, evals run in dry run mode where no tools actually execute. See Tool execution modes for details.

Check types

Checks define how each test case is graded. You select them when creating a test set, and they apply to every case in that set.

AI judge

AI judge compares your agent’s actual response to an expected answer you define for each test case. It uses a language model to evaluate whether the response meets the intent and content of the expected answer, even if the wording differs. You choose which model serves as the judge when creating the test set.

Tool check

Tool check verifies whether your agent called the expected tools for a given prompt. Define the tools you expect to be used, and the check confirms whether the agent called them during the evaluation. This is useful for agents with integration actions where calling the correct tool matters as much as the response itself.

Keyword check

Keyword check verifies whether the agent’s response contains required words or avoids forbidden words. Unlike AI judge, it does not involve a model and produces a deterministic pass or fail result. Each test case has two fields for this check: Must mention and Must not mention. Use it for compliance requirements, brand guidelines, or any case where certain terms must appear or must be avoided in the response.

Test cases

After creating a test set, you add the individual test cases that will be evaluated.

Adding cases manually

Click Add case on the test set page to create a single test case. Each case includes a prompt and the expected outputs that your selected checks will grade against.

Empty test set page with Add case and Import CSV buttons

Importing from CSV

For larger test sets, click Import CSV to load multiple cases at once. The CSV requires a prompt column. It can also include must_contain, must_not_contain, reference_answer, and expected_tools columns.

Importing from CSV is the fastest way to build comprehensive test sets, especially if you already maintain a spreadsheet of prompts you use for manual testing.

Running evals

Click Run in the top right of the test set page to start the evaluation. Results appear live as each case completes, showing the status, prompt, output, grader results, and duration for each case.

Test set with cases loaded showing prompts, expected answers, and expected tools columns with Export CSV, Edit, and Run buttons

A test set can contain up to 50 cases.

Each test case consumes usage in the same way as a regular agent conversation. A test set can only have one active run at a time, and each workspace can have up to three active runs.

Reviewing results

Once a case finishes, click on it to open the detail view. You can review:

Completed run showing pass and fail statuses, prompts, expected answers, outputs, grader results, and duration for each case

The full conversation between the prompt and the agent
The agent’s response
Token usage and duration
The result of each grader, such as pass, no pass, or unsure

To analyze results outside Langdock, click Download CSV to export the full run as a spreadsheet.

Tool execution modes

The tool execution mode determines whether your agent’s tools actually run during an evaluation. You configure this in the Advanced section when creating a test set.

Mode	Behavior	When to use
Dry run (default)	Records action and MCP calls without executing them. No real actions are taken.	Most evaluations. Safe for testing without side effects.
Live mode	Executes action and MCP calls that do not require approval. Actions that require approval stop the run instead of executing.	When you need to verify the full execution flow, including tool behavior and external system responses.

Live mode can change external systems, such as sending emails, creating tickets, or updating records. Use it only when you intentionally want an end-to-end integration test.

Get Started

Chat

Skills

Library

Agents

Workflows

Integrations

Microsoft Plugins

Models & Limits

Guides

Account

Resources

Troubleshooting

Agent Evals

Test sets

Creating a test set

Check types

AI judge

Tool check

Keyword check

Test cases

Adding cases manually

Importing from CSV

Running evals

Reviewing results

Tool execution modes

​Test sets

​Creating a test set

​Check types

​AI judge

​Tool check

​Keyword check

​Test cases

​Adding cases manually

​Importing from CSV

​Running evals

​Reviewing results

​Tool execution modes

Test sets

Creating a test set

Check types

AI judge

Tool check

Keyword check

Test cases

Adding cases manually

Importing from CSV

Running evals

Reviewing results

Tool execution modes