Evaluations: Measure and Track Agent Quality

Test your AI agents against curated datasets and custom scoring criteria to measure quality, catch regressions, and track improvement over time.

Evaluations, Datasets, and Evaluators are available on the full edition only and are disabled in the open source build. See the introduction for a comparison of editions.

Evaluations give you a systematic way to measure how well your agents perform. Rather than testing flows manually by chatting with them, you define a dataset of test cases, set up evaluators that score each response, and run an evaluation that processes every case automatically. The results show you how well the agent did across the full dataset, so you can compare performance across different versions and catch regressions before they reach users.

Evaluations in Forjinn have three components that work together:

Datasets

Collections of test cases — typically question-and-expected-answer pairs — that define what you want your agent to handle correctly.

Evaluators

Scoring definitions that judge each agent response against a criterion, such as accuracy, relevance, or custom logic.

Evaluations

A run that applies one or more evaluators to a dataset against a specific flow, producing scored results for every test case.

Datasets

A dataset is a set of rows, where each row represents one test case. At minimum, each row has an input (the question or prompt to send to the agent) and an expected output (the answer you consider correct). You can add more columns if your evaluator needs additional context.

Create a dataset

In the sidebar, under Evaluations, click Datasets, then click New Dataset.

Give the dataset a name that reflects what it's testing — for example, "Product FAQ accuracy" or "Refund policy edge cases".

Click Add Row to add test cases one at a time. Fill in the Input and Expected Output columns. You can also paste or import rows in bulk.

Click Save. The dataset is now available to use in evaluations.

Start small — even 10 to 20 well-chosen test cases give you meaningful signal. Focus on edge cases, known failure modes, and the most important things your agent needs to get right.

Evaluators

An evaluator defines how each agent response is scored. You configure the scoring criteria — for example, whether the response matches the expected output, whether it's relevant to the input, or whether it follows a policy you describe in a prompt.

Create an evaluator

In the sidebar, under Evaluations, click Evaluators, then click New Evaluator.

Give the evaluator a descriptive name — for example, "Factual accuracy" or "Tone compliance".

Define what a good response looks like. Depending on the evaluator type you choose, you may write a scoring prompt, set a similarity threshold, or choose a metric like exact match, contains answer, or LLM-judged relevance.

Click Save. The evaluator is now available to attach to evaluations.

Run an evaluation

In the sidebar, under Evaluations, click Evaluations, then click New Evaluation.

Choose the chatflow or agent flow you want to test.

Pick the dataset whose test cases you want to run through the flow.

Add one or more evaluators that will score each response. You can combine multiple evaluators in a single evaluation run.

Click Run. Forjinn sends each input row to your flow, collects the response, and scores it with each evaluator. This may take a few minutes depending on the size of your dataset.

Interpreting results

When an evaluation finishes, you see a results view with:

Overall score — an aggregated score across all test cases and evaluators
Per-row breakdown — the agent's response and score for each dataset row, so you can see exactly which cases passed and which failed
Evaluator breakdown — if you used multiple evaluators, each evaluator's scores are shown separately

If you change your flow and want to retest, you can re-run an evaluation from its detail view. Forjinn stores all versions of the results so you can compare runs over time and verify that a change improved performance.

Version history

Forjinn tracks whether an evaluation is outdated — that is, whether the underlying flow has changed since the evaluation last ran. When an evaluation shows as outdated, click Run Again to get fresh results against the current version of the flow.

Evaluations: Measure and Track Agent Quality

Datasets

Evaluators

Evaluations

Datasets

Create a dataset

Evaluators

Create an evaluator

Run an evaluation

Interpreting results

Version history

On this page