ForjinnForjinn
Evaluators

Evaluators — Configure Metrics and Scoring for AI Agent Assessment

Learn how to set up evaluators in Forjinn. Choose from built-in evaluation types, configure custom metrics, and establish scoring systems for thorough AI agent quality assessment.

Evaluators

The Evaluators page at /evaluators is the configuration center for defining how AI agent outputs are measured and scored. Evaluators determine the criteria by which agent responses are compared against ground truth data from Datasets, producing quantitative results you can use to assess quality, track improvements, and make deployment decisions.

Forjinn evaluators page showing available evaluator types, configuration panel, and scoring options

Accessing Evaluators

  1. From the left-hand sidebar, navigate to Settings.
  2. Click on Evaluators to open the evaluator configuration page.

Evaluators can also be accessed from the Dashboard tile under Settings, or directly from within the Evaluations page when creating a new evaluation run.


Available Evaluator Types

Forjinn provides a range of built-in evaluator types, each designed for specific assessment needs:

Exact Match

Compares agent output to expected output using strict string equality.

  • Best for: Code generation, factual answers with a single correct form, structured outputs.
  • Scoring: Binary — 1.0 for exact match, 0.0 otherwise.
  • Options: Case sensitivity toggle, whitespace normalization, trim trailing punctuation.

Semantic Similarity

Uses embedding models to measure how closely the agent output matches the expected output in meaning, regardless of wording.

  • Best for: Open-ended generation, paraphrasing, summaries, customer support responses.
  • Scoring: Continuous score from 0.0 to 1.0 based on cosine similarity.
  • Options:
    • Embedding model selection (e.g., OpenAI text-embedding, local models).
    • Threshold configuration to define pass/fail.
    • Dimensionality settings for embedded vectors.

F1 / Token Overlap

Measures the overlap of individual tokens or n-grams between agent output and expected output, computing precision, recall, and F1.

  • Best for: Keyword extraction, entity recognition, lists, and structured text.
  • Scoring: F1 score from 0.0 to 1.0, with separate precision and recall available.
  • Options: N-gram size, tokenization strategy (word, character, subword), stopword filtering.

LLM-as-Judge

Uses a secondary, high-capability LLM to assess the quality of agent output against rubric criteria or the expected answer.

  • Best for: Complex reasoning evaluation, creative writing, multi-criteria assessment, nuanced judgment calls.
  • Scoring: Configurable scale (e.g., 1-5 Likert, 0-100), with optional reasoning text.
  • Options:
    • Judge model selection (must be configured via Credentials).
    • Custom rubric prompts to define evaluation criteria.
    • Multi-dimensional scoring (e.g., accuracy, completeness, tone, safety).
    • Temperature and max token settings for the judge model.

Regex Pattern Match

Validates agent output against one or more regular expressions.

  • Best for: Format validation, structured data extraction verification, constraint checking.
  • Scoring: Binary per pattern; overall score is the fraction of patterns matched.
  • Options: Multiple patterns, case-insensitive flag, partial vs. full match mode.

Custom Code Evaluator

Run arbitrary Python or JavaScript code to compute evaluation scores.

  • Best for: Domain-specific metrics, integration with external scoring APIs, complex business rules.
  • Scoring: Fully user-defined; must return a numeric score between 0.0 and 1.0 (or configured range).
  • Options: Access to input, agent output, expected output, and any environment variables. Supports importing libraries (depending on execution sandbox).

Configuring Evaluators

To set up an evaluator:

  1. Navigate to the Evaluators page.
  2. Click New Evaluator.
  3. Select the Evaluator Type from the available options.
  4. Fill in the configuration:
FieldDescription
NameIdentifier for the evaluator (e.g., "Semantic Accuracy v2")
DescriptionOptional documentation of what the evaluator measures
TypeThe evaluation method (Exact Match, Semantic, LLM-as-Judge, etc.)
ParametersType-specific settings (thresholds, models, patterns, rubrics)
WeightRelative importance when combining multiple evaluators (0.0 to 1.0)
Pass ThresholdMinimum score at which a result is marked as passing
  1. Click Save to store the evaluator. It will appear in the list and be available when creating evaluations.

Evaluator Weights

When running evaluations with multiple evaluators, each evaluator can be assigned a weight to determine its contribution to the overall composite score. Weights are normalized to sum to 1.0.

For example:

  • Semantic Similarity: weight 0.5 (50% of total score)
  • LLM-as-Judge: weight 0.3 (30% of total score)
  • Regex Pattern: weight 0.2 (20% of total score)

Custom Evaluation Metrics

Beyond the built-in evaluator types, Forjinn supports creating fully custom evaluation metrics through the Custom Code Evaluator.

Writing a Custom Evaluator

A custom evaluator is a function that receives:

function evaluate({ input, agentOutput, expectedOutput, context }) {
  // Your scoring logic here
  // Return a score between 0.0 and 1.0
  return score;
}

Or in Python:

def evaluate(input_text, agent_output, expected_output, context=None):
    # Your scoring logic here
    # Return a float between 0.0 and 1.0
    return score

Common Custom Metric Patterns

  • Domain-specific accuracy: Industry-specific rules for determining correctness.
  • Toxicity / safety scoring: Integration with moderation APIs to flag unsafe content.
  • Hallucination detection: Cross-referencing agent output against a knowledge base for factuality.
  • Response time scoring: Penalizing outputs that exceed acceptable latency thresholds.
  • Format compliance: Checking that output follows a required schema (e.g., JSON structure, markup format).

External API Integration

Custom evaluators can call external services for scoring:

  • Connect via configured Credentials for API keys.
  • Use the platform's HTTP request utilities within the evaluator code.
  • Cache external API responses to reduce costs on repeated evaluations.

Scoring Systems

Forjinn's evaluation framework supports several scoring paradigms:

Binary Scoring

A result either passes or fails. Used by Exact Match and Regex Pattern evaluators.

  • Score: 0.0 or 1.0
  • Clear pass/fail determination based on the configured threshold

Continuous Scoring

A numeric value representing degree of correctness. Used by Semantic Similarity, F1, and custom evaluators.

  • Score: 0.0 to 1.0 (float)
  • Allows nuanced comparison between outputs of varying quality

Ordinal / Rubric Scoring

A scaled assessment, typically produced by LLM-as-Judge evaluators.

  • Score: Integer on a custom scale (e.g., 1-5, 1-10)
  • Includes qualitative reasoning alongside the numeric score
  • Supports multi-dimensional rubrics (accuracy, completeness, clarity, tone)

Composite Scoring

When multiple evaluators run together, Forjinn combines their individual scores into a weighted composite.

Composite Score = Σ (evaluator_score × evaluator_weight)

The composite provides a single number for overall assessment, while individual evaluator scores remain accessible for drill-down analysis.

Pass/Fail Determination

Each evaluator — and the composite score — can have a pass threshold configured:

  • Score >= threshold: PASS
  • Score < threshold: FAIL

The overall evaluation result is marked as passing only if the composite score meets its threshold.


Best Practices for Evaluation

Design Principles

  • Multi-dimensional assessment: Use more than one evaluator type to capture different aspects of quality. Semantic similarity alone may miss correctness; exact match may miss paraphrased but accurate answers.
  • Appropriate thresholds: Calibrate pass/fail thresholds based on real-world expectations, not arbitrary values. Run pilot evaluations to see score distributions before setting thresholds.
  • Representative datasets: Ensure your Datasets cover the full range of expected inputs, including edge cases and domain-specific challenges.

Workflow Recommendations

  • Baseline first: Run an initial evaluation before making agent changes. This establishes a reference point for measuring improvement.
  • Incremental tuning: After adjusting prompts or agent configurations, re-run evaluations with the same dataset version to quantify impact.
  • A/B testing: Create parallel evaluations to compare two agent configurations or model choices on the same dataset.
  • Regression guarding: Use evaluations as part of a CI/CD pipeline — prevent deployments if evaluation scores drop below acceptable thresholds.

LLM-as-Judge Considerations

  • Judge model selection: Use a high-capability model as the judge, ideally one that outperforms the agent being evaluated.
  • Prompt engineering: Write clear, specific rubric prompts. Include examples of high-scoring and low-scoring outputs to calibrate the judge.
  • Cost management: LLM-as-Judge evaluation incurs token costs for every item. Monitor usage and consider caching results for unchanged items.
  • Judge bias: Be aware that the judge model may have style preferences. Test rubric consistency by running the same evaluation twice.

Monitoring Evaluation Health

  • Track evaluator success rates — if an evaluator consistently returns errors, review its configuration.
  • Monitor score distributions — a flat distribution (all near 0.0 or 1.0) may indicate the evaluator is not discriminative enough.
  • Keep evaluator documentation up to date — include the rationale for chosen thresholds and weights.

Troubleshooting

  • Evaluator returns errors: Verify that all required credentials are configured and the referenced models are accessible. Check the Server Logs for detailed error messages.
  • Unexpected zero scores: For semantic similarity, ensure the embedding model is loaded and the dimensionalities match. For custom evaluators, verify the function returns a valid numeric value.
  • Slow evaluation runs: LLM-as-Judge evaluators are the slowest. Consider reducing the number of items per evaluation or batching smaller datasets.
  • Composite score seems wrong: Check that evaluator weights sum to a reasonable range and that individual evaluator scores are within expected ranges before aggregation.

  • Datasets — Create and manage the ground truth data evaluations run against
  • Evaluations — Combine evaluators and datasets to run evaluations and view results
  • Components Guide — Evaluators — Build evaluator nodes directly into agentflows and chatflows
  • Agent Executions — Correlate evaluation results with live agent execution data
  • Server Logs — Debug evaluation failures and monitor system health
  • Credentials — Configure API keys needed for embedding models and LLM-as-Judge evaluators

On this page