Forjinn Docs

Development Platform

Documentation v2.0
Made with
by Forjinn

Evaluators

Learn about evaluators and how to implement it effectively.

3 min read
🆕Recently updated
Last updated: 12/9/2025

Evaluators & Evaluations: Benchmarking AI/agent Performance

Evaluators in InnoSynth-Forjinn are core components for measuring, comparing, and ensuring the quality of AI-generated responses, agent actions, and chatflow outputs. They bring scientific rigor and automation to the process of model evaluation, prompt tuning, and workflow improvement.


What is an Evaluator?

An Evaluator is a module (node or workflow) that compares a system's output (from an LLM, agent, or workflow) against a known ground truth, using a variety of metrics:

  • Exact match (string/equality)
  • F1/ROUGE/BLEU (for text similarity)
  • Semantic similarity (embeddings-based)
  • Human preference/annotation
  • Custom scoring (via code or external API)

Evaluators can run automatically, in batch (across whole datasets), or interactively in the UI.


Types of Evaluations

  • Automated Evaluations: Run against a dataset, compare predictions to annotated ground truths, compute accuracy, precision, recall, etc.
  • Manual Evaluations: Human reviewers approve, reject, or score specific outputs using platform UI tools.
  • Hybrid: Automated filter, then human review for borderline or flagged cases.
  • Second-Order Analysis: Use one LLM/agent to critique or grade another's output ("self-evaluation", GPT-judge, etc).

Setting Up Evaluators

1. Add an Evaluator Node

  • Drag an Evaluator node (e.g. ExactMatchEvaluator, FuzzyEvaluator, LLMScorer) to your workflow.
  • Connect it after the node whose output you want to assess (typically LLM, Retriever, or Chat Output).
  • Attach your dataset (with ground truths) in the node config.

2. Configure Metrics & Criteria

  • Choose or define the metric(s): e.g.,
    • String match: Path to actual output, path to ground truth
    • Semantic: Use set model for embedding similarity
    • Custom: Supply Python/JS code to calculate match/score

3. (Optional) UI & Human-in-the-Loop

  • Enable manual evaluation mode for human reviewers
  • Use annotation UI to accept/reject, rate, or provide feedback on model outputs

4. Run Evaluation

  • Manually test on sample inputs
  • Or, run batch evaluation against all records in a dataset
  • Results are logged per sample and summarized (scores, confusion matrices, histograms, etc.)

Example: Evaluating RAG QA with String & Semantic Scores

  • Upload Dataset: Q/A pairs (question, context, answer)
  • Configure Workflow: Retriever → LLM → Evaluator
  • Set Evaluator:
    • Metric: ‘semantic similarity’ with threshold 0.85
  • Run: On each record, the Evaluator compares model output to the known answer, scores, and logs result.
  • Visualize: Charts show per-metric performance and failed cases.

Results & Reporting

  • Summaries: Overall accuracy, F1, semantic match
  • Per-sample reports for error analysis
  • Export to CSV/JSON for deeper/reproducible analysis

Troubleshooting

  • Scores lower than expected: Check for case, whitespace, or format differences; semantic metrics are more tolerant than string metrics.
  • Dataset not recognized: Ensure column mapping is correct and all required fields are present.
  • Stuck on "Evaluating..." Large datasets can take time; batch in smaller chunks or optimize retriever/LLM speed.

Best Practices

  • Always validate your evaluation set is reflective of live data.
  • Avoid leaking answer/context in prompts used for evaluation.
  • Use both automated and human/manual review for high-stakes or critical tasks.
  • Document metric settings and rationale in Sticky Notes or metadata.
  • Version your datasets and flows for reproducibility.

Evaluators make your AI measurable, improvable, and trustworthy—turning black box AI into repeatable, scientific-grade workflows.