Evaluators & Evaluations: Benchmarking AI/agent Performance

Evaluators in InnoSynth-Forjinn are core components for measuring, comparing, and ensuring the quality of AI-generated responses, agent actions, and chatflow outputs. They bring scientific rigor and automation to the process of model evaluation, prompt tuning, and workflow improvement.

What is an Evaluator?

An Evaluator is a module (node or workflow) that compares a system's output (from an LLM, agent, or workflow) against a known ground truth, using a variety of metrics:

Exact match (string/equality)
F1/ROUGE/BLEU (for text similarity)
Semantic similarity (embeddings-based)
Human preference/annotation
Custom scoring (via code or external API)

Evaluators can run automatically, in batch (across whole datasets), or interactively in the UI.

Types of Evaluations

Automated Evaluations: Run against a dataset, compare predictions to annotated ground truths, compute accuracy, precision, recall, etc.
Manual Evaluations: Human reviewers approve, reject, or score specific outputs using platform UI tools.
Hybrid: Automated filter, then human review for borderline or flagged cases.
Second-Order Analysis: Use one LLM/agent to critique or grade another's output ("self-evaluation", GPT-judge, etc).

Setting Up Evaluators

1. Add an Evaluator Node

Drag an Evaluator node (e.g. ExactMatchEvaluator, FuzzyEvaluator, LLMScorer) to your workflow.
Connect it after the node whose output you want to assess (typically LLM, Retriever, or Chat Output).
Attach your dataset (with ground truths) in the node config.

2. Configure Metrics & Criteria

Choose or define the metric(s): e.g.,
- String match: Path to actual output, path to ground truth
- Semantic: Use set model for embedding similarity
- Custom: Supply Python/JS code to calculate match/score

3. (Optional) UI & Human-in-the-Loop

Enable manual evaluation mode for human reviewers
Use annotation UI to accept/reject, rate, or provide feedback on model outputs

4. Run Evaluation

Manually test on sample inputs
Or, run batch evaluation against all records in a dataset
Results are logged per sample and summarized (scores, confusion matrices, histograms, etc.)

Example: Evaluating RAG QA with String & Semantic Scores

Upload Dataset: Q/A pairs (question, context, answer)
Configure Workflow: Retriever → LLM → Evaluator
Set Evaluator:
- Metric: ‘semantic similarity’ with threshold 0.85
Run: On each record, the Evaluator compares model output to the known answer, scores, and logs result.
Visualize: Charts show per-metric performance and failed cases.

Results & Reporting

Summaries: Overall accuracy, F1, semantic match
Per-sample reports for error analysis
Export to CSV/JSON for deeper/reproducible analysis

Troubleshooting

Scores lower than expected: Check for case, whitespace, or format differences; semantic metrics are more tolerant than string metrics.
Dataset not recognized: Ensure column mapping is correct and all required fields are present.
Stuck on "Evaluating..." Large datasets can take time; batch in smaller chunks or optimize retriever/LLM speed.

Best Practices

Always validate your evaluation set is reflective of live data.
Avoid leaking answer/context in prompts used for evaluation.
Use both automated and human/manual review for high-stakes or critical tasks.
Document metric settings and rationale in Sticky Notes or metadata.
Version your datasets and flows for reproducibility.

Evaluators make your AI measurable, improvable, and trustworthy—turning black box AI into repeatable, scientific-grade workflows.

Forjinn Docs

Evaluators