Forjinn Docs

Development Platform

Documentation v2.0
Made with
by Forjinn

Datasets

Learn about datasets and how to implement it effectively.

3 min read
🆕Recently updated
Last updated: 12/9/2025

Datasets: Data Management & RAG/Evaluation Integration

Datasets in InnoSynth-Forjinn provide the essential foundation for RAG (Retrieval Augmented Generation), code/data evaluation, benchmarking, fine-tuning, and advanced analytics. This guide explains how to organize, import, label, and use datasets, both for live agent workflows and for systematic testing or retraining.


What is a Dataset?

A Dataset is a structured (or semi-structured) collection of examples—typically consisting of:

  • Text passages, Q&A pairs, document blobs
  • Labeled examples and ground truth answers (for benchmarking/evaluation)
  • File attachments, images, code snippets, etc. (for some tasks)

Datasets are used for:

  • Knowledge retrieval (RAG flows)
  • Automated and manual agent evaluation
  • Training and fine-tuning agents or prompt chains
  • Organizing knowledge/context for sub-workflows

Supported Formats & Dataset Types

  • Text/CSV: Upload tabular data, with each row as a record (supports column mapping UI).
  • JSON/JSONL: Structured and scalable—each entry is a JSON object with fields like question, context, answer.
  • PDF/Doc: Ingested and auto-chunked into text rows/fields.
  • Image/File: Attachments/linked objects (supported for image-to-text or file-to-text tasks).

How to Create and Use Datasets

1. Create or Import a Dataset

  • Go to Datasets in sidebar or use "Attach Dataset" in a RAG/document loader node.
  • Click New Dataset or Import.
    • Upload a CSV/JSON file or connect a cloud storage location (S3/GCS/etc.).
    • Map data fields (e.g., set input column as question, output as answer, context for retrieval).
    • Assign a clear name, type, tags.

2. Labeling & Annotating

  • For Q&A or evaluation: Use platform UI for manual labeling, highlighting, and approval.
  • Add multiple correct answers, partial scores, or notes if your evaluation workflow requires.

3. Attaching Datasets to Flows/Nodes

  • For RAG: In a Document Loader or Retriever node, select your dataset as the knowledge source.
  • For Eval: In an Evaluator or Test Suite node, set the dataset as the evaluation corpus (ground truth).
  • Chaining: Layer multiple datasets for complex/multi-context workflows.

4. Example: Using Datasets for RAG & Evaluation

  • Create a Dataset: Upload Q&A pairs as input (question), context (reference text), answer (expected output).
  • Configure Retriever/Loader: Point to dataset as knowledge source.
  • Connect Evaluator Node: Set as ground truth, auto-compare agent output with correct answers.
  • Export Results: Download or visualize pass/fail, accuracy, or full error analysis.

Best Practices

  • Use version tags for evolving datasets; maintain strict validation schemas for production evals.
  • Use representative, diverse, and well-labeled data for meaningful RAG and test results.
  • Keep sensitive data access controlled if working with user or business data.
  • Document intended use, known bugs, and schema in metadata/thumbnail field.

Troubleshooting

  • Bad Import: If rows/fields are missing, check CSV delimiters/quotes or JSON structure.
  • No Retrieval: Ensure dataset is marked "active" and correctly attached to the RAG node.
  • Evaluation Errors: Make sure ground truths (answers) are non-empty and match expected output format.
  • Scaling Issues: For very large datasets, use platform's chunked/streaming upload and consider cloud storage integration.

Datasets turn Forjinn workflows into production-grade AI systems—powering dynamic knowledge, reliable evaluation, and safe, repeatable automation.