Datasets
Learn about datasets and how to implement it effectively.
3 min read
🆕Recently updated
Last updated: 12/9/2025
Datasets: Data Management & RAG/Evaluation Integration
Datasets in InnoSynth-Forjinn provide the essential foundation for RAG (Retrieval Augmented Generation), code/data evaluation, benchmarking, fine-tuning, and advanced analytics. This guide explains how to organize, import, label, and use datasets, both for live agent workflows and for systematic testing or retraining.
What is a Dataset?
A Dataset is a structured (or semi-structured) collection of examples—typically consisting of:
- Text passages, Q&A pairs, document blobs
- Labeled examples and ground truth answers (for benchmarking/evaluation)
- File attachments, images, code snippets, etc. (for some tasks)
Datasets are used for:
- Knowledge retrieval (RAG flows)
- Automated and manual agent evaluation
- Training and fine-tuning agents or prompt chains
- Organizing knowledge/context for sub-workflows
Supported Formats & Dataset Types
- Text/CSV: Upload tabular data, with each row as a record (supports column mapping UI).
- JSON/JSONL: Structured and scalable—each entry is a JSON object with fields like
question,context,answer. - PDF/Doc: Ingested and auto-chunked into text rows/fields.
- Image/File: Attachments/linked objects (supported for image-to-text or file-to-text tasks).
How to Create and Use Datasets
1. Create or Import a Dataset
- Go to Datasets in sidebar or use "Attach Dataset" in a RAG/document loader node.
- Click New Dataset or Import.
- Upload a CSV/JSON file or connect a cloud storage location (S3/GCS/etc.).
- Map data fields (e.g., set
inputcolumn as question,outputas answer,contextfor retrieval). - Assign a clear name, type, tags.
2. Labeling & Annotating
- For Q&A or evaluation: Use platform UI for manual labeling, highlighting, and approval.
- Add multiple correct answers, partial scores, or notes if your evaluation workflow requires.
3. Attaching Datasets to Flows/Nodes
- For RAG: In a Document Loader or Retriever node, select your dataset as the knowledge source.
- For Eval: In an Evaluator or Test Suite node, set the dataset as the evaluation corpus (ground truth).
- Chaining: Layer multiple datasets for complex/multi-context workflows.
4. Example: Using Datasets for RAG & Evaluation
- Create a Dataset: Upload Q&A pairs as
input(question),context(reference text),answer(expected output). - Configure Retriever/Loader: Point to dataset as knowledge source.
- Connect Evaluator Node: Set as ground truth, auto-compare agent output with correct answers.
- Export Results: Download or visualize pass/fail, accuracy, or full error analysis.
Best Practices
- Use version tags for evolving datasets; maintain strict validation schemas for production evals.
- Use representative, diverse, and well-labeled data for meaningful RAG and test results.
- Keep sensitive data access controlled if working with user or business data.
- Document intended use, known bugs, and schema in metadata/thumbnail field.
Troubleshooting
- Bad Import: If rows/fields are missing, check CSV delimiters/quotes or JSON structure.
- No Retrieval: Ensure dataset is marked "active" and correctly attached to the RAG node.
- Evaluation Errors: Make sure ground truths (answers) are non-empty and match expected output format.
- Scaling Issues: For very large datasets, use platform's chunked/streaming upload and consider cloud storage integration.
Datasets turn Forjinn workflows into production-grade AI systems—powering dynamic knowledge, reliable evaluation, and safe, repeatable automation.