Document Stores: Manage Data for RAG Pipelines
Document stores let you upload, process, and index documents into a vector search index so your flows can retrieve relevant content at query time.
A document store is a managed collection of documents that your flows can search through using semantic similarity. You upload source documents — PDFs, web pages, spreadsheets, databases — and Forjinn handles chunking the text, generating embeddings, and loading everything into a vector store. Once indexed, any chatflow or agentflow can connect to the store and retrieve the most relevant passages in response to a user's question. This is the foundation of retrieval-augmented generation (RAG).
Core concepts
Before building a document store, it helps to understand the four layers involved:
Loaders
Components that read your source content. A loader knows how to extract text from a specific format — a PDF file, a website, a CSV, a database table, or an API endpoint.
Chunks
After loading, content is split into smaller pieces called chunks. Chunking ensures that the text segments retrieved at query time are focused and relevant rather than entire documents.
Embeddings
Each chunk is converted into a numeric vector (an embedding) that captures its semantic meaning. Queries are also embedded, enabling similarity search rather than keyword matching.
Vector stores
The database that stores and indexes your embeddings. When a flow queries the document store, the vector store returns the chunks whose embeddings are closest to the query embedding.
Document store status
A document store moves through the following states as you work with it:
| Status | Meaning |
|---|---|
EMPTY | The store exists but no loaders have been added yet. |
NEW | A loader has been added but not yet processed. |
SYNCING | Documents are currently being loaded and chunked. |
SYNC | All loaders are processed and up to date. |
STALE | One or more loaders have changed since the last process run. |
UPSERTING | Chunks are being inserted into the vector store. |
UPSERTED | All chunks are in the vector store and ready to query. |
Creating a document store
In the left sidebar, click Document Stores, then click Add New.
Give the store a name and an optional description. The description is used when Forjinn generates a tool definition for this store, so a clear description improves retrieval quality. Click Create.
Inside your new store, click Add Document Loader. Select the loader that matches your source:
- PDF / DOCX / CSV / JSON / TXT — upload a file directly
- Web scraper — provide a URL to crawl (supports Cheerio, Playwright, and Puppeteer scrapers)
- API loader — fetch content from an HTTP endpoint
- Unstructured file loader — handles a wide range of document formats
Configure the loader's settings, such as file upload or URL, then click Preview to see how your content will be split before committing.
Choose a text splitter and set the chunk size (number of characters per chunk) and chunk overlap (characters shared between adjacent chunks). Overlap helps preserve context across chunk boundaries. Click Process to chunk the documents.
Click Configure Vector Store. Select your vector store provider and embedding model, then authenticate using a credential from your Credentials library. Click Save to store the configuration.
Click Upsert to generate embeddings for all chunks and insert them into the vector store. Forjinn runs this as a background job — you can monitor progress via the store's status badge.
Querying from the UI
Once the store is in UPSERTED status, you can test it directly from the document store detail page:
- Open the store and click Query.
- Type a natural-language question in the search box.
- Forjinn converts your query to an embedding, searches the vector store, and returns the most relevant chunks with their similarity scores.
Use this to validate retrieval quality before connecting the store to a flow.
Connecting a document store to a flow
In any chatflow or agentflow, add a Vector Store Retriever node and point it to your document store. The retriever node fetches the most relevant chunks for each incoming query and passes them to the language model as context.
If retrieval quality is poor, try reducing chunk size so each chunk covers a narrower topic, or increase chunk overlap to reduce the chance that relevant context falls across a boundary. You can preview chunk output before processing to compare different configurations.
Refreshing content
When your source documents change, return to the document store and click Refresh on the relevant loader. Forjinn re-processes the source and updates only the chunks that have changed, keeping your vector store in sync without a full re-index.
Forjinn tracks which flows are using each document store. If you delete a store that is referenced by an active flow, the flow's retrieval nodes will fail until you update them to point to a different store.