Speech Text Processing
Learn about speech text processing and how to implement it effectively.
2 min read
🆕Recently updated
Last updated: 12/9/2025
Speech & Text Processing Components
AI-powered speech and text utilities unlock voice interfaces, advanced parsing, and smarter context management in your Forjinn workflows. This guide documents built-in Speech-to-Text, Text Splitter, and related modules.
Speech-to-Text (STT) Nodes
These nodes convert audio (uploaded or streamed) into usable, searchable text.
Supported Providers
- OpenAI Whisper: Cloud-based, multi-language, high accuracy, used for uploads and real-time
- AssemblyAI, Google STT, etc: Alternative cloud STT when enabled
Inputs
- Audio file (wav, mp3, ogg, etc.) or stream
- Specify language, prompt/context (optional)
Outputs
- Transcribed text
- Optional: word timing, speaker diarization, confidence
Use Cases
- Voice chatbots/IVR
- Meeting/call transcription
- Multimodal data search (voice+text context)
Text Splitters
Text Splitter nodes preprocess long documents or user input for chunking—critical for RAG, search, and LLM context building.
Options
- By Characters: Fixed number of chars per chunk
- By Tokens: (preferred for LLMs, e.g., 500 tokens/chunk)
- By Sentence/Paragraph: Use built-in or custom splitter logic
- Overlap: Optional step for retaining context across splits
Use Cases
- Pre-indexing documents for vector search
- Breaking up chat logs, lengthy user submissions, etc
Text Processing Utilities
- Normalizer/Cleaner: Whitespace, encoding, character set fix
- Summarizer: Quick auto-summarize for long or multi-part docs
- Keyword Extraction: Highlight or extract key topics/entities
- Regex/Custom Parsers: Build or use provided plugins for special cases
Example: Voice-Driven Support Bot
- User uploads voice memo
- Speech-to-Text node transcribes to text
- Text Splitter node (if long)
- LLM/agent node answers or routes based on transcript
Best Practices
- Always preprocess/upload audio in a supported format
- Use correct langs/prompts to improve STT accuracy
- Tune chunking strategy by task (retrieval = small+overlap, doc summarization = longer)
- Chain normalizer before embedding for best search
Troubleshooting
- Poor accuracy: Try another provider or adjust input format/preprocessing
- Missing chunks: Lower chunk size, increase overlap, check log for parse errors
- Performance: For long docs/audio, use batch or background jobs to avoid blocking UI
Speech and advanced text nodes power truly multimodal, robust, and flexible automations for both human and machine users.