Performance Optimization & Scaling

Optimizing performance in Forjinn ensures your AI workflows respond quickly, scale smoothly, and stay within budget. The platform supports powerful metrics analysis and advanced tuning—these approaches apply to both self-hosted and cloud deployments.

Key Metrics & What to Watch

Request Latency: End-to-end time for predictions and API calls. Aim: <2s interactive, <500ms API.
Token Usage: Track LLM tokens per call/flow for budget control.
Active Workers: Number of concurrent job processors. Adjust for traffic.
Memory/CPU Consumption: Especially for large models/tools.
Cache Hit Rate: For retrievers, datasets, embedding generators.
API Throttle/Ratelimit: Avoid user-side wait times or skipped jobs.
Chat Builder Response Time: Monitor conversational agent latency separately from Visual Canvas flows.

Profiling and Benchmarking

Use platform metrics dashboard (if enabled) for high-level view.
Enable detailed log/tracing per node—see which step/agent adds delay.
Compare flow performance pre/post tweak: clone, run batch, measure.
For code heavy flows, add timing in Custom Function nodes.

Scaling Patterns

Caching

Use Redis, Momento, or platform in-memory cache for repeated retrievals/chunks.
Cache agent results for common/expensive prompts.

Load Balancing

In Docker: Use Traefik/Nginx/HAProxy in front of multiple containers.
In K8s: Set up Horizontal Pod Autoscaler (+ Worker Pool for heavy jobs).
Separate web/API from worker nodes—let heavy jobs queue and process asynchronously.
Route API Gateway traffic separately for higher-throughput endpoints.

Request Batching

In batch workflows (eval, retriever), send/score multiple samples in one LLM/API call to cut roundtrips.

Agent Framework Optimization

Google ADK

Leverages Google's infrastructure for optimized Gemini model routing
Best performance on GCP deployments with regional model endpoints

CrewAI

Role-based agent pools may consume more memory; size worker pods accordingly
Use task caching for repeated crew operations

AutoGen

Conversational agent chains benefit from reduced model roundtrips
Enable streaming for better perceived latency in long conversations

API Gateway Performance

Configure appropriate rate limits per endpoint to prevent abuse without throttling legitimate traffic
Use response caching for endpoints that return frequently-requested but static data
Monitor upstream flow latency separately for gateway vs. direct executions

Chat Builder Performance

Chat Builder sessions are optimized for low-latency conversational responses
Enable streaming responses for better user experience with longer model outputs
Memory components should be sized appropriately to avoid token budget exhaustion

Config Best Practices

Tune LLM/retriever chunk sizes, prompt length, and model params for speed/quality balance.
Use persistent external DB and file storage for high-availability setups.
Leverage platform's built-in prefetch/paginate for large dataset or report queries.

Troubleshooting Slow Performance

Node bottleneck: See logs for slowest step in execution trace.
LLM rate limits: Reduce concurrency or use secondary provider for spillover.
Database slowness: Ensure DB is on SSD, scale up vertical resources.
Cache misses: Check cache deployment and connection status.
API Gateway latency: Check rate limit configs and upstream flow health.

Advanced/Enterprise Scaling

Use managed cloud databases (RDS, CloudSQL, CosmosDB) for production.
Integrate with Prometheus, Grafana for deep custom metrics.
Periodically restart workers/pods to avoid memory leaks/GC pauses for long-running tasks.
Deploy dedicated API Gateway pods for high-throughput endpoint handling.

Scaling is about identifying real-world bottlenecks, automating what you can, and always measuring before optimizing. Confidently run Forjinn for thousands of users/flows with the right config and monitoring in place.

Performance Optimization & Scaling

On this page