Monitoring & Alerting Guide

Proactive Monitoring & Alerting is vital to keep InnoSynth-Forjinn reliable, performant, and secure. This guide explains key metrics, integration options, alert setups, and best practices for both cloud and on-prem deployments.

Metrics and Logging

Core Platform Metrics

API/Agent Latency
Active Requests/Queue Depth
Worker/Pod CPU & Memory
LLM/Token Usage (flow, workspace, org, global)
Error Rates (per flow, node, agent)
Disk Usage (uploads, logs, DB)

Log Types

Access logs (user logins, API usage)
System logs (errors, crashes, restarts)
Audit logs (admin actions, sensitive changes)
Application/agent execution traces

Monitoring Integrations

Prometheus: Native metrics via /metrics endpoint.
- Scrape flows, workers, system containers.
- Expose to Grafana or another dashboard for visualization.
ELK (Elasticsearch/Logstash/Kibana): Ship logs for search, alerting.
Cloud Logging: Export to AWS CloudWatch, GCP Logging, or Azure Monitor.
Application Performance Monitoring (APM): Datadog, NewRelic, Sentry supported for error/trace capture.

Example: Prometheus + Grafana Setup

Enable metrics on platform (PROMETHEUS_METRICS_ENABLED=true)
Add scrape config for platform container
Import prebuilt Grafana dashboards (if available)

Alerting

Built-in Alert Rules: E.g., “>X errors in Y min,” “high latency,” “failed logins,” “low disk.”
Alert Channels: E-mail, Slack, PagerDuty, custom webhook
Custom Rules: Configure via your monitoring/alerting provider or scripts.

Best Practices

Monitor all production and staging deployments
Alert on what matters most: availability, errors, and resource exhaustion
Regularly test your alerts and revise thresholds as you scale
Backup monitoring configs and dashboards

Troubleshooting

No metrics/alerts? Check endpoints, scrape configs, environment variables
Too many alerts (“noise”)? Increase thresholds, add suppression, alert only on persistent/critical events
Missing log entries: Verify log rotation configs and storage health

Strong monitoring and alerting is the foundation for reliable AI ops—never go live without it.

Forjinn Docs

Monitoring Alerting