Monitoring Alerting
Learn about monitoring alerting and how to implement it effectively.
2 min read
🆕Recently updated
Last updated: 12/9/2025
Monitoring & Alerting Guide
Proactive Monitoring & Alerting is vital to keep InnoSynth-Forjinn reliable, performant, and secure. This guide explains key metrics, integration options, alert setups, and best practices for both cloud and on-prem deployments.
Metrics and Logging
Core Platform Metrics
- API/Agent Latency
- Active Requests/Queue Depth
- Worker/Pod CPU & Memory
- LLM/Token Usage (flow, workspace, org, global)
- Error Rates (per flow, node, agent)
- Disk Usage (uploads, logs, DB)
Log Types
- Access logs (user logins, API usage)
- System logs (errors, crashes, restarts)
- Audit logs (admin actions, sensitive changes)
- Application/agent execution traces
Monitoring Integrations
- Prometheus: Native metrics via
/metricsendpoint.- Scrape flows, workers, system containers.
- Expose to Grafana or another dashboard for visualization.
- ELK (Elasticsearch/Logstash/Kibana): Ship logs for search, alerting.
- Cloud Logging: Export to AWS CloudWatch, GCP Logging, or Azure Monitor.
- Application Performance Monitoring (APM): Datadog, NewRelic, Sentry supported for error/trace capture.
Example: Prometheus + Grafana Setup
- Enable metrics on platform (
PROMETHEUS_METRICS_ENABLED=true) - Add scrape config for platform container
- Import prebuilt Grafana dashboards (if available)
Alerting
- Built-in Alert Rules: E.g., “>X errors in Y min,” “high latency,” “failed logins,” “low disk.”
- Alert Channels: E-mail, Slack, PagerDuty, custom webhook
- Custom Rules: Configure via your monitoring/alerting provider or scripts.
Best Practices
- Monitor all production and staging deployments
- Alert on what matters most: availability, errors, and resource exhaustion
- Regularly test your alerts and revise thresholds as you scale
- Backup monitoring configs and dashboards
Troubleshooting
- No metrics/alerts? Check endpoints, scrape configs, environment variables
- Too many alerts (“noise”)? Increase thresholds, add suppression, alert only on persistent/critical events
- Missing log entries: Verify log rotation configs and storage health
Strong monitoring and alerting is the foundation for reliable AI ops—never go live without it.