To outline our approach based on potential tools:
Prometheus and Grafana: If you’re using this setup, we can deploy Prometheus agents on all Kubernetes nodes to collect metrics like CPU, memory, etc., and configure alerts in Grafana with Prometheus as the data source.
Managed Services: For resources like Kafka, MemoryDB, etc., we’ll route CloudWatch metrics into Grafana for alerting.
Functional Alerts: For functional checks (e.g., failed runs), we’ll send metrics to Prometheus via a connector, enabling alerting in Grafana.
Datadog/Splunk: If Datadog or Splunk is the monitoring tool, we can deploy Datadog agents to monitor the entire infrastructure.
Alerts
Alert Name | Threshold | Severity |
| 70% | L1 Warning |
| 90% | L2 Critical |
| 10Gb | L1 Warning |
| 5Gb | L2 Critical |
| less than 50 | L1 Warning |
| less than 25 | L2 Critical |
| greater than 3sec | L1 Warning |
| greater than 5sec | L2 Critical |
| greater than 3sec | L1 Warning |
| greater than 5sec | L2 Critical |
| 85 | L1 Warning |
| 90 | L2 Critical |
| 85 | L1 Warning |
| 90 | L2 Critical |
| <15 | L2 Critical |
| >3 | L2 Critical |
| 80% | L1 Warning |
| 90% | L2 Critical |
| 80% | L1 Warning |
| 90% | L2 Critical |
| >500 | L1 Warning |
| >1000 | L2 Critical |
| >1000 | L2 Critical |
| <= 0 | L2 Critical |
| >1min | L2 Critical |
| >5 | L2 Critical |
| 1 | L2 Critical |
| >80% | L1 Warning |
| >90% | L2 Critical |
| >80% | L1 Warning |
| >90% | L2 Critical |
| 1 | L2 Critical |
| 4mins | L1 Warning |
| 5min | L2 Critical |
| >80% | L1 Warning |
| >90% | L2 Critical |
| <20% | L1 Warning |
| <10% | L2 Critical |
| 1 | L2 Critical |
| >0 | L2 Critical |
| 0 | L2 Critical |
| >50 hits within 5 minutes | L1 Warning |
| >10 hits within a minute | L1 Warning |
| >10 hits within a minute | L1 Warning |
| >10 hits within 5 minute | L1 Warning |
| >5 hits within 5 minutes | L1 Warning |
| >5 hits within 5 minutes | L1 Warning |