Live Feed
Latest Alerts
    Log Stream

    Playbooks

    BigQuery Cost Spike Investigation
    Identify rogue queries or misconfigured BI extractors.
    Service: analytics Risk: Low Downtime: None

    1. Check: Fetch top queries by cost in last 24h
    2. Query: bq query --use_legacy_sql=false 'SELECT ...'
    3. Decision: Is there a single outlier query?
    4. Action: Throttle service account or apply guardrail
    Kubernetes 5xx Surge - Gateway
    Triages 5xx spikes on the gateway; checks upstream health and retries.
    Service: edge-gateway Risk: Medium Downtime: < 2 min

    1. Check: Inspect HTTP 5xx rate and upstreams
    2. Query: kubectl -n edge logs deploy/gateway --tail=500
    3. Decision: Are upstream pods healthy?
    4. Action: Shift traffic to healthy subset; restart unhealthy pods
    Cloud SQL Connection Pool Exhaustion
    Detects pool exhaustion and applies connection cap + backoff.
    Service: payments-api Risk: Medium Downtime: None

    1. Check: Verify active connections vs max
    2. Query: select * from pg_stat_activity limit 5;
    3. Action: Roll out connection cap via env var + HPA tune