Playbooks
BigQuery Cost Spike Investigation
Identify rogue queries or misconfigured BI extractors.
Service: analytics
Risk: Low
Downtime: None
- Check: Fetch top queries by cost in last 24h
- Query: bq query --use_legacy_sql=false 'SELECT ...'
- Decision: Is there a single outlier query?
- Action: Throttle service account or apply guardrail
Kubernetes 5xx Surge - Gateway
Triages 5xx spikes on the gateway; checks upstream health and retries.
Service: edge-gateway
Risk: Medium
Downtime: < 2 min
- Check: Inspect HTTP 5xx rate and upstreams
- Query: kubectl -n edge logs deploy/gateway --tail=500
- Decision: Are upstream pods healthy?
- Action: Shift traffic to healthy subset; restart unhealthy pods
Cloud SQL Connection Pool Exhaustion
Detects pool exhaustion and applies connection cap + backoff.
Service: payments-api
Risk: Medium
Downtime: None
- Check: Verify active connections vs max
- Query: select * from pg_stat_activity limit 5;
- Action: Roll out connection cap via env var + HPA tune