Three Pillars
| Pillar | Tooling | Retention | Cardinality budget |
|---|
| Logs | structlog → stdout → Loki | 30 d hot, 365 d cold (R2) | unbounded; structured JSON |
| Metrics | Prometheus client → remote-write → Grafana Cloud | 13 mo | ≤ 50 k active series |
| Traces | OTLP → Tempo | 7 d | head-sample 10 % |
Golden Signals (per backend service)
- Latency —
snowir_request_latency_seconds p50 / p95 / p99
- Traffic —
snowir_requests_total rate by path
- Errors —
snowir_requests_total{status=~"5.."} ratio
- Saturation — process CPU, RSS, file descriptors
Domain SLIs
| SLI | Target | Window | Alert rule |
|---|
| Forecast pipeline freshness | last issue ≤ 90 min ago | 5 min | SnowIRForecastStale |
| Tile p95 | < 500 ms | 15 min | SnowIRTileLatencyP95 |
| Audit anchor success | ≥ 99 % | 24 h | SnowIRAuditAnchorFail |
| Rolling AUC | ≥ 0.80 | 30 d | SnowIRForecastSkillBreach |
| Winter availability | ≥ 99.5 % | 1 d | SnowIRWinterAvailabilityBreach |
Dashboard Layout
infra/grafana/dashboards/snow-ir-overview.json is the single
top-level board surfaced in NDMA's COP. Drill-down boards:
snow-ir-ingest.json — sensor + satellite ingest health
snow-ir-simulation.json — pipeline timing per stage
snow-ir-frontend.json — RUM via web-vitals → Prometheus pushgateway
snow-ir-audit.json — Polygon anchor lag + tx success rate