Skip to main content

Observability Handbook

Three Pillars

PillarToolingRetentionCardinality budget
Logsstructlog → stdout → Loki30 d hot, 365 d cold (R2)unbounded; structured JSON
MetricsPrometheus client → remote-write → Grafana Cloud13 mo≤ 50 k active series
TracesOTLP → Tempo7 dhead-sample 10 %

Golden Signals (per backend service)

  1. Latencysnowir_request_latency_seconds p50 / p95 / p99
  2. Trafficsnowir_requests_total rate by path
  3. Errorssnowir_requests_total{status=~"5.."} ratio
  4. Saturation — process CPU, RSS, file descriptors

Domain SLIs

SLITargetWindowAlert rule
Forecast pipeline freshnesslast issue ≤ 90 min ago5 minSnowIRForecastStale
Tile p95< 500 ms15 minSnowIRTileLatencyP95
Audit anchor success≥ 99 %24 hSnowIRAuditAnchorFail
Rolling AUC≥ 0.8030 dSnowIRForecastSkillBreach
Winter availability≥ 99.5 %1 dSnowIRWinterAvailabilityBreach

Dashboard Layout

infra/grafana/dashboards/snow-ir-overview.json is the single top-level board surfaced in NDMA's COP. Drill-down boards:

  • snow-ir-ingest.json — sensor + satellite ingest health
  • snow-ir-simulation.json — pipeline timing per stage
  • snow-ir-frontend.json — RUM via web-vitals → Prometheus pushgateway
  • snow-ir-audit.json — Polygon anchor lag + tx success rate