Phase 8 — V&V, Security & Production Deploy
Duration: ~7 working days Critical-path predecessor: Phases 2, 3, 5 Status: in progress · final phase
Deliverables (§9 / §11 / §15)
- Hindcast harness — replay 7+ documented Chenab events 2010–2024
- STRIDE threat model + ZeroTrustShield middleware
- Static-analysis CI gate — Bandit, Safety, Trivy, ZAP baseline, gitleaks
- OpenTelemetry + Prometheus + Sentry observability stack
- Grafana SLO dashboards + Prometheus alert rules
- Runbooks — GLOF onset · avalanche aftermath · total cloud occlusion
- Production Netlify deploy with custom domain
snow-ir.app - GitHub Pages docs deploy with full handbook + ADR archive
SystemHealthBadge— §15 Definition-of-Done stamp on the consolerelease.ymlworkflow — semantic-version tagged v1.0
Acceptance Gate · §15 Definition of Done
| # | Criterion | Verifier | Evidence artefact |
|---|---|---|---|
| 1 | Hindcast AUC ≥ 0.80 over 2010–2024 catalogue | snow_ir.vv.hindcast.runner | data/vv/hindcast_report.json |
| 2 | Mean Brier ≤ 0.18 | skill_metrics.brier_score | release-notes summary |
| 3 | Mean lead time ≥ 24 h on documented events | hindcast scoring | hindcast_report.json |
| 4 | All §9.3 gates green (R² ≥ 0.85, NSE ≥ 0.6, KGE ≥ 0.55) | nightly validation-summary | /validation/summary/rolling30d |
| 5 | Lighthouse Performance ≥ 0.90, A11y ≥ 0.95 | lhci autorun | LHCI artefact in release.yml |
| 6 | Backend coverage ≥ 75 % | pytest-cov | CI log |
| 7 | Zero high/critical findings · Bandit, Safety, Trivy, ZAP | security-extended.yml | weekly SARIF upload |
| 8 | STRIDE threat model reviewed | docs/security/threat-model.md | quarterly review header |
| 9 | Three runbooks committed | docs/runbooks/*.md | repository tree |
| 10 | Polygon audit anchors verifiable for ≥ 100 alerts | AuditAgent + explorer | audit_events.tx_hash |
| 11 | OpenTelemetry traces visible in Grafana Tempo | observability.install | dashboard screenshot |
| 12 | Prometheus alerts wired to PagerDuty / OpsGenie | infra/grafana/alerts/*.yaml | alertmanager route |
| 13 | Netlify production site at snow-ir.app with HSTS preload | netlify.toml | securityheaders.com A+ |
| 14 | GitHub Pages docs at snow-ir.app/docs | deploy-docs job | actions deployment URL |
| 15 | SystemHealthBadge shows ≥ 30 days unbroken | console UI | screenshot in release notes |
| 16 | v1.0.0 tag pushed; release notes attached | release.yml | GitHub release |
Phase 8 Day-by-Day Plan
| Day | Focus | Owner |
|---|---|---|
| D1 | Repo scaffold + historical event catalogue + hindcast runner | V&V lead |
| D2 | Skill metrics, hindcast tests, EvaluationAgent rolling window | V&V lead |
| D3 | STRIDE threat model + ZeroTrustShield + secret-scan tests | Security lead |
| D4 | OpenTelemetry + Prometheus + Sentry wiring; Grafana dashboards | SRE |
| D5 | Runbooks (GLOF, avalanche, cloud occlusion); tabletop dry-run | Ops lead |
| D6 | release.yml + security-extended.yml; Netlify production cutover | DevOps |
| D7 | LHCI gate; system-health badge; v1.0 tag + retrospective | Tech lead |
ADR Anchors
- ADR-014 — Why hindcast AUC, not POD/FAR, as the headline skill metric
- ADR-015 — Trade-off: ZeroTrustShield in-process vs API-gateway-only
- ADR-016 — Choice of Polygon PoS over Ethereum mainnet for audit anchors
- ADR-017 — Lighthouse 0.90 perf budget vs MapLibre WebGL workload