# Cost, Lifecycle, Compliance ## ### 29) Cost Telemetry (unit economics, showback/chargeback, carbon) * **Trigger** * Hourly CloudWatch metrics ingestion for online services; per-job hooks for training/batch; nightly CUR (Cost & Usage Report) refresh; weekly finance roll-up. * PR merge to main for tag compliance checks (prevents deploying untagged resources). * **Inputs** * **AWS CUR** in S3 (hourly/daily granularity, resource IDs with cost allocation tags). * **Tags** (mandatory): `Project=ADAS`, `Env`, `WorkflowId` (e.g., `w13_training`), `DatasetTag` (DVC tag), `ModelVersion` (registry version), `Team`, `CostCenter`. * **Runtime counters** emitted by jobs/services: GPU hours, instance type, container image SHA, data scanned (Athena/EMR), requests QPS, p50/p90/p99 latencies. * **Training metadata**: W\&B runs (epochs, wall-clock, GPU type/num), SageMaker job descriptions. * **Carbon factors** (optional): region-level kgCO₂/kWh and GPU/CPU power draw (from `nvidia-smi` logs or instance specs). * **Steps (with testing/validation)** * **(Tag enforcement pre-deploy)** * IaC policy as code (OPA/Conftest) validates Terraform/CloudFormation require cost tags. * GitHub Action fails PR if mandatory tags missing; unit test stubs check Terraform plan for tags. * **(Ingestion)** * Kinesis/Firehose → S3 for online ops metrics; EMR/Spark job normalizes to `ops_cost_metrics.parquet` (schema: `ts, resource_id, workflow_id, qps, p50, p99, gpu_util, cpu_util, mem_gb, bytes_out`). * CUR loader (Athena CTAS) builds **materialized views** by tags: `cur_by_workflow`, `cur_by_model`, `cur_by_dataset`. * Training hooks post to an SQS queue per job start/stop with `{job_id, model_version, dataset_tag, nodes, gpus, start_ts, end_ts}`; aggregator joins with CUR line items by resource ARN. * **(Compute unit economics)** * **Per-model build sheet**: `$ / epoch`, `$ / mAP point`, `$ / 1M inferences`, `$ / GB scanned`, `$ / GPU-hour`. * **Per-scenario unit cost**: cost to acquire + label one example for key slices (rain/night, workzone, pedestrian). * **(Carbon telemetry)** * Estimate energy = Σ (GPU power × utilization × time + instance overhead); apply region carbon factor → kgCO₂ per job; attach to W\&B run summary. * **(Anomaly detection & guardrails)** * Cost Explorer/Anomaly Detection thresholds: alert if **daily burn** for `w13_training` deviates > 2σ or **S3 retrieval** spikes (Glacier restores). * Policy guard: block training jobs > configured **budget per run** unless label `SpendOverride=true`. * **(Validation)** * Great Expectations checks on `cur_by_*` views: non-null tags; sum by tag == account total (±1% to allow amortized fees). * Reconciliation test: random sample of SageMaker jobs must appear in CUR within 48h. * Dashboard snapshot diffs (golden numbers for last week) to catch regressions. * **Tooling/Services** * **AWS**: CUR on S3 + Athena, Cost Explorer API, Cost Anomaly Detection, CloudWatch, EventBridge, QuickSight dashboards, SageMaker APIs, S3 Inventory. * **Data**: EMR/Spark or Glue ETL; Athena CTAS; Parquet/ICEBERG tables. * **CI**: OPA/Conftest for tag rules; GitHub Actions; Checkov/tfsec for IaC. * **Experiment tracking**: W\&B (attach cost & carbon to runs). * **Outputs & Storage** * `s3://…/governance/cost/cur_by_workflow.parquet`, `cur_by_model.parquet`, `ops_cost_metrics.parquet`. * QuickSight dashboards: **Model Unit Economics**, **Ops Cost & Latency Heatmap**, **Carbon per Run**. * Alerts in SNS/Slack: cost anomalies, budget breaches. * W\&B run summaries updated with `train_cost_usd`, `gpu_hours`, `kgCO2`. --- ### 30) Data Lifecycle & Tiering (retention, tiering, compaction, right-to-erasure) * **Trigger** * Daily lifecycle sweep; weekly compaction/OPTIMIZE; event-driven on **access pattern** change (S3 Storage Class Analysis). * **Inputs** * **Data classes**: Bronze (raw logs), Silver (synced/converted), Gold (curated/labels), Hot feature tables, Cold archives. * S3 Access Logs / Storage Class Analysis (object age, last-access time). * **Lineage graph** (Neptune/Atlas): object → derived tables/manifests (enables erasure propagation). * Legal requests: DSR (data subject request) manifests: `{drive_id, vehicle_id, ts_range}`. * **Steps (with testing/validation)** * **(Policy definition as code)** * Lifecycle YAML: for each **data class**, define retention, tier transitions, encryption, replication, PII status. * *Example*: Bronze camera: **30 days in Standard**, then **Intelligent-Tiering**, **archive to Glacier Deep Archive** at 180 days; retain 5 years if linked to unresolved safety incident. * Glue Iceberg table properties: `write.target-file-size-bytes`, `commit.manifest.min-count-to-merge`, snapshot retention (e.g., keep 14 days of snapshots). * **(Automated actions)** * Create/maintain S3 Lifecycle rules + Retrieval policies; **Intelligent-Tiering** for Silver, **One-Zone-IA** for low-risk derived frames, **Glacier** for Bronze archives. * **Compaction/OPTIMIZE**: EMR Spark job rewrites small Parquet/ICEBERG files into large (512MB–1GB) partitions; ZSTD compression; sort by `(date, vehicle_id, sensor)`. * **Partition evolution**: Validate partition strategy (e.g., `dt=YYYY-MM-DD/vehicle_id=`) and update Athena/Glue. * **(Right-to-erasure / legal hold)** * DSR processor traverses lineage to locate **all derivatives** (frames, embeddings, labels); issues S3 Batch Operations delete; tombstones rows in Iceberg; updates OpenSearch documents; re-compacts affected partitions. * Legal hold marks objects **Non-current Version Retention** via S3 Object Lock (compliance mode) where required. * **(Access & cost optimization)** * Storage Class Analysis reports → move infrequently accessed Gold labels ≥90 days to IA; auto-restore on demand with caching. * **Athena workgroup budgets** and **per-query bytes scanned limits** to curb runaway costs. * **(Validation & safety checks)** * Preflight: simulate lifecycle on a **canary bucket**; ensure no Gold/Registry artifacts are expired. * DVC/Git pointers audit: for each `dataset_spec.yaml`, verify referenced URIs exist after compaction/moves. * Random restore test from Glacier weekly; measure retrieval SLA; alert on failures. * GDPR audit trail: every erasure creates `erasure_receipt.json` with object list, versions, timestamps. * **Tooling/Services** * **AWS**: S3 (Lifecycle, Object Lock, Inventory), Glacier, Intelligent-Tiering, S3 Batch Operations, Glue/Athena, EMR Spark, Lake Formation (permissions), Macie (PII discovery), CloudTrail (audit). * **Catalog/lineage**: Glue Data Catalog + (optional) Atlas/Neptune for graph lineage; OpenSearch index updates. * **Outputs & Storage** * `s3://…/governance/lifecycle/policy.yaml`, `compaction_reports/…`, `erasure_receipts/…json`. * Glue/Athena metadata reflecting latest partitions; Lake Formation grants updated. * Ops dashboard: **Storage by Class**, **Hot/Cold by Dataset/Model**, **Restore SLA**. --- ### 31) Security Scans (code, containers, IaC, runtime, secrets) * **Trigger** * On every PR and nightly; on container build; pre-deploy gate in CD; quarterly full DAST; after critical CVE advisories. * **Inputs** * Source code (Python, infra), Dockerfiles, Terraform/IaC, Helm charts/K8s manifests. * SBOMs; dependency lockfiles; container images in ECR. * Staging endpoints for API DAST. * **Steps (with testing/validation)** * **(SAST & dependency audit)** * **Semgrep/CodeQL**: rulepacks for Python (FastAPI, boto3 misuse, deserialization), Rego, Terraform. * **Bandit** for Python; **pip-audit**/**Safety** for Python dependencies; block on critical issues. * **(Containers & SBOM)** * **Trivy/Grype** image scan; fail on CRITICAL CVEs (non-ignored); enforce non-root user, read-only filesystem, drop CAPs. * **Syft** SBOM (CycloneDX SPDX) published as artifact; attach to model package in registry. * **(IaC & policy)** * **Checkov/tfsec** for Terraform; **cfn-nag** if CFN present; **Conftest (OPA)** enforces: * Encryption at rest (S3, EBS, RDS), TLS 1.2+, private subnets for GPU nodes, SG least privilege. * Mandatory cost tags; disallow `0.0.0.0/0` ingress to control planes; deny public S3 ACLs. * **(Secrets hygiene)** * **Gitleaks**/git-secrets on diffs; pre-commit hooks strip secrets. * CI verifies secrets only from OIDC-assumed roles; no long-lived keys in repo or container layers. * **(DAST & API posture)** * **OWASP ZAP** active scan against staging APIs (rate-limited); **k6** smoke load to ensure auth flows work under scan. * TLS checker (sslyze) validates cipher suites; HTTP security headers lint. * **(Runtime hardening)** * EKS/ECS task defs: seccomp `RuntimeDefault`, AppArmor (if supported), read-only root, tmpfs for `/tmp`, resource limits set. * **AWS Inspector** on instances/containers; **GuardDuty** and **Security Hub** aggregation. * **(Compliance pack & exceptions)** * Findings triage to Jira; risk acceptance workflow with expiry; exception registry in codeowners file. * Weekly roll-up: open vs. closed issues, MTTR, trend. * **Tooling/Services** * **CI/CD**: GitHub Actions; CodeQL; Semgrep; Gitleaks; Trivy/Grype; Syft; Bandit; pip-audit; Checkov/tfsec; Conftest/OPA; OWASP ZAP; k6. * **AWS**: ECR scan, Inspector, GuardDuty, Security Hub, IAM Access Analyzer. * **Outputs & Storage** * CI artifacts: `sast_report.sarif`, `dependency_vulns.json`, `sbom.cyclonedx.json`, `container_scan.json`, `iac_scan.json`, `zap_report.html`. * Security Hub/Inspector findings; Jira tickets; exception register `security/risk_acceptances.yaml`. --- ### 32) Datasheets & Model Cards (governance, transparency, sign-off) * **Trigger** * After **Eval & Robustness** (#16) completes and a candidate is marked **ready-for-promotion**; on **Promotion** (#18) a final signed snapshot is minted; regenerate if training/config changes. * **Inputs** * W\&B run metadata (hyperparams, metrics, artifacts), evaluation reports (per-slice metrics, calibration, robustness), training config YAML, dataset `slices.yaml` + datasheet, drift & bias audits (Evidently/GE), safety predicate versions, cost/carbon summary from #29, compliance attestations (security scans, PII checks), lineage (code SHA, container digest, data versions). * **Steps (with testing/validation)** * **(Template render)** * Jinja2 templates for **Datasheet for Datasets** and **Model Card**; sections: * **Intended Use & Limitations** (operational domain, weather/time/sensor assumptions). * **Training Data** (provenance, size, class/condition balance, label sources: auto vs. human, QA rates). * **Evaluation**: overall & **slice metrics** (night/rain/workzone), error taxonomies, calibration plots, failure exemplars. * **Robustness**: perturbation tests (jpeg, blur, occlusion), drift sensitivity. * **Fairness/Compliance**: bias tests relevant to domain; privacy notes; applicable standards (e.g., cybersecurity controls). * **Operational**: latency/throughput envelopes, memory/compute footprint, dependency SBOM hash. * **Cost/Carbon**: \$\$ per epoch/run, kgCO₂ per training. * **Safety Predicates**: policy IDs enforced, thresholds, fallback behavior. * **Change Log**: deltas vs. prior version; migration notes. * **(Artifact gathering)** * Pull plots from W\&B; export as PNG; embed metrics tables from Athena queries; attach scan reports’ summaries (CVEs=0 critical). * Link to dataset datasheet: includes **collection methods**, **preprocessing**, **labeling guidelines**, **known gaps** (e.g., low snow coverage), **retention policy** (from #30). * **(Automated checks)** * **Completeness linter**: every required section present; numeric fields non-null; footnotes linkable. * **Consistency**: model hash in card matches registry entry & container digest; dataset tag matches DVC tag; metrics match evaluation report (checksum). * **(Approvals & signing)** * Codeowners-based reviewers: ML lead, Safety lead, Security, Product. GitHub PR “model-card-vX.Y.md”. * On approval: CI stamps **attestation** (in-toto SLSA provenance), signs with KMS; stores signed PDF/HTML. * **(Distribution & discoverability)** * Publish HTML to internal portal (S3 static hosting behind IAM/ALB); attach to **Model Registry** entry; persist link in W\&B run. * API endpoint `GET /model/{version}/card` returns signed snapshot; hash logged in promotion record. * **Tooling/Services** * **Content**: Jinja2, Pandas, Matplotlib/Plotly for visuals. * **Tracking**: W\&B Artifacts; DVC; Git for versioning. * **Signing**: in-toto attestation, AWS KMS; SLSA provenance (optional). * **Registry**: Your model registry (SageMaker/MLflow-compatible) enriched with `model_card_uri`, `datasheet_uri`. * **Outputs & Storage** * `s3://…/governance/model_cards/model_vX.Y/model_card.md|html|pdf` (signed), `datasheets/dataset_.md`. * Links recorded in Model Registry + W\&B; checksums in `promotion_record.json`. ---