Cost, Lifecycle, Compliance¶

¶

29) Cost Telemetry (unit economics, showback/chargeback, carbon)¶

Trigger
- Hourly CloudWatch metrics ingestion for online services; per-job hooks for training/batch; nightly CUR (Cost & Usage Report) refresh; weekly finance roll-up.
- PR merge to main for tag compliance checks (prevents deploying untagged resources).
Inputs
- AWS CUR in S3 (hourly/daily granularity, resource IDs with cost allocation tags).
- Tags (mandatory): Project=ADAS, Env, WorkflowId (e.g., w13_training), DatasetTag (DVC tag), ModelVersion (registry version), Team, CostCenter.
- Runtime counters emitted by jobs/services: GPU hours, instance type, container image SHA, data scanned (Athena/EMR), requests QPS, p50/p90/p99 latencies.
- Training metadata: W&B runs (epochs, wall-clock, GPU type/num), SageMaker job descriptions.
- Carbon factors (optional): region-level kgCO₂/kWh and GPU/CPU power draw (from nvidia-smi logs or instance specs).
Steps (with testing/validation)
- (Tag enforcement pre-deploy)
  - IaC policy as code (OPA/Conftest) validates Terraform/CloudFormation require cost tags.
  - GitHub Action fails PR if mandatory tags missing; unit test stubs check Terraform plan for tags.
- (Ingestion)
  - Kinesis/Firehose → S3 for online ops metrics; EMR/Spark job normalizes to ops_cost_metrics.parquet (schema: ts, resource_id, workflow_id, qps, p50, p99, gpu_util, cpu_util, mem_gb, bytes_out).
  - CUR loader (Athena CTAS) builds materialized views by tags: cur_by_workflow, cur_by_model, cur_by_dataset.
  - Training hooks post to an SQS queue per job start/stop with {job_id, model_version, dataset_tag, nodes, gpus, start_ts, end_ts}; aggregator joins with CUR line items by resource ARN.
- (Compute unit economics)
  - Per-model build sheet: $ / epoch, $ / mAP point, $ / 1M inferences, $ / GB scanned, $ / GPU-hour.
  - Per-scenario unit cost: cost to acquire + label one example for key slices (rain/night, workzone, pedestrian).
- (Carbon telemetry)
  - Estimate energy = Σ (GPU power × utilization × time + instance overhead); apply region carbon factor → kgCO₂ per job; attach to W&B run summary.
- (Anomaly detection & guardrails)
  - Cost Explorer/Anomaly Detection thresholds: alert if daily burn for w13_training deviates > 2σ or S3 retrieval spikes (Glacier restores).
  - Policy guard: block training jobs > configured budget per run unless label SpendOverride=true.
- (Validation)
  - Great Expectations checks on cur_by_* views: non-null tags; sum by tag == account total (±1% to allow amortized fees).
  - Reconciliation test: random sample of SageMaker jobs must appear in CUR within 48h.
  - Dashboard snapshot diffs (golden numbers for last week) to catch regressions.
Tooling/Services
- AWS: CUR on S3 + Athena, Cost Explorer API, Cost Anomaly Detection, CloudWatch, EventBridge, QuickSight dashboards, SageMaker APIs, S3 Inventory.
- Data: EMR/Spark or Glue ETL; Athena CTAS; Parquet/ICEBERG tables.
- CI: OPA/Conftest for tag rules; GitHub Actions; Checkov/tfsec for IaC.
- Experiment tracking: W&B (attach cost & carbon to runs).
Outputs & Storage
- s3://…/governance/cost/cur_by_workflow.parquet, cur_by_model.parquet, ops_cost_metrics.parquet.
- QuickSight dashboards: Model Unit Economics, Ops Cost & Latency Heatmap, Carbon per Run.
- Alerts in SNS/Slack: cost anomalies, budget breaches.
- W&B run summaries updated with train_cost_usd, gpu_hours, kgCO2.

30) Data Lifecycle & Tiering (retention, tiering, compaction, right-to-erasure)¶

Trigger
- Daily lifecycle sweep; weekly compaction/OPTIMIZE; event-driven on access pattern change (S3 Storage Class Analysis).
Inputs
- Data classes: Bronze (raw logs), Silver (synced/converted), Gold (curated/labels), Hot feature tables, Cold archives.
- S3 Access Logs / Storage Class Analysis (object age, last-access time).
- Lineage graph (Neptune/Atlas): object → derived tables/manifests (enables erasure propagation).
- Legal requests: DSR (data subject request) manifests: {drive_id, vehicle_id, ts_range}.
Steps (with testing/validation)
- (Policy definition as code)
  - Lifecycle YAML: for each data class, define retention, tier transitions, encryption, replication, PII status.
    - Example: Bronze camera: 30 days in Standard, then Intelligent-Tiering, archive to Glacier Deep Archive at 180 days; retain 5 years if linked to unresolved safety incident.
  - Glue Iceberg table properties: write.target-file-size-bytes, commit.manifest.min-count-to-merge, snapshot retention (e.g., keep 14 days of snapshots).
- (Automated actions)
  - Create/maintain S3 Lifecycle rules + Retrieval policies; Intelligent-Tiering for Silver, One-Zone-IA for low-risk derived frames, Glacier for Bronze archives.
  - Compaction/OPTIMIZE: EMR Spark job rewrites small Parquet/ICEBERG files into large (512MB–1GB) partitions; ZSTD compression; sort by (date, vehicle_id, sensor).
  - Partition evolution: Validate partition strategy (e.g., dt=YYYY-MM-DD/vehicle_id=) and update Athena/Glue.
- (Right-to-erasure / legal hold)
  - DSR processor traverses lineage to locate all derivatives (frames, embeddings, labels); issues S3 Batch Operations delete; tombstones rows in Iceberg; updates OpenSearch documents; re-compacts affected partitions.
  - Legal hold marks objects Non-current Version Retention via S3 Object Lock (compliance mode) where required.
- (Access & cost optimization)
  - Storage Class Analysis reports → move infrequently accessed Gold labels ≥90 days to IA; auto-restore on demand with caching.
  - Athena workgroup budgets and per-query bytes scanned limits to curb runaway costs.
- (Validation & safety checks)
  - Preflight: simulate lifecycle on a canary bucket; ensure no Gold/Registry artifacts are expired.
  - DVC/Git pointers audit: for each dataset_spec.yaml, verify referenced URIs exist after compaction/moves.
  - Random restore test from Glacier weekly; measure retrieval SLA; alert on failures.
  - GDPR audit trail: every erasure creates erasure_receipt.json with object list, versions, timestamps.
Tooling/Services
- AWS: S3 (Lifecycle, Object Lock, Inventory), Glacier, Intelligent-Tiering, S3 Batch Operations, Glue/Athena, EMR Spark, Lake Formation (permissions), Macie (PII discovery), CloudTrail (audit).
- Catalog/lineage: Glue Data Catalog + (optional) Atlas/Neptune for graph lineage; OpenSearch index updates.
Outputs & Storage
- s3://…/governance/lifecycle/policy.yaml, compaction_reports/…, erasure_receipts/…json.
- Glue/Athena metadata reflecting latest partitions; Lake Formation grants updated.
- Ops dashboard: Storage by Class, Hot/Cold by Dataset/Model, Restore SLA.

31) Security Scans (code, containers, IaC, runtime, secrets)¶

Trigger
- On every PR and nightly; on container build; pre-deploy gate in CD; quarterly full DAST; after critical CVE advisories.
Inputs
- Source code (Python, infra), Dockerfiles, Terraform/IaC, Helm charts/K8s manifests.
- SBOMs; dependency lockfiles; container images in ECR.
- Staging endpoints for API DAST.
Steps (with testing/validation)
- (SAST & dependency audit)
  - Semgrep/CodeQL: rulepacks for Python (FastAPI, boto3 misuse, deserialization), Rego, Terraform.
  - Bandit for Python; pip-audit/Safety for Python dependencies; block on critical issues.
- (Containers & SBOM)
  - Trivy/Grype image scan; fail on CRITICAL CVEs (non-ignored); enforce non-root user, read-only filesystem, drop CAPs.
  - Syft SBOM (CycloneDX SPDX) published as artifact; attach to model package in registry.
- (IaC & policy)
  - Checkov/tfsec for Terraform; cfn-nag if CFN present; Conftest (OPA) enforces:
    - Encryption at rest (S3, EBS, RDS), TLS 1.2+, private subnets for GPU nodes, SG least privilege.
    - Mandatory cost tags; disallow 0.0.0.0/0 ingress to control planes; deny public S3 ACLs.
- (Secrets hygiene)
  - Gitleaks/git-secrets on diffs; pre-commit hooks strip secrets.
  - CI verifies secrets only from OIDC-assumed roles; no long-lived keys in repo or container layers.
- (DAST & API posture)
  - OWASP ZAP active scan against staging APIs (rate-limited); k6 smoke load to ensure auth flows work under scan.
  - TLS checker (sslyze) validates cipher suites; HTTP security headers lint.
- (Runtime hardening)
  - EKS/ECS task defs: seccomp RuntimeDefault, AppArmor (if supported), read-only root, tmpfs for /tmp, resource limits set.
  - AWS Inspector on instances/containers; GuardDuty and Security Hub aggregation.
- (Compliance pack & exceptions)
  - Findings triage to Jira; risk acceptance workflow with expiry; exception registry in codeowners file.
  - Weekly roll-up: open vs. closed issues, MTTR, trend.
Tooling/Services
- CI/CD: GitHub Actions; CodeQL; Semgrep; Gitleaks; Trivy/Grype; Syft; Bandit; pip-audit; Checkov/tfsec; Conftest/OPA; OWASP ZAP; k6.
- AWS: ECR scan, Inspector, GuardDuty, Security Hub, IAM Access Analyzer.
Outputs & Storage
- CI artifacts: sast_report.sarif, dependency_vulns.json, sbom.cyclonedx.json, container_scan.json, iac_scan.json, zap_report.html.
- Security Hub/Inspector findings; Jira tickets; exception register security/risk_acceptances.yaml.

32) Datasheets & Model Cards (governance, transparency, sign-off)¶

Trigger
- After Eval & Robustness (#16) completes and a candidate is marked ready-for-promotion; on Promotion (#18) a final signed snapshot is minted; regenerate if training/config changes.
Inputs
- W&B run metadata (hyperparams, metrics, artifacts), evaluation reports (per-slice metrics, calibration, robustness), training config YAML, dataset slices.yaml + datasheet, drift & bias audits (Evidently/GE), safety predicate versions, cost/carbon summary from #29, compliance attestations (security scans, PII checks), lineage (code SHA, container digest, data versions).
Steps (with testing/validation)
- (Template render)
  - Jinja2 templates for Datasheet for Datasets and Model Card; sections:
    - Intended Use & Limitations (operational domain, weather/time/sensor assumptions).
    - Training Data (provenance, size, class/condition balance, label sources: auto vs. human, QA rates).
    - Evaluation: overall & slice metrics (night/rain/workzone), error taxonomies, calibration plots, failure exemplars.
    - Robustness: perturbation tests (jpeg, blur, occlusion), drift sensitivity.
    - Fairness/Compliance: bias tests relevant to domain; privacy notes; applicable standards (e.g., cybersecurity controls).
    - Operational: latency/throughput envelopes, memory/compute footprint, dependency SBOM hash.
    - Cost/Carbon: $$ per epoch/run, kgCO₂ per training.
    - Safety Predicates: policy IDs enforced, thresholds, fallback behavior.
    - Change Log: deltas vs. prior version; migration notes.
- (Artifact gathering)
  - Pull plots from W&B; export as PNG; embed metrics tables from Athena queries; attach scan reports’ summaries (CVEs=0 critical).
  - Link to dataset datasheet: includes collection methods, preprocessing, labeling guidelines, known gaps (e.g., low snow coverage), retention policy (from #30).
- (Automated checks)
  - Completeness linter: every required section present; numeric fields non-null; footnotes linkable.
  - Consistency: model hash in card matches registry entry & container digest; dataset tag matches DVC tag; metrics match evaluation report (checksum).
- (Approvals & signing)
  - Codeowners-based reviewers: ML lead, Safety lead, Security, Product. GitHub PR “model-card-vX.Y.md”.
  - On approval: CI stamps attestation (in-toto SLSA provenance), signs with KMS; stores signed PDF/HTML.
- (Distribution & discoverability)
  - Publish HTML to internal portal (S3 static hosting behind IAM/ALB); attach to Model Registry entry; persist link in W&B run.
  - API endpoint GET /model/{version}/card returns signed snapshot; hash logged in promotion record.
Tooling/Services
- Content: Jinja2, Pandas, Matplotlib/Plotly for visuals.
- Tracking: W&B Artifacts; DVC; Git for versioning.
- Signing: in-toto attestation, AWS KMS; SLSA provenance (optional).
- Registry: Your model registry (SageMaker/MLflow-compatible) enriched with model_card_uri, datasheet_uri.
Outputs & Storage
- s3://…/governance/model_cards/model_vX.Y/model_card.md|html|pdf (signed), datasheets/dataset_<tag>.md.
- Links recorded in Model Registry + W&B; checksums in promotion_record.json.