Cost, Lifecycle, Compliance

29) Cost Telemetry (unit economics, showback/chargeback, carbon)

  • Trigger

    • Hourly CloudWatch metrics ingestion for online services; per-job hooks for training/batch; nightly CUR (Cost & Usage Report) refresh; weekly finance roll-up.

    • PR merge to main for tag compliance checks (prevents deploying untagged resources).

  • Inputs

    • AWS CUR in S3 (hourly/daily granularity, resource IDs with cost allocation tags).

    • Tags (mandatory): Project=ADAS, Env, WorkflowId (e.g., w13_training), DatasetTag (DVC tag), ModelVersion (registry version), Team, CostCenter.

    • Runtime counters emitted by jobs/services: GPU hours, instance type, container image SHA, data scanned (Athena/EMR), requests QPS, p50/p90/p99 latencies.

    • Training metadata: W&B runs (epochs, wall-clock, GPU type/num), SageMaker job descriptions.

    • Carbon factors (optional): region-level kgCO₂/kWh and GPU/CPU power draw (from nvidia-smi logs or instance specs).

  • Steps (with testing/validation)

    • (Tag enforcement pre-deploy)

      • IaC policy as code (OPA/Conftest) validates Terraform/CloudFormation require cost tags.

      • GitHub Action fails PR if mandatory tags missing; unit test stubs check Terraform plan for tags.

    • (Ingestion)

      • Kinesis/Firehose → S3 for online ops metrics; EMR/Spark job normalizes to ops_cost_metrics.parquet (schema: ts, resource_id, workflow_id, qps, p50, p99, gpu_util, cpu_util, mem_gb, bytes_out).

      • CUR loader (Athena CTAS) builds materialized views by tags: cur_by_workflow, cur_by_model, cur_by_dataset.

      • Training hooks post to an SQS queue per job start/stop with {job_id, model_version, dataset_tag, nodes, gpus, start_ts, end_ts}; aggregator joins with CUR line items by resource ARN.

    • (Compute unit economics)

      • Per-model build sheet: $ / epoch, $ / mAP point, $ / 1M inferences, $ / GB scanned, $ / GPU-hour.

      • Per-scenario unit cost: cost to acquire + label one example for key slices (rain/night, workzone, pedestrian).

    • (Carbon telemetry)

      • Estimate energy = Σ (GPU power × utilization × time + instance overhead); apply region carbon factor → kgCO₂ per job; attach to W&B run summary.

    • (Anomaly detection & guardrails)

      • Cost Explorer/Anomaly Detection thresholds: alert if daily burn for w13_training deviates > 2σ or S3 retrieval spikes (Glacier restores).

      • Policy guard: block training jobs > configured budget per run unless label SpendOverride=true.

    • (Validation)

      • Great Expectations checks on cur_by_* views: non-null tags; sum by tag == account total (±1% to allow amortized fees).

      • Reconciliation test: random sample of SageMaker jobs must appear in CUR within 48h.

      • Dashboard snapshot diffs (golden numbers for last week) to catch regressions.

  • Tooling/Services

    • AWS: CUR on S3 + Athena, Cost Explorer API, Cost Anomaly Detection, CloudWatch, EventBridge, QuickSight dashboards, SageMaker APIs, S3 Inventory.

    • Data: EMR/Spark or Glue ETL; Athena CTAS; Parquet/ICEBERG tables.

    • CI: OPA/Conftest for tag rules; GitHub Actions; Checkov/tfsec for IaC.

    • Experiment tracking: W&B (attach cost & carbon to runs).

  • Outputs & Storage

    • s3://…/governance/cost/cur_by_workflow.parquet, cur_by_model.parquet, ops_cost_metrics.parquet.

    • QuickSight dashboards: Model Unit Economics, Ops Cost & Latency Heatmap, Carbon per Run.

    • Alerts in SNS/Slack: cost anomalies, budget breaches.

    • W&B run summaries updated with train_cost_usd, gpu_hours, kgCO2.


30) Data Lifecycle & Tiering (retention, tiering, compaction, right-to-erasure)

  • Trigger

    • Daily lifecycle sweep; weekly compaction/OPTIMIZE; event-driven on access pattern change (S3 Storage Class Analysis).

  • Inputs

    • Data classes: Bronze (raw logs), Silver (synced/converted), Gold (curated/labels), Hot feature tables, Cold archives.

    • S3 Access Logs / Storage Class Analysis (object age, last-access time).

    • Lineage graph (Neptune/Atlas): object → derived tables/manifests (enables erasure propagation).

    • Legal requests: DSR (data subject request) manifests: {drive_id, vehicle_id, ts_range}.

  • Steps (with testing/validation)

    • (Policy definition as code)

      • Lifecycle YAML: for each data class, define retention, tier transitions, encryption, replication, PII status.

        • Example: Bronze camera: 30 days in Standard, then Intelligent-Tiering, archive to Glacier Deep Archive at 180 days; retain 5 years if linked to unresolved safety incident.

      • Glue Iceberg table properties: write.target-file-size-bytes, commit.manifest.min-count-to-merge, snapshot retention (e.g., keep 14 days of snapshots).

    • (Automated actions)

      • Create/maintain S3 Lifecycle rules + Retrieval policies; Intelligent-Tiering for Silver, One-Zone-IA for low-risk derived frames, Glacier for Bronze archives.

      • Compaction/OPTIMIZE: EMR Spark job rewrites small Parquet/ICEBERG files into large (512MB–1GB) partitions; ZSTD compression; sort by (date, vehicle_id, sensor).

      • Partition evolution: Validate partition strategy (e.g., dt=YYYY-MM-DD/vehicle_id=) and update Athena/Glue.

    • (Right-to-erasure / legal hold)

      • DSR processor traverses lineage to locate all derivatives (frames, embeddings, labels); issues S3 Batch Operations delete; tombstones rows in Iceberg; updates OpenSearch documents; re-compacts affected partitions.

      • Legal hold marks objects Non-current Version Retention via S3 Object Lock (compliance mode) where required.

    • (Access & cost optimization)

      • Storage Class Analysis reports → move infrequently accessed Gold labels ≥90 days to IA; auto-restore on demand with caching.

      • Athena workgroup budgets and per-query bytes scanned limits to curb runaway costs.

    • (Validation & safety checks)

      • Preflight: simulate lifecycle on a canary bucket; ensure no Gold/Registry artifacts are expired.

      • DVC/Git pointers audit: for each dataset_spec.yaml, verify referenced URIs exist after compaction/moves.

      • Random restore test from Glacier weekly; measure retrieval SLA; alert on failures.

      • GDPR audit trail: every erasure creates erasure_receipt.json with object list, versions, timestamps.

  • Tooling/Services

    • AWS: S3 (Lifecycle, Object Lock, Inventory), Glacier, Intelligent-Tiering, S3 Batch Operations, Glue/Athena, EMR Spark, Lake Formation (permissions), Macie (PII discovery), CloudTrail (audit).

    • Catalog/lineage: Glue Data Catalog + (optional) Atlas/Neptune for graph lineage; OpenSearch index updates.

  • Outputs & Storage

    • s3://…/governance/lifecycle/policy.yaml, compaction_reports/…, erasure_receipts/…json.

    • Glue/Athena metadata reflecting latest partitions; Lake Formation grants updated.

    • Ops dashboard: Storage by Class, Hot/Cold by Dataset/Model, Restore SLA.


31) Security Scans (code, containers, IaC, runtime, secrets)

  • Trigger

    • On every PR and nightly; on container build; pre-deploy gate in CD; quarterly full DAST; after critical CVE advisories.

  • Inputs

    • Source code (Python, infra), Dockerfiles, Terraform/IaC, Helm charts/K8s manifests.

    • SBOMs; dependency lockfiles; container images in ECR.

    • Staging endpoints for API DAST.

  • Steps (with testing/validation)

    • (SAST & dependency audit)

      • Semgrep/CodeQL: rulepacks for Python (FastAPI, boto3 misuse, deserialization), Rego, Terraform.

      • Bandit for Python; pip-audit/Safety for Python dependencies; block on critical issues.

    • (Containers & SBOM)

      • Trivy/Grype image scan; fail on CRITICAL CVEs (non-ignored); enforce non-root user, read-only filesystem, drop CAPs.

      • Syft SBOM (CycloneDX SPDX) published as artifact; attach to model package in registry.

    • (IaC & policy)

      • Checkov/tfsec for Terraform; cfn-nag if CFN present; Conftest (OPA) enforces:

        • Encryption at rest (S3, EBS, RDS), TLS 1.2+, private subnets for GPU nodes, SG least privilege.

        • Mandatory cost tags; disallow 0.0.0.0/0 ingress to control planes; deny public S3 ACLs.

    • (Secrets hygiene)

      • Gitleaks/git-secrets on diffs; pre-commit hooks strip secrets.

      • CI verifies secrets only from OIDC-assumed roles; no long-lived keys in repo or container layers.

    • (DAST & API posture)

      • OWASP ZAP active scan against staging APIs (rate-limited); k6 smoke load to ensure auth flows work under scan.

      • TLS checker (sslyze) validates cipher suites; HTTP security headers lint.

    • (Runtime hardening)

      • EKS/ECS task defs: seccomp RuntimeDefault, AppArmor (if supported), read-only root, tmpfs for /tmp, resource limits set.

      • AWS Inspector on instances/containers; GuardDuty and Security Hub aggregation.

    • (Compliance pack & exceptions)

      • Findings triage to Jira; risk acceptance workflow with expiry; exception registry in codeowners file.

      • Weekly roll-up: open vs. closed issues, MTTR, trend.

  • Tooling/Services

    • CI/CD: GitHub Actions; CodeQL; Semgrep; Gitleaks; Trivy/Grype; Syft; Bandit; pip-audit; Checkov/tfsec; Conftest/OPA; OWASP ZAP; k6.

    • AWS: ECR scan, Inspector, GuardDuty, Security Hub, IAM Access Analyzer.

  • Outputs & Storage

    • CI artifacts: sast_report.sarif, dependency_vulns.json, sbom.cyclonedx.json, container_scan.json, iac_scan.json, zap_report.html.

    • Security Hub/Inspector findings; Jira tickets; exception register security/risk_acceptances.yaml.


32) Datasheets & Model Cards (governance, transparency, sign-off)

  • Trigger

    • After Eval & Robustness (#16) completes and a candidate is marked ready-for-promotion; on Promotion (#18) a final signed snapshot is minted; regenerate if training/config changes.

  • Inputs

    • W&B run metadata (hyperparams, metrics, artifacts), evaluation reports (per-slice metrics, calibration, robustness), training config YAML, dataset slices.yaml + datasheet, drift & bias audits (Evidently/GE), safety predicate versions, cost/carbon summary from #29, compliance attestations (security scans, PII checks), lineage (code SHA, container digest, data versions).

  • Steps (with testing/validation)

    • (Template render)

      • Jinja2 templates for Datasheet for Datasets and Model Card; sections:

        • Intended Use & Limitations (operational domain, weather/time/sensor assumptions).

        • Training Data (provenance, size, class/condition balance, label sources: auto vs. human, QA rates).

        • Evaluation: overall & slice metrics (night/rain/workzone), error taxonomies, calibration plots, failure exemplars.

        • Robustness: perturbation tests (jpeg, blur, occlusion), drift sensitivity.

        • Fairness/Compliance: bias tests relevant to domain; privacy notes; applicable standards (e.g., cybersecurity controls).

        • Operational: latency/throughput envelopes, memory/compute footprint, dependency SBOM hash.

        • Cost/Carbon: $$ per epoch/run, kgCO₂ per training.

        • Safety Predicates: policy IDs enforced, thresholds, fallback behavior.

        • Change Log: deltas vs. prior version; migration notes.

    • (Artifact gathering)

      • Pull plots from W&B; export as PNG; embed metrics tables from Athena queries; attach scan reports’ summaries (CVEs=0 critical).

      • Link to dataset datasheet: includes collection methods, preprocessing, labeling guidelines, known gaps (e.g., low snow coverage), retention policy (from #30).

    • (Automated checks)

      • Completeness linter: every required section present; numeric fields non-null; footnotes linkable.

      • Consistency: model hash in card matches registry entry & container digest; dataset tag matches DVC tag; metrics match evaluation report (checksum).

    • (Approvals & signing)

      • Codeowners-based reviewers: ML lead, Safety lead, Security, Product. GitHub PR “model-card-vX.Y.md”.

      • On approval: CI stamps attestation (in-toto SLSA provenance), signs with KMS; stores signed PDF/HTML.

    • (Distribution & discoverability)

      • Publish HTML to internal portal (S3 static hosting behind IAM/ALB); attach to Model Registry entry; persist link in W&B run.

      • API endpoint GET /model/{version}/card returns signed snapshot; hash logged in promotion record.

  • Tooling/Services

    • Content: Jinja2, Pandas, Matplotlib/Plotly for visuals.

    • Tracking: W&B Artifacts; DVC; Git for versioning.

    • Signing: in-toto attestation, AWS KMS; SLSA provenance (optional).

    • Registry: Your model registry (SageMaker/MLflow-compatible) enriched with model_card_uri, datasheet_uri.

  • Outputs & Storage

    • s3://…/governance/model_cards/model_vX.Y/model_card.md|html|pdf (signed), datasheets/dataset_<tag>.md.

    • Links recorded in Model Registry + W&B; checksums in promotion_record.json.