Cost, Lifecycle, Compliance¶
¶
29) Cost Telemetry (unit economics, showback/chargeback, carbon)¶
Trigger
Hourly CloudWatch metrics ingestion for online services; per-job hooks for training/batch; nightly CUR (Cost & Usage Report) refresh; weekly finance roll-up.
PR merge to main for tag compliance checks (prevents deploying untagged resources).
Inputs
AWS CUR in S3 (hourly/daily granularity, resource IDs with cost allocation tags).
Tags (mandatory):
Project=ADAS
,Env
,WorkflowId
(e.g.,w13_training
),DatasetTag
(DVC tag),ModelVersion
(registry version),Team
,CostCenter
.Runtime counters emitted by jobs/services: GPU hours, instance type, container image SHA, data scanned (Athena/EMR), requests QPS, p50/p90/p99 latencies.
Training metadata: W&B runs (epochs, wall-clock, GPU type/num), SageMaker job descriptions.
Carbon factors (optional): region-level kgCO₂/kWh and GPU/CPU power draw (from
nvidia-smi
logs or instance specs).
Steps (with testing/validation)
(Tag enforcement pre-deploy)
IaC policy as code (OPA/Conftest) validates Terraform/CloudFormation require cost tags.
GitHub Action fails PR if mandatory tags missing; unit test stubs check Terraform plan for tags.
(Ingestion)
Kinesis/Firehose → S3 for online ops metrics; EMR/Spark job normalizes to
ops_cost_metrics.parquet
(schema:ts, resource_id, workflow_id, qps, p50, p99, gpu_util, cpu_util, mem_gb, bytes_out
).CUR loader (Athena CTAS) builds materialized views by tags:
cur_by_workflow
,cur_by_model
,cur_by_dataset
.Training hooks post to an SQS queue per job start/stop with
{job_id, model_version, dataset_tag, nodes, gpus, start_ts, end_ts}
; aggregator joins with CUR line items by resource ARN.
(Compute unit economics)
Per-model build sheet:
$ / epoch
,$ / mAP point
,$ / 1M inferences
,$ / GB scanned
,$ / GPU-hour
.Per-scenario unit cost: cost to acquire + label one example for key slices (rain/night, workzone, pedestrian).
(Carbon telemetry)
Estimate energy = Σ (GPU power × utilization × time + instance overhead); apply region carbon factor → kgCO₂ per job; attach to W&B run summary.
(Anomaly detection & guardrails)
Cost Explorer/Anomaly Detection thresholds: alert if daily burn for
w13_training
deviates > 2σ or S3 retrieval spikes (Glacier restores).Policy guard: block training jobs > configured budget per run unless label
SpendOverride=true
.
(Validation)
Great Expectations checks on
cur_by_*
views: non-null tags; sum by tag == account total (±1% to allow amortized fees).Reconciliation test: random sample of SageMaker jobs must appear in CUR within 48h.
Dashboard snapshot diffs (golden numbers for last week) to catch regressions.
Tooling/Services
AWS: CUR on S3 + Athena, Cost Explorer API, Cost Anomaly Detection, CloudWatch, EventBridge, QuickSight dashboards, SageMaker APIs, S3 Inventory.
Data: EMR/Spark or Glue ETL; Athena CTAS; Parquet/ICEBERG tables.
CI: OPA/Conftest for tag rules; GitHub Actions; Checkov/tfsec for IaC.
Experiment tracking: W&B (attach cost & carbon to runs).
Outputs & Storage
s3://…/governance/cost/cur_by_workflow.parquet
,cur_by_model.parquet
,ops_cost_metrics.parquet
.QuickSight dashboards: Model Unit Economics, Ops Cost & Latency Heatmap, Carbon per Run.
Alerts in SNS/Slack: cost anomalies, budget breaches.
W&B run summaries updated with
train_cost_usd
,gpu_hours
,kgCO2
.
30) Data Lifecycle & Tiering (retention, tiering, compaction, right-to-erasure)¶
Trigger
Daily lifecycle sweep; weekly compaction/OPTIMIZE; event-driven on access pattern change (S3 Storage Class Analysis).
Inputs
Data classes: Bronze (raw logs), Silver (synced/converted), Gold (curated/labels), Hot feature tables, Cold archives.
S3 Access Logs / Storage Class Analysis (object age, last-access time).
Lineage graph (Neptune/Atlas): object → derived tables/manifests (enables erasure propagation).
Legal requests: DSR (data subject request) manifests:
{drive_id, vehicle_id, ts_range}
.
Steps (with testing/validation)
(Policy definition as code)
Lifecycle YAML: for each data class, define retention, tier transitions, encryption, replication, PII status.
Example: Bronze camera: 30 days in Standard, then Intelligent-Tiering, archive to Glacier Deep Archive at 180 days; retain 5 years if linked to unresolved safety incident.
Glue Iceberg table properties:
write.target-file-size-bytes
,commit.manifest.min-count-to-merge
, snapshot retention (e.g., keep 14 days of snapshots).
(Automated actions)
Create/maintain S3 Lifecycle rules + Retrieval policies; Intelligent-Tiering for Silver, One-Zone-IA for low-risk derived frames, Glacier for Bronze archives.
Compaction/OPTIMIZE: EMR Spark job rewrites small Parquet/ICEBERG files into large (512MB–1GB) partitions; ZSTD compression; sort by
(date, vehicle_id, sensor)
.Partition evolution: Validate partition strategy (e.g.,
dt=YYYY-MM-DD/vehicle_id=
) and update Athena/Glue.
(Right-to-erasure / legal hold)
DSR processor traverses lineage to locate all derivatives (frames, embeddings, labels); issues S3 Batch Operations delete; tombstones rows in Iceberg; updates OpenSearch documents; re-compacts affected partitions.
Legal hold marks objects Non-current Version Retention via S3 Object Lock (compliance mode) where required.
(Access & cost optimization)
Storage Class Analysis reports → move infrequently accessed Gold labels ≥90 days to IA; auto-restore on demand with caching.
Athena workgroup budgets and per-query bytes scanned limits to curb runaway costs.
(Validation & safety checks)
Preflight: simulate lifecycle on a canary bucket; ensure no Gold/Registry artifacts are expired.
DVC/Git pointers audit: for each
dataset_spec.yaml
, verify referenced URIs exist after compaction/moves.Random restore test from Glacier weekly; measure retrieval SLA; alert on failures.
GDPR audit trail: every erasure creates
erasure_receipt.json
with object list, versions, timestamps.
Tooling/Services
AWS: S3 (Lifecycle, Object Lock, Inventory), Glacier, Intelligent-Tiering, S3 Batch Operations, Glue/Athena, EMR Spark, Lake Formation (permissions), Macie (PII discovery), CloudTrail (audit).
Catalog/lineage: Glue Data Catalog + (optional) Atlas/Neptune for graph lineage; OpenSearch index updates.
Outputs & Storage
s3://…/governance/lifecycle/policy.yaml
,compaction_reports/…
,erasure_receipts/…json
.Glue/Athena metadata reflecting latest partitions; Lake Formation grants updated.
Ops dashboard: Storage by Class, Hot/Cold by Dataset/Model, Restore SLA.
31) Security Scans (code, containers, IaC, runtime, secrets)¶
Trigger
On every PR and nightly; on container build; pre-deploy gate in CD; quarterly full DAST; after critical CVE advisories.
Inputs
Source code (Python, infra), Dockerfiles, Terraform/IaC, Helm charts/K8s manifests.
SBOMs; dependency lockfiles; container images in ECR.
Staging endpoints for API DAST.
Steps (with testing/validation)
(SAST & dependency audit)
Semgrep/CodeQL: rulepacks for Python (FastAPI, boto3 misuse, deserialization), Rego, Terraform.
Bandit for Python; pip-audit/Safety for Python dependencies; block on critical issues.
(Containers & SBOM)
Trivy/Grype image scan; fail on CRITICAL CVEs (non-ignored); enforce non-root user, read-only filesystem, drop CAPs.
Syft SBOM (CycloneDX SPDX) published as artifact; attach to model package in registry.
(IaC & policy)
Checkov/tfsec for Terraform; cfn-nag if CFN present; Conftest (OPA) enforces:
Encryption at rest (S3, EBS, RDS), TLS 1.2+, private subnets for GPU nodes, SG least privilege.
Mandatory cost tags; disallow
0.0.0.0/0
ingress to control planes; deny public S3 ACLs.
(Secrets hygiene)
Gitleaks/git-secrets on diffs; pre-commit hooks strip secrets.
CI verifies secrets only from OIDC-assumed roles; no long-lived keys in repo or container layers.
(DAST & API posture)
OWASP ZAP active scan against staging APIs (rate-limited); k6 smoke load to ensure auth flows work under scan.
TLS checker (sslyze) validates cipher suites; HTTP security headers lint.
(Runtime hardening)
EKS/ECS task defs: seccomp
RuntimeDefault
, AppArmor (if supported), read-only root, tmpfs for/tmp
, resource limits set.AWS Inspector on instances/containers; GuardDuty and Security Hub aggregation.
(Compliance pack & exceptions)
Findings triage to Jira; risk acceptance workflow with expiry; exception registry in codeowners file.
Weekly roll-up: open vs. closed issues, MTTR, trend.
Tooling/Services
CI/CD: GitHub Actions; CodeQL; Semgrep; Gitleaks; Trivy/Grype; Syft; Bandit; pip-audit; Checkov/tfsec; Conftest/OPA; OWASP ZAP; k6.
AWS: ECR scan, Inspector, GuardDuty, Security Hub, IAM Access Analyzer.
Outputs & Storage
CI artifacts:
sast_report.sarif
,dependency_vulns.json
,sbom.cyclonedx.json
,container_scan.json
,iac_scan.json
,zap_report.html
.Security Hub/Inspector findings; Jira tickets; exception register
security/risk_acceptances.yaml
.
32) Datasheets & Model Cards (governance, transparency, sign-off)¶
Trigger
After Eval & Robustness (#16) completes and a candidate is marked ready-for-promotion; on Promotion (#18) a final signed snapshot is minted; regenerate if training/config changes.
Inputs
W&B run metadata (hyperparams, metrics, artifacts), evaluation reports (per-slice metrics, calibration, robustness), training config YAML, dataset
slices.yaml
+ datasheet, drift & bias audits (Evidently/GE), safety predicate versions, cost/carbon summary from #29, compliance attestations (security scans, PII checks), lineage (code SHA, container digest, data versions).
Steps (with testing/validation)
(Template render)
Jinja2 templates for Datasheet for Datasets and Model Card; sections:
Intended Use & Limitations (operational domain, weather/time/sensor assumptions).
Training Data (provenance, size, class/condition balance, label sources: auto vs. human, QA rates).
Evaluation: overall & slice metrics (night/rain/workzone), error taxonomies, calibration plots, failure exemplars.
Robustness: perturbation tests (jpeg, blur, occlusion), drift sensitivity.
Fairness/Compliance: bias tests relevant to domain; privacy notes; applicable standards (e.g., cybersecurity controls).
Operational: latency/throughput envelopes, memory/compute footprint, dependency SBOM hash.
Cost/Carbon: $$ per epoch/run, kgCO₂ per training.
Safety Predicates: policy IDs enforced, thresholds, fallback behavior.
Change Log: deltas vs. prior version; migration notes.
(Artifact gathering)
Pull plots from W&B; export as PNG; embed metrics tables from Athena queries; attach scan reports’ summaries (CVEs=0 critical).
Link to dataset datasheet: includes collection methods, preprocessing, labeling guidelines, known gaps (e.g., low snow coverage), retention policy (from #30).
(Automated checks)
Completeness linter: every required section present; numeric fields non-null; footnotes linkable.
Consistency: model hash in card matches registry entry & container digest; dataset tag matches DVC tag; metrics match evaluation report (checksum).
(Approvals & signing)
Codeowners-based reviewers: ML lead, Safety lead, Security, Product. GitHub PR “model-card-vX.Y.md”.
On approval: CI stamps attestation (in-toto SLSA provenance), signs with KMS; stores signed PDF/HTML.
(Distribution & discoverability)
Publish HTML to internal portal (S3 static hosting behind IAM/ALB); attach to Model Registry entry; persist link in W&B run.
API endpoint
GET /model/{version}/card
returns signed snapshot; hash logged in promotion record.
Tooling/Services
Content: Jinja2, Pandas, Matplotlib/Plotly for visuals.
Tracking: W&B Artifacts; DVC; Git for versioning.
Signing: in-toto attestation, AWS KMS; SLSA provenance (optional).
Registry: Your model registry (SageMaker/MLflow-compatible) enriched with
model_card_uri
,datasheet_uri
.
Outputs & Storage
s3://…/governance/model_cards/model_vX.Y/model_card.md|html|pdf
(signed),datasheets/dataset_<tag>.md
.Links recorded in Model Registry + W&B; checksums in
promotion_record.json
.