Release Notes / Release Ledger / v2.9.21 AI Runtime and K8s GPU Governance
Product Update

v2.9.21: Local-First AI Runtime and Kubernetes GPU Governance

New local AI runtime scanning, Kubernetes GPU capacity evidence, and governance APIs to help teams cut GPU waste with operator-safe execution.

K By Ken Reading time: 9 min

New Scan Surface

Kubernetes inventory and findings

Pods, nodes, workloads, services, PVs, PVCs, governance summaries, and action plans.

Trust Boundary

Local kubectl, read-only permissions

Kubeconfig and scan output stay on the operator machine. CWS does not require a hosted cluster collector.

Operator Value

Find cloud waste behind containers

Surface orphan PVs, underused node baselines, risky LoadBalancer services, and owner gaps.

Why this release matters now

For most teams running AI workloads, GPU spend is now one of the fastest-growing cost lines. The operational problem is not only price. The harder problem is evidence: many teams can see invoice growth, but cannot quickly prove where GPU waste is happening, who owns it, and which actions are safe.

v2.9.21 addresses that gap with a local-first workflow built for operators. Instead of relying on a hosted collector, Cloud Waste Scanner can now read local runtime signals and Kubernetes GPU allocation signals, then return concrete findings and recommendations for execution.

What shipped in v2.9.21

  • Local AI runtime scan support for NVIDIA and AMD runtime paths.
  • Kubernetes GPU summary endpoint to compare node allocatable GPUs against pod requests and limits.
  • AI governance report output for weekly execution, ownership review, and closure tracking.

What new problems you can solve

This release is designed for the common "invoice shock to action" path in AI infrastructure:

  • Find idle GPU windows where devices are allocated but underutilized.
  • Detect memory-stranded patterns where GPU memory pressure does not match compute usage.
  • Surface request-limit drift in Kubernetes workloads that reserve expensive GPU capacity without sustained demand.
  • Turn findings into an operator-ready governance report instead of ad-hoc screenshots and manual notes.

Operator benefits and expected outcomes

  • Faster triage: local runtime evidence reduces time from cost alert to actionable diagnosis.
  • Safer change decisions: capacity and usage context are presented together before rightsizing or scheduling changes.
  • Stronger accountability: findings can be grouped into weekly review queues for platform and FinOps owners.
  • Better repeatability: teams can use the same API outputs in dashboards, runbooks, and automation workflows.

How to use it in the UI

In the Community desktop app, the previous AI analyst entry is now replaced by an execution-oriented scan workflow.

  1. Open AI Device Scan from the left navigation.
  2. Click Run AI Device Scan to collect local GPU runtime findings.
  3. Review returned findings for idle, stranded, and efficiency signals.
  4. Open related scan results for cleanup planning and owner assignment.

You can also run this from the main scan wizard, where AI device utilization scan is enabled as a first-class scan option.

How to use it through Local API

For automation and integrations, v2.9.21 extends the local API surface:

  • GET /v1/ai/devices: list detected local runtime devices.
  • POST /v1/ai/scans: trigger local AI runtime scan.
  • GET /v1/ai/findings: retrieve AI runtime findings.
  • GET /v1/ai/recommendations: retrieve recommendations mapped to findings.
  • GET /v1/k8s/gpu-summary: retrieve cluster-level GPU allocatable/request/limit summary.
  • GET /v1/reports/ai-governance: retrieve governance-oriented report output.

Recommended execution pattern: trigger scan, collect findings, append k8s summary context, then publish a weekly governance report for review and closure tracking.

Trust boundary and data handling

CWS remains local-first in this release. Runtime probes and Kubernetes reads execute in your environment. Credentials and raw outputs do not need to leave your machine unless you intentionally export or forward them.

FinOps Execution Insight

  • Because GPU waste is high-cost and bursty, local evidence speed directly improves decision quality.
  • Because Kubernetes reservations can mask true pressure, allocation context must be paired with runtime findings.
  • Because cleanup fails without ownership, governance report outputs should be integrated into weekly operating cadence.