Founder Notes P6: Operating Model for Durable Savings

In Part 5, we covered simulation and reporting. This final chapter answers the operational question teams ask after first wins: how to keep savings from drifting back three months later.

Q: Why do many teams lose savings after an initially successful cleanup?

A: Because they optimize as a campaign, not as an operating model.

The first wave is often strong: obvious zombies get removed and budgets improve. Then entropy returns. New projects launch, ownership drifts, and low-priority resources quietly re-accumulate. Durable savings require recurring cadence, clear policy boundaries, and evidence that can move decisions across engineering, security, and finance.

Q: What changed in the product to support that operating model?

A: We expanded from scanner features to governance capabilities that compound over time.

Monitor 2.0: clearer operational health signals instead of raw metric noise.
Advanced Policy Engine: provider- and environment-specific thresholds instead of one global rule.
Global Rightsizing: rightsizing recommendations across AWS, Azure, and GCP.
Storage Tiering Analysis: lifecycle-policy intelligence for long-tail object storage cost.
Local API Playbooks: repeatable automation flows with schedule, account targeting, and report delivery.

The result is a practical shift: from ad-hoc cleanup to controlled, repeatable cloud governance.

Q: How does this look in a real weekly operating rhythm?

A: We see high-performing teams run a simple weekly loop:

Monday: run targeted scans for shared and production accounts.
Tuesday: triage with policy context (auto-approve low-risk, escalate medium/high-risk).
Wednesday: execute approved actions in controlled batches.
Thursday: review outcomes in Monitor and validate savings retention.
Friday: update thresholds/ownership routing based on accepted vs rejected findings.

This loop is intentionally lightweight. The point is consistency, not ceremony.

Q: Where does Local API automation create the biggest lift?

A: It removes "manual trigger" bottlenecks without giving up control.

Teams can trigger scans from schedulers, target only selected accounts, and track each scan by scan_id. This allows platform teams to keep one integration pattern while each business unit controls its own account scope and review cadence.

Example (create an async scan job):

Quick Request Examples

Use the same route with Bash, Python, or JavaScript.

curl -X POST "http://127.0.0.1:9123/v1/scans" -H "Authorization: Bearer YOUR_API_TOKEN" -H "Content-Type: application/json" -d '{"selected_accounts":["profile_abc123"],"report_emails":["finops@company.com","ops@company.com"]}'

import requests

payload = {
    "selected_accounts": ["profile_abc123"],
    "report_emails": ["finops@company.com", "ops@company.com"],
}

response = requests.post(
    "http://127.0.0.1:9123/v1/scans",
    headers={"Authorization": "Bearer YOUR_API_TOKEN"},
    json=payload,
    timeout=30,
)

print(response.json())

const response = await fetch("http://127.0.0.1:9123/v1/scans", {
  method: "POST",
  headers: {
    Authorization: "Bearer YOUR_API_TOKEN",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    selected_accounts: ["profile_abc123"],
    report_emails: ["finops@company.com", "ops@company.com"],
  }),
});

console.log(await response.json());

Q: Can post-scan report emails be automated for stakeholders?

A: Yes. You can pass report_emails in API scan requests, and the scan job records a delivery result in report_email_status.

In practice, this helps teams route outcomes to the right inboxes immediately: FinOps, platform operations, and application owners. It is one of the easiest ways to reduce the lag between detection and action.

We recommend treating report routing as governance metadata: decide once per environment and keep it versioned in your automation payloads. Limit is up to 5 recipients per scan request.

Example (poll delivery status):

Quick Request Examples

Check status in the same language your team already uses.

curl -H "Authorization: Bearer YOUR_API_TOKEN" "http://127.0.0.1:9123/v1/scans/SCAN_ID"

import requests

response = requests.get(
    "http://127.0.0.1:9123/v1/scans/SCAN_ID",
    headers={"Authorization": "Bearer YOUR_API_TOKEN"},
    timeout=30,
)

print(response.json()["report_email_status"])

const response = await fetch("http://127.0.0.1:9123/v1/scans/SCAN_ID", {
  headers: {
    Authorization: "Bearer YOUR_API_TOKEN",
  },
});

const data = await response.json();
console.log(data.report_email_status);

Check report_email_status in the response (sent, failed: ..., or skipped: ...).

Q: What did we learn about rightsizing vs deletion at scale?

A: Most organizations need both, but rightsizing often unlocks faster consensus.

Deletion conversations can stall when dependency confidence is low. Rightsizing offers a middle path for many "alive but overprovisioned" workloads. With support expanded to Azure and GCP, teams can apply one governance language across multi-cloud estates instead of negotiating separate heuristics for each provider.

Q: Why include storage lifecycle governance in a FinOps operating model?

A: Because storage waste is quiet, durable, and easy to miss.

Compute waste is visible during incidents. Storage waste is often invisible until quarter-end surprises. Lifecycle analysis closes this gap by identifying buckets and object sets where retention, transition, and expiration controls are missing or misaligned with real usage.

Q: What should teams measure to prove this is working?

A: Keep the KPI set small and operational:

Time to Action: recommendation generated to approved execution.
Decision Quality: rejection rate caused by missing context/false positives.
Savings Durability: retained savings after 30/60/90 days.
Coverage Depth: share of cloud accounts participating in recurring scans.

Better KPIs here usually mean better engineering discipline overall, not just lower cloud spend.

Q: If a team starts today, what rollout path is realistic?

A: Use a staged 30/60/90-day approach:

0-30 days: baseline scans, policy segmentation (prod/dev/shared), and report routing owners.
31-60 days: enable rightsizing + storage lifecycle workflows; automate low-risk recurring scans.
61-90 days: institutionalize weekly review cadence and tighten policy simulation feedback loops.

Final takeaway

The most important lesson from this series is simple: cloud efficiency is not a dashboard problem. It is an execution-system problem.

When monitoring, policy, simulation, reporting, and automation are designed as one loop, savings stop being a one-time win and become a durable operating capability.

This concludes the Founder Notes series. In future engineering updates, we will focus on cross-team rollout patterns, policy quality benchmarks, and automation runbooks from production environments.

Founder Notes Part 6: Operating Model for Durable Cloud Savings