The SLA That Sales Invented

There is a special kind of optimism that appears in technology companies right before engineering gets invited to a “quick alignment meeting.” It usually starts when somebody in a commercial conversation decides that reliability sounds more convincing with an extra nine, a firmer tone, and absolutely no consultation with the people who have to keep the thing alive at 3:14 a.m. The result is the SLA that sales invented: a contractual commitment born not from system behavior, failure testing, or operational evidence, but from vibes, ambition, and the gravitational pull of quarter-end targets.

This is not really a post about mocking sales. Well, not only that. It is about a deeper organizational habit: treating reliability as a messaging exercise instead of an engineering discipline. In healthy environments, SLAs sit on top of measured SLIs and realistic SLOs. In unhealthy ones, the order gets reversed. The promise goes first, the math comes later, and the on-call team becomes an unpaid translator between commercial aspiration and physical reality. Google’s SRE guidance is explicit that SLIs measure, SLOs define internal targets, and SLAs are the external commitments with consequences; in other words, the contract should be the last thing you derive, not the first thing you improvise.

That distinction matters because an SLA is not branding. It is not a customer romance language. It is a promise with remedies, service credits, and sometimes a direct path to executive panic. Microsoft’s guidance on reading SLAs makes the same point from a different angle: SLAs are inputs to architecture and resilience planning, not substitutes for it, and they often include exclusions and scope details that people conveniently forget when they are busy “closing the deal.”

The comedy, of course, is that everyone in tech claims to respect reality. We say things like “the data will tell us” and “we should be evidence-based,” right up until a customer asks for 99.99% uptime and someone decides evidence is for later. Yet the math is rude and deeply uninterested in pipeline goals. Over a year, 99.9% availability allows about 525.6 minutes of downtime. At 99.99%, that drops to roughly 52.56 minutes. At 99.999%, you are down to about 5.26 minutes for the entire year. That is not “work harder” territory. That is a different class of design, testing, redundancy, release discipline, and operating cost.

And those extra nines are not free. AWS’s Reliability Pillar frames reliability as an architectural property that must be designed, delivered, and maintained. Microsoft similarly notes that architectural choices affect the composite SLO of a solution, and higher resilience generally requires more redundancy. In plain English, if you want to promise more uptime, you usually need more infrastructure, more failover design, more careful dependency management, more operational maturity, and fewer cowboy changes on Friday afternoons.

That is where the SRE lens becomes useful, because SRE has always been suspicious of hand-wavy reliability claims. The discipline exists partly to force a harder conversation between product ambition and operational truth. Error budgets are a perfect example. Google’s SRE material describes them as the mechanism for balancing reliability and feature velocity, and its example policy is clear that when you overspend the budget, attention shifts toward stability work. That only works when the target is grounded in something real. If your SLA was invented in a meeting where no one discussed the current error budget, you are not managing reliability. You are outsourcing disappointment to future-you.

This tension also reveals a very human truth about IT organizations: people confuse intent with capacity. Sales intends to reassure the customer. Leadership intends to show confidence. Engineering intends to be helpful and collaborative. But distributed systems do not care about intent. Databases fail with no regard for optimism. Network partitions do not pause because the account team said “strategic logo.” Human nature in tech is to believe that alignment language can temporarily defeat constraints. It cannot. It can only delay the moment when constraints send an invoice.

There is also a cultural trap here for DevOps teams. DevOps is often described as better collaboration between development and operations, which is true, but weak organizations misread that as “everyone is jointly responsible for making the impossible somehow happen.” That is not collaboration; that is collective fiction. Real DevOps maturity means commercial, product, engineering, and operations all participate early enough to shape commitments before they become obligations. The point is not just faster delivery. It is tighter feedback between what the business wants to promise and what the platform can actually sustain.

There are, to be fair, two legitimate camps in this debate. One camp argues that aggressive SLAs can be commercially useful. A bold SLA signals confidence, creates competitive differentiation, and may push engineering to modernize architecture faster. You can almost hear this side saying, “Pressure creates diamonds.” In some cases, that is not entirely wrong. Strong commitments can force overdue investment in redundancy, observability, incident management, and reliability engineering. The business may genuinely need a more resilient service than engineering has historically prioritized. Vendor guidance from cloud providers also shows that architectural upgrades such as zone redundancy and multi-region design can materially improve service commitments.

The other camp, the one usually holding the pager, argues that aggressive commitments without system proof are basically contract-shaped denial. They point out that SLAs have consequences, outages are expensive, and penalties for breaching SLAs are part of why downtime costs escalate. Uptime Institute’s 2024 survey found that 54% of respondents said their most recent significant outage cost more than $100,000, and 20% reported costs above $1 million. The same report explicitly notes SLA penalties among the factors contributing to high outage costs. That is the less glamorous side of promising perfection: when the service misses, the bill arrives wearing both finance and legal perfume.

Personally, I think both sides are partly right and fully dangerous when left unsupervised. Commercial ambition without engineering evidence produces fantasy contracts. Engineering caution without customer context can become a reflexive “no” that slows the business unnecessarily. The grown-up answer is not to let one side win. It is to force a shared mechanism for making reliability commitments legible to everyone.

That mechanism starts with measurement. Before an SLA discussion ever reaches a customer, teams need a credible history of service behavior tied to meaningful user journeys. Not vanity infrastructure metrics. Not “CPU looked fine.” Actual indicators that reflect customer experience: request success, latency, availability of critical workflows, recovery behavior, and dependency impact. Google’s SRE guidance emphasizes that SLOs should come from what users care about and then be iterated over time. That sounds obvious until you remember how many companies still promise uptime for a “platform” that is really seventeen dependencies wearing a trench coat.

The second approach is to create an explicit reliability negotiation path between sales, product, legal, platform, and SRE. Not a ceremonial sign-off five minutes before contract review. A real operating process. Somebody should be able to say, “A 99.99% commitment for this workflow would require active-active failover, dependency isolation, release guardrails, and a larger on-call burden. Here is the cost. Here is the likely timeline. Here is the risk if we sign before we build it.” Azure’s reliability guidance is useful here because it frames SLAs as something to interpret critically and map back to actual design decisions. That mindset should exist inside your company too.

The third approach is to make error budgets socially real, not just technically documented. Many teams love the idea of error budgets right up until revenue is nearby. Then suddenly every exception is “temporary,” every burn is “understandable,” and every reliability concern is “not customer-facing enough.” Google’s published error budget policy material is refreshingly blunt: the mechanism is there to protect customers from repeated misses and create incentives to balance reliability with feature work. If your organization treats error budgets as optional but treats sales promises as sacred, you have not implemented SRE. You have implemented theater with dashboards.

A fourth approach, and one that matters more in 2026 than many leaders admit, is governance. Reliability targets decay when systems, traffic patterns, dependencies, and business expectations evolve faster than the operational model around them. Recent industry guidance on SLO frameworks and SLO oversight has focused on keeping objectives current as services change, rather than treating them as one-time paperwork. That matters because yesterday’s honest commitment can quietly become today’s lie if the architecture or customer use case has shifted.

There is also an uncomfortable people dimension to all this. The SLA that sales invented rarely fails only at the contract level. It tends to fail downstream in morale, trust, and cognitive load. On-call engineers become cynical because they are measured against promises they did not help make. Sales becomes defensive because engineering sounds obstructive after the contract is signed. Leadership gets trapped mediating between a customer expectation and a technical estate that never agreed to the terms. This is how organizations slowly convert reliability from a shared goal into an internal blame economy. And once that happens, every outage is followed by the same ritual: a timeline, a root cause, and a room full of adults rediscovering that no amount of confidence can outvote queue depth.

So here are the questions I think are worth arguing about in the comments. When your company promises higher reliability, are you buying customer trust or borrowing it? At what point does “commercial confidence” become a tax on the on-call team? Should engineering ever accept an SLA that is intentionally ahead of current architecture, or is that just technical debt with legal formatting? And perhaps the most dangerous question of all: does your organization actually know the cost of each additional nine, or do you merely know how nice it looks in a proposal deck?

The real lesson is simple, even if organizations work hard to make it complicated. Reliability commitments should emerge from measured behavior, deliberate design, and explicit trade-offs. SRE and DevOps are at their best when they turn those trade-offs into something visible, discussable, and boring. Boring is good. Boring is how you avoid surprise legal energy. Boring is how you keep your promises without needing a monthly séance between product, sales, and the laws of distributed systems.

Because in the end, the SLA that sales invented is not funny because sales is foolish. It is funny because every technology company is one optimistic meeting away from doing the same thing. The trick is building the kand early, “That target is not impossible. It is just currently imaginary.”

References

1. Google Cloud Blog, “SRE fundamentals: SLIs, SLAs, and SLOs”

https://cloud.google.com/blog/products/devops-sre/sre-fundamentals-slis-slas-and-slos

2. Google SRE Workbook, “Chapter 2 - Implementing SLOs”

https://sre.google/workbook/implementing-slos/

3. Google SRE Workbook, “Error Budget Policy for Service Reliability”

https://sre.google/workbook/error-budget-policy/

4. Microsoft Learn, “How to Read a Service-Level Agreement (SLA)”

https://learn.microsoft.com/en-us/azure/reliability/concept-service-level-agreements

5. Uptime Institute, “2024 Global Data Center Survey Report”

https://datacenter.uptimeinstitute.com/rs/711-RIA-145/images/2024.GlobalDataCenterSurvey.Report.pdf

#SRE #SiteReliability #DEVOPS #SLA #SLO #SLI #ErrorBudget #ReliabilityEngineering #PlatformEngineering #IncidentManagement #CloudArchitecture #DevOpsCulture