Created on 2025-07-21 07:38
Published on 2025-07-21 10:30
Capacity planning: the science—or is it art?—of figuring out how much infrastructure you’ll need to support your users, handle load, and stay ahead of growth. In theory, it’s a straightforward exercise: estimate traffic, map to resources, provision accordingly. In practice? It’s often a mix of guesswork, optimism, and hoping you don’t over-provision by 300% or cause an outage because you under-provisioned. So is capacity planning a disciplined engineering task? Or is it modern infrastructure astrology—reading signs, drawing curves, and crossing your fingers? Let’s explore both sides of this controversial (and often misunderstood) practice in the world of SRE.
The Case for Capacity Planning as Engineering Capacity planning should be rooted in science. It’s not a magic trick—it’s math plus monitoring. Well-executed planning involves:
Historical usage trends: CPU, memory, I/O, throughput.
Growth projections: business forecasts, product launches, seasonality.
Benchmarking: how services behave under stress.
Redundancy models: N+1, N+2, failover, regional availability.
Risk tolerance: acceptable latency vs. overprovisioning.
Good capacity planning includes:
Forecasting Tools Extrapolating past usage into future demand—especially with smoothing, outlier detection, and seasonal modeling.
Stress Testing Simulating load with tools like k6, Locust, or internal scripts to understand limits.
Buffering for Bursts Using auto-scaling and burstable resources while still keeping predictable baselines.
Business Collaboration Working with marketing, product, and sales to predict user behavior (think: Black Friday, product launches, ad campaigns).
Visibility Observability pipelines that track resource utilization trends and forecast alerts when nearing limits.
Capacity planning at its best is proactive. It prevents outages. It saves costs. It prepares the system for what’s coming—before it arrives.
The Case That It’s Still Guesswork
Yet ask most SREs and they’ll tell you: capacity planning still feels like a black art.
Forecasting is never precise. User growth doesn’t follow a clean line. Behavior changes. Incidents happen. The future doesn’t care about your spreadsheet.
Systems don’t behave linearly. Adding 10% more users doesn’t mean 10% more load. Some requests hit caches. Others explode databases.
Services change constantly. A code optimization can halve CPU usage—or triple it. Meanwhile, new features quietly introduce massive new queries.
Dependencies are unpredictable. Your service may be fine—but what about the upstream auth service that you forgot to model?
Cloud is “infinite”… until it isn’t. Many teams assume autoscaling saves them.
But limits exist—on IPs, CPUs, regional quotas. During spikes, you may not scale fast enough. In this view, capacity planning is more like forecasting the weather: you build models, read the signs, make your best guess—and prepare for surprise.
The Hidden Cost of Overplanning
Over-provisioning “just to be safe” feels responsible. But it comes at a price.
Wasted money Unused compute resources rack up bills—especially in cloud environments.
False sense of security “We have enough capacity” becomes a crutch. Teams stop monitoring trends or optimizing performance.
Environmental impact Running idle infrastructure is a sustainability concern—one that’s increasingly scrutinized.
The Hidden Cost of Underplanning On the flip side, underestimating capacity can lead to:
Performance degradation Latency spikes, failed requests, unhappy users.
Service outages You hit quota limits. The autoscaler lags. Connections fail.
Engineer burnout Teams scramble to patch holes during traffic surges.
Reputation damage Especially during product launches or peak events.
Capacity missteps aren’t just technical—they’re business risks.
Real-World Example: The Promo That Crashed the System
A streaming startup planned a big marketing push. The platform team was told to expect a “modest spike”—2x normal traffic. They provisioned conservatively. Turns out, the promo went viral. Traffic spiked 10x. The app crashed. Databases melted. CDN costs skyrocketed. The CTO had to explain on Twitter. The SRE team wasn’t to blame. They had asked for more runway but the budget got cut. Afterward, the company adopted:
Load testing before every campaign.
A clear traffic escalation protocol.
Pre-allocating surge capacity buffers.
Improved cross-team communication.
Capacity planning became everyone’s problem not just the infra team’s.
When Engineering Meets Intuition
Some of the best capacity planners aren’t just data-driven they’re pattern matchers. They know: When a feature looks small but hides a big cost. When an upcoming event could surprise. When a subtle trend hints at deeper trouble. That’s where “engineering meets instinct.” It’s not astrology—it’s experience.
What Good Capacity Planning Looks Like
Automated Dashboards Showing usage trends, forecasted growth, and limits across services.
Collaboration Across Teams Engineers, product managers, and business stakeholders in the same room.
Clear Runbooks for Scaling Events What to do if load spikes, including pre-approved escalations.
Synthetic Load Testing Pipelines CI-based load tests that evolve with the app.
Defined Risk Models Knowing which systems need strict headroom—and which can ride close to the edge.
Postmortems on Capacity Events Learn from under- and over-provisioning—not just outages.
The Cloud Complicates—and Enables
Cloud changed the game: You can scale horizontally with a few API calls, you can use autoscaling groups, spot instances, serverless. But it also introduced: Quota limits and throttling, hidden costs from misuse, tooling sprawl across regions and vendors. Good capacity planning today must understand both cloud primitives and application behavior.
Final Thought
Capacity planning isn’t astrology. But it’s not pure engineering, either. It’s a blend of measurement, modeling, experience, and adaptation. It’s about embracing uncertainty, building buffers, and communicating across teams. So no—it’s not inevitable that you’ll get it wrong. But it is inevitable that things will change. And the best SRE teams aren’t the ones who guess perfectly. They’re the ones who plan, measure, review, and evolve. Because in a world of unknowns, resilience starts with preparation—and the willingness to revise the plan.