Capacity Planning – Engineering or Astrology?

Created on 2025-07-21 07:38

Published on 2025-07-21 10:30

Capacity Planning – Engineering or Astrology?

Capacity planning: the science—or is it art?—of figuring out how much infrastructure you’ll need to support your users, handle load, and stay ahead of growth. In theory, it’s a straightforward exercise: estimate traffic, map to resources, provision accordingly. In practice? It’s often a mix of guesswork, optimism, and hoping you don’t over-provision by 300% or cause an outage because you under-provisioned. So is capacity planning a disciplined engineering task? Or is it modern infrastructure astrology—reading signs, drawing curves, and crossing your fingers? Let’s explore both sides of this controversial (and often misunderstood) practice in the world of SRE.

The Case for Capacity Planning as Engineering Capacity planning should be rooted in science. It’s not a magic trick—it’s math plus monitoring. Well-executed planning involves:

Good capacity planning includes:

  1. Forecasting Tools     Extrapolating past usage into future demand—especially with smoothing, outlier detection, and seasonal modeling.

  2. Stress Testing     Simulating load with tools like k6, Locust, or internal scripts to understand limits.

  3. Buffering for Bursts     Using auto-scaling and burstable resources while still keeping predictable baselines.

  4. Business Collaboration     Working with marketing, product, and sales to predict user behavior (think: Black Friday, product launches, ad campaigns).

  5. Visibility     Observability pipelines that track resource utilization trends and forecast alerts when nearing limits.

Capacity planning at its best is proactive. It prevents outages. It saves costs. It prepares the system for what’s coming—before it arrives.

The Case That It’s Still Guesswork

Yet ask most SREs and they’ll tell you: capacity planning still feels like a black art.

But limits exist—on IPs, CPUs, regional quotas. During spikes, you may not scale fast enough. In this view, capacity planning is more like forecasting the weather: you build models, read the signs, make your best guess—and prepare for surprise.

The Hidden Cost of Overplanning

Over-provisioning “just to be safe” feels responsible. But it comes at a price.

The Hidden Cost of Underplanning On the flip side, underestimating capacity can lead to:

Capacity missteps aren’t just technical—they’re business risks.

Real-World Example: The Promo That Crashed the System

A streaming startup planned a big marketing push. The platform team was told to expect a “modest spike”—2x normal traffic. They provisioned conservatively. Turns out, the promo went viral. Traffic spiked 10x. The app crashed. Databases melted. CDN costs skyrocketed. The CTO had to explain on Twitter. The SRE team wasn’t to blame. They had asked for more runway but the budget got cut. Afterward, the company adopted:

Capacity planning became everyone’s problem not just the infra team’s.

When Engineering Meets Intuition

Some of the best capacity planners aren’t just data-driven they’re pattern matchers. They know: When a feature looks small but hides a big cost. When an upcoming event could surprise. When a subtle trend hints at deeper trouble. That’s where “engineering meets instinct.” It’s not astrology—it’s experience.

What Good Capacity Planning Looks Like

  1. Automated Dashboards     Showing usage trends, forecasted growth, and limits across services.

  2. Collaboration Across Teams     Engineers, product managers, and business stakeholders in the same room.

  3. Clear Runbooks for Scaling Events     What to do if load spikes, including pre-approved escalations.

  4. Synthetic Load Testing Pipelines     CI-based load tests that evolve with the app.

  5. Defined Risk Models     Knowing which systems need strict headroom—and which can ride close to the edge.

  6. Postmortems on Capacity Events     Learn from under- and over-provisioning—not just outages.

The Cloud Complicates—and Enables

Cloud changed the game: You can scale horizontally with a few API calls, you can use autoscaling groups, spot instances, serverless. But it also introduced: Quota limits and throttling, hidden costs from misuse, tooling sprawl across regions and vendors. Good capacity planning today must understand both cloud primitives and application behavior.

Final Thought

Capacity planning isn’t astrology. But it’s not pure engineering, either. It’s a blend of measurement, modeling, experience, and adaptation. It’s about embracing uncertainty, building buffers, and communicating across teams. So no—it’s not inevitable that you’ll get it wrong. But it is inevitable that things will change. And the best SRE teams aren’t the ones who guess perfectly. They’re the ones who plan, measure, review, and evolve. Because in a world of unknowns, resilience starts with preparation—and the willingness to revise the plan.