SRE concepts part 5 ( Capacity Planning & Availability Monitoring)

The fifth article in the series about SRE Concepts/Topics is about two topics, Capacity Planning and "Time-based Versus Aggregated Availability"

Capacity Planning

Capacity planning is a critical metric that gives the relation between the production capacity an organization needs to meet the customer's demands and the capacity at present.

Capacity planning helps companies plan their applications and products according to the demands. Usually, many industries use capacity planning, from automotive to information technology.

One of the most popular uses for the term is in the IT department. In information technology, capacity planning means estimating the necessary hardware, software, and network resources for a timespan.

For instance, you may estimate how much storage you will need for the next five months, which is capacity planning.

Capacity planning is quite complicated, even though it may seem simple. It becomes much essential to calculate capacity planning with the utmost precision. However, when it comes to cloud-based models, you can be a bit flexible on capacity as you can dynamically deallocate and reallocate the required resources and pay as you go.

Even though capacity planning sounds like a redundant job with cloud models' "pay as you go" service model, it is worth knowing how many resources you may need in the future. It will give you an idea of how you can grow or scale as per the requirement.

How does it Work?

Let us say that you have a cryptocurrency exchange platform. Since there is a crypto boom going on right now, you may be looking to increase the number of available servers. Even if you have a cloud model, you still have to present a reasonable budget to your supervisor.

Presenting a budget that exceeds your resources is a waste of company money. At the same time, you cannot offer a budget lower than what is required.

In such cases, you turn to capacity planning. You will take a look at the user growth tables of your company.

Making necessary predictions and after hours of plotting charts, you present a budget that should cover your server charges for the next quarter. You may exceed this budget or fall below.

All that matters is how close the budget is to the practical costs. The closer this budget is, the better was your capacity planning.

Why Do You Need Capacity Planning?

Now, let us say that you're working at the same company. Since you have a cloud model, you decided not to perform capacity planning but pay as you go. However, the crypto boom brings millions of users to your platform, which you did not expect. Now, you have to shut down for hours in the middle of a crypto boom.

Does this look familiar to you? You guessed it. It precisely is what happened with numerous platforms during the crypto boom of BTC and dogecoin in January 2021.

Hence without proper capacity planning, you may lose millions in profit and may end up paying more for the same resources that you had before.

Conclusion

Capacity Planning is essential for any business cloud or not. Ensure that you consider the growing market when planning capacity to avoid any discrepancies.

Time-based Versus Aggregated Availability

One of the critical responsibilities of a Site Reliability Engineer is to ensure availability. Availability is considered as the success condition for any product. In terms of Site Reliability Engineering or SRE, availability can be defined as when a product performs the intended function without difficulties.

If a product is not available, in other words, if it is not performing the function it was intended to, then the product is a failure. Site Reliability Engineers have to ensure the availability of a product at all times. Usually, some automated tools are available to monitor the availability, but one should do manual checks on certain occasions.

Availability Checks

Availability checks are also a type of risk assessment. Since the success and failure conditions of products are defined by availability, it is essential to measure availability regularly.

However, the influence of external factors may skew the results. To get as accurate results as possible, we have two types of availability: aggregated availability and time-based availability.

Time-Based Availability

The best way to represent risk tolerance in Site Reliability Engineering is by measuring the acceptable level of downtime of a product. As you may have already known, Site Reliability Engineers do not eliminate risk. Instead, they try to keep it as close to reality as possible.

We can calculate the service availability and the unplanned downtime using Time-Based Availability.

The formula to calculate Time-Based Availability is as given below.

Availability = uptime / (uptime + downtime)

In other words, Time-Based Availability is the ratio of uptime of a product to the total time since the product was deployed. We can calculate the acceptable downtime for a period using the above formula.

For instance, if you want to calculate the total acceptable downtime in a year with an availability target of 99.99%, you can do so by taking the percentage on the left-hand side and the whole time.

In a year, we have 525600 minutes. The uptime is unknown, and availability is 99.99%. So, the acceptable downtime for this case would be52.56 minutes.

Aggregated Availability

Even though time-based availability gives a useful metric, many organizations need something different. Generally, many organizations are serving more than one set of products. So, even if a product is down at a place, it may be running at the other, making the time-based availability potentially useless.

Aggregated availability counters this by measuring requests instead of time. This metric is more reliable since if a product is not available, these requests will fail. The formula to calculate Aggregated availability is as given below.

Availability = Successful requests / Total Requests

In general, not all requests are the same. For instance, a signup request is different from a purchase request. Using aggregated availability has its disadvantages like this.

However, availability measured as the total number of requests is a fair approximation for the overall acceptable downtime.

Conclusion

Availability is a crucial aspect of Site Reliability Engineering. Even though it is more of an operations job, even the development phase can affect the availability.