Created on 2020-08-14 19:05
Published on 2020-08-14 19:18
When you talk about high availability in up time numbers everybody talks about how many nines they need to have. But in a lot of cases they do not understand the level of commitment in Programming, Hardware and Organization they need to get to these numbers. And a lot of time the cost does not out way the benefits. We all have become brainwashed in the better availability the better the product/application. We need to get back to the Good Enough scenario for High availability. Because of this I wrote this little article that gives an over view of what the consciences are for the different Nines. Let’s start.
This means a 99% up time. When you calculate this it comes down to 3,65 Days or 87 hours and 36 minutes down time a year. So let’s give some examples how long you can be down.
The Hardware stack that you run on does not have to be anything special even the software does not have to anything special.
Organization wise you do not need standby at night or in the weekends. Everything can be dealt with in normal office hours.
This means 99,9% up time the calculation comes back with 8 hours and 46 minutes down time per year. So let’s give some examples how long you can be down and when not.
The hardware needs to be designed for this. You minimally want multiple servers on different power cables. You should think about Raid disks, Backup generator but these are not 100% necessary to get this availability. You need to test you code very carefully and you will need to build in extra monitoring into your application. The monitoring is needed to create alerts to wake people up at night.
In your organization you will need a group of people that is standby / on call for the complete stack because they need to take care of problems at night or in the weekends. They can do this from home and should not have to come to the office.
This means 99,99% uptime or when you calculate that 52 minutes a year down time. With this you get into the levels where it is going to get difficult and expensive. Examples of how long you can be down or Not down.
The hardware stack needs to be designed for this but not like last time now you get to the big items where you will need to start thinking about.
* 2 network connections per server
* 2 Power supplies per server
* 2 power grids per server
* Raid disks
* Not on the same power grid
* With backup generators
* This must be replicated in different Datacenters
* Logging must be very clear
* Metrics is a must
* Stateless (Preferred)
Organization wise this becomes also a lot more critical. let’s say it takes you monitoring about 10 minutes to see that something is really wrong. At that moment it sends out an alert to the standby he wakes up in the middle of the night and turns on his laptop and logs into the company. This will take another 10 minutes. Now he checks the logs and reads them and tries to find what is wrong. This takes him another 10 minutes. Now he needs somebody else to do something for him to resolve it so he calls the next standby.
As you can see the time fly’s by when you’re in this kind of situation you already lost 30 minutes of your 52 minutes a year down time. So the organization needs to be very flat and the first person that is called needs to be able to solve most if not all the problems because if you need to start calling extra people in you will run out of your availability times. Also the standby needs to be aware and willing to stay close to a computer and internet. Going swimming at the beach or going shopping is not an option you will already loss to much time getting connected.
The most important thing with the four & five nines is planning you need to plan for every possible failure that you can think of upfront. So that if it happens you have designed your Stack (Hardware & Software) so that it can continue to run with the failure.
This means 99,999% uptime; this is the domain of the gods. This means that you will have a down time of 5 minutes and 26 Seconds in a year. Examples of how long you can be down or Not down.
You will need the hardware stack that is mentioned in the four nines plus.
For your organization it is a radical shift. No more standby. You will need people onsite 24*7 and you will need people onsite that can fix the complete stack if that is one person or multiple. You need highly skilled and highly motivated people.
The big thing I want to end with is what is “Good Enough”. A lot of people want the best uptime they can afford. But this is not a good way to look at it. There are 3 major things you need to consider when you decide to set what uptime you want to get to:
1. What does my customer expect? What if you customer expects 99,9% only in business hours. Why then would you have standby.
2. What technology is in front of my application and what is behind my application.
a. If any of these technologies have a lower uptime, then what you are planning why are you paying the extra money that nobody will see.
3. Cost. What are the losses I get from my downtime and what do I need to spend to get more uptime? If the cost of getting more uptime is higher than the losses Do Not do it.
The second one is important to think about I can give you 2 examples from The Netherlands.
1. Internet providers only give a 99,9% uptime guaranty so why would a website or an internet application go higher than that.
2. The mobile network is even worse a study showed that in the Netherlands it only has a reliability of 99,7%. So again why would you go higher than that.