SRE concepts part 8 ( Break your system & Test in Production )

The eighth article in the series about SRE Concepts/Topics is about two topics, "Break your system" and "Test in Production".

Break your systems- why?

Site Reliability Engineering relies a lot on rapid deployments of stable pieces of code and ensuring that the application is available to all the customers. However, this is not an easy job considering how low the downtime should be for a practical application.

Sometimes, companies lose millions of dollars if they're down for more than a few minutes. These tight targets call for a complete vulnerability check. You can achieve insane uptime targets only if you know how to fix incidents or avoid them in the first place.

One common rule in both cybersecurity and software development is that an incident always happens. A Site Reliability Engineer has to take care of handling this incident before it causes too much downtime.

We can calculate the downtime of the total product using the error budget. In simple terms, there are just two types of incidents: the ones you knew about and the ones you didn't. As a Site Reliability Engineer, your goal is to ensure that your team knows about most incidents.

Even though automation tools are helpful, they still cannot detect if an application is vulnerable to an incident. The situation becomes even worse when there are zero-day incidents. The best way to ensure that you know about most of these incidents is to break your system.

Breaking your system is a type of chaos engineering. You take your product and try to break it in every way possible. You can perform all the vulnerability analyses on that, try for some good old SQL Injection, and check whether cross-site scripting is possible. If you can break your system, go back to development and fix this issue.

Repeat the process until no one in the team can break the system. Now, the system will be fully ready for the outside environment.

One of the mottos for SREs should be, "If it isn't broken, then let's break it!" It may sound weird at first, but it makes complete sense. If you can break your system, then it is likely that a trained hacker in the wild can too. Trying to break your system exposes so many different vulnerabilities that need fixing.

Chaos Engineering to the Win

Chaos engineering is perfectly applicable to SREs. As we've mentioned, one of the primary goals of an SRE is to deliver stable, reliable applications rapidly. Every application or product will have some potential risk and hidden vulnerabilities before you try chaos engineering.

However, once you know most of the issues associated with your product, you can get a clear idea of when you can deploy your product.

You can apply chaos engineering to pretty much any system in the SRE network. However, the basic implementation methods and procedures may vary depending on the application.

Conclusion

Breaking your system is one of the best methods to identify issues in your network. If you execute this step correctly, you may be looking at an application with little to no downtime.

Why Do You Have to Test in Production?

There is a myth that testing in production means releasing some untested code directly to the customers hoping that everything would work great. However, in reality, it couldn't be anything farther than that. Since a DevOps or an SRE team focuses more on speed, it is sometimes better to rapidly release a tested code to the production.

However, the development phase does not end here in such models. Instead, the developers check the application for errors in the production mode as well. Usually, these errors are hard to detect during the build phase.

Hence to increase the deployment speed, testing for bugs in the production environment can help the developers see the bugs quickly and fix them. Such bug monitoring principles are called continuous monitoring. In other words, the developer team will be on the lookout for the bugs even in production and will be ready to fix them whenever one pops up.

Continuous monitoring principles are essential to any DevOps engineer and Site Reliability Engineer. It helps speed up the deployment phase and deliver the application quickly. User experience can also be monitored and analyzed by testing in the production phase.

Why Test in Production?

You can avoid testing in production if you have a staging environment. Staging environments are replicas of production environments but much smaller. If you already have a staging environment built and synced, it is best to use that for testing. However, syncing the environment can be pretty challenging and may delay the product deployment.

When testing in staging environments, you also have to configure the application differently. Since the staging environment is much smaller than the production environment, most configuration options will be different. It may cause some unexpected errors and can cause some confusion in the engineers.

How to Conduct Production Testing?

You can conduct production testing in two ways: A/B testing and Continuous monitoring.

A/B Testing

People primarily use A/B testing to analyze the user experience. Usually, two versions of a product are released. The version that the users prefer the most is selected and kept. The other is sent for development or discarded.

A/B Testing is a fantastic tool to measure the impact of UI and UX on the users. You can only do it in the production environment since the users are organic. The users either prefer option A or B and hence the name, A/B Testing.

Continuous Monitoring

It is an essential part of any DevOps team. The team first deploys the application as soon as possible. Once the application is available to the customers, one can detect bugs faster than in the staging environments.

The engineers now fix any bugs that come up and will release the updated versions. The process continues for a long time until the application is bug-free.

Conclusion

Testing in the production phase can be essential to many rapid deployment and development teams, especially if they don't have a staging environment. Ensure you follow the continuous monitoring cycle to update the product over time and keep your clients happy.