SRE Concepts series Part 1

I have been asked many times about certain concepts of SRE. So I will do a series about 15 topics that feature in the Google SRE book. These concepts are not just for SRE. These are concepts that you can use in any IT environment. The first in the series ( This one ), I will discuss a little what SRE is. After that, in no particular order I will discuss the following topics/concepts.

Risk
Time-based Versus Aggregated Availability
Error budget
Service Level indicator
Service Level Objectives
toil
Black box monitoring
White box monitoring
The value of automation
Continuous build and deployment
Stability versus Agility
Root Cause analysis
Capacity planning
Break your systems
Testing in production

What is Site Reliability?

Site Reliability Engineering or for short, SRE, is essentially a discipline where we apply software engineering aspects to infrastructure and other operations problems. Site Reliability Engineering aims at delivering more reliable and scalable applications/environments.

Even though first developed before the DevOps movement, Site Reliability Engineering is still one of the hot fields to work in the Information Technology sector. Site Reliability Engineers are also present in the modern cloud infrastructures and are a vital part of the process. Depending on the infrastructure, Site Reliability Engineering may do quite a few things.

A Site Reliability Engineer spends half of their time doing work related to operations. For instance, the engineer may work on issues, manual intervention, deployments, and much more ad-hoc work. The engineer spends the other 50% of their time on development.

Site Reliability Engineers generally oversee projects that are easy to automate. This leaves them with more time to develop new features and improve the application along with maintaining it. Since a Site Reliability Engineer needs to know both development and operations, it is usually hard to find a skilled Site Reliability Engineer.

SRE and DevOps usually work on the same base principle, one engineer managing both development and management. Site Reliability Engineering is often known as a specific implementation of DevOps.

Site Reliability Engineer Responsibilities

Share responsibility

Many organizations are adopting a shared responsibility model to speed up the development and ensure security in applications. Using a shared responsibility model will also remove the single point of delay.

Site Reliability Engineers use the same tools and software programs as developers. They share the responsibility of developing a product with genuine developers, which will be a significant part of the shared responsibility model.

Accept Failure and Prepare for it

Unlike traditional developers or operational engineers, Site Reliability Engineers understand that failure is common and consider the failure scenarios. They measure the downtime of the product using an error budget and will take the necessary measures. Site Reliability Engineers must embrace risk for the whole project to work without any issues.

Site Reliability Engineers quantify failure and availability in terms of SLIs or Service Level Indicators or SLOs Service Level Objectives.

Use Automation for Menial Tasks

A core function of a Site Reliability Engineer is to automate some simple tasks. Automating menial tasks that would otherwise require decent work power will considerably reduce the time spent on operations. When the automation is done, the Site Reliability Engineers more time to work on new features or develop the existing ones.

Implement and Adopt to Gradual Changes

Site Reliability Engineers urge the developers to move quickly by ensuring that the cost of failure is low. As a result, changes can be implemented by SREs more rapidly than the traditional ways.

Quantify Parameters

Site Reliability Engineers try to quantify as much information as possible to minimize the losses and maximize the gains.

Conclusion

Site Reliability Engineers are a core part of modern development and operations teams, especially when the cloud is involved. Combining a good cloud environment with a decent software development plan and a good SRE can bring in the results you are looking for.