Who is an SRE?

Site reliability engineer — who are they, and what tasks do they handle?

Site reliability engineering is an aspect of software development whose goal is to ensure the continuous reliability of software systems.

Note

Site reliability engineers (SRE) are production-level engineers who focus on the performance and reliability of software once it goes out into the real world.

The main task for an SRE is software reliability.

Let's break this down in more detail. SRE is an important part of the entire software production and operation cycle. Once software reaches production, it is important to do everything possible to ensure it runs reliably and is available to end users. As part of this, SRE addresses the following tasks:

  • Increasing uptime (that is, the availability of the application)
  • Speeding up the application so that it runs at optimal speed (taking cost constraints into account)
  • Working together with developers on the code to achieve its maximum efficiency
  • Working to prevent various kinds of attacks, including DDOS and intrusion

Site reliablity engineering is when you treat operations as engineering problem
Ben Treynor Sloss, Vice President of Engineering at Google

This statement emphasizes the importance of treating software operations as a complex engineering problem that requires a strategic approach and specialized knowledge.

This approach frees us from the outdated paradigm of "send us the code and we will run it on the servers." As a result, it aligns philosophically with the widespread DevOps movement, which seeks to remove the barriers between software development and operations.

To achieve this, Site Reliability Engineering entails fine-tuning the software and its underlying infrastructure, often by developing, adapting, or designing special tools, as well as advocating for best practices.

The biggest misconception about SRE is that it is focused on reliability only from a narrow point of view, i.e., the availability of the software. But, as you can see from all the factors listed above, that is not all there is to it. Issues such as performance, code quality, and security all affect the quality of the user experience and the perceived reliability of the software.

The Origins of SRE

The term Site Reliability Engineering, as we know it today, was coined in 2003 by the visionary Ben Treynor Sloss, who is currently the Vice President of Engineering at Google.

He turned a small team of seven "software engineers" into a formidable force that, as of 2016, numbered more than 1,200 SRE engineers.

As Google continued to grow, it recognized that it would face reliability challenges. The company realized that as software complexity increased, ensuring reliability would become an increasingly difficult task. This recognition led to the development of SRE, which is a proactive approach to addressing the growing problem of software complexity.

At its core, SRE is a set of practices that prioritize reliability in software development. The goal is to ensure the reliability and scalability of software systems, even as they continue to grow and become more complex.

It is very important to understand: SRE is not just a reactive approach to solving problems. On the contrary, it is a proactive approach aimed at preventing problems before they occur.

What is "reliability"?

Reliability, simply put, is the absence of errors. Expanding on this definition further: reliability, by definition, is the ability of a system to function correctly and consistently under various conditions. In the context of SRE, reliability refers to the ability of an application to work as expected, without any downtime or failures.

As we know, any change poses a significant threat to reliability. Even a seemingly small change in the code can break production or cause a hard-to-detect failure. That is why many mission-critical systems, such as airplanes and power plants, still rely on legacy software written in COBOL since the 1960s. In these applications, any change carries the risk of potentially catastrophic errors. Every change introduces the possibility of a bug arising and the production system being compromised. Changes can include deploying new code, updating infrastructure configurations, and much more. Thorough, multifaceted testing helps reduce these risks.

Conclusion

SREs are responsible for ensuring the reliability, scalability, and efficiency of software. They work closely with developers, operations teams, and other stakeholders to achieve these goals.

Thus, SREs play a crucial role in ensuring the timely delivery of software products and their compliance with the high standards expected by users and customers.