Start United States USA — software Site Reliability Engineer (SRE) Roles and Responsibilities

Site Reliability Engineer (SRE) Roles and Responsibilities

Von

January 22, 2022

114

This post defines the roles and responsibilities of site reliability engineers and shows how SRE can improve the resilience of people, processes, and technology
Join the DZone community and get the full member experience. Software development is getting faster and more complex – frustrating IT operations teams more than ever. So, DevOps gained popularity in order to combat siloed workflows, decreased collaboration, and a lack of visibility. While establishing a culture of DevOps has helped teams collaborate better and deliver reliable software faster, DevOps teams don’t necessarily have someone specifically dedicated to developing systems that increase site reliability and performance. That’s where a site reliability engineer (SRE) comes into the picture. The concept of SRE was initially brought to life by Google engineer, Ben Treynor. Then, shortly after implementing SRE, they published their popular SRE eBook – helping the movement gain traction in the industry. Site reliability engineers sit at the crossroads of traditional IT and software development. Basically, SRE teams are made up of software engineers who build and implement software to improve the reliability of their systems. So, let’s first define the basic roles and responsibilities of a site reliability engineer and show how SRE can drastically improve the resilience of your people, processes, and technology. In the words of Ben Treynor, SRE is “what happens when you ask a software engineer to design an operations function.” In a traditional setup of siloed IT operations and software development teams, developers would throw their code over to IT professionals. Then, IT would be in charge of deployment, maintenance, and any on-call responsibilities associated with the system in production. Luckily, DevOps came along and forced developers to share accountability for systems in production, own their code, and take on-call responsibilities. DevOps pushed shared responsibility for the reliability of your applications and infrastructure. And, while this is a great first step forward, it doesn’t proactively help teams add resilience to their system. Many DevOps teams, even with shortened feedback loops and improved collaboration, can still find themselves deploying new, unreliable services into production at a rapid pace. Site reliability engineering is a way to bridge the gap between developers and IT operations, even in a DevOps culture.