Home United States USA — software Microservices and Site Reliability Engineering

Microservices and Site Reliability Engineering

350
0
SHARE

A recent article talks about how the complexities introduced by microservices initially seem at odds with the concept of Site Reliability Engineering (SRE) and how companies such as Google are tackling that to ensure that whilst development groups can continue to embrace microservices they and their SRE teams have the necessary tools and understandings to make them work well together.
Over recent years we’ve discussed the role of Site Reliability Engineering (SRE) and particularly how that group has grown from at one time the domain of companies such as Google to being an expectation within companies in other sectors such as financial and medical. Recently Technology Journalist Alex Handy has written about how SREs and microservices architectures fit: […] while SREs and microservices evolved in parallel inside the world’s software companies, the former actually makes life far more difficult for the latter.
For Alex the reason for this is fairly clear: […] SREs live and die by their full stack view of the entire system they are maintaining and optimizing. The role combines the skills of a developer with those of an admin, producing an employee capable of debugging applications in production environments when things go completely sideways.
Alex goes on to cover some of the background of SRE and how that function works at scale within Google as an example, quoting Todd Underwood, one of Google’s SRE directors about how Google have put practices and systems in place to help development groups consider reliability and availability as well as technology approaches such as using Paxos for consensus in their distributed systems. Underwood highlights another aspect of the SRE job that is essential, here, however: visibility. When microservices are throwing billions of packets across constantly changing ecosystems of cloud-based servers, containers, and databases, finding out what went wrong where is essential to troubleshooting any type of problem. This is where the full stack aspects of an SRE’s job come into place.
According to one of the Product Managers at Google, Morgan McLean, the key here is monitoring and traceability of microsrvices, something others have stated in the past asnd we’ve covered elsewhere. In the article by Alex he mentions a few new tools Google have released to help tackle the problem: […] Google recently released Stackdriver Trace, Stackdriver Debugger, and Stackdriver Profiler. There’s a reason these tools sound like old-school testing and operations tools from traditional enterprise vendors: they perform the more traditional troubleshooting tasks developers and operations people are used to, but with a focus on microservices and performing these duties in the cloud.
Morgan McLean is quoted within Alex’s article summarising what these tools do to enable the SRE group to better manage new microservice-based architectures and stating that although tracing is important, Google believes that the profiling and debugging aspects of their tools are unique at this stage and bring key benefits to developers and SRE. Alex then finishes up his article by covering further the topics of monitoring, metrics and observability with more Google and other industry references which are worth considering because they are likely to be relevant to a growing number of companies.
As we see more and more developers and companies employing microservices and many of them also using, or beginning to use, SRE teams, it will be interesting to see how architectures and tooling evolve to ensure that reliability, availability, consistency etc. are maintained such that developers and SRE teams can work in harmony. If you have any experiences to share in that regard, positive or negative, it would be useful for the wider community to hear about them.

Continue reading...