Home United States USA — software Russ Miles: Ignored Architects and Chaos Engineering

Russ Miles: Ignored Architects and Chaos Engineering

323
0
SHARE

At the recent Event-Driven Microservices Conference in Amsterdam, Russ Miles claimed that the biggest challenge for an architect is that you get ignored. You have great ideas like event-driven microservices, but the reaction too often is that it sounds good, but that it’s overly complicated for the needs at hand.
At the recent Event-Driven Microservices Conference in Amsterdam, Russ Miles claimed that the biggest challenge for an architect is that you get ignored. You have great ideas like event-driven microservices, but the reaction too often is that it sounds good, but that it’s overly complicated for the needs at hand. Miles commonly get this reaction when he suggests that companies should consider looking at asynchronous event-driven systems as a way of introducing scaling, redundancy and fault tolerance. The words often make sense to the company, but just as often they get ignored.
The main goal for Miles in his work is having reliable systems. Reliability for him is a measure of what the customer wants; a system that is feature-rich and always running. This means we have two opposing forces which don’t coexist easily, especially notable in complicated systems – continuous innovation and change, versus a system that is always working.
According to Miles, the hardest thing for an architect is to get everyone to understand that you are building resilient systems, and Miles emphasizes that he is not just talking about technology, he is referring to the whole system which includes the people, the practices and the processes that surrounds it. Considering all this, he regards it a minor miracle that systems in production ever work.
Miles refers to John Allspaw for defining resilience. If you build systems with a lot of redundancy, replication, distribution and so on, you may be building robust systems. For Allspaw resilience is when you also involve people. In the same way, chaos engineering is beyond the tools – it’s about how people think and approach a system.
For Miles, chaos engineering is a technique for finding failures before they happen, but also a mindset:
The single most important thing about chaos engineering for Miles is that you must be part of the team working on the system. You cannot be someone that hurts a system and then wait for others to fix the problem. You must be part of the effect of what you have done and work with everyone else to fix it. Miles has seen companies that have a group of people that hurt systems for a living, but in his experience this doesn’t work.
Miles points out that in his mind, chaos engineering is simple. There are only two main key practices to learn, and he emphasizes that there is no need for any certification program:
If you are ready to start working with chaos engineering at your company, Miles’ first advice is to not use the term at all. Don’t talk about breaking things; instead talk about incidents that have happened and what you can learn from them and improve. He notes that you are in a learning loop trying to get a system that gradually gets more and more resilient.
In a summary of his points, Miles noted some rules from the “Chaos Club” that you must follow:
When working with event-driven microservices based system, one of the hardest things is to get developers to understand how to become a good citizen in production. This includes having the right endpoint exposures to declare your health and the right touchpoints to say if you are OK or not. Good logging is an important aspect and a way to improve on this is to have developers read their own logs, for example during a game day when they must understand what the system did through their own logs.
When doing chaos engineering, one advantage with event-sourced systems is the observability it brings. For Miles, observability means the ability to debug the system in production without changing it. If you are doing some form of chaos experiment, the first thing you want to do is debugging the system to figure out what went wrong, and with an event-sourced system you have a system of record, you know exactly what happened and when.
Miles concluded by stating that for the first time in his career, there is a best practice. For the complex and maybe chaotic systems we build today, chaos engineering is a technique for which he wants to say “just do it”. Do a small amount of it, manually, as a game day or whatever works for you. If you care about the reliability or resilience of your systems, he believes it’s a tool for you.

Continue reading...