This article is about potential causes of failure in software architecture and how one can address, prevent, and manage those failures.
Join the DZone community and get the full member experience. Let’s consider two things:
1.) Bad things happen to good people
2.) Architects are people
Ergo, bad things happen to good architects. In other words, at some point, no matter how much effort you and your team put into designing resilient, high-performing, well-architected systems – something is going to blow up spectacularly and make you look silly. Call it Murphy’s Law. When we design systems, we usually try to do it well. We write good code, we write test cases, we follow frameworks and best practices. All of these things are under our control (and even these aren’t always bulletproof). However, the problem comes in when things are not under our control or we don’t even consider the possibility that something can go wrong (the unknown unknowns). A couple of examples of why things fail:
Ultimately, if we look beyond the code that we interact with, there’s a ton of complexity under the surface – from the hardware that something runs on (yes, that’s there, even if you are in the cloud) to operating systems, containers, virtual machines and runtime environments, networks, etc. Consider the code below. A multitude of things have to come together in order for it print some characters to a console. Now, think beyond “Hello World!” to distributed enterprise systems with multiple components, produced by multiple parties, running on multiple different tech stacks. As such, we should be wary of overconfidence in our own ability – systems are far from trivial, even if they are “simple” systems. In order to make it easier to deal with failure, we should first accept that failure at some point is pretty much inevitable. Once you’ve made peace with this and it becomes a nagging concern in the back of your mind when you design something, you can start looking past some of your blind spots. Systems are complex already, so if you design something, consider whether it can be simplified (while still being fit-for-purpose). Unnecessary complexity increases both the likelihood of failure (due to more moving parts) and the difficulty involved in trying to fix a failure. This is a good point to plug in a reminder of Kernighan’s Law. One of the dangers here comes in the form of resume-driven development. Sure, the shiny tech/framework/approach will look great on your CV, but is it actually necessary? Microservices have lots of benefits, but if your employee-leave tracking system will only ever have 50 users – does “LeaveService + EmployeeService + HolidayService + orchestration + all the overhead that goes with it” really give you any meaningful benefit over “LeaveTrackingSystem.