Home United States USA — software When Disaster Strikes: Production Troubleshooting

When Disaster Strikes: Production Troubleshooting

May 31, 2022

120

Production is failing and everything is lost? That used to be the case. Fail whale, hysteria and panic. Developer observability fixes this!
Join the DZone community and get the full member experience. Tom Granot and I have had the privilege of Vlad Mihalcea’s online company for a while now. As a result, we decided to do a workshop together talking about a lot of the things we learned in the process. This workshop would be pretty informal ad-hoc, just a bunch of guys chatting and showing off what we can do with tooling. In celebration of that, I thought I’d write about some of the tricks we discussed amongst ourselves in the past to give you a sense of what to expect when joining us for the workshop but also a useful tool in its own right. Before we begin I’d like to take a moment to talk about production and the role of developers within a production environment. As a hacker I often do everything. That’s OK for a small company but as companies grow we add processes. Production doesn’t go down in flames as much. Thanks to staging, QA, CI/CD, and DevOps who rein in people like me… So we have all of these things in place. We passed QA, and staging, and everything’s perfect. Right? Well… Not exactly. Sure. Modern DevOps made a huge difference to production quality, monitoring and performance. No doubt. But bugs are inevitable. The ones that slither through are the worst types of vermin. They’re hard to detect and often only happen on the scale. Some problems, like performance issues. Are only noticeable in production against a production database? Staging or dev environments can’t completely replicate modern complex deployments. Infrastructure as Code (IaC) helps a lot with that but even with such solutions, production is at a different scale. Everything that isn’t production is in place to facilitate production. That’s it. We can have the best and most extensive tests. With 100% coverage for our local environments. But when our system is running in production behavior is different. We can’t control it completely. A knee-jerk reaction is “more testing”. I see that a lot. If only we had a test for that… The solution is to somehow think of every possible mistake we can make and build a test for that. That’s insane. If we know the mistake, we can just avoid it. The idea that a different team member will have that insight is again wrong. People make similar mistakes and while we can eliminate some bugs in this way. More tests create more problems… CI/CD becomes MUCH slower and results in longer deploy times to production. That means that when we do have a production bug.