Home United States USA — software Building a Better Notification System

Building a Better Notification System [Video]

December 5, 2021

Start thinking about improvements to make to a notification infrastructure, send better messages, and maintain these systems moving forward.
Join the DZone community and get the full member experience. Today, we’re going to talk about building a modern product notification system. This talk’s really intended for engineers, product managers, anyone working for a product organization that needs to send notifications and messages to their users, and maybe is thinking about where and how they need to be building that over the next 6 months,12 months,18 months as their product and organization evolve. By the end of this, we’re hoping to really give you enough to get started thinking about what improvements you might want to make to your notification infrastructure or what notifications you want to send that you’re not sending today and how you should think about building and maintaining those systems moving forward. We are putting together a white paper that’ll be available soon. So take a look at that if you want to dig into even more detail than what we cover in this discussion. So when we think about a modern product notification system, we need to think about what are the requirements, right? And what we’re going to walk you through is a few of the ways we think about it when we work with our customers and what we hear about from our customers that may have already built their own product notification system before they met Courier. Here’s the way I normally like to break it down is between requirements that are for the development team: Right? The team that’s building and maintaining these notifications and these messages, and the requirements that are really for more of the product management team, the design team, the marketing team, and support team, those that maybe aren’t directly responsible for building this infrastructure, but who really rely on this infrastructure to power a lot of the activities and objectives that they’re trying to achieve with their projects with the product. When we think about the objectives and the requirements for the development team, three of them we’re going to dig deeper later on in this discussion. That would be scale and reliability, abstracting your channels and providers, thinking about how you route between the different kinds of channels and taking into account the preferences of the recipient or user? And how do you take all of the messages that are flowing through this infrastructure and put a layer of observability and analytics on top of it so you can know what is and is not working the way you would want it to? In addition to that, though, and we won’t go into as much detail in this discussion, you should be thinking about what is the developer experience for other developers within the organization? Because while some companies that we work with have a dedicated, centralized notification infrastructure managed by a dedicated, centralized comms team, many, many more companies that we talk to don’t and this infrastructure sometimes can be centralized, sometimes not, but the team is very typically distributed and different teams will need to interact with the infrastructure that you’re building to solve different use cases. They’re essentially a customer, they’re an internal customer of yours, so what should that experience look like? You need to be thinking about the analytics needed by dev ops and other parts of the organization, not just at the business level but also at the operational level. How do you know when the messages are going out as expected? When are they not? When are they delayed? Which of these investments are paying off and maybe you should double down on and which maybe aren’t, and maybe you should revisit and reconsider. Last is, and this is really kind of tied to that developer experience, how do you set up good testing environments? This is actually especially challenging for messaging infrastructure. You want to be able to potentially run integration tests and you want to be able to run scale tests, but how do you do this without accidentally sending messages to real people or significantly driving up costs with your downstream service providers? Thinking through how do you create an environment where a developer can test against their local changes? How can they test within a staging environment, pulling together a number of different possible poll requests and testing it all together to see that it’s not going to impact production negatively? Then of being able to do things like smoke tests and the actual non-testing production sends from your live environment. Whether you’re sending 100 messages a day or 100 million messages a day, you do need to think about scale and reliability. Obviously, scale becomes much harder as you go to larger and larger volumes, but what we’ve found is that even for companies with really small amounts of notification volume, it’s still harder to scale than you might think. The reason why is because it tends to come in bursts. Your notification volume doesn’t really get spread out like peanut butter. If you’re sending 30,000 messages a month, that doesn’t mean you’re sending 1,000 messages a day and you wouldn’t then divide that by 24 hours and by 60 minutes. Instead, what you see is huge spikes from time to time and then long valleys. When you’re thinking about building your infrastructure, you need to make sure that you’re accounting for what your tallest spike may be. And that’s the spike on your side but you also need to be thinking about downstream impact because whatever channel you’re using, whether it be email or mobile push or Slack or SMS, there are going to be constraints that your service provider implements as well. How many messages can you send out over how long of a period of time? You also need to be thinking about, “Okay, well, if my spike exceeds the possible spike input for that provider, I need to make sure that I’m backing up those messages and robustly being able to trickle them through the downstream service provider at the rate that service provider allows.” On the reliability side of things, messaging is not perfect. It’s pretty common to see issues and failures. When we’re looking at email, we have things like bounces, incorrect email addresses, also service outages for ESPs. Long delays in things like getting delivery confirmation from the various ESPs, not only at the send layer, but at the receive layer. On SMS, you see a number of issues that can vary by region where you might see temporary outages in one region of the world. While the rest of the regions are working fine. On things like push, it’s very common to not even be able to know, “Did my message get successfully delivered?” You might see that the Apple push notification service accepted this message. That doesn’t mean that it ever showed up on a device for the user. Across all of these different channels, each has their own unique constraints around how do you know how well things are working and under what scenarios might they fail? You need to be thinking about what happens when they fail. One obvious thing to do is to make sure you have robust retry infrastructure in place so that as a message goes out, if it fails for any reason, you want to be able to requeue it and retry it. If you do this, make sure this will impact kind of the scalability requirements that you have, because if a bunch of them start to fail, let’s say it’s a general service outage of the downstream provider or let’s say imagine your API key is wrong because somebody rotated it and forgot to go update it within your environment variables. Well, now you’re going to see a ton of that volume basically get requeued and reprocessed and then fail and requeued and reprocessed. This is where things like exponential back offs come into play. I also think you should really think about things like determining whether a failure is retry-able or not. If it’s an API key, that’s invalid, honestly, you probably shouldn’t retry it. It’s unlikely to be resolved with a subsequent request. I’ve seen a few service providers where we get intermittent API key failures, and so on our side, we’ve had to kind of add even more intelligence but I would say that that’s an edge case. More than likely you can say if the API key is bad, you need to go to that environment variable, retrying it’s not necessarily going to help. But there are other failures that may very well be intermittent and that might be downstream at the carrier level where you are going to want to retry that.