Build A Modern Product Notification System: For Engineers and Product
Hi, I’m Troy Goode, the founder of Courier. Today, we’re going to talk about building a modern product notification system. This talk’s really intended for engineers, product managers, anyone working for a product organization that needs to send notifications and messages to their users, and maybe is thinking about where and how they need to be building that over the next 6 months, 12 months, 18 months as their product and organization evolves. By the end of this, we’re hoping to really give you enough to get started thinking about what improvements you might want to make to your notification infrastructure or what notifications you want to send that you’re not sending today and how you should think about building and maintaining those systems moving forward. We are putting together a white paper that’ll be linked below. So take a look at that if you want to dig into even more detail than what we cover in this discussion.
So when we think about a modern product notification system, we need to think about what are the requirements, right? And what we’re going to walk you through is a few of the ways we think about it when we work with our customers and what we hear about from our customers that may have already built their own product notification system before they met Courier. Here’s the way I normally like to break it down is between requirements that are for the development team: Right? The team that’s building and maintaining these notifications and these messages, and the requirements that are really for more of the product management team, the design team, the marketing team, and support team, those that maybe aren’t directly responsible for building this infrastructure, but who really rely on this infrastructure to power a lot of the activities and objectives that they’re trying to achieve with their projects with the product.
When we think about the objectives and the requirements for development team, three of them we’re going to dig into deeper later on in this discussion. That would be scale and reliability, abstracting your channels and providers, thinking about how do you route between the different kinds of channel and take into account the preferences of the recipient or user? And how do you take all of the messages that are flowing through this infrastructure and put a layer of observability and analytics on top of it so you can know what is and is not working the way you would want it to?
In addition to that, though, and we won’t go into these in as quite as much detail in this discussion, but you should be thinking about what is the developer experience for other developers within the organization? Because while some companies that we work with have a dedicated, centralized notification infrastructure managed by a dedicated, centralized comms team, many, many, many more companies that we talk to don’t and this infrastructure sometimes can be centralized, sometimes not, but the team is very typically distributed and different teams will need to interact with the infrastructure that you’re building to solve different use cases.
They’re essentially a customer, they’re an internal customer of yours, so what should that experience look like? You need to be thinking about the analytics needed by dev ops and other parts of the organization, not just at the business level but also at the operational level. How do you know when the messages are going out as expected? When are they not? When are they delayed? Which of these investments are paying off and maybe you should double down on and which maybe aren’t, and maybe you should revisit and reconsider.
Last is, and this is really kind of tied to that developer experience. How do you set up good testing environments? This is actually especially challenging for messaging infrastructure. You want to be able to potentially run integration tests and you want to be able to run scale tests, but how do you do this without accidentally sending messages to real people or significantly driving up costs with your downs stream service providers? Thinking through how do you create an environment where a developer can test against their local changes? How can they test within a staging environment, pulling together a number of different possible poll requests and testing it all together to see that it’s not going to impact production negatively? Then of being able to do things like smoke tests and the actual non-testing production sends from your live environment.
Managing Volume Spikes
Whether you’re sending 100 messages a day or 100 million messages a day, you do need to think about scale and reliability. Obviously scale becomes much harder as you go to larger and larger volumes, but what we’ve found is that even for companies with really small amounts of notification volume, it’s still harder to scale than you might think. The reason why is because it tends to come in bursts. Your notification volume doesn’t really get spread out like peanut butter. If you’re sending 30,000 messages a month, that doesn’t mean you’re sending 1,000 messages a day and you wouldn’t then divide that by 24 hours and by 60 minutes. Instead, what you see is huge spikes from time to time and then long valleys.
Provider Constraints & Errors
When you’re thinking about building your infrastructure, you need to make sure that you’re accounting for what your tallest spike may be. And that’s the spike on your side but you also need to be thinking about downstream impact because whatever channel you’re using, whether it be email or mobile push or Slack or SMS, there are going to be constraints that your service provider implements as well. How many messages can you send out over how long of a period of time? You also need to be thinking about, “Okay, well, if my spike exceeds the possible spike input for that provider, I need to make sure that I’m backing up those messages and robustly being able to trickle them through the downstream service provider at the rate that that service provider allows.”
On the reliability side of things, messaging is not perfect. It’s pretty common to see issues and failures. When we’re looking at email, we have things like bounces, incorrect email addresses, also service outages for ESPs. Long delays in things like getting delivery confirmation from the various ESPs, not only at the send layer, but at the receive layer. On SMS, you see a number of issues that can vary by region where you might see temporary outages in one region of the world. While the rest of the regions are working fine.
ON things like push, it’s very common to not even be able to know, “Did my message get successfully delivered?” You might see that the Apple push notification service accepted this message. That doesn’t mean that it ever showed up on a device for the user. Across all of these different channels, each has their own unique constraints around how do you know how well things are working and under what scenarios might they fail?
You need to be thinking about what happens when they fail. One obvious thing to do is to make sure you have robust retry infrastructure in place so that as a message goes out, if it fails for any reason, you want to be able to requeue it and retry it. If you do this, make sure this will impact kind of the scalability requirements that you have, because if a bunch of them start to fail, let’s say it’s a general service outage of the downstream provider or let’s say imagine your API key is wrong because somebody rotated it and forgot to go update it within your environment variables. Well, now you’re going to see a ton of that volume basically get requeued and reprocessed and then fail and requeued and reprocessed. This is where things like exponential back offs come into play.
I also think you should really think about things like determining whether a failure is retry-able or not. If it’s an API key, that’s invalid, honestly, you probably shouldn’t retry it. It’s unlikely to be resolved with a subsequent request. I’ve seen a few service providers where we get intermittent API key failures and so on Courier side, we’ve had to kind of add even more intelligence but I would say that that’s an edge case. More than likely you can say if the API key is bad, you need to go that environment variable, retrying it’s not necessarily going to help.
But there are other failures that may very well be intermittent and that might be downstream at the carrier level where you are going to want to retry that. If you have these retries in place, if you have exponential back offs, you need to be thinking about the time limit on them. Retry for up to how long? Up to 72 hours? Up to 24 hours? That can also really vary by the type of content you’re trying to deliver.
You probably don’t want to deliver a password reset message 72 hours later, that would be weird. Some messages, maybe it’s fine to deliver a few days even a week later. Each of the kinds of messages that you’re sending may have a different correct strategy for how long you’re willing to retry these things. Think about that and think about ways to instrument your platform to say, “Okay, for this kind of message, here’s going to be our retry policy. For this one, we’re never going to retry it. This is an important message. It’s got to go out, but if it doesn’t go out, maybe it doesn’t make sense to retry it.” I kind of think you probably wanted to retry at least a few times in a short time span, but you might not be interested in retrying for more than a few minutes. For others, you might be fine with a really long retry policy with maybe less frequent retries during that life cycle so that it can have less impact on the scalability requirements of your system.
The other thing you can consider, especially if you have multi-channel infrastructure, is when do you failover from one channel to another failure, another channel? Another thing you can consider is if you have multiple channels, when do you failover from one channel to another? Let’s say that you were going to send me a mobile push message and for whatever reason, you get a failure. You know in this case that didn’t go through. Well, you could retry it and maybe in some cases that would be something you’d want to do, but maybe in other cases, what you should do is retry that message on a different channel. This time, send it to Troy via email or SMS instead of push. That’s channel failover.
You should also consider provider failover. Let’s say you’re sending an email, maybe SendGrid is your email service provider, and for whatever reason, the SendGrid account isn’t working right now so you attempt to send the message it fails, and you can see that this is not really worth retrying with SendGrid. Again, bad API key or a service adage on SendGrid’s side. Well now, instead of failing over to a different channel, do you instead failover and call Mailgun or Postmark? A lot of times, this is something that you really want to do at scale, because you have a lot of messages going out to a lot of users and you don’t necessarily want to get in that password reset example to send this message 48 hours later. You want to make sure that it goes out now and while you probably have a service provider that’s your preferred, well, it’s good to have a backup.
As you think about things like retry logic, well, you also want to make sure that you’re not resending the same notification to the same user. This is unfortunately pretty easy to end up doing. One of the things that you need to bake into a system is idempotency and idempotency just means a way to track which messages have been sent to which users.
Imagine that you were sending a batch of messages and notifications to 1000 different users and you’re looping through this batch and halfway through processing it something fails. Well, you want to requeue and retry that batch. Now you make it to user number 800 and it fails again and you retry the whole thing again. But what ends up happening here is you see the same user receiving the same notification many, many times. HBO Max recently had an issue with this.
Well, what you want to do and what you want to bake into your infrastructure to prevent this is idempotency. You want to be able to key each of these notifications that you’re sending to an idempotency key similar to how Stripe does it so that you can make sure that each of those notifications can it’s processed for each user only once. Your infrastructure needs to be checking for this even if it’s processing everything in one big batch, that way you can safely retry not just individual messages, but also huge batches of messages. As you build out all of this infrastructure for retries, for failovers, for item potency, the key things you’re going to be wanting to measure are latency, and in this regard, we’re really talking about from the time that you said, “Send this message to Troy,” how long does it take for the infrastructure to attempt that send? Time to first attempt. Many things can happen after that, where maybe the provider failed but that first bit, that’s your latency and you want to keep that as small as possible. Well under a second. The gold standard here is about 200 milliseconds.
You also want to be checking things like, “Well, what is the deliverability rate I’m getting for each of these providers and for each of the channels?” Now channels tend to be more of let’s call it a performance optimization input. You can know which channels are most effective, but provider deliverability is more of a systems effectiveness measurement. You will see significant differences in deliverability between different vendors and providers that you’re using. It’s really critical to measure each of them so you can understand which one’s working well for you and which one is not. This is basically what then can drive towards the ceiling of what is possible for that channel. You don’t want to be artificially constrained by some limitations you might have placed on you by issues with the provider that you’re using.
As you pull all of this data together around things like latency and deliverability, start to come up with what you think for your own internal needs to be your SLO. What is your service level objective? What is the goal that you want to make sure that your team is consistently hitting to make sure that it’s not negatively impacting the rest of the business?
Service Level Objectives
Once you’ve identified your SLO, then you want to be thinking about, “Okay, well, what are the SLIs that help you measure that? What are the service level indicators,” maybe you’re using Datadog or some sort of other observability platforms, “That I can look at to see what is the latency? What is the request per second that we’re able to process here? What is the deliverability rate we’re getting for each channel?” Look at those SLIs, figure out a way to pull that together, and report it out to the rest of your engineering team. Maybe it’s just your team, maybe it’s the entire department. You can figure that out for what works best for your business, but you have to make sure that you’re constantly measuring this, because it will just change over time, even without you interacting with it. Measuring it at just one point in time is not sufficient. Make sure that you have processes in place to continuously measure this and compare it against your own internal objectives.
The Right Message, at The Right Time, to The Right User, through The Right Channel
A few moments ago, we were talking about failover. When do you decide to failover between one channel and another, or failover between one provider within a channel to another provider? That’s one form of routing. Routing between channels, routing between providers. But routing is actually a broader concept that you should be thinking about as you design your modern product notification system. Failover and reliability is one goal here, but also things like, well, just making sure that the message is delivered to the user using the right channel at the right time. Scheduling falls under the routing umbrella. Do you deliver that message right now or do you wait until tomorrow morning, business hours, or maybe after work? Figure out when you want to deliver this message and figure out based upon when it’s being delivered, well, which channel’s going to be most effective?
If you’re a B2B product, delivering a message in the middle of the day, perhaps Slack would be a great channel. If you’re a B2C product, delivering a message at 7:00 PM, well maybe mobile push is likely the best channel for that message. That’s kind of a naive take though because you can also take a lot of data about the user to help influence this.
For example, let’s say you have in-app notifications, a little toast that pops in from the corner of the screen or a desktop notification that appears in the corner of the operating system, or even as a PWA notification on their mobile device. Well, maybe do you only want to send those if the user is currently online? Same thing for Slack, a Slack notification. Do you want to send a message to the user in Slack if they’re not currently logged into Slack? Would you maybe instead reroute that message to email?
Take into account what you know about the user. If you don’t have their phone number, you’re not going to be sending an SMS so make sure that you are able to then reroute and redirect that notification to the channel that will be valid for them. If you have other metadata about the user, such as presence of are they online and which platform, use that to help you understand which channel’s going to be the best. Use other context around what kind of use they are, what kind of use case it is, and what time of day it is in their time zone to help determine when and via which channel you should be delivering your messages.
One last thing to think about when you’re thinking about building routing across channels into your notification infrastructure is well, does the user actually have a preference? As much intelligence as you can add to dynamically switch between different channels, a lot of times users might have their own opinion and being able to extend that opinion out via your app and ask them for their opinion, via for example a preferences page, is really important. Let them turn off just the notifications they don’t want to receive instead of everything. Let them say, “Hey, for this notification, I’d rather receive it via SMS instead of email.” Maybe at some point, you even consider things like letting them specify a different recipient. Maybe say, “Hey, you know what? I’m going to be on vacation. Please send this to my colleague instead of me,” or pause notifications for some time period, instead of turning it off entirely, “I’m going on PTO, don’t send any more of these notifications for the next week.”
As you think about this, there’s going to be differences between what preferences you’re willing to let users set on some channels or on some kinds of notifications versus others. Let’s say you have, password reset emails are not something you ever want somebody to unsubscribe from, whereas the weekly newsletter, you probably want them to unsubscribe from that without impacting some of your other more growth-oriented messages. Divide up the different kinds of notifications that you send. Think about how you segment them and think about the different channel options that are applicable to the user for each of those notifications.
Lastly, think about potentially what we call digests, which is pulling many, many notifications together, and instead of sending many, many notifications out to the user, do you aggregate them and send them out to the user in a batch? If you think about LinkedIn, for example. Instead of receiving an email every time somebody asks to connect with you, sometimes you’ll receive a notification from LinkedIn saying five, six, 20 people have asked to connect with you. Do you start to batch and group things together and send out digests, which will increase the value of that notification and decrease the annoyance factor?
We’ve talked a little bit about observability in analytics. We talked about using SLIs to make sure that you’re meeting your SLOs. What else do you need to be pulling together? Well, the way I think about it is this. Observability is really primarily geared towards support and engineering. It’s to make sure that a system is working the way you expect it to. There are some performance indicators that would fall under this, things like those SLIs, but you also need to think about logs. It’s more challenging than I think any of us might wish it were. When we think about channels, and often this varies by both channel and provider, if we think about really mature channels, like email, if you’re a SendGrid customer, you can log into the SendGrid account and get a pretty good snapshot of what emails you’ve been sending to whom and what the impact was. Did it land? Did it bounce? Did it get opened or clicked?
But as you look at other channels, it starts to become something that you have less access to, certainly via the actual provider. Mobile push, you typically need to layer another service on top of to get access to that data. Things like Slack or WhatsApp, you have very little to no visibility into without building custom infrastructure. When you’re sending this message, how does the developer, or in some cases, support team look at what happened? What happened as soon as the developer said, “Send this notification to Troy,” what happened next? How did the system decide which channel that was going to deliver that message to Troy? What happened when it made that API call? Did it fail? Did it have to be retried? If it’s being retried, when’s it going to reattempt it and when will it be delivered?
Logging is really helpful for really looking at and debugging use cases in test and development, but also in production when you do see things go wrong. Make sure that, that data’s all being pulled together in some way, whether you’re putting it into your data warehouse and creating look at reports that give you direct logs access. Those typically are not likely real-time, which makes it hard for certain debug scenarios. Maybe instead put it in something like Datadog, which is better for the engineer debugging scenario, but maybe a little bit tougher for the support debug scenario. Regardless, make sure you’re finding a way to pull that data, those logs, together in a place that both the engineering team and the support team can access.
Engagement & UX Outcomes
Beyond the observability side of things, then we look at the analytics side of things, and this is really the business outcome side of your notifications. Is it driving the value that you’re hoping for? Is it creating engagement? Is it creating the right experience? If somebody’s resetting their password or receiving a magic login SMS or email, you want to make sure that’s coming in fast and that people are actually clicking it. You should see very few instances of people requesting password reset emails without them resetting their password. You should see very few instances of people getting a 2FA token delivered via SMS without them then logging in and inserting that 2FA token. You want to make sure that you’re measuring for each of these use cases how many of these notifications are going out, which channels are they going out via, and what was the outcome? Did it actually produce the intended effect?
You want to then look at this and you can start to see, “Hey, maybe the one provider works better than another,” or maybe, “This channel isn’t as effective as you were hoping it would be.” Pulling that data together and being able to look at it in aggregate by channel, in aggregate by notification use case, template for example, looking at it by user cohort, these are at ways you are going to want to slice and dice this data so that people in the product management team, people in the engineering team, people in the data team can look at that data and understand where does it make sense to continue to invest? Where might there be problems and there be dragons? Dig in deeper to figure out what needs to be fixed or improved.
Building a modern product notification system, honestly, it’s pretty complicated, but it’s not impossible. Most software companies have to do this at some point in their life cycle. If you’re just getting started, probably just pick a single channel and a single provider, and off you go. If it’s something like email or mobile push, which usually most companies start with one of those two, there are great platforms for both of those channels to get started with. As you expand and grow your audience and maybe start to investigate additional channels or additional use cases that are more complex, then usually need to start to think about investing in a broader ecosystem of infrastructure that can tie all of this together. At scale, you’re probably going to want multiple providers for each channel. In fact, if one of your channels is mobile push, you’re going to need that more or less from the beginning, because you’ve got the Apple ecosystem and the Android ecosystem, you’re probably going to need to service both. Start to think about, even from the very beginning, how do you abstract away these different channels, these different providers? How do you pull together all of this data, create observability, create analytics that can be consumed by your team and the rest of the organization?
Hopefully, this video gave you enough to get a feel for what parts of a modern product notification system are applicable to you today versus what might make sense down the road. For whatever parts are applicable to today, we’ve produced a white paper that digs into even more detail. We’d love to have you download it and check it out, looking for feedback, if you have any. Are there other parts of building a modern product notification system we haven’t really talked about that you’d love to hear how we do it or how we see other customers building? We also have a Discord server and would love to have you join us there. We’re all hanging out. Would love to chat with you and geek out around notifications if this is something you’re working on. Please stop by and say hi. Thank you.