Here are some details on the incident.
All times are in UTC and 24h time.
On November 18 at around 12:00 our internal monitoring alerted us that we were having severe delays in sending push notifications for campaigns and recurring automations.
After a quick investigation by our on-call team we found out that the problem was that a server in one of our distributed system was overloaded; we decided to open this incident at around 12:30.
We quickly identified the issue however finding a solution proved to be more difficult. At around 13:40 we ran a maintenance operation that should have removed the server in question from the affected cluster, however this did not work as expected and instead created more problems (more on that later). At 13:57 we decided to completely shut down the server: this resolved the issue and our services started to catch up sending the delayed notifications.
Unfortunately, we found out that the maintenance operation we performed earlier had an adverse effect on one of our service responsible for sending a subset of the push notifications: it started to resend notifications it had already sent.
At around 19:30 everything was back to normal with no delays.
The impact is two fold.
First, push notifications for campaigns and recurring automations were significantly delayed, up to 3 hours. Transactional and trigger automations were not affected.
Second, some notifications that were scheduled to be sent between 12:00 and 13:57 were sent more than once to the same users.
The overloaded server that caused the initial delays was due to a data replication process that was running; we performed a maintenance operation on this server to improve reliability on the distributed system but we didn’t anticipate that it couldn’t keep up with the amount of data the replication process was sending its way. While replicating data it couldn’t serve requests performed by our services which means they were stuck waiting, thus accumulating delays.
Since then we’ve identified a number of ways to keep the replication process under control and make it so it doesn’t have any impact on the server.
The root cause of the duplicate push notifications was a misconfiguration in one of the service in the pipeline responsible for sending the notifications.
A configuration flag is used to tell the service at what frequency it should save what subset of a campaign or recurring automation (which we call a task) it has already processed so that it can restart correctly in case of a failure or a crash, unfortunately this flag was misconfigured to only save this metadata every 30 seconds. This means that any failure this service experienced in accessing the distributed system could result in reprocessing up to 30 seconds of tasks that were already processed; this happened multiple times over 10 minutes and is the reason why we sent notifications more than once.
This service has now been properly configured. In the future we will work on a systematic way to have better delivery guarantees across our entire systems when we do this maintenance work, which will mean these problems will not happen again.