Push campaign delivery issues
Incident Report for Batch
Postmortem

Here are some details on the incident.

Timeline

All times are in UTC and 24h time.

On November 18 at around 12:00 our internal monitoring alerted us that we were having severe delays in sending push notifications for campaigns and recurring automations.

After a quick investigation by our on-call team we found out that the problem was that a server in one of our distributed system was overloaded; we decided to open this incident at around 12:30.

We quickly identified the issue however finding a solution proved to be more difficult. At around 13:40 we ran a maintenance operation that should have removed the server in question from the affected cluster, however this did not work as expected and instead created more problems (more on that later). At 13:57 we decided to completely shut down the server: this resolved the issue and our services started to catch up sending the delayed notifications.

Unfortunately, we found out that the maintenance operation we performed earlier had an adverse effect on one of our service responsible for sending a subset of the push notifications: it started to resend notifications it had already sent.

At around 19:30 everything was back to normal with no delays.

Impact

The impact is two fold.

First, push notifications for campaigns and recurring automations were significantly delayed, up to 3 hours. Transactional and trigger automations were not affected.

Second, some notifications that were scheduled to be sent between 12:00 and 13:57 were sent more than once to the same users.

Root cause

The overloaded server that caused the initial delays was due to a data replication process that was running; we performed a maintenance operation on this server to improve reliability on the distributed system but we didn’t anticipate that it couldn’t keep up with the amount of data the replication process was sending its way. While replicating data it couldn’t serve requests performed by our services which means they were stuck waiting, thus accumulating delays.

Since then we’ve identified a number of ways to keep the replication process under control and make it so it doesn’t have any impact on the server.

The root cause of the duplicate push notifications was a misconfiguration in one of the service in the pipeline responsible for sending the notifications.

A configuration flag is used to tell the service at what frequency it should save what subset of a campaign or recurring automation (which we call a task) it has already processed so that it can restart correctly in case of a failure or a crash, unfortunately this flag was misconfigured to only save this metadata every 30 seconds. This means that any failure this service experienced in accessing the distributed system could result in reprocessing up to 30 seconds of tasks that were already processed; this happened multiple times over 10 minutes and is the reason why we sent notifications more than once.

This service has now been properly configured. In the future we will work on a systematic way to have better delivery guarantees across our entire systems when we do this maintenance work, which will mean these problems will not happen again.

Posted Dec 01, 2023 - 16:26 UTC

Resolved
Push notifications campaign are fully back to normal.

However, we discovered that a small part of the campaign push scheduled during the incident period (pre 13:40 UTC) were sent in duplicate.

Detailed explanation will be provided in a following PostMorthem
Posted Nov 18, 2023 - 19:34 UTC
Update
A fix has been deployed and the push notification delays are currently resorbing
Posted Nov 18, 2023 - 14:43 UTC
Update
We are still working on a fix for this issue.
Posted Nov 18, 2023 - 14:02 UTC
Update
We are continuing to work on a fix for this issue.
Posted Nov 18, 2023 - 13:30 UTC
Identified
The issue has been identified. We are currently working on a fix.
Posted Nov 18, 2023 - 12:51 UTC
Investigating
We are aware of delays delivering push notifications for campaigns and recurring automations; other automations and transactional push notifications are not affected.
We are currently investigating.
Posted Nov 18, 2023 - 12:31 UTC
This incident affected: Delivery (Push campaigns, Push automations).