Push delivery issue

Incident Report for Batch

Postmortem

Here are some more details on the incident.

Timeline

At 12:29 UTC some instances of our internal service responsible for processing push campaigns started to fail.
At around 13:00 UTC all instances were stuck processing push campaigns.

At this point we were alerted by our monitoring system that these services weren't making progress; we then opened this incident at 13:19 UTC.
Soon after we found out that one of the caching database cluster used by this service was not behaving correctly because one node in the cluster was down; we then decided to restart this node.

At 13:29 UTC the node came back online, the database cluster started behaving correctly again and the service started making progress again.

At 13:54 UTC the service caught up and there was no longer any delay when processing push campaigns.

Impact

While the service was in a degraded state push notifications could be delayed between 15 to 45 minutes.
After we restored the caching database cluster this delay was progressively reduced.

Posted Oct 26, 2022 - 07:42 UTC

Resolved

The incident has now been resolved.

Posted Oct 25, 2022 - 14:35 UTC

Monitoring

The caching database cluster is now functioning correctly again and our internal services are currently recovering. Push delivery is no longer delayed. We will keep monitoring the situation.

Posted Oct 25, 2022 - 13:54 UTC

Identified

We have identified an issue with one of our caching database cluster. We are working on a fix.

Posted Oct 25, 2022 - 13:37 UTC

Investigating

We are aware of delays impacting the push notifications delivery since approximately 12:50 UTC. We are currently investigating.

Posted Oct 25, 2022 - 13:19 UTC

This incident affected: [OLD_STATUSPAGE_MODEL] Delivery (Push campaigns).