Web push delivery issue
Incident Report for Batch
Postmortem

Summary

The overall impact of this incident was as follows:

  • Slowed down campaign web push notifications delivery, starting at 09:00 UTC and ending at 09:40 UTC
  • No campaign push notifications delivery (all platforms), starting at 09:40 UTC and ending at 10:10 UTC

Timeline

On the 2021-04-15, 09:00 UTC we made a routine change to our message queuing system which had an unforeseen effect on the delivery of campaign web push notifications.

Delivery was slowed down significantly up to 09:40 UTC. At this time we were still diagnosing the problem.

At 09:40 UTC, we decided to restart another component which plays a role in the delivery of campaign push notifications because what we were seeing reminded us of another problem we've seen in the past.

Unfortunately, as soon as this component was restarted it failed due to an obsolete configuration. This meant no campaign push notifications, on all platforms were sent at this time.

Due to an obsolete deployment process for this component we had trouble bringing it up again but eventually it was up and running again around 10:10 UTC.

While we were troubleshooting this component, a quick fix was developed for the campaign web push delivery.

After both component were redeployed, all notifications were sent correctly and in a timely manner.

Conclusion

This incident highlighted two things:

  • don't rush trying to fix a problem at the risk of making things worse.
  • it is extremely important to centralize on a single deployment system to avoid problems.

In the future we will continue working on our process and tooling to improve these two points.

Posted Apr 19, 2021 - 14:47 UTC

Resolved
This incident has been resolved.
Posted Apr 15, 2021 - 17:18 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Apr 15, 2021 - 10:37 UTC
Update
APNS and GCM campaigns are fully functionnal again. Web push campaigns may still be impacted with latency to delivery
Posted Apr 15, 2021 - 10:34 UTC
Update
After investigation, the issue is affecting push campaigns on all platforms (Web, GCM and APNS). Our team is working on the issue to restore the service as soon as possible
Posted Apr 15, 2021 - 10:10 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Apr 15, 2021 - 09:53 UTC
Investigating
We're currently experiencing an issue which affects web push notifications delivery. Our team is investigating and will keep you posted.
Posted Apr 15, 2021 - 09:43 UTC
This incident affected: Delivery (Push campaigns).