On february 16 at around 9:45 we detected significant delays for a subset of push notifications, email and sms sent for campaigns or recurring automations. An investigation by our team revealed that one instance of the service responsible for processing these campaigns or automations was having trouble keeping up with the incoming data. The issue was found and a mitigation was put in place immediately. After this operation the service operated correctly again and started catching up its delays. At around 9:50 all delays were resolved and everything was back to normal.
Impact A small subset of push notifications, emails and sms were sent with a delay. We estimate that around 4% of all messages were delayed up to 7h.
Root cause There was an issue with our profile selection system at around 3:00 which caused a small part of the processing to halt. Once the problem was identified our team proceeded to mitigate it, after which the service was working correctly again and the delays were resolved.
Conclusion Although the original problem was an easy fix, the main issue was that we lacked efficient monitoring for this particular service which resulted in much higher delivery delays than it should have. In the near future we will work on improving the monitoring for this service so that we can address any issues much more quickly; in addition we will also work on preventing these kind of issues altogether.