On the 21/11/2020 06:35 UTC an alert fired pointing to some network issues on Zookeeper — which is a cluster maintaining our messaging services' data coherence and integrity.
What seemed to be at first some network hiccups with a small impact snowballed into a global incoherent state leading to severe data corruption. We were unable to send any push notifications — impacting the "Push Campaigns", "Transactional Push", and "Trigger Campaigns" features.
We started working on restoring the messaging cluster, trying to get back to a stable state, fixing corrupted data manually. After 3 hours, seeing that we still didn't have a clear view on a possible ETA, we decided to split our efforts and started deploying a new messaging cluster while continuing the recovery work on the original one.
Circa 12:45 UTC we switched the "Transactional Push" to the new messaging cluster, making this feature available again.
Around 14:10 UTC the original messaging cluster was fixed and available again which allowed us to restart the "Push Campaigns" feature's services and delivering campaigns.
15min after, with some additional efforts, we restored the "Trigger Campaigns" feature.