On Thursday, November 7, 2024, an unexpected issue happened while migrating one of our key message queue clusters. This issue resulted in a major outage of our delivery services and APIs.
A mandatory migration was underway for one of our core message queue clusters. This migration was a prerequisite for expanding our infrastructure across multiple data centers.
Unexpectedly, at the final step of the migration, a key sub-component of our message queue cluster (based on Kafka) encountered issues communicating with other nodes in the cluster. The cluster became unavailable resulting in our applications stopping processing messages.
We detected this incident immediately and determined that it was related to a bug due to an edge case in the migration process. However, we chose not to simply roll back the migration, as we needed to ensure data integrity and prevent any potential loss.
Finishing this migration was what ultimately solved the issue and allowed us resume normal operations.
APIs: Experienced a 16% error rate across all services, except the Custom Audience API, which continued to encounter errors until Nov. 8th, 09:31 GMT+1.
• Successful API calls (returning a success status code) were enqueued but not processed during the incident.
• Processing of enqueued requests began around 23:00 GMT+1 and concluded by Nov. 8th, 00:40 GMT+1.
• Action Required: Retry any important failed API calls, as they were not enqueued.
SDK Web Services:
• In-App Automations with “Re-evaluate targeting just before display” did not function as expected.
• Events, attribute updates, and push opens from the mobile SDK and plugins will be retried when users reopen their apps.
• Events, attribute updates, and push opens from the Web SDK have been partially lost.
For clarity, this timeline only lists the most important events. All times are GMT+1
Our alerting system detected that part of our core services were not working properly due to the unavailability of one of our key message queue clusters.
Since we were working on this cluster, we immediately identified the root cause and began our investigation.
After assessing the severity of the incident, we declared it publicly via our status page and started working on various plans to resume operations.
We decided not to force the migration or roll back until we fully understood the root cause and could ensure no data would be lost.
In order to restore all message delivery services as fast as possible, we decided to implement a quick and temporary workaround.
This change involved removing an internal feedback loop necessary for all post-delivery actions (analytics, marketing pressure, inbox).
The workaround has been deployed, and confirmed to be working.
Messages are being sent again.
We decided to resume the migration procedure.
All nodes were successfully migrated, and the cluster started healing itself.
We then reverted the workaround and restarted all services using this cluster.
All services seemed operational.
The incident remained open and under monitoring.
Due to a flood of alerts caused by the incident, the monitoring for our Custom Audience API was broken.
After post-incident in-depth investigation, this monitoring issue was detected and the Custom Audience API was fixed.
We marked the incident as resolved after verifying that all services were functioning as expected.
As this migration was part of a long term plan to have a more resilient infrastructure — preventing this very issue from happening — we will continue deployment as planned.