Notifications, SMS, Email, API & SDK Web services down

Incident Report for Batch

Postmortem

On Thursday, November 7, 2024, an unexpected issue happened while migrating one of our key message queue clusters. This issue resulted in a major outage of our delivery services and APIs.

‌

What Happened?

A mandatory migration was underway for one of our core message queue clusters. This migration was a prerequisite for expanding our infrastructure across multiple data centers.

Unexpectedly, at the final step of the migration, a key sub-component of our message queue cluster (based on Kafka) encountered issues communicating with other nodes in the cluster. The cluster became unavailable resulting in our applications stopping processing messages.

We detected this incident immediately and determined that it was related to a bug due to an edge case in the migration process. However, we chose not to simply roll back the migration, as we needed to ensure data integrity and prevent any potential loss.

Finishing this migration was what ultimately solved the issue and allowed us resume normal operations.

‌

Impact on our platform

Push Campaigns: Delayed by up to one hour.
Transactional Push Notifications: Delayed by up to two hours.
APIs: Experienced a 16% error rate across all services, except the Custom Audience API, which continued to encounter errors until Nov. 8th, 09:31 GMT+1.

• Successful API calls (returning a success status code) were enqueued but not processed during the incident.

• Processing of enqueued requests began around 23:00 GMT+1 and concluded by Nov. 8th, 00:40 GMT+1.

• Action Required: Retry any important failed API calls, as they were not enqueued.
SDK Web Services:

• In-App Automations with “Re-evaluate targeting just before display” did not function as expected.

• Events, attribute updates, and push opens from the mobile SDK and plugins will be retried when users reopen their apps.

• Events, attribute updates, and push opens from the Web SDK have been partially lost.

‌

Timeline & mitigation actions

For clarity, this timeline only lists the most important events. All times are GMT+1

17:40

Our alerting system detected that part of our core services were not working properly due to the unavailability of one of our key message queue clusters.

Since we were working on this cluster, we immediately identified the root cause and began our investigation.

18:08

After assessing the severity of the incident, we declared it publicly via our status page and started working on various plans to resume operations.

We decided not to force the migration or roll back until we fully understood the root cause and could ensure no data would be lost.

18:20

In order to restore all message delivery services as fast as possible, we decided to implement a quick and temporary workaround.

This change involved removing an internal feedback loop necessary for all post-delivery actions (analytics, marketing pressure, inbox).

18:48

The workaround has been deployed, and confirmed to be working.

Messages are being sent again.

21:30

We decided to resume the migration procedure.

22:30

All nodes were successfully migrated, and the cluster started healing itself.

We then reverted the workaround and restarted all services using this cluster.

23:30

All services seemed operational.

The incident remained open and under monitoring.

09:26, the Next Day

Due to a flood of alerts caused by the incident, the monitoring for our Custom Audience API was broken.

After post-incident in-depth investigation, this monitoring issue was detected and the Custom Audience API was fixed.

14:26

We marked the incident as resolved after verifying that all services were functioning as expected.

‌

Forthcoming actions

As this migration was part of a long term plan to have a more resilient infrastructure — preventing this very issue from happening — we will continue deployment as planned.

Posted Nov 12, 2024 - 16:29 UTC

Resolved

The system has been functioning properly since our last communication, and we now consider this incident resolved.

Summary of Impact:

• Push Campaigns: Delayed by up to one hour.
• Transactional Push Notifications: Delayed by up to two hours.
• APIs: Experienced a 16% error rate across all services, except the Custom Audience API, which continued to encounter errors until Nov. 8th, 09:31 GMT+1.
• Successful API calls (returning a success status code) were enqueued but not processed during the incident.
• Processing of enqueued requests began around 23:00 GMT+1 and concluded by Nov. 8th, 00:40 GMT+1.
• Action Required: Retry any important failed API calls, as they were not enqueued.
• SDK Web Services:
• In-App Automations with “Re-evaluate targeting just before display” did not function as expected.
• Events, attribute updates, and push opens from the mobile SDK and plugins will be retried when users reopen their apps.
• Events, attribute updates, and push opens from the Web SDK have been partially lost.

Analytics and Tracking Limitations:

To restore campaign functionality as quickly as possible, we temporarily disabled internal tracking of push, email, and SMS deliveries between 18:40 GMT+1 and 23:20 GMT+1. As a result:
• Analytics for messages sent during this period are unavailable and cannot be recovered.
• Open rate percentages for this timeframe are unreliable.
• Marketing pressure features (Global Frequency, Label Frequency, and Recurring Automation Cappings) do not account for push, email, or SMS deliveries during this interval.

We are exploring ways to partially regenerate missing analytics data.

Next Steps:

Our team is preparing a comprehensive postmortem, which we plan to publish next week.

We apologize for the inconvenience caused and appreciate your understanding.

Posted Nov 08, 2024 - 13:26 UTC

Update

The Custom Audience API encountered errors until November 8th, 9:31 GMT+1. It is now working as expected.

We will send an update later today with a more information about the impacted components. A full post-mortem is planned for next week.

Posted Nov 08, 2024 - 09:23 UTC

Monitoring

The previous operation was successfully completed. SDKs and APIs are functioning correctly. Data ingestion has been back online since 22:45 GMT+1.

From 18:47 GMT+1 to 23:20 GMT+1, no analytics data was collected, and unfortunately, we will not be able to recover this data. As a result, you may notice abnormal open rates, as messages were sent during this period but acknowledgment information was not collected. Push notifications won't show up in the Inbox feature either.

Our teams are continuing to monitor the situation.
We will publish a post-mortem next week.

Posted Nov 07, 2024 - 22:33 UTC

Update

To prepare our platform for the upcoming operation, we will temporarily suspend all API and SDK web services. During this time, data ingestion will not be possible (you will receive HTTP 500 errors) .

We will inform you as soon as data ingestion is restored.

Analytics are sill unavailable.

We will post another update in an hour.

Posted Nov 07, 2024 - 21:01 UTC

Update

Our teams are still working on a complete solution.

We will post another update in an hour.

Posted Nov 07, 2024 - 19:48 UTC

Update

The workaround is now also implemented to resume Transactional Push. Transactional Push will now be sent again, and any queued push that were delayed are being delivered progressively.

Our teams are still working on a complete solution.

Posted Nov 07, 2024 - 18:48 UTC

Update

The workaround is now also implemented to resume Email & SMS campaigns. Email & SMS will now be sent again, and any queued Email & SMS that were delayed are being delivered progressively.

Due to this workaround, success and error analytics will not be available on the dashboard, APIs, or exports.

Our teams are still working on a complete solution.

Posted Nov 07, 2024 - 18:19 UTC

Update

We are continuing to work on a fix for this issue.

Posted Nov 07, 2024 - 18:18 UTC

Identified

We have located the root cause but are still working on exactly what components are affected.

We have implemented a workaround to resume APNS, FCM, and Web Push notifications for campaigns. Notifications will now be sent again, and any queued notifications that were delayed are being delivered progressively.

Due to this workaround, success and error analytics will not be available on the dashboard, APIs, or exports.

Our teams are still actively working to fully restore the remaining affected services.

Posted Nov 07, 2024 - 17:47 UTC

Investigating

We are currently experiencing technical issues since 17:40 GMT+1. Notifications, Email, SMS, our API, and SDK web services are all down.

Our team is actively investigating the situation to restore the service as quickly as possible. We will keep you updated as soon as we have more information.

Posted Nov 07, 2024 - 17:09 UTC

This incident affected: [OLD_STATUSPAGE_MODEL] Delivery (Transactional push, Push campaigns, In-app messaging, Push automations, Email campaigns), CEP Core Services (Email delivery), MEP Core Services (Data ingestion), and API ([MEP] Campaigns API, [MEP] Transactional API, [MEP] Custom Data API, [MEP] Transactional API for Partners).