CEP & MEP Delivery Issues

Incident Report for Batch

Postmortem

On August 14th, 2025, we encountered a database issue when performing a maintenance operation. This issue resulted in a major outage of our MEP & CEP delivery services.

What happened?

During the rework of our internal DNS, we performed the renaming of servers using our IaC (Infrastructure as Code).

A quirk of one of our IaC plugins can cause it to wrongly believe that machines should be re-imaged.

We knew about this issue and applied a safeguard, but this safeguard wasn’t enabled on 5 of the database instances. When applying the IaC changes, those 5 servers were wiped clean of their data and re-imaged. The erased nodes caused some data to be unavailable (~0.028% of total data) as all of their replicas were on the deleted machines. It also broke cluster quorum, forcing services to operate at a degraded consistency level, which raised their error rates.

Core services used by both the MEP & CEP were on this database cluster. In some cases, services stopped processing events when they encountered unreadable database rows to ensure data integrity.

The issue was immediately noticed, and our incident response system was triggered.

It was fixed by restoring a backup of the wiped nodes. We then manually brought the system back online by slowly restoring services so that the database wasn’t overloaded while the applications caught up with the backlogged work.

Impact on our platform

  • MEP: Push v1 Campaigns & Trigger Automations were unavailable during the incident. Trigger automations were eventually sent, but many campaign messages were dropped. Campaigns sent between August 13th and 14th may be missing analytics.
  • CEP: Automations (Push v2, SMS, Email) were backlogged during the duration of the incident.
  • APIs: Experienced elevated error rates, especially around 12:10 CEST when the incident started.

Timeline & mitigation actions

For clarity, this timeline only lists the most important events. All times are CEST.

12:10

The IaC operation was applied, and the 5 servers were re-imaged.

12:20

Our infrastructure engineers were paged about the missing database nodes and started looking at the problem. The incident response process was triggered, calling other engineers to help. They began drafting an action plan.

12:40

The action plan was reviewed, and work on it started. Database backups were being restored.

A second plan was drafted, hoping that it could help us bring part of the system online sooner than expected.

We started an in-depth audit of all of our products to establish a list of what was working and what was not

13:30–14:23

The severity of the incident was reevaluated; the Statuspage incident was drafted and published.

15:50

The database restore was finished, and the cluster was fully back online.

As we expected, human intervention was required to carefully bring the services back online while managing database load as the system needed to catch up with the backlogged work, which we immediately started doing.

15:30–17:58

The system caught up with the backlogged work, and the database cluster was stable.

We wrote an update and moved the incident to the “Monitoring” phase.

07:37, the next morning

The on-call team reported that everything had been working as expected since we moved to “Monitoring”. We marked the incident as resolved.

Forthcoming actions

  • We are looking at ways to split this database cluster into smaller ones to reduce the blast radius of an unavailability event.
  • We improved our IaC configuration to add more protections against accidental re-imaging.
  • We will be making changes to how the IaC action plan is presented to human operators by reducing noise to make it less likely to miss an unwanted change.
Posted Aug 21, 2025 - 07:11 UTC

Resolved

The system has been stable for the last 12 hours.

As of August 14, 18:04 CEST:
- Late MEP & CEP Trigger Automation messages have been sent
- Some MEP Push Campaigns & Recurring Automations have been partially send and will not be retried
- API error rates are back to normal
- Webhooks are back to normal


This incident is now considered as resolved.
Posted Aug 15, 2025 - 05:37 UTC

Monitoring

The system has now fully processed the backlog of CEP and MEP automation messages, including Email, SMS, and Push v1 and v2.
All previously delayed messages have been sent.

We are now moving into a monitoring phase, during which the on-call team will closely watch system performance.
Posted Aug 14, 2025 - 16:04 UTC

Update

The system is still catching up with the backlogged work.
All late CEP & MEP Automation messages (Email, SMS, Push v1 & v2) will be sent.

We will provide further updates as soon as we have more information or in 1 hour, whichever comes first.
Posted Aug 14, 2025 - 15:36 UTC

Update

The database issue has been resolved.
All components are gradually returning to normal.
Processing speed is being increased progressively, with priority given to maintaining system stability.

We will provide further updates as soon as we have more information or in 1 hour, whichever comes first.
Posted Aug 14, 2025 - 14:38 UTC

Update

We are still working on fixing the database problem.

The affected components are still the same.
Posted Aug 14, 2025 - 13:33 UTC

Identified

We have been experiencing a database issue since August 14th, 12:10 CEST.
We are working on a complete report of what is affected.

In the meantime, here is a brief overview of the platform’s status

* CEP Automations (Email, SMS, Push v2) are not being sent
* MEP Automations (Push v1) are not being sent
* MEP Campaigns are partially being sent
* CEP Campaigns are partially being sent
* The transactional API (MEP) is working
* Data sent on APIs for which we sent back a 200 HTTP status code may not be processed live but is enqueued and will eventually be processed
* Some APIs might experience elevated error rate over the afternoon
* Webhooks are partially working


We will provide further updates as soon as we have more information or in 1 hour, whichever comes first.
Posted Aug 14, 2025 - 12:23 UTC
This incident affected: Optional Services (Inbox, Webhook), CEP Core Services (Email delivery, Push Delivery, SMS Delivery, Event Targeting & Retargeting), and MEP Core Services (Push delivery, Data ingestion, Custom Audiences).