On August 14th, 2025, we encountered a database issue when performing a maintenance operation. This issue resulted in a major outage of our MEP & CEP delivery services.
During the rework of our internal DNS, we performed the renaming of servers using our IaC (Infrastructure as Code).
A quirk of one of our IaC plugins can cause it to wrongly believe that machines should be re-imaged.
We knew about this issue and applied a safeguard, but this safeguard wasn’t enabled on 5 of the database instances. When applying the IaC changes, those 5 servers were wiped clean of their data and re-imaged. The erased nodes caused some data to be unavailable (~0.028% of total data) as all of their replicas were on the deleted machines. It also broke cluster quorum, forcing services to operate at a degraded consistency level, which raised their error rates.
Core services used by both the MEP & CEP were on this database cluster. In some cases, services stopped processing events when they encountered unreadable database rows to ensure data integrity.
The issue was immediately noticed, and our incident response system was triggered.
It was fixed by restoring a backup of the wiped nodes. We then manually brought the system back online by slowly restoring services so that the database wasn’t overloaded while the applications caught up with the backlogged work.
For clarity, this timeline only lists the most important events. All times are CEST.
12:10
The IaC operation was applied, and the 5 servers were re-imaged.
12:20
Our infrastructure engineers were paged about the missing database nodes and started looking at the problem. The incident response process was triggered, calling other engineers to help. They began drafting an action plan.
12:40
The action plan was reviewed, and work on it started. Database backups were being restored.
A second plan was drafted, hoping that it could help us bring part of the system online sooner than expected.
We started an in-depth audit of all of our products to establish a list of what was working and what was not
13:30–14:23
The severity of the incident was reevaluated; the Statuspage incident was drafted and published.
15:50
The database restore was finished, and the cluster was fully back online.
As we expected, human intervention was required to carefully bring the services back online while managing database load as the system needed to catch up with the backlogged work, which we immediately started doing.
15:30–17:58
The system caught up with the backlogged work, and the database cluster was stable.
We wrote an update and moved the incident to the “Monitoring” phase.
07:37, the next morning
The on-call team reported that everything had been working as expected since we moved to “Monitoring”. We marked the incident as resolved.