At 01:30 GMT+1 on 02/02/2024, our on-call team monitored a potential issue on the platform. At 01:58 GMT+1, it was identified that the network was down on 189 servers. The physical network interfaces were down, due a misconfiguration from our hosting provider following a hardware failure, rendering immediate action impossible.
Approximately 20 minutes later, some servers began to come back online spontaneously, leading to further investigation and attempts to restore services.
The network failure affected 189 servers, causing their physical network interfaces to go down. Once our hosting provider corrected the configuration on the network, our servers came back online. While some servers were restored by automated systems, others required manual intervention.
For approximately 1 hour and 35 minutes, between 01:25 GMT+1 and 03:00 GMT+1 on 02/02/2024, various core and optional services experienced partial or complete downtime. The incident affected several components:
Despite the downtime, there was no impact on email and SMS delivery.
Regarding potential data loss: Requests accepted by our APIs during the incident were ingested, though some may have experienced delays. All retries made after 03:00 GMT+1 were successfully processed, and the data was properly ingested.
Campaigns scheduled to launch during the incident did not trigger at the expected time but were restored once the network came back online. However, a small portion of these campaigns expired after two hours and were not sent.
01:30 GMT+1 - Pager alert triggered; our on-call SRE begins investigation.
01:58 GMT+1 - Network failure identified on 189 servers; physical interfaces down.
02:18 GMT+1 - Some servers begin to recover automatically; further investigation initiated.
03:00 GMT+1 - Most services restored; incident reported to infrastructure provider.
19:00 GMT+1 - Manual intervention by our infrastructure provider to restore few remaining servers; all services back online by 19:00 GMT+1.