Network issue on our hosting provider

Incident Report for Batch

Postmortem

Foreword

At 01:30 GMT+1 on 02/02/2024, our on-call team monitored a potential issue on the platform. At 01:58 GMT+1, it was identified that the network was down on 189 servers. The physical network interfaces were down, due a misconfiguration from our hosting provider following a hardware failure, rendering immediate action impossible.

Approximately 20 minutes later, some servers began to come back online spontaneously, leading to further investigation and attempts to restore services.

Fault

The network failure affected 189 servers, causing their physical network interfaces to go down. Once our hosting provider corrected the configuration on the network, our servers came back online. While some servers were restored by automated systems, others required manual intervention.

Impact

For approximately 1 hour and 35 minutes, between 01:25 GMT+1 and 03:00 GMT+1 on 02/02/2024, various core and optional services experienced partial or complete downtime. The incident affected several components:

CEP Core services: Push delivery, Data ingestion, and Dashboard were partially down.
MEP Core services: Push delivery, In-app delivery, and Data analysis experienced partial downtime.
APIs: Several APIs, including Audience API, Export API, and Campaign API, were partially down (flapping errors due to the time needed by the system to auto-heal).
Optional services: Inbox, Webhook, and Custom exports were partially down.

Despite the downtime, there was no impact on email and SMS delivery.

Regarding potential data loss: Requests accepted by our APIs during the incident were ingested, though some may have experienced delays. All retries made after 03:00 GMT+1 were successfully processed, and the data was properly ingested.

Campaigns scheduled to launch during the incident did not trigger at the expected time but were restored once the network came back online. However, a small portion of these campaigns expired after two hours and were not sent.

Timeline

01:30 GMT+1 - Pager alert triggered; our on-call SRE begins investigation.

01:58 GMT+1 - Network failure identified on 189 servers; physical interfaces down.

02:18 GMT+1 - Some servers begin to recover automatically; further investigation initiated.

03:00 GMT+1 - Most services restored; incident reported to infrastructure provider.

19:00 GMT+1 - Manual intervention by our infrastructure provider to restore few remaining servers; all services back online by 19:00 GMT+1.

Posted Feb 05, 2025 - 13:33 UTC

Resolved

After further investigation, we now consider this incident resolved. The root cause has been identified as a network issue. We are preparing a detailed analysis, which will be shared in the coming days to provide further insights into the incident.

Posted Feb 03, 2025 - 17:01 UTC

Monitoring

All our services have returned to normal.
We continue to closely monitor the platform.

Posted Feb 02, 2025 - 03:00 UTC

Update

A significant part of our infrastructure became unreachable at 01:30 GMT+2.
Most of the affected servers are now accessible again, and services are gradually returning to normal. The status page will be updated progressively to reflect component availability.

We will provide another update in 30 minutes, at 04:00 GMT+2.

Posted Feb 02, 2025 - 02:35 UTC

Identified

We're currently experiencing network problems that may impact all services. Said issue is likely to come from our hosting provider. We will keep you posted as the investigation goes on.

Posted Feb 02, 2025 - 00:45 UTC

This incident affected: MEP Core Services (Dashboard, In-app delivery, Data ingestion), Optional Services (Editorial dashboard, Inbox, Custom Exports), [OLD_STATUSPAGE_MODEL] Delivery (Transactional push, Push campaigns, In-app messaging), [OLD_STATUSPAGE_MODEL] Webservices (Custom attributes, Web push static resources), and API ([MEP] Campaigns API, [MEP] Transactional API, [MEP] Custom Data API, [MEP] Transactional API for Partners).