CEP Campaign Delivery Degradation (Email, Push & SMS)

Incident Report for Batch

Postmortem

Post-mortem: Partial Platform Outage

Summary

On December 22, 2025, between 10:17 and 13:40 UTC, part of our CEP experienced service disruptions affecting message delivery, data ingestion, and dashboards for some customers.

The incident was caused by a hardware-related issue on a high-grade server hosting part of our private virtualization infrastructure. The issue has since been fully resolved, and corrective actions have been implemented.

Impact

During the incident window, the following impacts were observed:

Push and Email Campaigns (CEP): Delivery was interrupted for a subset of campaigns.
Data Ingestion & Analytics: Temporary ingestion delays occurred.
Dashboards: The Batch dashboard was intermittently unavailable.

No impact was observed on SMS delivery.

The incident affected a limited subset of customers, depending on their usage at the time.

Timeline (UTC)

10:17 – First service degradation detected.
10:20 – Investigation initiated.
11:00 – Infrastructure issue identified on one hypervisor.
11:30 – Mitigation and recovery actions started.
13:40 – All impacted services fully operational.

Root Cause

The incident was caused by a fault in the cooling system of a high-grade server used in our private virtualization infrastructure.

This cooling issue led to overheating, triggering a protective shutdown of a disk group on the affected server. As a result, the hypervisor abruptly lost access to multiple disks, causing all virtualized services hosted on that node to stop simultaneously.

Although this class of hardware is designed to provide strong reliability guarantees, this cooling failure resulted in the loss of a single hypervisor and exposed the impact of service colocation on shared infrastructure.

Resolution

The cooling issue and impacted hardware were fully repaired and validated by our infrastructure provider.
Affected services were restarted and resynchronized.
Data consistency was verified after recovery.

Corrective and Preventive Actions

Following this incident, we took the following actions:

The faulty hardware was repaired and removed from service until fully validated.
We reviewed service placement on our private virtualization infrastructure and reduced the colocation of critical components on single hypervisors.
Stateful services (including Redis clusters) were redistributed to limit the blast radius of a single host failure.
We strengthened monitoring and alerting around service colocation, allowing us to detect and act earlier when multiple critical components are unintentionally placed on the same underlying host.

These actions aim to reduce the impact of similar infrastructure-level incidents in the future.

Conclusion

We apologize for the disruption this incident caused.

While hardware failures of this nature are rare, this event highlighted areas where we could further improve infrastructure resilience and observability.

We remain committed to transparency and continuous improvement.

—

The Batch Engineering Team

Posted Dec 24, 2025 - 13:51 UTC

Resolved

This incident has been resolved.

Posted Dec 22, 2025 - 17:01 UTC

Monitoring

An incident affected CEP campaign delivery across Email, Push, and SMS between 11:00 and 14:30 CET.

* Campaigns: Messages scheduled during this window were not sent and not retried. Analytics will show zero delivery and engagement. If impacted, consider resending your campaign.
* Automations: Delays occurred between 11:40 and 12:15 CET. Affected automations were eventually sent, and the displayed analytics are accurate.

The issue has been fixed, and we are actively monitoring the platform.

Posted Dec 22, 2025 - 13:43 UTC

This incident affected: CEP Core Services (Dashboard, Email delivery, Push Delivery, SMS Delivery).