Webhook and Export API data processing delays
Incident Report for Batch
Postmortem

Here are some details on the incident.

Timeline

All times are in UTC and 24h time.

On April 23 at around 15:45 our internal monitoring system alerted us that a data processing pipeline was starting to lag; this pipeline provides data to the Export API and the Webhook features. When this pipeline lags behind it means that the newest data is not immediately available to these features.

Our team immediately started to investigate but finding the root cause of the problem proved difficult and due to the accumulated delays we decided to open this incident at around 16:45.

We continued to investigate and finally found the root cause at around 17:45: a subset of our database servers had a networking issue due to a spurious restart of their networking interfaces which made them slow to respond to queries, thus impacting our data processing pipeline. We worked around the issue by restarting the problematic servers. After this operation the data processing pipeline was in a good state again and started catching up its lag at full speed.

At around 19:00 the pipeline finally caught up and was processing data in real time again, everything was back to normal.

Impact

Two features were impacted, the Export API and Webhooks but only a subset of the features.

Export API

Only the push campaign data exports were impacted. Due to the processing lag of up to 3h, if you asked for an export for data younger than 3h between 15:45 and 19:00 you could have received an export with either no data or only a subset of the data.

Exports of transactional, in-app campaign, reachability and userbase data were not impacted.

No data was lost: if needed you can trigger a new export for this time period to get the correct data.

Webhooks

Only the webhook event type push_campaign_sent was impacted. Due to the same processing lag, these events were sent with a delay of up to 3h to your endpoints instead of being sent in real time.

No data was lost: the events were eventually sent to your endpoints.

Conclusion

To prevent the same issue from happening again in the future we will mandate that the database server is always restarted when the network interface experiences a spurious restart for any reasaon.

This is a routine maintenance operation so it won’t have any negative impact and instead will allow us to be more resilient to these kinds of issues.

Posted Apr 24, 2024 - 12:34 UTC

Resolved
This incident has been resolved.
Posted Apr 23, 2024 - 19:02 UTC
Monitoring
A fix has been implemented, our data processing pipelines are operating at full speed and catching up with the delays. We will continue monitoring the situation.
Posted Apr 23, 2024 - 18:28 UTC
Identified
The issue has been identified and we are currently on applying a fix.
Posted Apr 23, 2024 - 18:17 UTC
Update
We are still working on identifying the issue. Data flowing in our pipelines is being processed but at a slower rate than normal which incurs some delays, up to 1h30 in some cases.
Posted Apr 23, 2024 - 17:24 UTC
Investigating
We are aware of an issue impacting data processing for the webhook and export API features. We are currently investigating.
Posted Apr 23, 2024 - 16:45 UTC
This incident affected: Data (Exports).