Here are some details on the incident.
All times are in UTC and 24h time.
On April 23 at around 15:45 our internal monitoring system alerted us that a data processing pipeline was starting to lag; this pipeline provides data to the Export API and the Webhook features. When this pipeline lags behind it means that the newest data is not immediately available to these features.
Our team immediately started to investigate but finding the root cause of the problem proved difficult and due to the accumulated delays we decided to open this incident at around 16:45.
We continued to investigate and finally found the root cause at around 17:45: a subset of our database servers had a networking issue due to a spurious restart of their networking interfaces which made them slow to respond to queries, thus impacting our data processing pipeline. We worked around the issue by restarting the problematic servers. After this operation the data processing pipeline was in a good state again and started catching up its lag at full speed.
At around 19:00 the pipeline finally caught up and was processing data in real time again, everything was back to normal.
Two features were impacted, the Export API and Webhooks but only a subset of the features.
Only the push campaign data exports were impacted. Due to the processing lag of up to 3h, if you asked for an export for data younger than 3h between 15:45 and 19:00 you could have received an export with either no data or only a subset of the data.
Exports of transactional, in-app campaign, reachability and userbase data were not impacted.
No data was lost: if needed you can trigger a new export for this time period to get the correct data.
Only the webhook event type push_campaign_sent was impacted. Due to the same processing lag, these events were sent with a delay of up to 3h to your endpoints instead of being sent in real time.
No data was lost: the events were eventually sent to your endpoints.
To prevent the same issue from happening again in the future we will mandate that the database server is always restarted when the network interface experiences a spurious restart for any reasaon.
This is a routine maintenance operation so it won’t have any negative impact and instead will allow us to be more resilient to these kinds of issues.