Multiple processing delays
Incident Report for Batch
Postmortem

Here are some details about the incident.

The incident started the 2022/03/16 at 06:00 UTC and a fix was deployed around 09:00 UTC. Following a monitoring period the incident was resolved at 11:12 UTC.

The incident impacted 3 major features; following are some details about the impact for each feature.
We will also talk about the root cause at the end.

Trigger campaigns

During the incident delivery of trigger campaigns push notifications was delayed significantly, up to ~1h40 of delay.
After the fix was implemented, some delayed push notifications were sent correctly but push notifications with more than 30 min of delay were dropped.

This affected all running trigger campaigns.

Custom audience ingestion

During the incident the ingestion of custom audience was delayed significantly, up to ~2h40 of delay.
This affected all custom audience, regardless of identifier types.

This delay meant that some custom audiences that were sent to our APIs were not usable in push or in-app campaigns until after the fix was implemented and the delay resolved.
After the fix was implemented all custom audiences that were delayed were processed correctly and usable in push or in-app campaigns.

Inbox

During the incident the inbox processing pipeline was delayed significantly, up to ~1h of delay.

This delay meant that the inbox in your application was not up to date with the latest push until after the fix was implemented and the delay resolved.
After the fix was implemented the inbox processing pipeline was working correctly again.

Root cause

The incident was due the custom audience ingestion pipeline. It works in a peculiar way which had an unforeseen consequence while processing a particularly big custom audience, overloading our database cluster which is shared
between the 3 features listed above.

While overloaded the database cluster was unable to process the majority of requests for these services and pipelines, thus accruing the significant delays we saw.

The fix we implemented is a good stop gap solution, in the future we plan to change how custom audiences work to eliminate this overloading risk.

Posted Mar 16, 2022 - 16:23 UTC

Resolved
This incident has been resolved.
Posted Mar 16, 2022 - 11:12 UTC
Monitoring
A fix has been implemented, trigger campaigns processing and custom audience ingestion is now working correctly.
Posted Mar 16, 2022 - 09:00 UTC
Identified
We have identified the issue and are implementing a fix.
Posted Mar 16, 2022 - 08:29 UTC
Investigating
We are aware of an issue for our trigger campaigns which is causing processing delays.
Our custom audience ingestion pipeline is also facing the same issue and it also has processing delays.

We are currently investigating.
Posted Mar 16, 2022 - 08:14 UTC
This incident affected: API (Custom Data API).