Here are some details about the incident.
The incident started the 2022/03/16 at 06:00 UTC and a fix was deployed around 09:00 UTC. Following a monitoring period the incident was resolved at 11:12 UTC.
The incident impacted 3 major features; following are some details about the impact for each feature.
We will also talk about the root cause at the end.
During the incident delivery of trigger campaigns push notifications was delayed significantly, up to ~1h40 of delay.
After the fix was implemented, some delayed push notifications were sent correctly but push notifications with more than 30 min of delay were dropped.
This affected all running trigger campaigns.
During the incident the ingestion of custom audience was delayed significantly, up to ~2h40 of delay.
This affected all custom audience, regardless of identifier types.
This delay meant that some custom audiences that were sent to our APIs were not usable in push or in-app campaigns until after the fix was implemented and the delay resolved.
After the fix was implemented all custom audiences that were delayed were processed correctly and usable in push or in-app campaigns.
During the incident the inbox processing pipeline was delayed significantly, up to ~1h of delay.
This delay meant that the inbox in your application was not up to date with the latest push until after the fix was implemented and the delay resolved.
After the fix was implemented the inbox processing pipeline was working correctly again.
The incident was due the custom audience ingestion pipeline. It works in a peculiar way which had an unforeseen consequence while processing a particularly big custom audience, overloading our database cluster which is shared
between the 3 features listed above.
While overloaded the database cluster was unable to process the majority of requests for these services and pipelines, thus accruing the significant delays we saw.
The fix we implemented is a good stop gap solution, in the future we plan to change how custom audiences work to eliminate this overloading risk.