Push delivery issues for automations with a targeting using the installation date
Incident Report for Batch
Postmortem

Timeline

All times are in UTC and 24h time.

On September 27 at around 10:00 AM we got multiple reports of significant drops in the number of push sent for some automations. The affected automations all had a targeting using conditions based on the installation date. After confirming that these conditions were indeed not working properly we launched an investigation.

At around 13:00 PM we rolled back a change that was applied to a service on September 12 which we suspected could be responsible for the behaviour we were seeing. We continued to investigate to try to find the root cause.

On September 28 at around 7:00 AM we determined that the rollback fixed the problem; we were still unsure why and continued to investigate.

At around 8:40 AM we realised that the installation date had simply been deleted for all users that were active since September 12. This gave us a clue as to the root cause: if everything was operating normally this data should never have been deleted.

At around 12:00 PM we found the root cause which we will explain below.

At around 16:00 PM we devised a plan to restore the lost data. Over the next days we implemented this plan, tested it and validated it.

On October 3 at around 16:00 PM we launched the data restoration script.

On October 6 the script finished its execution.

Impact

Between September 12 and October 6, any automation with a targeting using conditions based on the installation date could have sent less notifications than expected due to the missing data.

After we rolled back the service on September 27 at around 13:00 PM the installation date was no longer deleted and slowly these automations started to send more notifications, depending on the exact targeting conditions.

The data restoration script managed to restore a majority of the missing data for active users.

Root cause

The root cause is fairly technical but ultimately comes down to two things: a misconfigured service and a surprising behaviour from our Cassandra database.

Due to a database migration that we are currently rolling out, the service responsible for storing the installation date was misconfigured and in some specific cases started to issue deletions for this data instead of not touching it.

This misconfiguration was completely silent and not something easily observable which helps explain why it went unnoticed for almost two weeks. In addition, we didn’t predict this failure mode: we didn’t expect a benign operation to turn into a deletion simply by changing a configuration flag.

Since detecting this bug we’ve made a number of changes in our deployment processes. First, we’ve modified our services to never allow this misconfiguration. Second, we’ve modified our standard configuration for the database to prevent its surprising behaviour.

This should ensure this failure mode doesn’t happen again.

Posted Oct 20, 2023 - 12:50 UTC

Resolved
We've experienced push delivery issues for automations with a targeting using the installation date.
You can find the complete details in the postmortem.
Posted Sep 27, 2023 - 13:00 UTC