Database outage
Incident Report for Batch
Postmortem

Whilst conducting a routine maintenance (cleaning some stale data) on our Cassandra database cluster, a part of one of our clusters started to consume a large amount of memory which led to a partial collapse.

At first, we didn't suspected this simple database operation was the cause, and we focused on getting the cluster back on its feet. A further investigation revealed that this outage was clearly due to human intervention.

During this outage :

  • the transactional and campaign API requests were still accepted
  • campaign creations and updates through our dashboard were still working
  • the push notification delivery were significantly slowed down (during 28min)
  • the Inbox service was unavailable during the whole incident (during 48 min)
  • almost the totality of impacted notifications have been sent but with up to a 1h delay
Posted Apr 01, 2020 - 20:00 UTC

Resolved
This incident has been resolved.
Posted Apr 01, 2020 - 13:34 UTC
Monitoring
Our services are getting back to a nominal state. A post mortem will be added to this incident. We're monitoring the situation to ensure of our quality of service.
Posted Apr 01, 2020 - 13:29 UTC
Update
Our delivery components are back to a nominal state. Our fix is still being implemented.
Posted Apr 01, 2020 - 13:25 UTC
Identified
We're aware of the our issue's source and are currently implementing what's necessary to get back to a nominal state.
Posted Apr 01, 2020 - 13:19 UTC
Update
We've isolated a potentially problematic component from the rest of our services. Pushes are now being sent, our delivery services may still have performances and reliability issues until we've fully resolved that issue.
Posted Apr 01, 2020 - 13:10 UTC
Update
So far, we're aware that our delivery is impacted on all 3 components, also our Inbox webservice. We're continuing to investigate and will keep you informed as close as possible to real time.
Posted Apr 01, 2020 - 12:58 UTC
Update
We are continuing to investigate this issue.
Posted Apr 01, 2020 - 12:45 UTC
Investigating
We're aware of an outage on our databases, our technical team is currently investigating and will keep you informed through this post.
Posted Apr 01, 2020 - 12:40 UTC
This incident affected: Delivery (Transactional push, Push campaigns, In-app messaging) and Webservices (Inbox).