Database outage

Incident Report for Batch

Postmortem

Whilst conducting a routine maintenance (cleaning some stale data) on our Cassandra database cluster, a part of one of our clusters started to consume a large amount of memory which led to a partial collapse.

At first, we didn't suspected this simple database operation was the cause, and we focused on getting the cluster back on its feet. A further investigation revealed that this outage was clearly due to human intervention.

During this outage :

the transactional and campaign API requests were still accepted
campaign creations and updates through our dashboard were still working
the push notification delivery were significantly slowed down (during 28min)
the Inbox service was unavailable during the whole incident (during 48 min)
almost the totality of impacted notifications have been sent but with up to a 1h delay

Posted Apr 01, 2020 - 20:00 UTC

Resolved

This incident has been resolved.

Posted Apr 01, 2020 - 13:34 UTC

Monitoring

Our services are getting back to a nominal state. A post mortem will be added to this incident. We're monitoring the situation to ensure of our quality of service.

Posted Apr 01, 2020 - 13:29 UTC

Update

Our delivery components are back to a nominal state. Our fix is still being implemented.

Posted Apr 01, 2020 - 13:25 UTC

Identified

We're aware of the our issue's source and are currently implementing what's necessary to get back to a nominal state.

Posted Apr 01, 2020 - 13:19 UTC

Update

We've isolated a potentially problematic component from the rest of our services. Pushes are now being sent, our delivery services may still have performances and reliability issues until we've fully resolved that issue.

Posted Apr 01, 2020 - 13:10 UTC

Update

So far, we're aware that our delivery is impacted on all 3 components, also our Inbox webservice. We're continuing to investigate and will keep you informed as close as possible to real time.

Posted Apr 01, 2020 - 12:58 UTC

Update

We are continuing to investigate this issue.

Posted Apr 01, 2020 - 12:45 UTC

Investigating

We're aware of an outage on our databases, our technical team is currently investigating and will keep you informed through this post.

Posted Apr 01, 2020 - 12:40 UTC

This incident affected: [OLD_STATUSPAGE_MODEL] Delivery (Transactional push, Push campaigns, In-app messaging) and Optional Services (Inbox).