Unavailability of our services
Incident Report for Batch
Postmortem

Summary

This incident relates to two seperate downtimes:

  • The first one starting at 9:54 UTC, ending at 9:57 UTC.
  • The second one starting at 10:15 UTC, ending at 10:21 UTC.

First downtime

At 09:54 UTC, our monitoring system alerted us of an increase in HTTP error rate with some of our HTTP endpoints, which was confirmed manually.

The immediate effect was that the dashboard, APIs and different webservices were unavailable.
This lasted until 09:57 UTC when the error rate dropped and all services began to recover.

This was later diagnosed as a network connectivity loss in our private network which prevented our HTTP servers from talking to our services.

Second downtime

When troubleshooting downtimes, we use a combination of external monitoring services and internal monitoring tools to get a sense of the global state of our services.
As seen above, the private network was cut and since our internal monitoring tools also use this network to do their job, they were unavailable; this meant that we didn't have a clear picture of the services state at this time.

At the time the first downtime resolved itself we still believed we were in a degraged operating state, we then decided to divert all HTTP traffic to a single cluster (instead of the two we usually use).
At 10:15 UTC, this diversion was completed, unfortunately a misconfiguration caused our HTTP servers to respond with errors instead of serving the requests.
Once we realized this we rolled back the changes and at 10:21 UTC all services had recovered.

Conclusion

We will continue to improve our reliability and incident response procedures to further mitigate and reduce incidents like this one.

Specifically related to this incident, we identified two pain points:

  • we will design a more fool-proof procedure to divert traffic to a single cluster.
  • we will improve availability and reliability of our internal monitoring tools.
Posted Feb 17, 2021 - 16:53 UTC

Resolved
All our services are operating normally again.

We identified two separate downtimes related to this incident:
* The first one starting at 9:54 UTC, ending at 9:57 UTC.
* The second one starting at 10:15 UTC, ending at 10:21 UTC.

We're still investigating the root cause of this incident and will follow up with a postmortem once we have a better understanding of what happened.
Posted Feb 16, 2021 - 12:09 UTC
Monitoring
All services are available again. We continue to monitor the situation.
Posted Feb 16, 2021 - 10:38 UTC
Identified
We identified the issue and are currently working to recover all services.
Posted Feb 16, 2021 - 10:26 UTC
Update
We are continuing to investigate this issue.
Posted Feb 16, 2021 - 10:05 UTC
Investigating
We are aware that some of our services are unavailable since 09:54 UTC. We are currently investigating and will update this incident when we know more.
Posted Feb 16, 2021 - 10:02 UTC
This incident affected: API (Transactional API, Campaigns API, Custom Data API, Transactional API for Partners), Dashboard (Main dashboard, Editorial dashboard), and Webservices (SDK, Inbox, Custom attributes, In-app messaging, Web push static resources).