Unavailability of our services

Incident Report for Batch

Postmortem

Summary

This incident relates to two seperate downtimes:

  • The first one starting at 9:54 UTC, ending at 9:57 UTC.
  • The second one starting at 10:15 UTC, ending at 10:21 UTC.

First downtime

At 09:54 UTC, our monitoring system alerted us of an increase in HTTP error rate with some of our HTTP endpoints, which was confirmed manually.

The immediate effect was that the dashboard, APIs and different webservices were unavailable.
This lasted until 09:57 UTC when the error rate dropped and all services began to recover.

This was later diagnosed as a network connectivity loss in our private network which prevented our HTTP servers from talking to our services.

Second downtime

When troubleshooting downtimes, we use a combination of external monitoring services and internal monitoring tools to get a sense of the global state of our services.
As seen above, the private network was cut and since our internal monitoring tools also use this network to do their job, they were unavailable; this meant that we didn't have a clear picture of the services state at this time.

At the time the first downtime resolved itself we still believed we were in a degraged operating state, we then decided to divert all HTTP traffic to a single cluster (instead of the two we usually use).
At 10:15 UTC, this diversion was completed, unfortunately a misconfiguration caused our HTTP servers to respond with errors instead of serving the requests.
Once we realized this we rolled back the changes and at 10:21 UTC all services had recovered.

Conclusion

We will continue to improve our reliability and incident response procedures to further mitigate and reduce incidents like this one.

Specifically related to this incident, we identified two pain points:

  • we will design a more fool-proof procedure to divert traffic to a single cluster.
  • we will improve availability and reliability of our internal monitoring tools.
Posted Feb 17, 2021 - 16:53 UTC

Resolved

All our services are operating normally again.

We identified two separate downtimes related to this incident:
* The first one starting at 9:54 UTC, ending at 9:57 UTC.
* The second one starting at 10:15 UTC, ending at 10:21 UTC.

We're still investigating the root cause of this incident and will follow up with a postmortem once we have a better understanding of what happened.
Posted Feb 16, 2021 - 12:09 UTC

Monitoring

All services are available again. We continue to monitor the situation.
Posted Feb 16, 2021 - 10:38 UTC

Identified

We identified the issue and are currently working to recover all services.
Posted Feb 16, 2021 - 10:26 UTC

Update

We are continuing to investigate this issue.
Posted Feb 16, 2021 - 10:05 UTC

Investigating

We are aware that some of our services are unavailable since 09:54 UTC. We are currently investigating and will update this incident when we know more.
Posted Feb 16, 2021 - 10:02 UTC
This incident affected: MEP Core Services (Dashboard, In-app delivery, Data ingestion), Optional Services (Editorial dashboard, Inbox), [OLD_STATUSPAGE_MODEL] Webservices (Custom attributes, Web push static resources), and API ([MEP] Campaigns API, [MEP] Transactional API, [MEP] Custom Data API, [MEP] Transactional API for Partners).