This incident relates to two seperate downtimes:
At 09:54 UTC, our monitoring system alerted us of an increase in HTTP error rate with some of our HTTP endpoints, which was confirmed manually.
The immediate effect was that the dashboard, APIs and different webservices were unavailable.
This lasted until 09:57 UTC when the error rate dropped and all services began to recover.
This was later diagnosed as a network connectivity loss in our private network which prevented our HTTP servers from talking to our services.
When troubleshooting downtimes, we use a combination of external monitoring services and internal monitoring tools to get a sense of the global state of our services.
As seen above, the private network was cut and since our internal monitoring tools also use this network to do their job, they were unavailable; this meant that we didn't have a clear picture of the services state at this time.
At the time the first downtime resolved itself we still believed we were in a degraged operating state, we then decided to divert all HTTP traffic to a single cluster (instead of the two we usually use).
At 10:15 UTC, this diversion was completed, unfortunately a misconfiguration caused our HTTP servers to respond with errors instead of serving the requests.
Once we realized this we rolled back the changes and at 10:21 UTC all services had recovered.
We will continue to improve our reliability and incident response procedures to further mitigate and reduce incidents like this one.
Specifically related to this incident, we identified two pain points: