Here are more details on the incident that started the 2022/02/16 at around 17:00 UTC.
While the incident was closed the 2022/02/21 at around 13:00 UTC, after a more complete analysis we have come to the conclusion that the incident had a less severe impact than we thought.
Following are some details about the definitive impact of the incident and resolution times for each affected components.
Dashboard analytics such as campaign analytics, the userbase page, the debug page were unavailable or not up to date between the 2022/02/16 at around 17:00 UTC and
after a fix was deployed the dashboard analytics pipeline started to catch up data processing until it was up to date the 2022/02/17 at around 04:30 UTC.
Only recurring exports for web applications were affected; no mobile applications' recurring exports were affected.
The impact depends on the kind of recurring export:
Starting on the 2022/02/17 at 15:00 UTC all recurring exports worked as expected with no missing data.
When we opened the incident it wasn't immediately obvious what the problem was; it took our response team some time to find the problematic table and service in our processing pipeline.
Unfortunately even when we were confident we found the source of the problem, it wasn't fixable without some work by our data team which would delay the incident resolution.
The root cause was our database cluster misbehaving when updating data in the table responsible for storing web applications export data.
This made the cluster unstable and unable to reliably respond to queries: this meant our various data processing pipelines were unable to make progress and stalled for some time.
Therefore the 2022/02/17 around 00:30 UTC we made the following plan:
This work was deployed the 2022/02/17 around 15:00 UTC; after that all recurring exports worked as expected.
While finding the issue was not straightforward, after the incident resolution we looked back at our telemetry and found several things which hinted to a problem with this table.
In particular, a (seemingly innocuous) configuration flag on this table was responsible for this misbehaviour, which we didn't anticipate.
It is not a standard flag in our typical tables configuration so we will take steps to ensure that it can't be used anymore on all our database clusters.
Because it was the first time we saw this misbehaviour it was difficult to understand what the problem was; now that we solved it and have various logs and metrics to look at, we
will also work on improving our diagnostics process for the incident response team.