Issue impacting dashboard analytics and data exports

Incident Report for Batch

Postmortem

Here are more details on the incident that started the 2022/02/16 at around 17:00 UTC.

While the incident was closed the 2022/02/21 at around 13:00 UTC, after a more complete analysis we have come to the conclusion that the incident had a less severe impact than we thought.
Following are some details about the definitive impact of the incident and resolution times for each affected components.

Dashboard analytics

Dashboard analytics such as campaign analytics, the userbase page, the debug page were unavailable or not up to date between the 2022/02/16 at around 17:00 UTC and
after a fix was deployed the dashboard analytics pipeline started to catch up data processing until it was up to date the 2022/02/17 at around 04:30 UTC.

Data exports

Only recurring exports for web applications were affected; no mobile applications' recurring exports were affected.

The impact depends on the kind of recurring export:

daily exports executed the 2022/02/17 morning and referring to data from the previous day might have some missing data.
hourly exports executed between the 2022/02/17 at 02:00 UTC and the 2022/02/17 at 15:00 UTC might have some missing data.

Starting on the 2022/02/17 at 15:00 UTC all recurring exports worked as expected with no missing data.

Timeline

When we opened the incident it wasn't immediately obvious what the problem was; it took our response team some time to find the problematic table and service in our processing pipeline.
Unfortunately even when we were confident we found the source of the problem, it wasn't fixable without some work by our data team which would delay the incident resolution.

The root cause was our database cluster misbehaving when updating data in the table responsible for storing web applications export data.
This made the cluster unstable and unable to reliably respond to queries: this meant our various data processing pipelines were unable to make progress and stalled for some time.

Therefore the 2022/02/17 around 00:30 UTC we made the following plan:

first, restore almost all services except for the web applications' recurring exports
the following morning work on a new version of the recurring exports service which would work around the problem

This work was deployed the 2022/02/17 around 15:00 UTC; after that all recurring exports worked as expected.

Future

While finding the issue was not straightforward, after the incident resolution we looked back at our telemetry and found several things which hinted to a problem with this table.

In particular, a (seemingly innocuous) configuration flag on this table was responsible for this misbehaviour, which we didn't anticipate.
It is not a standard flag in our typical tables configuration so we will take steps to ensure that it can't be used anymore on all our database clusters.

Because it was the first time we saw this misbehaviour it was difficult to understand what the problem was; now that we solved it and have various logs and metrics to look at, we
will also work on improving our diagnostics process for the incident response team.

Posted Feb 23, 2022 - 12:41 UTC

Resolved

This incident has been resolved.
All dashboard analytics and recurring data exports are now working as expected.

Posted Feb 21, 2022 - 15:23 UTC

Identified

The issue has been identified and we're currently working to resolve it.

Posted Feb 17, 2022 - 00:18 UTC

Investigating

We are aware of an issue impacting both the dashboard analytics (campaign analytics, userbase page for example). They might be slow to load or not show up at all.
This also impacts recurring data exports.
We are currently investigating.

Posted Feb 16, 2022 - 20:47 UTC

This incident affected: MEP Core Services (Dashboard) and Optional Services (Custom Exports).