Significant processing delays and problems with digital twin api's

Incident Report for Aloxy IIoT Hub - EU3

Postmortem

We are sorry for any inconvenience caused. We trust that this Post Incident Review (PIR) provides you with sufficient information to reassure you that we are actively working to improve. If you require further details, please do not hesitate to contact our support team.

Executive Summary

Between ~13:00 UTC 2024-04-16 and ~14:00 UTC 2024-04-16, we were unable to process any incoming sensor messages due to our digital twin accidentally using faulty development configuration, causing significant processing delays. Almost no message loss occurred due to the digital twin being offline, however this incident triggered 2 other incidents which resulted in major message loss: https://status.aloxy.io/incidents/7c75j3wntlml and https://status.aloxy.io/incidents/hwll7v3znctq.

Leadup

When development was finished on a new update of our digital twin deployment, we staged it to be tested in our acceptance/staging environment.

Fault

A bug in the deployment of our production cluster, led to some of the configuration to also be deployed (~13:00 UTC 2024-04-16) in production. This partial configuration misconfigured our digital twin, which resulted in API errors and failing to process any incoming messages.

Detection

After the faulty config was deployed on production (~13:00 UTC 2024-04-16), we received alerts indicating processing was significantly delayed. Soon after, we received additional alerts indicating our processing pipeline had fully halted.

Root causes

Why did a development config get deployed on production?
- Some production components were configured to listen to development changes.
Why was the production config configured this way?
- This was a misconfiguration. It was a deployment bug introduced with the 1.20 deployment rewrite, and was only now caught. This was due to the affected component not receiving updates on a regular basis.
Why was a partially incorrect config so detrimental to message processing?
- The deployed component contained authentication and connection config, which was migrated from one place to another as part of the simplification effort of our deployment files. This eventually resulted in our digital twin fully crashing and stopping the consumption of any new message received by our ingress.
Why did it take so long for the misconfiguration in the deployment files to be noticed?
- The component with the misconfiguration generally does not receive any updates on a regular basis, and updates to this component often do not harm the processing flow. This all meant that the misconfiguration went unnoticed for a pretty long time.

Mitigation and resolution

The change was verified to be working in our development clusters, which led us to 2 options to mitigate the problem. Roll-back or roll-forward the deployment misconfiguration. As rolling-back would take longer and could introduce undefined behavior, we opted to roll-forward the change and later fix the misconfiguration. After updating and restarting the digital twin, we noticed that our message backlog was not decreasing. This was the result of 2 things:

By rolling-forward, we also updated our message broker unknowingly, which temporarily lowered the rate at which we could accept messages in the digital twin. This was resolved within 5 mins, but triggered incidents https://status.aloxy.io/incidents/7c75j3wntlml and https://status.aloxy.io/incidents/hwll7v3znctq.
The backlog we accrued was pretty large, and we ate through it very quickly, which resulted in some parts of the digital twin crashing. We fixed this by temporarily increasing the resources available to the digital twin.

After all this, the backlog was fully processed.

Follow-Ups

Check if other components also listen to the incorrect environment

Posted Apr 24, 2025 - 11:58 UTC

Resolved

The problem has not reoccurred for 15 minutes and processing delays are back to normal.

Posted Apr 16, 2025 - 14:15 UTC

Monitoring

We are processed the full backlog and everything seems operational again. We'll continue monitoring for a while, to be sure the problem does not re-occur.

Posted Apr 16, 2025 - 14:00 UTC

Identified

It seems the large backlog is overstressing some of our internal components, which again is increasing processing delays.

Posted Apr 16, 2025 - 13:44 UTC

Monitoring

We have applied a patch, and it seems processing is continuing again, we'll continue monitoring until all backlogged messages are processed.

Posted Apr 16, 2025 - 13:37 UTC

Identified

We have identified the issue and are working on a patch

Posted Apr 16, 2025 - 13:11 UTC

Investigating

We noticed increased delays and failing ditto API's we are investigating why.

Posted Apr 16, 2025 - 13:08 UTC

This incident affected: Digital twin (Digital twin search, Digital twin API, Digital twin streaming) and Processing (MVPI processing, History processing).