Major message loss

Incident Report for Aloxy IIoT Hub - EU3

Postmortem

We are sorry for any inconvenience caused. We trust that this Post Incident Review (PIR) provides you with sufficient information to reassure you that we are actively working to improve. If you require further details, please do not hesitate to contact our support team.

Executive Summary

A previous incident caused our message ingress to not store messages for most tenants. This caused major message loss between ~13:30 UTC 2024-04-16 until 21:24 UTC 2024-04-16, positions and calibration data has not been received or processed by the IIoT Hub. All valve positions should have been fixed and updated before ~23:30 UTC 2024-04-16.

A reoccurrence of the same problem happened again between 18:00 UTC 2024-04-17 - 07:55 UTC 2024-04-18.

Leadup

https://status.aloxy.io/incidents/7swhg6wdf543

Fault

During the roll-forward of the digital twin, we accidentally updated our message broker. This update caused a brief disconnect (around 5 mins) with our ingress services. A bug in the ingress service caused it to permanently keep most tenant connections in a broken state, which resulted in message loss.

Detection

We detected the problem via an alert that was triggered outside our office hours.

Root causes

Why was the message broker updated?
- During the roll-forward, we accidentally added our message broker to the update queue.
Why did the ingress service not automatically recover?
- There is a bug in our ingress service that does not recover from certain errors.
Why was the response so slow?
- We got an alert a few hours after the start of the issue.
Why did the Alerts trigger so late?
- The fast triggering alert was not triggered because the ingress service was reporting that it was successfully receiving messages. We do have long triggering alerts for these cases.
Why was this problem not caught right after the previous incident?
- We did not check if the ingress was affected after the incident, as we verified the ingress was working during the first incident.

Mitigation and resolution

A restart of the ingress service (around 21:24 UTC) resulted in all messages being accepted again.

Later, we prevented any further reoccurrences by changing a setting in our message broker to not cause anymore of these disconnects.

Follow-Ups

Reproduce the issue and create an upstream fix
Add additional alerts that trigger faster
Revise Incident Playbooks

Posted Apr 24, 2025 - 11:59 UTC

Resolved

The previous incident triggered another incident, which resulted in major messages loss between 13:30 UTC to 21:30 UTC for almost all tenants. This incident was resolved at 21:30 UTC, which means all positions should have been corrected before 23:30 UTC.

A PIR (Post Incident Response) for both incidents will be published on a later date.

Posted Apr 16, 2025 - 13:30 UTC