We are sorry for any inconvenience caused. We trust that this Post Incident Review (PIR) provides you with sufficient information to reassure you that we are actively working to improve. If you require further details, please do not hesitate to contact our support team.
A previous incident caused our message ingress to not store messages for most tenants. This caused major message loss between ~13:30 UTC 2024-04-16 until 21:24 UTC 2024-04-16, positions and calibration data has not been received or processed by the IIoT Hub. All valve positions should have been fixed and updated before ~23:30 UTC 2024-04-16.
A reoccurrence of the same problem happened again between 18:00 UTC 2024-04-17 - 07:55 UTC 2024-04-18.
https://status.aloxy.io/incidents/7swhg6wdf543
During the roll-forward of the digital twin, we accidentally updated our message broker. This update caused a brief disconnect (around 5 mins) with our ingress services. A bug in the ingress service caused it to permanently keep most tenant connections in a broken state, which resulted in message loss.
We detected the problem via an alert that was triggered outside our office hours.
Why was the message broker updated?
Why did the ingress service not automatically recover?
Why was the response so slow?
Why did the Alerts trigger so late?
Why was this problem not caught right after the previous incident?
A restart of the ingress service (around 21:24 UTC) resulted in all messages being accepted again.
Later, we prevented any further reoccurrences by changing a setting in our message broker to not cause anymore of these disconnects.