We are sorry for any inconvenience caused. We trust that this Post Incident Review (PIR) provides you with sufficient information to reassure you that we are actively working to improve. If you require further details, please do not hesitate to contact our support team.
Between ~13:00 UTC 2024-04-16 and ~14:00 UTC 2024-04-16, we were unable to process any incoming sensor messages due to our digital twin accidentally using faulty development configuration, causing significant processing delays. Almost no message loss occurred due to the digital twin being offline, however this incident triggered 2 other incidents which resulted in major message loss: https://status.aloxy.io/incidents/7c75j3wntlml and https://status.aloxy.io/incidents/hwll7v3znctq.
When development was finished on a new update of our digital twin deployment, we staged it to be tested in our acceptance/staging environment.
A bug in the deployment of our production cluster, led to some of the configuration to also be deployed (~13:00 UTC 2024-04-16) in production. This partial configuration misconfigured our digital twin, which resulted in API errors and failing to process any incoming messages.
After the faulty config was deployed on production (~13:00 UTC 2024-04-16), we received alerts indicating processing was significantly delayed. Soon after, we received additional alerts indicating our processing pipeline had fully halted.
Why did a development config get deployed on production?
Why was the production config configured this way?
Why was a partially incorrect config so detrimental to message processing?
Why did it take so long for the misconfiguration in the deployment files to be noticed?
The change was verified to be working in our development clusters, which led us to 2 options to mitigate the problem. Roll-back or roll-forward the deployment misconfiguration. As rolling-back would take longer and could introduce undefined behavior, we opted to roll-forward the change and later fix the misconfiguration. After updating and restarting the digital twin, we noticed that our message backlog was not decreasing. This was the result of 2 things:
After all this, the backlog was fully processed.