Conversation
Notices
-
Post-mortem on our partial service outage.
In January we started moving to a new #Kubernetes based setup. During or immediately after #FOSDEM we managed to move over almost all of our services from the old #Rancher 1.x based setup. Two services remained - #Matrix, #Riot and the association homepage.
Today 17th March approx 11am UTC the Matrix server, our #Riot and the homepage went offline due to the LetsEncrypt certificate expiring. Normally this certificate would have auto-renewed, but this did not any more work with our old Rancher.
Unfortunately due to mail delivery failures which we're still looking into, we didn't get LetsEncrypt warnings about the certificate expiring or the Uptimerobot alerts when the services became unavailable. Naturally alertmanager monitoring into our admin room went through the Matrix server, so we didn't receive those either.
The admin team was however alerted about the failure very soon after the cert expired but unfortunately key admins that had experience with the old setup we're tied up at work. When a new certificate was finally received from LetsEncrypt, Rancher decided to push one more stick into the wheels by refusing to restore the load balancer to a working condition.
After flaky networking for a few hours while the admins juggled work, commute and fighting load balancers, reliable service was finally restored to the Matrix and Riot instances at approx 6:45pm UTC. It should hopefully be catching up and soon be back to normal.
Migration of the Matrix parts will be done asap during the next week or so to the new Kubernetes setup. This should avoid things falling apart due to legacy stuff that only a few people in the team knew how it works.
Sorry for the inconvenience for any users on the #Feneas Matrix server 🤗