Recovering from Disaster! Chat server almost died.
Self Hosting… Worth it, but it’s hard sometimes. Specifically, right now. While I LOVE self hosting from a privacy perspective, it can be a real pain in the ass sometimes. I have been working on the Matrix Synapse server for the past day with mixed success. I started out with one goal: to fix the search functionality within Riot.im, since for whatever reason, searching for users was broken. After many hours of experimentation and research, I finally fixed it… Sweet! Since I was on a roll, I decided to go ahead and fix federation functionality… Which I did! However, the video chat still doesn’t seem to work, and integrations also still doesn’t work. Arg. At least voice chat sort of works, although not super reliably. Overall, pretty good progress! Little did I know about the troubles to come…
I was about to wrap up my maintenance late in the day, when my Ubuntu installation was warning me about low disk space. I suppose it was only given a 12gb disk to use. This is… a problem. People upload content to this server when they share stuff with users, so I need a big chunk of storage available for uploads. Well, to make a long story short, I tried to expand the VHD (Virtual Hard Disk) which seemed to mess with the whole partitioning of the drive after I attempted to extend the partitions… Things basically broke to the point where recovery was not practical.
No worries- I did a export of the VM directly before I expanded it. I had been in the game long enough to know that shit always goes wrong, so a backup would mitigate that risk, right? Not totally, especially when that backup will NOT IMPORT. That’s correct- after everything was jacked up, I wanted to try again. I attempted to import my backup, only to see that it failed every single time.
No problem, I’ll just make a copy of the VHD and mount it to the VM in place of the messed up VHD- it’ll be like restoring from a backup- NOT SO FAST! When I did the export, I had about 10 snapshots of the VM in place, which means that the VHD was the original snapshot, followed by the incremental changes between each snapshot were stored in separate files- one per snapshot. I found this out the hard way, when mounting the VHD only to discover that it acted just like a new Ubuntu install.
In this situation, one starts to panic. I had spent literal days working on the Matrix Synapse server, only to be facing the real possibility that I’m going to have to start rebuilding from scratch. However, I was determined to figure something out. I did some aggressive googling on merging VHD and AVHDX files. Turns out it’s possible, so long as you do it in the correct order. Each AVHDX file has a parent file. I used the HyperV inspect disk tool to check each AVHDX file, making notes on each parent file. Eventually, I had the list together.
Ubuntu 18.04.2 LTS (1).vhdx
Ubuntu 18.04.2 LTS (1)_5AC1CA94-704C-43F0-9FB9-3758BCA9B295.avhdx
Ubuntu 18.04.2 LTS (1)_4A5BF5C1-EEEF-4CE8-9D7A-36C513FAA37A.avhdx
Ubuntu 18.04.2 LTS (1)_CF0735F5-6A70-42B3-AE59-6F3F662808EB.avhdx
Ubuntu 18.04.2 LTS (1)_F9DA5B92-13C6-40FB-86BE-FA2A464C24CC.avhdx
Ubuntu 18.04.2 LTS (1)_037D43A4-FFE9-40FE-BEF0-B87A2C5EF266.avhdx
Ubuntu 18.04.2 LTS (1)_BC4104EB-9A06-4542-A2C6-0E199DF66F3E.avhdx
Ubuntu 18.04.2 LTS (1)_5759360A-2C78-400A-89F7-2C4EA8F84878.avhdx
Ubuntu 18.04.2 LTS (1)_D36B77C8-15B3-4BB4-BDD6-F62FCB173310.avhdx
Then I went through the tiring process of merging each file, one at a time, with the parent file, using HyperV’s Edit Disk function. Slowly, I merged each file with success. In theory, the final file will be the Ubuntu disk at the exact point in time that my backup was conducted… I made a backup of the newly merged VHD file, then finally mounted it. At this point, we were at the moment of truth. If I start this VM and it doesn’t work, I think I’ll be totally broken at this point. It HAS to work now.
It did! I was so relieved to see it work again. I have to say I learned a lot from this catastrophe, and I definitely came out smarter having done it- that being said, when it comes at the expense of your 99.9% uptime metric, it definitely sucks. As someone that self-hosts, I want to offer services on-par or better than other solutions. Uptime is somewhere I can’t compete unfortunately.