The Quest For 99.9999% Uptime

The Quest For 99.9999% Uptime
Uptime Kuma!

It's extremely difficult to build a system with enough redundancies to be online nearly 100% of the time, but It's the goal. Fifthdread Services has always aimed to have high availability- especially when I strongly encourage my friends and family to use them. While things are better now more than ever, I think there's a little more I can do to design a system that will never need to go down.

The issues with high availability have varied over the years. Let me give you a list of reasons a system may want to go down

  • Software / System Updates
  • Hardware Failure
  • Hardware Upgrades
  • Power Loss due to Power Outage
  • System Crashes

All of these things have possible mitigations or solutions. To build a truly robust system, you must develop redundancies for these things.

The Current System

The current system is pretty decent for a home lab. I'm running on consumer grade hardware mostly, so there aren't hardware redundancies at the moment. If a server were to crash crash, as in, can't be recovered due to hardware failure, the hardware would have to be replaced... sad. That could be a week downtime for many of Fifthdread Services. Since I don't have a paying customer, we'd live. I may even be able to migrate some of the services over to my other server. All this is to say- it's not very hardware redundant at the moment.

Power- now that's something that'd decent (for 1 of 2 servers) since I have an UPS providing power to my main server. This will ensure that during a power outage, the server stays powered on until the backup generator kicks on. Most of Fifthdread Services are protected from a power failure, and that's fantastic! I do want to improve this further in the future, but for now it's a decent robust solution.

System Crashes - now that's something that has been relevant to me recently. I had the GPU on my main server get stuck at 100% utilization, and the only way to stop it was to shutdown the system. No signals like terminate or kill would stop the GPU process- only a shutdown was sufficient. This was likely due to parts of the GPU driver living in the kernel. Because the main server's so important to Fifthdread Services, I have stopped generating AI Art on it. It was using ComfyUI which caused the GPU issues to begin with, and it's not worth taking down all of Fifthdread Services in order to correct it. I have since moved AI Art Generation to my Gaming PC. Besides that, the system is pretty robust and crash resistant.

The Future

So how do we get more redundant? Many of Fifthdread Services are simple web-based services with no need for major storage, such as Matrix Synapse (Element), Fifthdread.com, NGINX Proxy Manager, Mumble- pretty much everything. Lots don't require bulk storage- some do, like Peertube, Plex, or Jellyfin. Some require a ton of processing... like Peertube, Plex, or Jellyfin. lol But a majority of Fifthdread Services don't need a ton of resources or storage.

These essential services could transition to an affordable redundant solution that I'm dreaming up. The idea is to buy 3 mini PCs, put them in a Proxmox Cluster, and bam- I have a high-availability cluster that I can run my essential services on!

These Mini PCs are getting kind of insane. They're very capable little devices, with multi-core Ryzen chips in them. With 8 cores 16 threads, these things will be surprisingly capable of running a ton of stuff. Just look at these specs!

  • Beelink SER6 Mini PC
  • AMD Ryzen 9 6900HX
  • 32GB DDR5 RAM
  • 1TB PCIE4.0 SSD
  • AMD Radeon 680M
  • Triple Display
  • USB4.0
  • WiFi6
  • BT5.2
  • 2.5Gbps LAN

Not bad!

For the price, it's a pretty decent setup but it does have some things to consider. For one, I do wish I could make a 10g link between all the nodes. I seen a guy on YouTube make a 10gb mesh network between his 3 Proxmox clustered machines... Very interesting. 2.5g isn't bad, but I'll be sharing that 2.5g link between the replication tasks and the services. Considering that I want replication to be as fast as possible, it may be ideal if I had at least a dual-lan solution available. On these mini-PCs, I don't have a PCIe slot available to throw in 10g cards. It could be smarter to go with something that does.

I will say though, that fow a low power solution, this is solid. I think I'll go with it for those services I described above- my essential services which have both a low power and low storage requirement. Running a Proxmox High Availability cluster between them could be a sweet project. Unfortunately it comes with the up-front investment at around $1500 USD. Yes, I could do it for less, but I'm looking for a bang for your buck value. I really want 8 fast cores 16 threads, with 32gb RAM, 2tb NVME SSD, and dual LAN. I may end up spending a little over budget to get it.

Is it worth?

Ehhhh it depends. If a server were to crash right now, I'd have to resort to a daily backup to bring things back online. We could lose maybe a days worth of data, and I could bring things up on my second server...

But I can do better!

I want to make it so the machines appear to never go down. You never notice an outage because of a cluster keeping things truly redundant. That's what I want- the magic of 99.9999% uptime- at least regarding the machines uptime. Network uptime is still a concern since I don't have any redundant WAN.

Currently in my home, the best uptime number I see is from my Ubiquity switch at 3 months. I've had much better uptime from my servers before- but not really beyond a year. My goal is to make Fifthdread Services more and more robust as time goes on, so with that said, look forward to the cluster project coming in the neat future.