Why Some Fifthdread Services Are Down: Recovering From Disaster

Why Some Fifthdread Services Are Down: Recovering From Disaster

There I was, enjoying a relaxing weekend after an absolutely brutal work week, when tragedy struck. Suddenly and without warning, I could not access Fifthdread Services. It seemed like my entire network just died suddenly. I immediately shot into action. I've spent an ungodly amount of time building a setup that would have extremely high uptime numbers, yet I was currently down. This was unacceptable, and frankly surprising. So, what happened?

I tried pinging my main server, only to receive a "destination unreachable" message. That's a big deal. Why can't I ping anything in my server rack? I suppose I did just change some network configurations this past weekend. Embarrassing as it is to admit, I had previously configured my 10g connection between my main network stack and my server rack to run at only 1g, and only just discovered the error. I corrected the issue last weekend, and was finally negotiating at 10g full speed. After fiddling with things, it seems like maybe my 10g copper SFP interface may be the culprit. I wasn't sure, but I had assumed it was overheating since they get extremely hot.

After re-seating the interface, the network came back to life. I decided the long term solution would be to run 10g fiber, along with 10g fiber SFP modules. I ended up running fiber and so far, things are cooler and more stable, but I did have a slight hickup after installation. Hopefully it wasn't a sign that the 10g switch is dying, and if it is, the 10g switch I want from Ubiquiti will run me around $1000. It sucks, but it'll be worth it if it resolves my networking issues.

If that was my only issue, I'd be fine. However, an even bigger issue showed itself. After bringing back up the network, I had noticed my main server was down. "DreadNAS" as it's charmingly named, was not communicating, and my uptime monitor was complaining about it. This sucks because it's my main storage pool of around 180TB of storage, supporting all the docker containers which require large amounts of data.

Troubleshooting DreadNAS

The problem is I couldn't communicate with DreadNAS. I can't ping it, I can't SSH into it- it may as well be off, but it wasn't. It was still on, so what happened? I checked my KVM connection, allowing me to look at the terminal display. Odd, it just seems stuck. I try typing on the keyboard, and it does not move. It's just completely stuck. There are a bunch of errors on screen, but they don't seem related to the crash. No big, I'll just reboot and call it an odd bug.

I brought it back up and things were fine. I even went and cleared some of the lingering issues causing errors to pop up in the logs. However, it wasn't long before the server was once again completely frozen. Now, we have a trend. After looking at journalctl, I didn't see anything alarming. I poked my head around the system to see if anything odd was running. Nothing abnormal. Let's check zfs. Seems fine, although the SMART checks on the drives are signaling issues...

DreadNAS was built with two different ZFS Pools: One was the Raid Mirror for the boot NVME SSDs, and the other was the 12 drive pool. One of the 12 drives was failing SMART checks, indicating spin-up issues. However, this did not impact the zpool. The pool can handle 2 drive failures before issues arise. No, the bigger issue was the NVME SSD mirror pool. One of the two drives was failing SMART checks, indicating excessive wear. It was claiming to have written 1.6PB over it's lifetime, which exceeded the manufacturer rating. The second SSD was not far behind.

The SSDs are worn out, but there are no errors on the zpool. Unsure on if the drives are the real concern here, I recruited the help of my buddy Allyn Malventano. I don't know anyone more qualified to help me diagnose this issue, and he graciously agreed to help.

After a lot of diagnostics, nothing we did could stop the machine from freezing after a random amount of time. In fact, the freezing seemed to take less and less time before it would occur, leaving the system completely unresponsive. Nothing we did seemed to help or hurt it. The system gave us no warnings prior to freezing. I suspected a docker container could be related, but even disabling docker all-together did not resolve the freeze.

With all this, I suspected the most likely issue was the NVME SSDs failing, but it could still be somehow software related. I ordered two new SSDs and re-installed Proxmox from scratch on the new drives, thinking it would resolve the issue. Unfortunately, the issue returned hours later.

With reality setting in that DreadNAS was truly borked, it was time to think about the next steps: Replacing the underlying hardware. Since the entire software stack was replaced, it had to be a hardware issue, and not even the hardware I had suspected. What hardware would I replace? Basically everything. The CPU, Motherboard, and I'd need new sticks of DDR5 ECC capable memory. That's a whole new build. Not fun. Not only that, but I may as well get a new Power Supply since I don't trust the current one all that well.

What are my options? Well, the current build is my old consumer grade gaming PC parts. I had an AMD 5950x build. Pretty great all things considered, but sadly these machines aren't truly made to be servers. They have limited PCIE lanes, and depending on the motherboard, the lanes they do have are split up all sorts of random ways it becomes a nightmare to get one that has a configuration that doesn't bottleneck your setup. No, this time, I'll be getting a build that has plenty of PCIE lanes. This time, I'll go Threadripper.

Threadripper 9000 came out recently, and it looks nice. So nice, that I put together a build that'll run me around 3-4 thousand dollars. Rough, especially considering RAM prices... I'm still considering when to pull the trigger on the build, especially since I'm tight on cash at the moment. Until then, I migrated the most vital services off of that machine and onto a different machine- DreadALT. DreadALT is an even older machine- my AMD 3950x gaming pc, which often runs game servers and non-essentials. Now, it runs my email services, a temporary Plex, and some other servers that I can't afford to have down... Thankfully most of my essential services are already running inside the Docker Swarm, which does give me high availability for the most part- so long as the networking stays stable. However, I really really want DreadNAS to come back to life. My entire media library is on there, all my bulk data, all my photos and videos... I need my NAS!

Alas, I must wait a while longer before I can afford to fix it. Until then, enjoy this sad uptime dashboard.