It's been a little while since I've had a good rant in this section, because for the most part I've been able to ignore most of the issues that I've run into so far. They either weren't big enough of a deal to write about, or simply didn't feel worth describing in detail. But today, I had yet another case of Linux being a menace, making me spend the majority of this Saturday debugging it, with mixed results.
So let's set the scene, one day I realize that one of my servers no longer starts up after the daily restart (which is there so I can sleep without the PSU noise). I realize that I cannot connect to it through SSH, nor do any of the Docker containers on it launch, so I go over to the desk and connect a monitor and keyboard to it. This is what I see as the result of the boot process:
It's not a boot loop. The server isn't actually booting all the way. I get past GRUB normally, but eventually it just decides to... sit there. My first thought is that it might have been some sort of a hardware failure, but thankfully GRUB still works and I can choose various boot options to figure out what's going on:
One of those options is opening a recovery menu, which luckily has some helpful options, like jumping straight into a shell prompt:
Doing that actually makes the server work, we are able to get root on it and interact. This makes me feel like it might be a driver related issue, however in this mode the output about the graphics devices isn't very helpful:
I have another server running besides this one with more or less the same hardware, so I can actually show you how a working configuration without driver issues would look like:
(note that I can actually use screenshots for the remote output from that one, because I'm not forced to take pictures with a phone on the broken one)
As you can see, what's missing on the broken one is:
driver=amdgpu
But why?
My best guess is that a software update to the drivers might have caused some issues to crop up. You see, I run AMD Athlon 200GE CPUs because of their low 35W TDP and they also have integrated Vega 3 graphics. Not something you would use for gaming, sure, but perfect for low power servers, built on consumer hardware. Now, what doesn't make sense to me, is the fact that only one of the servers would have issues, because they are updated at almost the same time.
What's even weirder is the fact that the server decided to not start up entirely, just because of a graphics related issue. I don't know about you, but I don't think that a server needs graphics to function, especially because the recovery mode (which perhaps doesn't bother with loading the GPU drivers) has no issues and I could probably still run all of my containers and software without a graphical environment. So, in what world does it make sense for your nodes to fail because of optional hardware functionality that's a part of your CPU package?
Then, there's the next question - how do I reinstall the drivers? Some of the first search results out there aren't too helpful. Some suggest enabling third party personal package archives (PPAs), others suggest that you should install the proprietary AMDGPU-PRO drivers... but none of that is what I need or want. All I really want is the equivalent of the following:
sudo apt reinstall amd-gpu-drivers
which, of course, is not how reality works, because nothing is ever easy.
The thing is, that I don't have any amdgpu packages actually installed:
So as far as I'm concerned, how they work on Linux is basically magic, since no package is explicitly present. Some other sources suggest that you should look for a package meant to uninstall the drivers. But the thing is that I have nothing like that on the system either:
This is very odd, at least in my eyes, because it feels like the user experience is more or less a wall here. Now, I actually did go back and review the boot logs, to find out what was the last line before everything goes wrong on the server, and as it turns out there is exactly one search result on the Internet for it (for PopOS, but that is based on Ubuntu, so close enough):
There is a bit of a discussion there, but it devolves into nothingness and isn't satisfactory enough, because it doesn't provide us with actionable advice:
So, what am I to do? As far as I'm concerned, the only option is to get rid of the graphics driver and just give up and render things in software, instead of using the hardware. Thankfully, testing that out isn't too hard:
We basically enter grub, press E
to edit the command for booting and add the following option:
nomodeset
This is how it looks like in my case:
And we can also make the change persistent later. Aaand... it works!
Listen, I can understand that the RX 570 in my personal computer might not work all that well with ROCm, leaving me to do AI development with CPU and RAM instead, leading to something like this:
Well, maybe besides the fact that I ran into lots of DKMS issues, installing the actual custom drivers was a total mess, it broke like 4 times during it and took up most of the evening prior to me writing this post... and the fact that a 5-6 year old GPU apparently isn't good enough to do comparatively basic tasks because developers can only be bothered to support the latest hardware... But we're not talking about that here. We're talking about having drivers for an integrated GPU that are stable enough not to prevent the node from booting!
Regardless, we are now able to get until network configuration:
Which fails shortly afterwards...
What?
It wasn't bad enough for the GPU to fail initializing, now we are also having network issues? As expected, it's now impossible for me to connect to the server, or send any outgoing traffic, even though the boot itself more or less completes as expected:
My first idea was to just restart the network interface, surely that would be good enough, right? Well, sadly not:
Essentially, whenever I try to do something, I get the following:
Temporary failure in name resolution
I thought that perhaps the issue could be with the name servers, which resolve DNS queries from hostnames into IP addresses, which could explain at least the outgoing traffic issue, or why some random intermediate setup step could fail. These aren't actually managed in resolv.conf
though:
It's odd that we're left with a configuration file that's just generated by something else, but thankfully adding a custom name server wasn't too hard either:
Except that while the service starts up correctly, nothing still works:
I actually did a few restarts but the only thing that it got me was some random output... in the login screen:
That's probably not something you'd expect or even want, for that matter. But with nothing working, now what? I checked the device information, but there was nothing interesting to be seen:
(again, they were pretty much the same as the other server and the idea of messing around with driver versions was pretty much the same as previously - demotivating)
One thing I did notice, however, was that the server was getting an IPv6 address, but not an IPv4, so that might as well be the issue! Because of how my homelab is setup, this should be as simple as getting it from the DHCP server, so that implies that that's exactly where the issues might be. So let's try to see whether the DHCP client works or not! A brief check reveals that it just freezes:
In comparison, launching it on the other server returns information much more quickly:
If DHCP has forsaken us, then the reasonable thing is to get rid of it, right? Just have to configure networking manually with a static IP address and we should be good! A little bit like this:
network:
ethernets:
enp4s0:
dhcp4: false
addresses: [192.168.8.251/24]
gateway4: 192.168.8.1
nameservers:
addresses: [8.8.8.8, 8.8.4.4]
version: 2
(well, more or less that, the addresses can obviously vary, but this is the format that Netplan needs)
Then, it should be just a matter of generating a new plan and applying it:
sudo netplan generate
sudo netplan apply
Aaaand... nope! We get the static IP address, but still get name resolution errors. I actually went to the router and tried checking whether the interface (ether3
and so on) has any data going through it and it does, except that clearly not the data we need. Restarts also don't help and we still get errors in boot logs (journalctl -xe
):
So then what?
Based on the title of the chapter, you might think that I had a faulty network card, but that's not the case!
I decided to try a LiveCD, to boot into a similar distro (Linux Mint, which I had at the time) off of a flash drive and then see what works or what doesn't. What I found was the following:
I was utterly dumbfounded. I tried the same thing with the main OS and that also suddenly started working. So it turned out that the cause for the networking issue was that the router decided to have one of its ethernet ports... just stop functioning? I have 0 idea why that is, because nothing odd is up with the configuration and I previously saw that apparently some traffic was going through it, even though the server didn't work as expected.
Excuse me, but what the heck? Shouldn't I have gotten an explicit message that would suggest that I should check the hardware, or something? Something along the lines of:
[ OK ] Netplan loaded correctly
[ OK ] Realtek drivers loaded correctly
[ OK ] Realtek device recognized, self-test successful
[FAIL] Attempts to contact gateway unsuccessful
[FAIL] DHCP lease unsuccessful, no IP address provisioned
[INFO] Please check your connection to the gateway, possible hardware issue
That does not appear to be the case, however, at least in my experience. With all of the hardware out there that Linux and its many distros attempt to support, alongside all of the (sometimes seemingly needless) changes that happen within the ecosystem, the software has a hard enough time running on its own, as opposed to having any advanced capabilities for diagnosing issues. The same happens with UX, where you're left poking around in the darkness, hoping to find the causes for your issues, having no idea whether it's a software or a hardware issue.
As for the router - I have no idea what's wrong with it. I still have 1 spare port and if that fails, I'm just getting a new box sent to me by the ISP, for I don't have neither the expertise, nor the time to bother trying to fix these things, probably just to kill my entire network in the process. So what does that leave me with?
So far, we had:
There is an amusing hypothesis about all of us living in a simulation. If that's the case, then I must have really upset someone, for them to put me in circumstances where I have to deal with this mess.
Someone once made this suggestion:
"Linux is only free if your time has no value."
You know what? I agree. I wholeheartedly agree that Linux is a bad operating system that runs most of the world's servers out there, that won't be migrated off of any time soon, because there are quite literally no alternatives with hardware support that's good enough (FreeBSD might be a great OS, but isn't as widely supported, or widely known).
For better or worse, we're stuck with Linux. And I dread how often seemingly simple things keep breaking in obscure ways, as well as how unpleasant trying to debug things when you haven't built everything from the ground up is. Networking isn't my strong suit, I'm supposed to be just a regular software developer for the most part and yet I'm thrown into this area where I'm relatively incompetent, out of necessity.
This is also why I am forced to agree with statements along the lines of:
"Running your own homelab is signing up for keeping things running as well and dealing with all of the outages yourself."
But guess what - I'm too poor to host everything in the cloud and have all of my data there as well. Storage is cheap. My time is also cheap (due to the country that I live in, for the most part). So I have to essentially suck it up and deal with it, hopefully eventually read a few networking books and do a deep dive in this stuff - but by then I'll probably have lots of negative former experiences like this, which will make doing it much less enjoyable than it otherwise should be.
If I had to sum things up in a single sentence:
I want to like Linux distros, but in practice find myself frustrated a lot.
That's pretty much it.