You know, when I wrote "Never update anything" it as mostly supposed to be satire. We all know that updates are necessary because of security, whenever you have a piece of software that's connected to any network. However, that doesn't mean that updates typically aren't unnecessarily brittle and won't break things randomly, making for one stressful experience.
I have a number of servers running in the cloud, which I rent from Time4VPS, because they're affordable enough for my needs and have better uptime than I even want my homelab to have (given that I turn off my homelab servers when I'm asleep, so they don't use electricity and don't make noise). However, with administering and using your own Linux servers, comes the responsibility of keeping them up to date.
That's exactly what I wanted to achieve, when updating some that had an uptime of around 120 days, also restarting them afterwards to be sure that there are no boot related issues that the unattended upgrades otherwise wouldn't make me aware of. There were.
The upgrade process for the packages, which I called manually this time, succeeded as the automated ones always do. It was only after a restart that issues manifested, issues that I wasn't even immediately aware of. Looking at the monitoring the next day, I saw that the CPU usage had spiked:
Not enough to actually break any of the sites that are made available on the server, but definitely enough to slow everything down. Surely there was a cause for all of it, a cause that I needed to discover and hopefully fix, since having 50% of your server capacity disappear isn't normal, no matter how you look at it.
Because the monitoring solution that I use, Zabbix, is good for figuring things out at a glance, but not for digging into details at a more fine grained level, I then proceeded to connect to the actual server and use htop to look at the processes, but not before turning off the display of user process threads, because that makes everything a bit useless for me:
With that out of the way, I could see... nothing useful:
Now, one can speak about the difference between CPU utilization and load averages in Linux, but I ended up in a bit of a silly situation here: I can see that something is using up the server resources, yet no single process appears to be the actual cause. This might be caused by some process that starts up and shuts down quickly often, which makes me think that htop isn't the best tool here. I'd actually need something along the lines of:
Here's a tree of processes that were started/running in the last X minutes, alongside their sub-processes that might have been stopped, with CPU time in %.
Sadly, I don't know of such a tool, so instead I just looked at the Docker container history, since I run most of my software as isolated containers with resource limits. This is not only a good idea from a manageability, update and stability perspective, but also revealed the issue:
I run my own mail server, which in this case was stuck in a restart loop. Now, as far as I can tell, it had been consistently chugging away at constant restarts for the last day or so, with no back-off by default (since that's more of a Kubernetes thing, instead of Docker, Docker Compose or Docker Swarm), which was using up the CPU resources. You can actually specify a custom restart policy, which is otherwise useful, along the lines of this:
deploy:
restart_policy:
condition: on-failure
delay: 60s
max_attempts: 10
The problem with this is that if you have a long delay, then there might be downtime for one-off crashes and such, whereas if you have a limited amount of attempts for attempted restarts, then you might end up with a long outage for something that might crash during a time of high load. Whereas if you have low delays or no limits for restarts, you'll get what I did - the CPU being overcome with unnecessary load. That's one of the things that Kubernetes does better.
I then jumped over to Portainer to figure out exactly what's wrong with my mail server, just because doing that is a bit easier than just using the CLI, at least in my eyes. We see much of the same scene, where a container that is scheduled to be running just keeps failing all the time:
With a few mouse clicks, we see the actual cause of the issue:
So essentially what's wrong is the fact that I've set up my mail server container to bind to a few ports on the node itself, so that I can use it as a regular mail server, but these ports are not available. It means that something has already started up and is keeping them busy, making the container startup fail every single time. Am I being hacked or something?
I decided to look into the causes for this and what could possibly be running on the port 25, since I hadn't installed anything like that. Thankfully, it was pretty easy to check:
It appears that exim4 is is to blame here, it is listening on port 25 and is breaking all of my containers. Now here's a good question:
What the heck is exim4 and what is it doing on my system?
The first part of the question isn't too hard to answer, because the Debian Wiki has a page about it, which explains what's up:
Sadly, I have no satisfactory answer for the second part of that question. I have not manually installed a mail server on my servers. I didn't intend to do that. I didn't even install any software that might need it manually. I have ran no scripts on those servers whatsoever, since basically everything is done through containers with limited permissions. I also don't share access with anyone and there's nothing weird in the login logs.
The reasonable answer, of course, is that the update process must have installed it for some unfathomable reason. Let's get rid of it:
I double checked that it indeed wasn't doing anything (even though I didn't uninstall it, in case it was needed as a dependency for something else):
And afterwards, the container started up correctly and I had my mail server up and running again:
In short, the problem wasn't that complicated, thankfully not too hard to fix, but annoying nonetheless. I actually checked my other nodes that were running Ubuntu and none of them were affected by this, but all of the Debian ones had exim4 as an installed package all of the sudden, so I'm really leaning towards updates being the culprit here, in addition with the restart making the service actually launch.
So what can we learn from this? A few things, I'd say:
Overall, using containers is still far less brittle than installing everything on the system manually, at least I can plan updates to those pieces of software, as opposed to having something like kernel and essential service updates bring along a bunch of stuff I don't need or want, that will break other parts of my setup. That said, Debian isn't alone in this and I've had similar issues with basically every Linux distro out there, from Alpine to RHEL.
It's not horrible, of course, just a bit underwhelming and disappointing. I think a bit of additional alerting would go a long way, not just Zabbix and Uptime Kuma, given that not everything I expose can have external health checks (like websites), or will impact the server itself that much. What I need is good container monitoring, that I can host myself.
Thankfully, at least the other forms of monitoring are a little bit diversified: for example, I wouldn't get e-mails from Zabbix if the mail server is down, but I'd still get Uptime Kuma notifications to my Mattermost instance. It'd be nice to have everything use multiple communication channels, but at least what I have right now is better than nothing.