My journey from ad hoc chaos to order (a tale of legacy code, services and

Date: 2021-11-27

Recently, there was a discussion on HackerNews about Kubernetes, containers and other ways to manage services. One of the arguments that i heard was also one that i had run into a lot previously:

Kubernetes is really complex and hard to use.

Many of the organizations really don't need it, Docker Compose would suffice.

Honestly, systemd services would also be enough for many out there.

I do agree with the points above, however i do believe that there's definitely a middle ground to be found in regards to running services that need to be present on multiple servers and be coordinated amongst themselves somehow, be it with Hashicorp Nomad, Docker Swarm or another solution entirely. Yet in many cases, people talk about these things as if there is nothing between the full blown complexity of Kubernetes and running things on a single node with no orchestration (e.g. systemd or Docker Compose).

In my current company i've gone through quite the long path in finding the least painful ways to run a plethora of services in a similar setup, across multiple servers, so i figured that i'd write a quick blog post talking a bit more about my journey. Some of the details will be changed from the real world circumstances, of course, but the overall picture should remain useful nonetheless.

When i first joined the team, there was no service management in place

When i first joined the project that i'll tell you more about (well, maybe it was actually multiple projects, perhaps some that we took over the maintenance of from external developers etc.), it was pretty baffling to look at it, since everything just seemed very ad hoc and confusing, like looking at a pile of puzzle pieces that probably fit together, but you don't quite see how:

puzzle image

People just ran Tomcat instances through scripts in the bin folder and hoped that they'd keep running. This was bad for a number of reasons - when you have 10 servers and some of those routinely experience restarts because of updates and patches (dev environments), you'll find yourself struggling to keep up with everything and restarting everything that won't automatically start after such a restart.

Then there were also issues with permissions, because the users were also managed manually and not all of them had proper permissions in place. In many organizations this can also be a security problem, because creating users means that you also have to clean them up manually - unless this is explicitly described as a procedure to be followed (or even if it is), you'll sooner or later run into servers with accounts for people who no longer work in your organization. And even if you do remember to clean these accounts up, someone will still sit there with a checklist wasting time and looking over all of the environments manually. Or, if you try to centralize your account management in an organization wide manner, you better have really capable Ops people, otherwise you'll run into a situation where the single point of failure goes down and all of the sudden almost no one can do their work.

Also, you could expect to see things occasionally running as root, because clearly managing folder permissions for users and groups was too much to ask. Sure, one might say "oh, hey, those were just the dev environments that didn't contain anything important", but i don't think that that's a strong enough argument. In my experience, people will be as lazy and sloppy as you let them be, just to get what they want in a faster and simpler way. "It seems good enough, because it works for me" is the exact cause for bad permissions management and perhaps the worst part is that oftentimes they'll just use those bad practices, since the people who are likely to suffer from them (e.g. liability due to a shared root login account without even certificate based auth) won't be themselves. Since they'll ship fast, by the time everything burns down or breaches happen, they'll already be at a different company, spreading bad practices further.

As for the actual setup, it was also routine that one service would start misbehaving and eat up all of the RAM, leading to things breaking. Also, there were problems with how the environments had been set up over the years and a lot of inconsistency - in one the applications lived under /app on the file system, in another under /dir01, another had things under /usr/share, or /var/opt, or /etc/tomcat or something else (don't really remember the particulars, just imagine lots of inconsistent configuration) and even the JDK versions differed. Sloppiness all around, one that wasn't necessarily acknowledged until the point where QA specialists couldn't do their work due to environments randomly breaking multiple times per week, the Slack/Teams chats looked like this occasionally:

chat example

All of this is especially bad, when you have many developers on the team, which occasionally need to change the server config for the apps (and which they sometimes won't do properly for all of the servers), which will only accelerate the configuration drift and environment rot. That's just not sustainable and if you ever see an environment like that, your choices are to either turn and run for the hills, or spend months fixing everything and trying to avoid the eventual outcome of Knight Capital:

This is the story of how a company with nearly $400 million in assets went bankrupt in 45-minutes because of a failed deployment.

Sure, you might also just stick your head into the sand and get away with bad practices, but why put yourself at risk? Or, better yet, why put yourself at risk due to the incompetence or unwillingness to put in the work of someone else who doesn't care about neither the code, nor your future?

Thus, i chose working in the service of both the other team members, the clients and the company as a whole to untangle that utter mess and hopefully educate my colleagues of how not to make these mistakes in internal workshops and seminars later.

Then, i introduced systemd and Ansible

There weren't that many options that i could utilize to fix all of that, honestly. But i felt that i should start with the things that seem the worst off and see where that takes me - fixing how the services themselves are launched and fixing the way the application configuration is managed for the environments (as well as system packages, user accounts and everything else).

For these, Ansible and systemd seemed to make the most sense, being established, capable and boring solutions that wouldn't present too many surprises:

systemd and ansible

Now a lot of folks dislike systemd because of how much it attempts to do, and while i agree with that assertion, that fact also made it pretty much perfect for my use case, since i didn't have the time or resources to integrate 10 different solutions otherwise, especially because the business is likely to view my success or failure as a reflection of how good the actual practices and attempts to fix things are in the first place, so succeeding was paramount.

Personally, i'd say that you should use whatever suits your needs, be it systemd, OpenRC, or anything else. The same goes for the likes of Ansible - some folks might prefer Salt, Chef, Puppet or something else entirely.

However, introducing both of these was a good idea, since this addressed two problems at once, both the configuration drift and having no idea who changed what for which environment and why, as well as the fact that the services should keep working after restarts and also should be as formalized as possible, since running /scripts/kill-process-thats-probably-a-zombie.sh and /scripts/start-process.sh isn't a good approach, even more so if you need something like that for 3/7 environments.

Systemd actually worked nicely for this, since you could also see the statuses of each service and manage them as well, so restarts and redeployments became more stable and easier as well:

systemd service status example

(random example of a systemd service status, found online not to display any actual services; though that little bit of green text alone does wonders for my blood pressure)

Not only that, but Ansible also works wonderfully for managing configuration, since all of the sudden you can have what you need for all of your dev environments within a single Git repo and automate configuration changes, or even do merge/pull requests and code review for them.

Where it went wrong and why we needed something more

The problem with systemd and Ansible was that the people we were developing the system for didn't have the time or resources to fix their infrastructure as well following our example - while there was minimal to no disparity because of this (still just Tomcat instances or .jars with embedded Tomcat running, just launched differently) and the continued management of our own environments became much easier, many of the resource related issues remained.

As did the fact that there were no proper health checks or restarts in place, nor was there proper load balancing and managing DNS changes and reverse proxying was a bit of a mess, especially on their environments, or on ours in cases where getting the Ops people to give us new DNS names took days instead of minutes, so in practice some servers could still only be referenced as "134.241.211.40:3000", which is really ugly and bad for discoverability, as well as problematic when you want to test SSL/TLS certificates.

Not only that, but in the circumstances where specific JDK versions were needed and also Tomcat versions and also OS package versions (e.g. running CentOS/Oracle Linux/Fedora in one set of environments and RHEL in another, or any other pairing of such distros that are mostly compatible but not really) was another thing that actually caused issues. Telling people to update a package version for them to not do so and you not to see that they hadn't done so since you don't have access to prod is absolutely unacceptable.

So, you can't just make your own life easier by fixing your own environments, but you need to propagate the changes all the way to the clients or any other environments that you want to ensure will be consistent, something that i couldn't easily do with systemd and Ansible due to the work involved (CI server, pipelines to run the Ansible container as a part of CI, new accounts and certs for SSH access to the server, playbook configuration, the actual Git configuration changes and all of the parameters for customizing that config).

Thus, that road to getting laurels for a job well done was closed, because my job wasn't done, i'd have to also fix the clients' environments despite the clients themselves, to find an approach that's easy enough for them to utilize successfully:

it should just work

Personally, i think that on a technical level we could have stopped here, but in reality oftentimes there will be social factors and politics at play, that you can have no hope of tackling by yourself whatsoever. Thus, in many environments out there, what you ship is more or less what will be running in prod - any additional configuration that needs to be done or any of the best practices that you suggest should be used may or may not actually be done. If this happens after introducing solutions that should supposedly fix the longstanding problems, then clearly the solutions themselves are not enough for that particular environment and you need something more.

Don't just distrust yourself or your colleagues to be infallible, but any other third parties that will be running your code. Always assume the worst so that your outcomes turn out for the best. So, if we could make our environments good but not poke around the clients' environments, then we'd just have to ship our our entire environments to them, in a manner of speaking. Ergo, containers entered the picture.

Thus, we adopted Docker Swarm

So, we decided to ship the entire environment that each app needs separately, as a self contained package and one that'd coincidentally be the same that we'd test on our own end, alongside moving towards Infrastructure as Code and fully automated CI/CD cycles.

Now all of the environments were moved into Docker containers that run on Docker Swarm (more than enough for most small/medium deployments out there), with the added benefit of overlay networking that doesn't expose anything that we don't want to the outside world, while at the same time allowing us to easy manage everything from how ingress is handled, to what's in the containers (Tomcat/JDK/packages/distro) and just generally have higher confidence that what we deliver will actually work, and prevent rogue services from eating up all of the resources without having to manually mess around with cgroups, FreeBSD jails, or any other mechanism.

Plus, the developers no longer needed to manually log into the servers and could also have their write permissions revoked, forcing all of the changes to go to the servers through the formal merge/pull request approval process, Git versioning and automated deployment, short of exceptional circumstances:

container CI/CD cycle

(a simplified example of how it looks "from above" without container registries etc.)

The problem with this was that getting buy-in from the clients was needed, as well as that implementing all of this was pretty cumbersome, because while containers are amazing for working on projects from day 1, containerizing legacy code that hasn't ever heard of the 12 Factor App principles is a pain and rewrites are necessary.

Even more so when the clients demand that you upgrade app components and you don't have the power to say no and just do containers successfully first, before being bogged down with upgrading outdated frameworks and code with insufficient unit tests just because no one else will and therefore you have to.

To not slip into a rant of sorts, allow me to instead re-iterate how important the 12 Factor App principles above are:

twelve factor app

It's not enough to just throw any monolith into a container and call it a day. Even if you manage to avoid the above minefield of throwing in framework version updates alongside containerization and can just do the bare minimum to get the app working, you should still look into shared data, configuration management, logging methods and many other things that the site wonderfully goes over. It's surprisingly sane, not overcomplicated and has actionable advice on what to do to make your apps better.

And the beauty of it is that it can work for both containerized apps, as well as any other piece of cloud native software that you might want to create. And it's mostly technology agnostic, so you won't have to struggle to use YAML or XML for configuring a Java service, a properties file or .env file for Python/PHP or whatever, but instead can use a more OS centric mechanism like reading configuration from the environment variables, or use mounted files for secrets management.

Apart from the organizational issues and the things surrounding the code, Docker Swarm actually performed wonderfully and i suggest that anyone else look into using Ansible and GitOps with it, and maybe use something like Portainer for onboarding people more easily.

And then another project with Kubernetes came along

While i explored Kubernetes (K3s and Rancher) as a part of my Master's degree and while i dabble in containers and orchestrators in my free time (Nomad is really nice btw), things still took a turn for the worse when i was called in to help with another project that had opted for K8s.

Now, this is unrelated to the previous example, however in my eyes it serves as an example of what can happen when you try to adopt complicated technologies without having the Ops expertise or resources to do so properly, possibly due to not being familiar with all of the downsides:

kubernetes for small teams

Now, i'm not saying that Kubernetes is a bad choice in all circumstances, because when it works, it works really well and is like having superpowers, and can prevent you from having to deal with lots of otherwise bureaucratic processes that have been automated away and also make you not worry about the factor of human error quite as much.

And yet, when working with small teams and with limited resources, you should almost always prioritize simplicity over most other concerns and not try to replicate what companies with 100x - 1000x more resources than you are doing, as long as you remember to avoid the aforementioned bad practices that lead to security risks. Surely, you can find a compromise to follow at least some of the good practices of the past decade, without having to rely on overly complex procedures or tools.

In this particular case, the clients essentially expected us to duplicate their entire environment, from hosting a Helm chart repo, to having K8s clusters, systems for managing secrets, their CI/CD pipelines and so on. The complexity of it was mind boggling and was encapsulated in around a 50 page long document - it was still great that there was documentation like that in place, however its length should be telling.

It's just another case of not going all the way as far as automation goes - if you can't give others a playbook that will create everything that's needed for development, then don't expect others to be able to do what probably took you weeks to months in a much shorter amount of time. What's perhaps the worst aspect here, is that it's an utterly blocking issue, without a solution for which any sort of development can't take place.

The management had already undertaken contractual obligations, so imagine the conversation with them going like: "Oh, hey, can you help us set up the environment? Think you could have it ready by 3 PM today?" and the answer being along the lines of "No, unless they have IaC playbooks that they can share (which they didn't), then doing all of that to completion will take anywhere from a week to a month."

Of course, you also can't forget about the hardware requirements for a technology like Kubernetes, especially with service meshes like Istio + Kiali and other solutions thrown in. The good news is that some distros, like K3s can be pretty close to Nomad and Swarm in regards to the resources that you'll need for a Kubernetes cluster, and also have good management tools, the strongest option probably being Rancher (which supports K3s, not just RKE), but Portainer also being passable. However, if you want one of the more heavy distros, then that might not be the case at all. This can be especially bad when you have to justify to Ops why you suddenly need a server with 16 - 32 GB of RAM for this one microservices project.

Alas, that's bound to make you suffer and create plenty of blockers along the way, even if OCI containers are a good idea. Their beauty and simplicity got lost somewhere along the way, i'm afraid.

Summary

Now, some (many) of the details above are changed a bit (e.g. you can replace Java with Ruby, or Node, or Python; or RPM distros with DEB distros; it's the same story with every technology, in my experience), but the point remains - the path to preventing systems rotting with age is a long and complicated one and you'll probably need to invest both time and effort into finding out what works for any particular deployment.

That said, if you don't have a strong history of DevOps into your org, tread lightly, especially if your organization unit is viewed as a cost center, rather than a profit center - you'll probably want as much buy in from management and your fellow engineers as possible and probably also utilize the least amount of software and the simplest possible configurations for tackling your problems, otherwise you're at a risk of making things much worse in your real world conditions while chasing some abstract perfect environment.

I still think that containers are a good option, or even just systemd services + Ansible for managing server setup and configuration, but personally the ease of use and fault tolerance of OCI containers and technologies like Docker Swarm is hard to ignore, so i'd definitely suggest that you look into them, even if they're not currently hot on the market. As for Kubernetes, K3s is proof that things are getting at least a little bit better as time goes on, but in many cases, resource usage, ease of use, ease of setup and reproducibility of the entire cluster (not just what's running inside of it) often isn't on the radar for people who want to use the trendy tech to make their CVs look better.

If possible, look out for the dangers of CV driven development, or find environments where that can be done responsibly (e.g. when you actually need those technologies and they solve a problem that your org has, and you actually have the resources to do pilot projects first that embrace them). Alternatively, Choose Boring Technology with fewer unknowns, while not overlooking some of the best practices out there, like 12 Factor Apps - you don't need to take any of it as gospel, but do implement approaches that can simplify things for you.