Keycloak is broken

Okay, so here's the thing: I've been working on bootstrapping an entire platform as a part of some freelance work, for which I decided to use a bunch of off-the-shelf components, especially in regards to authn/authz. I still remember the words of Eoin Woods in a security presentation, which stuck with me: "Never invent security tech"

For that, something like Keycloak seems like a great choice: I've successfully used it previously in an enterprise setting, in general it works nicely and supports all sorts of configuration that you might need. Not only that, but I decided to opt for mod_auth_openidc, which is, in their own words:

OpenID Certified™ OpenID Connect Relying Party implementation for Apache HTTP Server 2.x

That might sound like a lot to unpack, but basically everything boils down to it being a gateway solution of sorts, that can protect ANY web application or other resource with OpenID Connect or OAuth2. Much like how web servers like Apache or Nginx can put a bunch of security or other heads when acting as a reverse proxy, or handle TLS termination for you, or even integrate with something like Let's Encrypt through ACME for automated certificate renewal, mod_auth_openidc lets you handle all of the heavy lifting in regards to authn/authz, redirects to your login portal and all that jazz.

Even the fact that it uses Apache2 isn't really a big issue for me, because at the scale that I work at, it's a perfectly serviceable web server and I haven't ever really needed to use the excellent HAProxy, nor has Nginx been needed for performance reasons (though it's an excellent web server otherwise, as long as you don't need a lot of custom modules, since compiling isn't pleasant). But I digress, the whole setup can be thought of as the following:

OpenID Connect

There's probably a fancy graph that someone could draw with all of the possible authorization flows, but all we need to know here is that the web server would see something like the following, when a user accesses a protected resource:

OIDC header example

(this is actually a debug endpoint in the API so admin users can test the headers in Swagger, as received by the server from mod_auth_openidc, normally the users should never see these directly, only the server; don't get confused by the Swagger UI, this is what your API would see server side, for example)

This means that if you have 4 different languages/frameworks in your API, you technically don't need 4 different OIDC libraries that you have to configure and integrate: merely reading the data from the headers (that are only allowed to be set by the Apache module) is enough! In addition to that, the module can help with token refresh logic, logging out and a bunch of other stuff. If you're not actually logged in and try to access a protected resource, you'll see something like this or will be redirected to the SSO platform:

unauthenticated example

Pretty cool, right? It really feels like at a certain scale, this is indeed what one would call "the right tool for the job".

It wasn't meant to be

Unfortunately, Keycloak kind of rains on my parade here. Imagine looking at your production uptime monitoring and seeing something like the following:

uptime monitoring

(thankfully, this isn't the end of the world before the service actually going live and being available to the public)

Now, in my case individual requests failing aren't reported as an outage but rather degraded service, however seeing that only 90% of the requests to Keycloak succeeded is insanity. In the industry, people talk about having "multiple nines" of uptime: whether their uptime is 99.99% or maybe 99.999%. Instead, we have way less here, to the point where even the most low stakes setup would find this unacceptable - how would you feel if 1/10th of your users couldn't log in?

Well, first I went to have a look at Keycloak logs, but there was no luck there whatsoever:

no errors in the logs

The software has the typical Java issue of INFO logging having nothing useful in it, whereas DEBUG spams way too much, while still being more or less useless. Regardless, Keycloak itself doesn't report issues in the single node setup, which is unfortunate, because it means that something a bit more fundamental than just a distributed cache config has gone awry.

Then again, I'm distrustful of Keycloak in general, because their releases are straight up weird - by default, their container images actually install a bunch of stuff during startup, which is a big no-no and telling users to create their own optimized container images feels a bit weird, albeit maybe is understandable, to let people customize exactly what is included (though they should still provide a default image with all functionality available out of the box):

enterpriseitis

(you can see an attempt to disable the caching of any static resources, though this didn't help either)

This "enterpriseitis" aside, actual web requests seem to work most of the time, when you try testing the login functionality:

how it should work

However when things break, it seems like it's the most random stuff ever:

web failures

Sometimes it's the CSS files, sometimes it's the PNG or WOFF2 resources, other times the page itself fails, however what fails and when seems entirely random. What the heck? Looking at a failed request, we don't see that much out of the ordinary, either:

headers

The mod_auth_openidc_state cookie is a bit large, but other than that it seems like we just get a server error, where Apache2 is unable to talk to Keycloak, which apparently just drops the request:

AH01097: pass request body failed to KEYCLOAK_APP

That's pretty horrible, especially because it happens seemingly with no rhyme or reason. People online have actually had pretty much the same issue, though their suggestion of increasing the Nginx proxy buffer size doesn't exactly work, when I'm not using Nginx:

proxy_buffer_size 128k;
proxy_buffers 4 256k;
proxy_busy_buffers_size 256k;

Even using whatever might be the closest Apache2 equivalent does pretty much nothing.

It's not like I can switch to Nginx either due to something like mod_auth_openidc not existing there. This means that I'd have to rely on something like OpenResty, which brings its own complexity into the mix (and also hope that the functionality that I need isn't available in some paid version). If I had to guess, I'd say that this is in part due to Apache being more modular and thus not requiring anyone to recompile the whole thing to add modules, which is reflected by what's available in the ecosystem.

It works, 90% of the time, every time

So what's the end result there? I'd call it the Russian Roulette login approach.

Sometimes it will look okay:

01 looking okay

Other times things will work, but be slightly broken:

02 slightly broken

Or a bit more broken:

03 more broken

Perhaps it will look something out of the 90s:

04 more breakage

Although sometimes you'll get a mobile layout, just less functional:

05 different look altogether

Or, worst case, nothing will work at all:

That's kind of embarrassing to be honest, and just isn't workable. Unfortunately, mod_auth_openidc is kind of niche, as is the combination with Keycloak, so while they do have some nice docs, nothing like this is covered. Ouch.

Summary

So what's left for me to do? Look for another OpenID Connect Relying party implementation? Maybe swap out Keycloak for something like Authentik and admit defeat? Well, no, I actually like the fact that Keycloak is popular and well documented, in addition to also running with MySQL/MariaDB, not just PostgreSQL, so I want to make it work.

To do that, however, I might have to be a bit devious along the way. Basically, I don't owe Keycloak anything - if it refuses to service a request, I can just make it again. And again. And I can keep doing this until it will eventually work. Apache2 itself doesn't support this (many web servers don't have transparent retries, since technically you shouldn't do this), but frankly with something like Go and its built in HTTP client/server, I could probably create something that will proxy requests for me and automatically retry them, up to a point:

keycloak plan

Truth be told, the reality is that there's a lot of arguably broken software out there, but most of the issues should be possible to work around. Now, isn't it a bad thing that some requests will randomly hang for a bit longer while they're repeated multiple times behind the scenes, also leading to higher load on Keycloak side? Well, yes, but also I don't care - I've done my due diligence, it seems that disabling/enabling caching on Keycloak side does nothing, Apache2 configuration does nothing and I certainly don't have 10 years to spend debugging this odd issue. As long as it works, that's good enough, so I'll turn that 90% into at least 99%, whatever it takes.

Either way, it's still a bit saddening that things like need throw a wrench into the works and software won't "just work".

By the way, want an affordable VPN or VPS hosting in Europe?
Personally, I use Time4VPS for almost all of my hosting nowadays, including this very site and my homepage!
(affiliate link so I get discounts from signups; I sometimes recommend it to other people, so also put a link here)
Maybe you want to donate some money to keep this blog going?
If you'd like to support me, you can send me a donation through PayPal. There won't be paywalls for the content I make, but my schedule isn't predictable enough for Patreon either. If you like my blog, feel free to throw enough money for coffee my way!