So here's a little something after a bit of a hiatus. Recently i had a pretty bad outage at work, one that required bringing me in to fix it. What should have been a pretty short cycle of gathering application metrics, digging through them to figure out what's causing the bottlenecks and taking actionable steps to improve the situation, instead ended up being something that caused me to tear my hair out (thankfully only figuratively) for numerous hours and even shipping versions at like 2 AM.
You see, the clients could reproduce the problem on their environments, however, we could not on ours, something that i found out only after some days, when i got adequate resources to do large scale automated testing (the tests i had to write myself, sadly, with Selenium and Docker). Prior to that, i actually had already done quite a few preliminary optimizations, which improved the request processing times noticeably, yet that wasn't enough:
And yet, the request processing times were bad on the clients' side still, even despite the optimizations. Now, it's great that i had the data to prove how the request processing times changed due to my efforts, so i'd definitely advise everyone to have some APM solutions in place and have it all automated as far as possible, but if things still don't work on the actual clients' environments and in prod, then it's a total non starter either way!
So, after benchmarking their servers and finding that the actual hardware isn't to blame, and checking that their DB works as expected, i finally checked their JDK version. As it turns out, they hadn't followed instructions that were almost a year old and as a consequence were still running Oracle JDK instead of OpenJDK, which i later found out was to blame for everything being horrible:
What created this difference? No idea, my backlog is too long to care. Is this acceptable? Absolutely not! At least if Java wants to pretend that it's "compile once, run everywhere", since as the data shows, they'd need an asterisk next to that claim and to explain that performance might be downright terrible in some cases.
But that's not the real lesson to learn here. That honor belongs to the idea, that you should never trust your clients. Well, maybe trust them, but verify everything nonetheless! Defensive programming should probably also apply to what the application needs. JDK vendor doesn't match the one that's required by the application? Check it at startup and fail fast - make the app refuse to start unless the environment matches what you expect.
Alternatively, containerize everything and start living in the 21st century, by cutting out dependency management out of the list of manual steps that need to be taken in the first place.