(3/4) Pidgeot: a system for millions of documents; Deploying and Scaling

You can read the previous part here: (2/4) Pidgeot: a system for millions of documents; The Application

In the previous parts, we setup PostgreSQL and MinIO for our document storage system, as well as wrote a Java API to glue them together. We also implemented some logic to generate test data, however now we actually need to get all of this up and running somewhere, so that I can generate a reasonable amount of data for testing!

Technology choices

The easiest way to get things to be portable by far, is to use containers, which will include the runtimes that any of my apps need, as well as allow for consistent and easy configuration, logging, resource management and so on.

I generally build my own containers when applicable and store them in my Nexus instance, to lessen my reliance on Docker Hub and also improve how consistent my containers are, to know how everything works inside of them.

In this case, I don't even need to build the PostgreSQL or MinIO containers separately, because they are already in my Nexus registry (which is mostly private for now), however you can use the Bitnami container images for this.

All I need is to launch them in some container stack/deployment and connect the application/dbmate to them. For that, I already have a Docker Compose file:

version: '3.7'
services:
  pidgeot_dbmate:
    image: docker-dev.registry.kronis.dev:443/dbmate
    volumes:
      - ./db/migrations:/workspace/db/migrations
    environment:
      - DATABASE_URL=postgres://pidgeot_pg_user:pidgeot_pg_password@pidgeot_postgres:5432/pidgeot_pg_database?sslmode=disable
    deploy:
      resources:
        limits:
          memory: 512M
          cpus: '1.00'
      restart_policy:
        condition: on-failure
        delay: 60s
  pidgeot_postgres:
    image: docker-prod.registry.kronis.dev:443/postgresql
    ports:
      - 5432:5432
    volumes:
      - pidgeot_postgres_data:/bitnami/postgresql
    environment:
      - POSTGRESQL_USERNAME=pidgeot_pg_user
      - POSTGRESQL_PASSWORD=pidgeot_pg_password
      - POSTGRESQL_DATABASE=pidgeot_pg_database
    deploy:
      resources:
        limits:
          memory: 2048M
          cpus: '2.00'
  pidgeot_minio:
    image: docker-prod.registry.kronis.dev:443/minio
    ports:
      - 9000:9000
      - 9001:9001
    volumes:
      - pidgeot_minio_data:/data
    environment:
      - MINIO_ROOT_USER=pidgeot_min_user
      - MINIO_ROOT_PASSWORD=pidgeot_min_password
      - MINIO_SERVER_ACCESS_KEY=pidgeot_min_access_key
      - MINIO_SERVER_SECRET_KEY=pidgeot_min_secret_key
    deploy:
      resources:
        limits:
          memory: 2048M
          cpus: '2.00'
volumes:
  pidgeot_postgres_data:
    driver: local
  pidgeot_minio_data:
    driver: local

(there might be additional deployment constraints in clusters with more nodes, as well as fewer exposed ports directly when I use an ingress)

Packaging the migrations

I want to actually package the migrations for any application version, instead of dynamically loading them from the file system. Because I already run the migrations in a Docker container, building a Dockerfile for dbmate is pretty simple:

# Use my own dbmate image as a base
FROM docker-dev.registry.kronis.dev:443/dbmate

# Copy over the migration files
COPY ./db/migrations /workspace/db/migrations

# That's it, the rest is already preconfigured in my own dbmate image
# Just need to specify the DATABASE_URL variable on startup!

This is it! Once I run this container with the appropriate DATABASE_URL variable set, it will connect to that database instance and attempt to run whatever migrations haven't been executed yet!

Now we can move on to the application.

Packaging the application

I also need a container image for my back end application, which is where things get a little bit more interesting, because I also want to use my Nexus instance for any Maven dependencies. In a sense, it's a caching proxy, so my own servers will be hit to download libraries and such. This should have a nice impact on the performance of everything, especially when build cache might not always persist.

The build itself will be multi stage, and the instructions will be laid out in such a way to maximize layer caching:

# BUILDER =====================================================================
FROM docker-dev.registry.kronis.dev:443/openjdk17 AS builder
LABEL stage=builder

# Setup Maven proxy configuration, so we use our own repo
COPY ./maven-settings-docker.xml /etc/maven/settings-docker.xml

# Copy the pom.xml first and install the dependencies so we can take advantage of Docker caching
WORKDIR /workspace
COPY pom.xml /workspace/pom.xml
RUN mvn -s /etc/maven/settings-docker.xml \
--batch-mode dependency:go-offline

# Build the application (and also run any tests we might have)
COPY . /workspace
RUN mvn -s /etc/maven/settings-docker.xml \
--batch-mode clean test package

# RUNNER ======================================================================
FROM docker-dev.registry.kronis.dev:443/openjdk17 AS runner
LABEL stage=runner

# Copy files from builder
WORKDIR /app
COPY --from=builder /workspace/pidgeot-api-template.yml /app/pidgeot-api-template.yml
COPY --from=builder /workspace/api-docker-entrypoint.sh /app/docker-entrypoint.sh
COPY --from=builder /workspace/target/pidgeot_api-latest.jar /app/pidgeot_api-latest.jar

# For running the app
RUN chmod +x /app/docker-entrypoint.sh

# Run app
CMD "/app/docker-entrypoint.sh"

I also have some custom Maven configuration for my registry, but apart from that, you might be interested to see how I handle the application configuration.

You see, Dropwizard expects a YAML file, but this doesn't really fit how 12 Factor Apps should be configured: instead, we should use environment variables. I've heard some people suggest that sometimes structured files can be easier to work with than one long list of environment variables, but at the same time it's hard to argue with the simplicity and ability to put all of your configuration in your container stack/deployment description.

Alas, I have something like this in my docker-entrypoint.sh script:

#!/bin/bash

echo "Generating configuration..."

export LOGGING_LEVEL=${LOGGING_LEVEL:-'INFO'}
export LOGGERS_DEV_KRONIS=${LOGGERS_DEV_KRONIS:-'DEBUG'}
export DATABASE_DRIVER_CLASS=${DATABASE_DRIVER_CLASS:-'org.postgresql.Driver'}
export DATABASE_USER=${DATABASE_USER:-'pidgeot_pg_user'}
export DATABASE_PASSWORD=${DATABASE_PASSWORD:-'pidgeot_pg_password'}
export DATABASE_URL=${DATABASE_URL:-'jdbc:postgresql://localhost:5432/pidgeot_pg_database'}
export DATABASE_PROPERTIES_CHARSET=${DATABASE_PROPERTIES_CHARSET:-'UTF-8'}
export DATABASE_MAX_WAIT_FOR_CONNECTION=${DATABASE_MAX_WAIT_FOR_CONNECTION:-'10s'}
export DATABASE_VALIDATION_QUERY=${DATABASE_VALIDATION_QUERY:-'"/* Pidgeot API Health Check */ SELECT 1"'}
export DATABASE_VALIDATION_QUERY_TIMEOUT=${DATABASE_VALIDATION_QUERY_TIMEOUT:-'10s'}
export DATABASE_MIN_SIZE=${DATABASE_MIN_SIZE:-'8'}
export DATABASE_MAX_SIZE=${DATABASE_MAX_SIZE:-'64'}
export DATABASE_CHECK_CONNECTION_WHEN_IDLE=${DATABASE_CHECK_CONNECTION_WHEN_IDLE:-'false'}
export DATABASE_EVICTION_INTERVAL=${DATABASE_EVICTION_INTERVAL:-'30s'}
export DATABASE_MIN_IDLE_TIME=${DATABASE_MIN_IDLE_TIME:-'1 minute'}
export MINIO_USERNAME=${MINIO_USERNAME:-'pidgeot_min_user'}
export MINIO_PASSWORD=${MINIO_PASSWORD:-'pidgeot_min_password'}
export MINIO_URL=${MINIO_URL:-'http://localhost:9000'}
export TEST_DATA_GENERATE_IF_NOT_PRESENT=${TEST_DATA_GENERATE_IF_NOT_PRESENT:-'true'}
export TEST_DATA_ACTUALLY_UPLOAD_TO_MINIO=${TEST_DATA_ACTUALLY_UPLOAD_TO_MINIO:-'false'}
export TEST_DATA_BLOB_GROUP_COUNT=${TEST_DATA_BLOB_GROUP_COUNT:-'10'}
export TEST_DATA_CLIENT_COUNT=${TEST_DATA_CLIENT_COUNT:-'10'}
export TEST_DATA_DOCUMENT_COUNT_FOR_CLIENTS=${TEST_DATA_DOCUMENT_COUNT_FOR_CLIENTS:-'10'}
export TEST_DATA_BLOB_COUNT_PER_DOCUMENT=${TEST_DATA_BLOB_COUNT_PER_DOCUMENT:-'10'}
export TEST_DATA_BLOB_VERSION_COUNT_PER_BLOB=${TEST_DATA_BLOB_VERSION_COUNT_PER_BLOB:-'10'}

envsubst < pidgeot-api-template.yml > pidgeot-api-container.yml

PRINT_TEMPLATE=${PRINT_TEMPLATE:-false}
if [ "$PRINT_TEMPLATE" = true ]
then
    echo "Printing template:"
    cat pidgeot-api-template.yml
else
    echo "You can print the template with PRINT_TEMPLATE=true"
fi

PRINT_CONFIGURATION=${PRINT_CONFIGURATION:-false}
if [ "$PRINT_TEMPLATE" = true ]
then
    echo "Printing configuration:"
    cat pidgeot-api-container.yml
else
    echo "You can print the configuration with PRINT_CONFIGURATION=true"
fi

echo "Checking Java version..."
java -version

echo "Checking Pidgeot API configuration..."
java -jar pidgeot_api-latest.jar check pidgeot-api-container.yml

echo "Running Pidgeot API..."
java -jar pidgeot_api-latest.jar server pidgeot-api-container.yml

You might wonder what all of those variables are. A brief look at how my configuration looks now would reveal the reason:

43 configuration example

Basically, I took the initial configuration in pidgeot-api.yml file, then created a template where the values are replaced with variables: pidgeot-api-template.yml. Then, once I run docker-entrypoint.sh, it will read what are the current environment variables and, based on those, use the envsubst command, to put my environment variables into a newly created file: pidgeot-api-container.yml.

What happens when some value is not set? You see, this file has default values included in the initialization!

One could definitely suggest that this generation based approach of: local configuration --[manual work]--> configuration template --[script+env]--> container configuration has a bit more work involved than there should be, but personally I haven't found anything better.

The only actual pain point is that I'd have to change the configuration in three places (the local configuration, configuration template and script), but apart from that it's easier than mucking about with files, or trying to rewrite how the application needs its configuration.

If internally it needed a .properties file, I could generate that as well, or JSON files, or XML files or anything else. If externally I wanted to pass the configuration to the container through a Helm chart, that's easily doable. If I wanted to set the configuration for a container locally or on server directly, I could easily use the -e parameter, without even opening a text editor. And, of course, Docker Compose and Docker Swarm (or Nomad, or Kubernetes) deployments can also be written consistently.

This overall simplicity is also reflected in the build process, which can be launched very easily locally:

docker build -t local_pidgeot -f .\api.Dockerfile .

Someone who wants to build the container doesn't need to think about whether they need Maven or a particular version of JDK installed locally, nor needs to worry about manual instructions. Instead, you just run the command and a little while later everything is successfully built:

44 first build

Not only that, but if you change only some of the code, the dependencies will already be cached and you'll just do a compile, thus greatly speeding up your builds:

45 cached build

That's about it for making this buildable, the next step is getting it to build on a server and to put it in Nexus, so I can enjoy some degree of automation.

Adding CI/CD into the mix

Now, all that we need is a CI setup, for which I'll use Drone CI to automatically use these container definitions for building and persisting container images in my Nexus instance.

Here, you can probably use anything like Jenkins, GitLab CI or GitHub Actions as well, but in my case Drone CI is pretty lightweight and works well.

For my setup, I simply go to the Drone CI UI and look for my repository:

46 Drone activity

There, I select and enable it for CI execution:

47 Drone activate

I then need a bit of configuration in regards to which features I want to enable:

48 Drone configuration

Whichever tool you pick, the gold standard here is to have the description of your build be declarative, as text files that you can run another instance of your CI solution that wouldn't be manually configured (e.g. not the old way of doing Jenkins builds).

In my case, it looks like the following:

kind: pipeline
type: docker
name: default
image_pull_secrets:
  - DOCKER_AUTH
volumes:
  - name: docker_socket
    host:
      path: /var/run/docker.sock
steps:
  - name: pidgeot_dbmate_image
    image: docker-prod.registry.kronis.dev:443/docker
    volumes:
      - name: docker_socket
        path: /var/run/docker.sock
    environment:
      DOCKER_USERNAME:
        from_secret: DOCKER_USERNAME
      DOCKER_PASSWORD:
        from_secret: DOCKER_PASSWORD
    commands:
      - IMAGE_TAG="docker-dev.registry.kronis.dev:443/pidgeot_dbmate"
      - DOCKERFILE="dbmate.Dockerfile"
      - docker build -t "$IMAGE_TAG" -f "$DOCKERFILE" .
      - docker push "$IMAGE_TAG"
  - name: pidgeot_api_image
    image: docker-prod.registry.kronis.dev:443/docker
    volumes:
      - name: docker_socket
        path: /var/run/docker.sock
    environment:
      DOCKER_USERNAME:
        from_secret: DOCKER_USERNAME
      DOCKER_PASSWORD:
        from_secret: DOCKER_PASSWORD
    commands:
      - IMAGE_TAG="docker-dev.registry.kronis.dev:443/pidgeot_api"
      - DOCKERFILE="api.Dockerfile"
      - docker build -t "$IMAGE_TAG" -f "$DOCKERFILE" .
      - docker push "$IMAGE_TAG"
  - name: send_email
    depends_on:
      - pidgeot_dbmate_image
      - pidgeot_api_image
    image: docker-prod.registry.kronis.dev:443/drone_email
    settings:
      host: ...
      port: ...
      username:
        from_secret: EMAIL_USERNAME
      password:
        from_secret: EMAIL_PASSWORD
      from: ...
      recipients: [ ... ]
    when:
      status:
        - failure
        - success

(some login instructions might be skipped, to keep this short and readable)

Once that file is in place and has been committed, builds will start after I push new changes:

49 build on commit

It also helps you visualize your build dependencies, which in my case are rather simple - just a dbmate and API container:

50 build structure

If need be, it's also possible to preview logs:

51 build structure

And once the build is completed, I get a nice e-mail about the results:

52 build email

The build speed is currently around 2 minutes, but depending on what and how I build, it can speed up thanks to my CI servers also having their local Docker caches. Either way, the images are now in my Nexus instance and the rest is just a matter of launching the containers on some server.

Running on a container cluster

Initially, I try it on my own homelab servers, which can also access the images, just to see whether the data generation also works. In my homelab I run Docker Swarm, which is delightfully simple, together with Portainer, which allows me to launch my new stack:

53 deploy on Swarm

(you'll notice that this file has "real" passwords, but the instances aren't exposed publicly and also were wiped by the time you're reading this)

There are a few interesting changes, here. Firstly, we now make our API to wait for both the DB and MinIO instance to be up. Also, we only expose to localhost, which means that I'll use port forwarding on my local machine to access the applications, the MinIO instance and even the database.

Also, you can't see it in the image above, but I also use bind mounts instead of Docker volumes, for the ability to more easily look at the underlying files. Some software is a bit picky about DNS names (e.g. no underscores), so we also setup network aliases for our containers.

For testing purposes, this is more than secure enough and adequate, but it's nice to be aware of these small details:

pidgeot_api:
  image: docker-dev.registry.kronis.dev:443/pidgeot_api
  depends_on:
    - pidgeot_postgres
    - pidgeot_minio
  networks:
    pidgeot_network:
      aliases:
        - pidgeot-api
  ports:
    - "127.0.0.1:10103:8080"
    - "127.0.0.1:10104:8081"

In a little while, all of the images are pulled and the containers start up one by one:

56 containers running

We can also look at the logs of any of those, for example, here's the API instance after a restart, when the data is already present:

57 successful startup

We can then use our preferred means of doing some SSH tunneling, which in my case is using the functionality build into MobaXTerm:

54 tunnel

This is not all that different from what Kubernetes lets you do, albeit we don't need a Kubeconfig file and just use some SSH keys:

55 tunnel

So, we can open a seemingly local URL, which will actually be the app instance running on the server, to which we connect through our SSH tunnel to forward the port:

58 tunnel running

You can also just do tunneling through the CLI, which can actually be a pretty nice experience as well, or use any other software package of choice. I do find it a bit odd how this doesn't seem to be as popular of an approach as it should be, since it's just so very usable.

With that done, we can gradually scale up and do some tests.

Trying to generate the data in my homelab

My initial idea was to simply use my homelab for all of those millions of records as well, so I just launched the instance and set it to work. A nice program by the name of iotop showed me that the disk writes weren't all that fast, though:

59 total disk writes

The containers themselves also seemed to just be doing their thing. Here you can see me running two instances of everything parallel, one with migrations that partition the tables and another without the partitions, as a control group:

60 cpu usage

Also, eventually I parallelized the actual generation process across multiple containers, thinking that maybe the DB thread pool or something else caused the generation performance to be bottlenecked:

61 refactored structure

You'll also notice that I have separate instances for the generation and another one for the web API with an exposed port. This is possible thanks to feature flags and lets me work around the limitation that only one container can be allocated to a certain port in host mode networking.

Otherwise only one of them would successfully start, whereas now one can be the public instance and others can handle any background processes that I might want (or data generation, in our case). I'd say that building systems modular like that is a really useful approach!

But alas, the load average seems high, yet the CPU utilization itself is comparatively low and not that much memory is actually used, which feels a bit odd:

62 increased resource usage

In the end, it turned out that my HDD was pretty much the limit here. HDDs are a rather cheap way of storing a decent amount of data, so I use them for backups, my own personal cloud, Minecraft servers for friends and other things like that, but they're not the best when you want to run databases with lots of IO against them.

So, with that I realized that it's time to move over to the cloud!

Moving to the cloud

Normally, I'd go for something like Time4VPS because they're pretty affordable in the long term and are more or less a local business here in the Baltics, so their ping times and other characteristics are very much to my liking.

This time, however, I'd need a bit more focus on beefier machines that I can rent out for a day or two and do some number crunching and IO on, as well as functionality like being able to easily attach additional volumes, which will make more sense shortly. Because of that I chose to go with Hetzner, another nice European cloud platform.

I wasn't entirely sure how much resources I'd actually need, so I provisioned a node that'd vaguely match what I have in my homelab, albeit with faster SSD storage (even if the total size is many times smaller than what I have):

63 pidgeot server

I then proceeded to install Docker on it and add it to my container cluster, so that I could easily schedule workloads on it:

64 node added

Once that was done, I could get this show back on the road and it seemed like the SSD really helped, allowing the CPU to also be used more efficiently. It appeared that the workload had been IO bound:

65 new load

It actually looked like all of the data would be generated in not too long of a while. Here, I had gone for a total of 10 million blob versions, each instance generating a part of the total (1 million):

66 generation performance

The problem, however, was the CPU usage, since it felt like if I'd keep the node at 100% load for too long, the server provider might have a few choice words to say about it. At my current setup, the data generation would take a few hours of this, which may or may not be okay:

67 the problem

Not only that, but it also appeared that the amount of data that we were generating, the MinIO files in particular was actually so plentiful that it was no longer easy to see how big individual directories were on the server. Even tools like ncdu ran into problems:

68 ncdu breaks

Either way, so far things seemed well and I would probably not even exhaust all of the storage, because even though plentiful, the test files were pretty small. That said, eventually things did break, just not for the reasons that I initially expected.

The first thing that broke was MinIO, complaining about not enough space on the device:

70 MinIO upload errors

Shortly after, the PostgreSQL database also broke and went into recovery mode:

72 database also died

Why was that? Well, the disk actually wasn't full, since there was still free space. Rather, I had run out of inodes, meaning that I had so many small files, that no new ones couldn't be created because of how the file system was initialized:

71 out of inodes

Curiously, this was still 5 times more inodes than my homelab servers have. So my choices were to either give up on actually storing all of the files and only test the database, or find another solution. But it wouldn't be much of a document storage system, if it couldn't store documents, now would it?

Cloud to the rescue

When the cloud doesn't suffice, just add more cloud! Or rather, in our case, a new attached volume:

73 xfs

The nice thing about Hetzner is that you can not only attach volumes to your servers, but also choose what file system the volume should be running. That means that you can use not just ext4, but something like XFS instead, which allows you to create far more files, among other things.

So, with just opening my wallet and a click of a button, I had essentially solved my problem:

74 xfs volume created

Shortly after, I connected to the server and saw the very nice number of 100 million inodes available:

76 new volume inodes

With that now out of the way, I allowed the generation logic to run throughout the night, which tells me that this is definitely not the fastest way to go about this, but is also passable, because by the time morning came, the generation had successfully been finished (it could have been even faster, but I dialed the CPU limits a bit lower):

77 data generation finished

The API instances told me that the total process had taken around 32200 seconds for a total of around 2000000 records each (I decreased the count of parallel instances slightly, but increased the amount of total data to generate):

78 data generation finished

This came out to:

  • around 60 new records per second, per instance
  • around 310 new records per second, per database/MinIO
  • around 620 new records per second, per the whole server

Considering that this included the file uploads and all of the related data and whatnot, I'd say that overall this was a pretty decent outcome. I also then proceeded to zip up the entire database and download the archive, for backup purposes:

80 download data

If needed, I could run PostgreSQL locally with all of that pre-generated data, because the actual DB compressed down to just around 2 GB when in an archive and was around 10 GB unzipped. For a comparison, the files were 100 GB and that's just when they were very short text strings:

82 results space

Also, the inodes were also doing a bit better this time around, so technically I could fit in twice the amount of files if I wanted to:

81 results inodes

The last thing that I did was to also make PostgreSQL do some vacuuming to ensure that both of the instances will perform as well as I might hope them to:

83 vacuum verbose analyze

With that, the data generation was done.

Summary

All of this wasn't strictly necessary to document or share, the actual system would probably work similarly well if you installed everything manually on the nodes themselves, or perhaps used Ansible for automation.

Then again, many people out there have their own setups and I find it interesting to look at those, so maybe you'll also appreciate learning a little bit about mine, since it took a while for me to settle on something so dead simple and usable. I have used Kubernetes as well and think that something like K3s can also definitely belong in a homelab, but for now this will do.

It was also interesting to run into the limitations of the file system, as well as show how easily the cloud can allow you to work around them. You could also set up XFS, or Btrfs or maybe even ZFS locally as well, but this was perhaps the most time effective solution for my needs.

All that's left, really, is to test the system!

You can read the next part here: (4/4) Pidgeot: a system for millions of documents; Testing and Conclusions

By the way, want an affordable VPN or VPS hosting in Europe?
Personally, I use Time4VPS for almost all of my hosting nowadays, including this very site and my homepage!
(affiliate link so I get discounts from signups; I sometimes recommend it to other people, so also put a link here)
Maybe you want to donate some money to keep this blog going?
If you'd like to support me, you can send me a donation through PayPal. There won't be paywalls for the content I make, but my schedule isn't predictable enough for Patreon either. If you like my blog, feel free to throw enough money for coffee my way!