(1/4) Pidgeot: a system for millions of documents; The Data

Date: 2022-10-21

Let me set the scene: you need to process a large amount of documents and have some basic versioning for them, quite possibly for a project at work. The specifics aren't even as important as the overall domain, you can make up whatever feels suitable for such a scenario. There are numerous situations like that out there in the wild: be it working with generated PDF invoices, having a document management system, some sort of batch reporting, processing images and other media content that your users upload, perhaps something along the lines of an ETL pipeline or anything else, really.

Yet, I've seen many pitfalls with such solutions, both personally and through the experiences of others: sometimes there are issues with performance, other times the technologies are hard to work with, sometimes the whole system is brittle and can break.

That made me want to experiment with my own proof of concept, or rather something a bit like what I made for "Apturi Covid" a while back. Yet, back then, the solution was simpler and was primarily used for the storage of static assets, whereas I want to test things at a larger scale. That's how I came up with the idea of Pidgeot, a system built just so I could play around with some of the technologies and have it act as a proof of concept:

00 pidgeot logo.jpg

(the logo was generated with the DALL·E 2 AI, the project is named after a Pokemon because I hope my code will be pretty fly)

Lots of time you end up with something like that just being a module in an existing system, which shares the database, or even the blob/file storage with the rest of the system. Why? Because such an approach is easy and you can build on top of the foundations of what you already have. More so, solutions like Ruby's ActiveStorage give you the ability to manage blobs (like files or images, user uploads etc.) out of the box.

However, I'm here to tell you that you should consider your choices carefully, if you intend to work with larger volumes of data! Of course, this little exercise is also a nice code kata, to prepare for actually moving things forwards with my DevProject 2022, where I'll soon have a full video series on how to develop cloud native solutions. Let's begin.

Technology choices

The first and simplest idea would be to use your file system for the storage of files yourself (as the name implies). This is a good way to get started and will be adequate for many setups out there. For example, my homepage actually uses the aforementioned ActiveStorage, given that the scale I'm dealing with there is pretty small and so is the amount of files.

However, if you intend to store millions of files, there are some technical aspects to consider:

you'll need to figure out some way to only allow valid file names, since those vary based on what file system you use
the amount of inodes available to you, as well as the maximum amount of files in a directory that you can have will also be limited
if you ever want to scale this setup, you might need to look into using NFS, or even a distributed file system like GlusterFS or Ceph
apart from that, you are essentially one directory traversal attack away from a really bad time and your data being leaked

Here's a quick example of inodes on one of my servers, which you can check on yours with df -i:

02 inodes

(this will actually be more relevant later, for now just remember that this limit will depend on the file system that you use and won't be easy to extend in the case of something like ext4)

Don't get me the wrong way - the file system is a good abstraction, it's just that you might consider using a bit of software that someone else has written on top of it, something that has been around for a while and has been developed (and is maintained) by way more people than you could hire for your own needs.

So, I'd suggest that most of the time you should consider a blob storage solution instead - a specialized system with a consistent API that you can use for handling files in your system. This lets you avoid some of the pain points of using your file system and can allow you to scale out horizontally once that becomes necessary!

The most successful implementations of document management systems that I've seen also use another system for storing and searching through the metadata, either a relational database or even a document store (yes, MongoDB can be okay, but we'll focus on ACID here).

In this particular experiment, I'll focus on the following:

an S3 compatible blob store, such as MinIO (though Zenko is also fine)
a traditional relational database, such as PostgreSQL (some might prefer MariaDB/MySQL though)

Building our database schema and handling migrations

Let's start with the schema first and focus on our data model, to figure out how we'll store and query everything. For actually designing it, we'll be using pgAdmin, which you can get for free. You can, of course, also do this through the CLI, but I enjoy working with GUI software for something like this.

We'll generally focus on 3 levels of data:

blob_groups
blobs
blob_versions

In addition, we'll also have a few supporting tables:

clients
documents

If we wanted to make a super generic solution, these tables would normally live in other systems and we'd just keep references to blobs, but it's also passable to create something a bit more specialized, depending on the domain and what you want to achieve.

Supporting tables

Instead of jumping right into working with the blob storage logic, we'll probably want to model our data from the other end, thus starting with a table to keep track of a bunch of companies that we might be interested in collaborating with:

03 clients table

Note the UUID as the ID column type and the gen_random_uuid() method. This is generally an okay idea for when you don't want to utilize numeric IDs, which aren't always the best choice for security (albeit it shouldn't be a concern in most cases) or for situations where you might want to eventually split up your database into multiple separate ones, where numeric IDs aren't always viable.

Admittedly, if we didn't mind having just one instance (which is okay for most projects out there) and cared about the IDs being short, we could use SERIAL as well, which works a bit like AUTO_INCREMENT in MariaDB/MySQL. The performance difference between generating a UUID or incrementing some counter also shouldn't matter too much.

Then, we can proceed with some made up linked data table, in our case documents, which is supposed to demonstrate a schema that would have more joins when you'd query it:

04 documents table

As I said, technically these could be completely separate from our solution, in another database altogether. Perhaps another microservice, which would just pass us the necessary identifiers and store references to our blob IDs in its own DB, whenever necessary.

However, I'm curious about how the performance will be impacted when everything's on the same instance, so I'll leave this in here for benchmarking later. I will, however, NOT create any foreign keys between these tables and the blob related tables, keeping those two groups largely decoupled and separate.

I also create some simple indices for this data, for faster retrieval of this data, just for some eventual example searches:

CREATE INDEX clients_name_idx ON public.clients (name);
CREATE INDEX documents_code_idx ON public.documents (code);
CREATE INDEX documents_name_idx ON public.documents (name);
CREATE INDEX documents_created_at_idx ON public.documents (created_at);

I won't share all of the code to keep things short, but you can assume a few simple indices like that here and there (albeit in practice you might want to create some against multiple columns, depending on how you intend to query your data).

Blob tables

Now, onward to the blob storage! The idea is to allow storing multiple types of different blobs, such as invoices, receipts, reports, certificates, tickets or whatever else that you might need, or even sub types of each, such as unpaid/paid/overdue invoices, which fit nicely as groups of data in your system:

05 blob groups table

This can be especially nice for things like figuring out how many entries of a certain kind your system has, exporting everything of a certain kind, or perhaps adding similar bits of data later on, like which system/service the entry type comes from and so on. We are going to partition the actual blobs, but for now a simple index or two against the code and name columns will probably be enough:

CREATE INDEX blob_groups_code_idx ON public.blob_groups (code);
CREATE INDEX blob_groups_name_idx ON public.blob_groups (name);

Then, there are the actual blobs themselves, though one could also call them "files" or "objects" for all I care. These are particular instances of stuff that fits in the above groups, for example: "Invoice 25415 for customer 23155", which is probably what you'll search for most often, since that's the most useful information here.

07 blobs

You'll actually notice that id is no longer a PRIMARY KEY, which is because under the hood we're now using partitions for this table. If we tried creating a primary key here, we'd get an error:

ERROR:  unsupported PRIMARY KEY constraint with partition key definition
DETAIL:  PRIMARY KEY constraints cannot be used when partition keys include expressions.
SQL state: 0A000

This is because the table definition is actually now a bit more interesting than one might expect:

CREATE TABLE public.blobs
(
    id uuid NOT NULL DEFAULT gen_random_uuid(),
    blob_group_id uuid NOT NULL REFERENCES public.blob_groups (id),
    document_id uuid NOT NULL,
    name character varying(256) NOT NULL,
    created_at timestamp without time zone NOT NULL
) PARTITION BY LIST (EXTRACT(YEAR FROM created_at));

Here, we have chosen to partition the records into multiple separate underlying tables, based on the year, which is contained within the created_at timestamp:

08 blob_partitions

The problem, however, is that specifying all of those must be done manually. There have been attempts to create plugins which can automatically create new partitions or other approaches which can involve something like triggers, but if you keep things simple, what you have will be a bit like the following:

CREATE TABLE public.blobs_2020 PARTITION OF public.blobs FOR VALUES IN (2020);
CREATE TABLE public.blobs_2021 PARTITION OF public.blobs FOR VALUES IN (2021);
CREATE TABLE public.blobs_2022 PARTITION OF public.blobs FOR VALUES IN (2022);
CREATE TABLE public.blobs_2023 PARTITION OF public.blobs FOR VALUES IN (2023);
CREATE TABLE public.blobs_2024 PARTITION OF public.blobs FOR VALUES IN (2024);
CREATE TABLE public.blobs_2025 PARTITION OF public.blobs FOR VALUES IN (2025);
CREATE TABLE public.blobs_2026 PARTITION OF public.blobs FOR VALUES IN (2026);
CREATE TABLE public.blobs_2027 PARTITION OF public.blobs FOR VALUES IN (2027);
CREATE TABLE public.blobs_2028 PARTITION OF public.blobs FOR VALUES IN (2028);
CREATE TABLE public.blobs_2029 PARTITION OF public.blobs FOR VALUES IN (2029);

Obviously, you might run into some issues if you try to insert some data in 3035, unless you specify a default partition, which might be a bad idea in of itself as well, since you might forget that you need to fix the underlying tables and wonder why the performance gets progressively worse!

As for the reasons for partitioning these tables: the idea is that a lot of the time we might only care about the data in the last year. Thus, having a smaller table to search through could be pretty useful for something like that. More so, if this data grows a lot, with a bit of clever engineering, you could have the older years (say, 2004) stored on spinning HDDs, whereas the newer years could be stored on SSDs, or a beefy RAID array.

You might have also noticed that none of the columns in this table seem to have a lot to do with actual storage of... you know, files. That's because these don't necessarily correspond to individual files by themselves, at least not 1:1. Why?

Because in our domain we'll also have versioning for these: there might be multiple invoices for each customer, maybe they get an additional package of a service and the invoice needs to be re-generated, maybe there are discounts or other changes, maybe it's a usage based subscription or something else where you can just preview bills ahead of time.

So, let's create a table for the versions:

09 blob versions

This table will have the largest amount of data, so some partitioning here is in order as well. That said, partitioning just by year seems somewhat... ineffectual? After all, we will normally care more about the creation date for the blobs themselves (e.g. when a contract is started), but we might be more interested in what blob_versions belong to a particular blob.

For that, we use the blob_id column, an index upon which might still not be enough. Knowing that those are UUID values, however, we can surmise that only the following symbols will be present:

abcdef0123456789

Hmm, why not take the first character and use that as a partition key? It might not be an optimal solution, but with a bit of code generation (and a larger amount of tables than the above), we might be able to cover every single possible blob_id value of ours and end up with passable performance.

Actually, just for the sake of an example, I might as well demonstrate how to achieve something like that for the entirety of the alphabet:

abcdefghijklmnopqrstuvwxyz0123456789

It isn't actually all that bad:

CREATE TABLE public.blob_versions
(
    id uuid NOT NULL DEFAULT gen_random_uuid(),
    blob_id uuid NOT NULL,
    filename character varying(256) NOT NULL,
    mime_type character varying(256) NOT NULL,
    checksum character varying(32) NOT NULL,
    version integer NOT NULL,
    size bigint NOT NULL,
    created_at timestamp without time zone NOT NULL
) PARTITION BY LIST (LEFT(blob_id::text, 1));

CREATE TABLE public.blob_versions_a PARTITION OF public.blob_versions FOR VALUES IN ('a');
CREATE TABLE public.blob_versions_b PARTITION OF public.blob_versions FOR VALUES IN ('b');
CREATE TABLE public.blob_versions_c PARTITION OF public.blob_versions FOR VALUES IN ('c');
CREATE TABLE public.blob_versions_d PARTITION OF public.blob_versions FOR VALUES IN ('d');
CREATE TABLE public.blob_versions_e PARTITION OF public.blob_versions FOR VALUES IN ('e');
CREATE TABLE public.blob_versions_f PARTITION OF public.blob_versions FOR VALUES IN ('f');
CREATE TABLE public.blob_versions_g PARTITION OF public.blob_versions FOR VALUES IN ('g');
CREATE TABLE public.blob_versions_h PARTITION OF public.blob_versions FOR VALUES IN ('h');
CREATE TABLE public.blob_versions_i PARTITION OF public.blob_versions FOR VALUES IN ('i');
CREATE TABLE public.blob_versions_j PARTITION OF public.blob_versions FOR VALUES IN ('j');
CREATE TABLE public.blob_versions_k PARTITION OF public.blob_versions FOR VALUES IN ('k');
CREATE TABLE public.blob_versions_l PARTITION OF public.blob_versions FOR VALUES IN ('l');
CREATE TABLE public.blob_versions_m PARTITION OF public.blob_versions FOR VALUES IN ('m');
CREATE TABLE public.blob_versions_n PARTITION OF public.blob_versions FOR VALUES IN ('n');
CREATE TABLE public.blob_versions_o PARTITION OF public.blob_versions FOR VALUES IN ('o');
CREATE TABLE public.blob_versions_p PARTITION OF public.blob_versions FOR VALUES IN ('p');
CREATE TABLE public.blob_versions_q PARTITION OF public.blob_versions FOR VALUES IN ('q');
CREATE TABLE public.blob_versions_r PARTITION OF public.blob_versions FOR VALUES IN ('r');
CREATE TABLE public.blob_versions_s PARTITION OF public.blob_versions FOR VALUES IN ('s');
CREATE TABLE public.blob_versions_t PARTITION OF public.blob_versions FOR VALUES IN ('t');
CREATE TABLE public.blob_versions_u PARTITION OF public.blob_versions FOR VALUES IN ('u');
CREATE TABLE public.blob_versions_v PARTITION OF public.blob_versions FOR VALUES IN ('v');
CREATE TABLE public.blob_versions_w PARTITION OF public.blob_versions FOR VALUES IN ('w');
CREATE TABLE public.blob_versions_x PARTITION OF public.blob_versions FOR VALUES IN ('x');
CREATE TABLE public.blob_versions_y PARTITION OF public.blob_versions FOR VALUES IN ('y');
CREATE TABLE public.blob_versions_z PARTITION OF public.blob_versions FOR VALUES IN ('z');
CREATE TABLE public.blob_versions_0 PARTITION OF public.blob_versions FOR VALUES IN ('0');
CREATE TABLE public.blob_versions_1 PARTITION OF public.blob_versions FOR VALUES IN ('1');
CREATE TABLE public.blob_versions_2 PARTITION OF public.blob_versions FOR VALUES IN ('2');
CREATE TABLE public.blob_versions_3 PARTITION OF public.blob_versions FOR VALUES IN ('3');
CREATE TABLE public.blob_versions_4 PARTITION OF public.blob_versions FOR VALUES IN ('4');
CREATE TABLE public.blob_versions_5 PARTITION OF public.blob_versions FOR VALUES IN ('5');
CREATE TABLE public.blob_versions_6 PARTITION OF public.blob_versions FOR VALUES IN ('6');
CREATE TABLE public.blob_versions_7 PARTITION OF public.blob_versions FOR VALUES IN ('7');
CREATE TABLE public.blob_versions_8 PARTITION OF public.blob_versions FOR VALUES IN ('8');
CREATE TABLE public.blob_versions_9 PARTITION OF public.blob_versions FOR VALUES IN ('9');

You might achieve that without writing a single line of code or loop, just get the English alphabet and either Vim or your favorite editor that allows for writing on multiple lines at once. It's that simple:

10 blob version partitions

That's actually all that we need for our schema, we just have to hope that we've made everything correctly, which we'll test promptly. Furthermore, remember that at this time we have just a vague idea of how we expect things to work, but how far this scales remains to actually be tested!

Running our migrations

So, how do we handle all of these migrations? We'll be using dbmate, a language/platform agnostic migration tool for a whole bunch of different databases. There are good solutions for various languages and frameworks (like Rails has its own, Django has its own, Java has stuff like Flyway and Liquibase), however there's something really nice about having a solution that you can use across every project and tech stack that you use.

We have a directory named db/migrations that dbmate reads, tries to execute and the results of which it puts into a schema_migrations table. We can create a new migration by running the container (or running the tool locally, it's a static executable) and running the new migration command, which will give us a new file in said folder, named after a name that we'll give it, as well as an automatically generated timestamp.

It's about as easy as it gets, since it doesn't really force you to think much about your application versions and how those map to the migrations:

winpty docker run --rm -it --mount type=bind,source="$(pwd)"/db/migrations,target=/workspace/db/migrations docker-dev.registry.kronis.dev:443/dbmate bash
./dbmate new migration_name

(note: this image isn't public, you might want to build your own, or just use the executable from the link)

As for applying the actual migrations against any given database instance, I just have a Docker Compose file that launches the tool in an automatic migration execution mode:

version: '3.7'
services:
  pidgeot_dbmate:
    image: docker-dev.registry.kronis.dev:443/dbmate
    volumes:
      - ./db/migrations:/workspace/db/migrations
    environment:
      - DATABASE_URL=postgres://pidgeot_pg_user:pidgeot_pg_password@pidgeot_postgres:5432/pidgeot_pg_database?sslmode=disable
    deploy:
      resources:
        limits:
          memory: 512M
          cpus: '1.00'
      restart_policy:
        condition: on-failure
        delay: 60s

Actually, the rest of the stack is also runnable locally, including both an instance of PostgreSQL and, later, MinIO as well: the idea is to only run the Java application outside of containers and only do even that because debugging outside of containers can be a little bit easier, as well as startup can shave a second or ten off here and there.

Regardless, after executing our migrations, we get the following:

11 migration execution

As expected, the schema migration table is also filled out:

12 migration results

It doesn't give us that much details and isn't as fancy as something like Flyway, but then again, it doesn't really be, because the end result is the same for now. I'd say that going from a bare PostgreSQL instance (or any other RDBMS, really) to whatever your application needs is definitely a best practice:

13 generated schema

I do have to admit that the schema ends up looking... a bit weird. For example, we can have a FOREIGN KEY from blobs to blob_groups because the groups table isn't partitioned, but we cannot make blob_versions point at blobs, because we do use partitions here. Also, in our case we've also chosen not to couple our documents table with blobs because that's just an example of how the data structure in another microservice somewhere might look like.

Either way, once you get to having to partition your data, you might no longer be able to rely upon foreign keys for ensuring data consistency and no orphaned data, as well as using something like the CASCADE option for deleting data (assuming that you don't want to use soft deletes, which you really should).

Then again, depending on how everything is set up, this approach can still be what you're looking for, due to scalability concerns. How much these actually matter, we'll test later with millions of documents. Just remember that you don't always have to overcomplicate things with partitions, unless you're sure about them being a good solution for your problem.

Testing our database partitions

But first, let's check whether everything actually works, shall we?

Let's add a test client:

INSERT INTO clients(name, location, email, website) 
VALUES ('CloudCat', 'Riga', 'cloudcat@kronis.dev', 'https://kronis.dev');

Which works with no issues, UUID is generated and all:

14 insert client

Let's add a test document for them:

INSERT INTO documents(client_id, code, name, created_at, valid_until)
VALUES (
    (SELECT id FROM clients WHERE name = 'CloudCat'),
    'example_receipt',
    'This is an example receipt',
    NOW(),
    NOW() + INTERVAL '1 day'
);

Which also works:

15 insert document

So far, so good, now let's move over to an example blob. First, we'll need a group:

INSERT INTO blob_groups(code, name)
VALUES('blob_grp_1', 'Blob group #1');

Which works as expected:

16 blob group

And now, let's test some of the blob insertion logic. If we've done everything correctly, the data should be inserted in the underlying partitioned tables, depending on the year:

INSERT INTO blobs(blob_group_id, document_id, name, created_at)
VALUES(
    (SELECT id FROM blob_groups WHERE code = 'blob_grp_1'),
    (SELECT id FROM documents WHERE code = 'example_receipt'),
    'Test blob #1',
    TO_TIMESTAMP('2021.05.04 13:44:00', 'YYYY.MM.DD HH24:MI:ss')
);
INSERT INTO blobs(blob_group_id, document_id, name, created_at)
VALUES(
    (SELECT id FROM blob_groups WHERE code = 'blob_grp_1'),
    (SELECT id FROM documents WHERE code = 'example_receipt'),
    'Test blob #2',
    TO_TIMESTAMP('2022.06.05 14:45:00', 'YYYY.MM.DD HH24:MI:ss')
);
INSERT INTO blobs(blob_group_id, document_id, name, created_at)
VALUES(
    (SELECT id FROM blob_groups WHERE code = 'blob_grp_1'),
    (SELECT id FROM documents WHERE code = 'example_receipt'),
    'Test blob #3',
    TO_TIMESTAMP('2023.03.02 15:34:00', 'YYYY.MM.DD HH24:MI:ss')
);

Of course, the first thing we'll see, is that the data is actually there, if we select everything from the main table:

17 blobs

But what if we want to check what's going on with the partitions? Luckily, the underlying tables are just that: tables that we can SELECT data from. For example, we can do the following:

SELECT * FROM blobs_2022;

And we'll get the record we expect to be there, because its created_at timestamp's year matches the partition key:

18 data in partition

You can use either the main table for selecting your data, or the partition table directly when you are sure that the data you need is in it. Sadly, there don't appear to be many options outside of procedural database languages for dynamically choosing particular tables yourself (unless you want to do SQL query generation app side), so you'll just need to count on the database knowing what's best for you sometimes.

We are also going to do the same with our blob versions, creating a few:

INSERT INTO blob_versions(blob_id, filename, mime_type, checksum, version, size, created_at)
VALUES(
    '06a95de9-f3c4-4293-9231-ceac34d90e29',
    'test_file_1.txt',
    'text/plain',
    'b98a59e5c7ae38e76eae58d39ee7711d',
    1,
    223,
    NOW() + INTERVAL '1 second'
);
INSERT INTO blob_versions(blob_id, filename, mime_type, checksum, version, size, created_at)
VALUES(
    '06a95de9-f3c4-4293-9231-ceac34d90e29',
    'test_file_1_final.txt',
    'text/plain',
    'b98a59e5c7ae38e76eae58d39ee7711d',
    2,
    228,
    NOW() + INTERVAL '2 seconds'
);
INSERT INTO blob_versions(blob_id, filename, mime_type, checksum, version, size, created_at)
VALUES(
    '06a95de9-f3c4-4293-9231-ceac34d90e29',
    'test_file_1_final(1).txt',
    'text/plain',
    'b98a59e5c7ae38e76eae58d39ee7711d',
    3,
    234,
    NOW() + INTERVAL '3 seconds'
);

I'm faking the checksums and size values here since we don't have a backing data store yet, but this should be enough to check whether the data is there as expected:

19 blob versions

Would you look at that - we get a few nice versions of our blob! Now, let's check whether the partitions work. If everything's set up correctly, the first letter of the blob_id should correspond to the partition, which in our case is 06a95de9-f3c4-4293-9231-ceac34d90e29. So, let's check the partition table:

SELECT * FROM blob_versions_0;

Eureka! Here they are:

20 blob version partitions

You might notice that all of them are in the very same partition, which is exactly what we want in this particular case: we won't have to turn to all of the partitions to find the records for a particular blob. Instead, it will be like everything we need is in a single table, at least in this context.

It's a little bit like some might partition their SaaS schema based on the customer IDs. Of course, I'm not saying that database partitioning is a perfect solution, but it seems that partitioning or even things like splitting the database entirely has worked out pretty well for companies like GitLab, allowing them to scale further without a complete rewrite of their apps.

With our database being mostly ready, we should think about the blob store and then integrate everything into a test program, so let's do just that.

Setting up our blob store

As for the actual solution of storing the file data, my choice of something S3 compatible is driven by the fact that, whilst I've seen people roll their own blob storage solutions with varying degrees of success, generally it'll be better to use solutions that are as widespread (at least API-wise) as you can get.

These choices be battle tested and are supported by a large, vibrant community - both in regards to the features and documentation that are available, as well as any client SDKs or libraries in almost any language you can imagine.

Some might also wonder about storing blobs in the database directly, since most out there have some sort of a binary datatype, however I'm here to tell you: consider whether that's a good idea! The amount of actual file data that you might need to persist will usually be a few orders of magnitude bigger than the metadata and downloading the file contents might be rather IO intensive.

Sometimes you might want to keep the metadata storage smaller and quick almost all the time, but allow a separate system to handle the uploads/downloads, however fast or slow those could be. Of course, you should also consider whether you're okay with building a distributed system and all that comes with that - such as your database being able to write metadata, but the MinIO layer failing to upload the file.

Regardless, for viewing things in our MinIO instance, we'll be using Cyberduck with the MinIO profile that they provide. Again, while software like this isn't strictly necessary, I figured that I might offer it up as a nice option for poking around.

Setting up the actual DB schema took a little bit of effort and needed an explanation or two. So we'd expect our actual blob store to be similarly complicated, right? Well, no, not really. This is all that you need to get up and running with it, from my Docker Compose file:

version: '3.7'
services:
  pidgeot_minio:
    image: docker-prod.registry.kronis.dev:443/minio
    ports:
      - 9000:9000
      - 9001:9001
    volumes:
      - pidgeot_minio_data:/data
    environment:
      - MINIO_ROOT_USER=pidgeot_min_user
      - MINIO_ROOT_PASSWORD=pidgeot_min_password
      - MINIO_SERVER_ACCESS_KEY=pidgeot_min_access_key
      - MINIO_SERVER_SECRET_KEY=pidgeot_min_secret_key
    deploy:
      resources:
        limits:
          memory: 512M
          cpus: '1.00'

We just need to indicate how we'll connect to our instance of MinIO (which will be largely S3 compatible) and then we can do the rest in our Java code (or another language, whichever we prefer). Even then, all we'll need to do is probably just create a few buckets to store the data in, a little bit like partitions and then the instance will be usable. I'll actually do that in the next part, because I need to explain a bit about our Java application, but it's rather simple.

Testing our blob store

If you really wanted to, you could do the same inside of Cyberduck, even including file uploads:

21 Cyberduck compatibility

Under the hood there is some clever logic and optimizations to let us handle quite a few files, but the actual API is dead simple and we thus get lots of great software built around it. I'd say that S3 owes its success to just how generic and widely usable it was. A little bit like SFTP for the masses, with included file download functionality, should you need it.

Also, did I mention that MinIO also comes with an easy to use web interface?

23 MinIO web interface

I've seen some bespoke in-house blob storage systems in my professional career so far, but nothing has been as good as being able to pick up an off-the-shelf solution that does 90% of what you need out of the box. Developers should suffer from the "Not Invented Here" way of thinking a bit less and should embrace open source projects like this when possible.

You can actually copy the URL of a file and download it directly, if you want to make the resources public. Or, you know, protect them and return an error message if the request isn't signed appropriately (the defaults actually don't expose anything publicly unless you explicitly say that you want that with configuration):

22 access denied

This is perhaps one of the more annoying things about the solution itself, because you might need some messing around with the Host headers to allow files to be downloaded properly when you're running in containers. Other software out there lets you easily specify whatever you want as the external URL for the instance and I catch myself wishing that MinIO had something similar to MINIO_PUBLIC_URL, but that doesn't appear to be the case.

But hey, at that point it's a matter of configuration, albeit with the rest of the system to actually build, I might put it off - maybe even do it after these blog posts go live, given that I still have the rest of the system to build here. Priorities and all!

Summary

So far, we settled on using PostgreSQL for storing our metadata and using MinIO for actually storing the uploaded files. Database partitioning wasn't too hard to get working, albeit it might be nice to have a little bit more information about it available. Either way, the tech stack seems both pretty stable and boring enough to let me focus on building things here.

In the next part we'll write a web API for retrieving metadata as needed, as well as most of the logic for persisting data in the database, in addition to actually generating lots of test data, so we can see how well all of this works at scale!

You can read the next part here: (2/4) Pidgeot: a system for millions of documents; The Application