(2/4) Pidgeot: a system for millions of documents; The Application

You can read the previous part here: (1/4) Pidgeot: a system for millions of documents; The Data

In the previous part, we settled on PostgreSQL and MinIO for our document storage and versioning solution, with the aim of creating something that's capable of storing millions of documents with very little in the way of issues. But now, let's crate an API that'd actually realistically interact with the database and blob storage solution!

Technology choices

Here, I think I'll go for Java with Dropwizard, a pretty nice alternative to Spring in my eyes: it's pretty lightweight and simple to get up and running with.

Many might reach for Spring Boot and while it's a good enterprise solution, it's also a bit of a heavyweight that's full of "magic", at least in my opinion. In contrast, Dropwizard runs on Jetty and lets you actually pick how much complexity you want in your project, based on your needs, although it might not cover 100% of them, unless you feel comfortable stitching together libraries.

There are more projects like that in the category of "micro frameworks", instead of "platform frameworks", around which you'd build your entire business or system. I'd say that in many cases the smaller options are good enough for both lightweight APIs, as well as entire ERP systems, though building systems out of many smaller (and simpler) components personally feels worthwhile.

Of course, I'm obligated to say that Java is a good choice for the data generation that I'll do later (though I could say the same for .NET), but in general most of the languages out there would be okay for what I'm trying to do here.

First, we create a new project, following their official instructions, which lets us use a Maven archetype for generating a starter project, something that's delightfully simple:

24 create new project

Then, we generate a configuration class that will correspond to our configuration YAML file (it would be nicer to count on environment variables for configuration, but we can do that later):

25 Dropwizard configuration

And for a very basic application, that's actually all that we need to build it:

26 build

We can then promptly run it in a mode that will validate its configuration, something that a lot of the applications out there could use more of (e.g. a separate command that doesn't cause actual connection attempts to the database, or scheduled processes to be run):

27 check configuration

And after that we can also run the application in server mode:

28 run server

All we needed to get up and running was a few minutes of following the official instructions and a few versioned IntelliJ IDEA run profiles (though that's just for convenience, this would work through the CLI as well):

29 run profiles

After that, we can start adding dependencies and implementing logic to populate our solution with some data, before testing its performance.

Integrating Dropwizard with PostgreSQL through JDBI3

First, we'll probably want to add a few libraries to be able to access PostgreSQL and also interact with the MinIO server.

For the first, Dropwizard already provides us with an integration with JDBI3, which should be good enough for our needs, so we can add it to the project:

<dependency>
    <groupId>org.postgresql</groupId>
    <artifactId>postgresql</artifactId>
    <version>42.5.0</version>
</dependency>
<dependency>
    <groupId>io.dropwizard</groupId>
    <artifactId>dropwizard-jdbi3</artifactId>
</dependency>

You'll notice that we include JDBI3 separately from the PostgreSQL driver, because that's the kind of decoupling that we can have - we could technically swap out the PostgreSQL instance for MariaDB or another database that we want, as long as the actual application code would be compatible with it.

Truthfully, it never really works out that well in practice, apart from some of the more popular and established solutions, but it's nice to have the option: or at least to use a common solution for interacting with the DB and just pick what DB that will be at the start of the project.

We follow the instructions to set it up with PostgreSQL. This makes us indicate some of the connection values in the configuration file:

database:
  driverClass: org.postgresql.Driver
  user: pidgeot_pg_user
  password: pidgeot_pg_password
  url: jdbc:postgresql://localhost:5432/pidgeot_pg_database
  properties:
    charSet: UTF-8
  maxWaitForConnection: 10s
  validationQuery: "/* Pidgeot API Health Check */ SELECT 1"
  validationQueryTimeout: 10s
  minSize: 8
  maxSize: 32
  checkConnectionWhileIdle: false
  evictionInterval: 30s
  minIdleTime: 1 minute

Sadly there is no codegen yet that I could find, so we can't just point to the DB schema and generate a package of objects that will allow us to interact with it, but luckily writing our own SQL Objects isn't too hard, though I'd still consider JPA over it in some cases.

Basically, first we create a data object, an entity that matches what we'll find in the database:

public class Client {
    public Client() {
    }

    private String id;
    private String name;
    private String location;
    private String email;
    private String website;

    ...
}

Then, we create a mapper, for interacting with the database (e.g. when we want to read whole objects from it, not just ResultSet instances):

public class ClientMapper implements RowMapper<Client> {

    public ClientMapper() {
    }

    @Override
    public Client map(ResultSet rs, StatementContext ctx) throws SQLException {
        Client client = new Client();
        client.setId(rs.getString("id"));
        client.setName(rs.getString("name"));
        client.setLocation(rs.getString("location"));
        client.setEmail(rs.getString("email"));
        client.setWebsite(rs.getString("website"));
        return client;
    }
}

And after that, we create a Dao, which is just an interface, that has SQL snippets in annotations.

While there are benefits to using something like Hibernate this lets us work with SQL directly and have very few layers of abstraction, for example:

public interface ClientDao {

    @SqlQuery("""
        SELECT COUNT(1)
        FROM clients
    """)
    Long countAll();

    @SqlUpdate("""
        INSERT INTO clients(name, location, email, website) 
        VALUES(:client.name, :client.location, :client.email, :client.website)
    """)
    @GetGeneratedKeys("id")
    String insert(
            @BindBean("client") Client client
    );
    @SqlUpdate("""
        UPDATE clients 
        SET 
            name = :client.name, 
            location = :client.location, 
            email = :client.email, 
            website = :client.website 
        WHERE id = :client.id::uuid
    """)
    void update(
            @BindBean("client") Client client
    );

    @SqlUpdate("""
        DELETE FROM clients
        WHERE id = :id::uuid
    """)
    void delete(
            @Bind("id") String id
    );

    @SqlQuery("""
        SELECT *
        FROM clients
        WHERE id = :id::uuid
    """)
    @RegisterRowMapper(ClientMapper.class)
    Client findById(
            @Bind("id") String id
    );

    @SqlQuery("""
        SELECT *
        FROM clients
        WHERE name ILIKE '%' || :name || '%'
    """)
    @RegisterRowMapper(ClientMapper.class)
    List<Client> searchByName(
            @Bind("name") String name
    );

}

Thanks to Java text blocks we finally can use normal multi-line strings, like in other languages, so mixing SQL with the methods that call it becomes rather easy! You can actually just copy these snippets and put them in pgAdmin or another tool of your choice, as opposed to generating all of your SQL dynamically, which would make debugging harder.

Plus, thanks to the IntelliJ IDEA integration, we can actually see whether our SQL is correctly written here, which might help us catch some glaring issues with columns that don't exist, or invalid SQL:

32 IntelliJ integration

We also temporarily add some code for when the app starts up, to test whether this works with our local database:

final ClientDao clientDao = jdbi.onDemand(ClientDao.class);
Client newClient = new Client(
        null,
        "TestCo",
        "Riga",
        "test-co@kronis.dev",
        "https://test-co.kronis.dev"
);
logger.info("Inserting new test client: {}", newClient);
String newClientID = clientDao.insert(newClient);
logger.info("Inserted UUID: {}", newClientID);

Client savedNewClient = clientDao.findById(newClientID);
logger.info("Retrieved by ID: {}", savedNewClient);
savedNewClient.setName("CatCo");
savedNewClient.setEmail("cat-co@kronis.dev");
savedNewClient.setWebsite("https://cat-co.kronis.dev");
logger.info("Updating test client: {}", savedNewClient);
clientDao.update(savedNewClient);

Long clientCount = clientDao.countAll();
logger.info("There are: {} clients in the database...", clientCount);

List<Client> clients = clientDao.searchByName("Cat");
logger.debug("Current clients after search:");
for (Client client : clients) {
    logger.debug("{}", client);
}

This allows us to INSERT and UPDATE some data, get scalar values (e.g. COUNT) and also return entire lists of data from the DB, which will be mapped into objects:

33 results after test

Which matches what we end up with in our database:

34 database contents

Sadly, I have to do the same for all of the tables, though thankfully we don't have too many of those. I do add some search operations as well, for the sake of later testing whether an application benefits from the DB partitions.

Integrating Dropwizard with MinIO through its client library

For the second system that we want to integrate, MinIO, we'll just add the MinIO client library to our pom.xml file and get going with that:

<dependency>
    <groupId>io.minio</groupId>
    <artifactId>minio</artifactId>
    <version>8.4.5</version>
</dependency>

Then, we can write a rather small class for initializing our buckets, e.g. groups of data, in our case also "partitioned" by year:

30 bucket initialization

Cyberduck also shows them, which means that our initialization was successful:

31 created buckets

This is actually a bit like what you might do with RabbitMQ - start your application and let it configure the attached solution based on its own unique demands. If I wanted to, of course, I could also extract this into a separate tool or process, like I did with dbmate, but this feels like a simple enough domain to work with in code.

Generally, any sort of automation like this, that lets your application tell the attached resources what sort of a configuration and setup it expects and do as much as possible of it, is a good idea. Mostly because then you don't need to bother much with manual setup instructions and can launch new environments in a matter of minutes.

After that, however, all we need to do is use the API as needed, though the actual API is somewhat fluid, so one request might differ from another. As an example, here's how you can upload an arbitrary file from an InputStream instance:

public void uploadFile(String bucket, String object, InputStream inputStream)
    throws IOException, ServerException, InsufficientDataException, ErrorResponseException,
    NoSuchAlgorithmException, InvalidKeyException, InvalidResponseException, XmlParserException,
    InternalException {
    minioClient.putObject(
            PutObjectArgs.builder()
                    .bucket(bucket)
                    .object(object)
                    .stream(inputStream, -1, ObjectWriteArgs.MIN_MULTIPART_SIZE)
                    .build()
    );
}

You can mess around with the settings and some of the additional parameters, but MinIO and other S3 implementations are generally reasonably easy to use!

Making a web API for testing how it works

So, with most of the logic in place, we can start thinking about how we'll preview the data and modify it. The easiest way is to set up a RESTful API (or a bit more loosely defined JSON API, since we won't try to create all of the possible deeply nested routes) by defining a few resource classes:

@Path("/api/clients")
@Produces(MediaType.APPLICATION_JSON)
public class ClientResource {

    @GET
    @Path("/count")
    @Timed
    public Long getCount() {
        ClientDao clientDao = PidgeotDependencyContainer.INSTANCE.getClientDao();
        return clientDao.countAll();
    }

    @GET
    @Timed
    public List<Client> get(@QueryParam("name") Optional<String> name) {
        ClientDao clientDao = PidgeotDependencyContainer.INSTANCE.getClientDao();
        if (name.isPresent()) {
            return clientDao.searchByName(name.get());
        } else {
            return clientDao.findAll();
        }
    }

    @POST
    @Timed
    public Client create(Client client) {
        ClientDao clientDao = PidgeotDependencyContainer.INSTANCE.getClientDao();
        String savedId = clientDao.insert(client);
        return clientDao.findById(savedId);
    }

    @PUT
    @Timed
    public Client update(Client client) {
        ClientDao clientDao = PidgeotDependencyContainer.INSTANCE.getClientDao();
        clientDao.update(client);
        return clientDao.findById(client.getId());
    }

    @GET
    @Path("/{clientId}")
    @Timed
    public Client getById(@PathParam("clientId") String clientId) {
        ClientDao clientDao = PidgeotDependencyContainer.INSTANCE.getClientDao();
        return clientDao.findById(clientId);
    }

    @DELETE
    @Path("/{clientId}")
    @Timed
    public Response delete(@PathParam("clientId") String clientId) {
        ClientDao clientDao = PidgeotDependencyContainer.INSTANCE.getClientDao();
        clientDao.delete(clientId);
        return Response.status(Response.Status.OK).build();
    }

}

You'll notice that I'm not using dependency injection, as previously stated, and instead am choosing the fill out fields based on my dependency container. Some might prefer to still pass the dependencies either in the constructor manually (e.g. during resource registration) or in the actual method parameters, but personally, this is good enough for a quickly written system.

Testing isn't as impossible as some might suggest in such a setup, since you can just set up whatever you need in your dependency container yourself. Of course, this also isn't suitable for larger systems (think hundreds of resources, it'd become a mess then).

That said, we're keeping things as simple as possible. Dropwizard also registers all of these resourcese with just a few lines in its configuration:

JerseyEnvironment jersey = environment.jersey();
jersey.register(new ClientResource());
jersey.register(new DocumentResource());
jersey.register(new ClientDocumentResource());
...

After startup, I'll get the following message that will list the available paths, something that, once again, is immensely useful to have, both to tell whether your paths are defined correctly, as well as to figure out what ones are actually available:

35 jersey paths

(note that I can access documents both under a particular client, as well directly, when we just have an ID, the same will go for other paths as well, to allow us to make some simpler requests with fewer parameters)

Opening one of those URLs in the browser also gives me the corresponding data:

http://localhost:8080/api/clients

36 clients

We might also think about paging, but frankly, I don't quite care about that just yet. There are also quite a few endpoints to actually add, much like there were for the Dao logic:

37 all resources

We'll mostly use these to make sure that the initial data generation works. You see, first I intend to generate 10-20 records for each table, then fan out into thousands and eventually millions, should the hardware available to me be able to handle that.

It should help me find the limitations of such a setup nicely, but first I obviously need to make sure that my data generation works as expected, for which even inspecting JSON contents of the API responses will be sufficient.

Normally we'd also care about some sort of an authentication mechanism for our APIs, but this is not one of those cases - using Java instead of Node/Python slows us down bunches, so let's not overcomplicate.

One of the things that are nice and that I might want to add later is having an OpenAPI spec, or some framework integration so I can generate those automatically.

Being able to test the web API more easily is really nice, though in other cases Postman on its own can also be enough, even though getting a full description of your API in Postman after importing the OpenAPI spec is really nice.

Another interesting thing about the actual API is that in some endpoints you'll see additional paths which allow selecting data by the year:

http://localhost:8080/api/blobs
http://localhost:8080/api/blobs/by-year/{year}

or

http://localhost:8080/api/blobs/{blobId}
http://localhost:8080/api/blobs/by-year/{year}/{blobId}

or

http://localhost:8080/api/blob-groups/{blobGroupId}/blobs/{blobId}/blob-versions
http://localhost:8080/api/blob-groups/{blobGroupId}/blobs/by-year/{year}/{blobId}/blob-versions

I was sleepy when I wrote these, so putting the year in the beginning didn't occur to me, mostly because it'd mess with the alphabetical sorting of all of these paths.

Either way, the idea here is that PostgreSQL should be smart enough to detect when our data fits the partition key. We did partition the blobs table by the year:

PARTITION BY LIST (EXTRACT(YEAR FROM created_at))

I wonder whether when we do a very similar select, we'll also only hit the corresponding table without checking the others:

SELECT *
FROM blobs
WHERE EXTRACT(YEAR FROM created_at) = :year
    AND id = :id::uuid

versus just fetching why whatever else we want:

SELECT *
FROM blobs
WHERE id = :id::uuid

(for that, you'll need to see the latter parts of this series of blog posts)

This should allow us to test that nicely. On the other hand, the blob_versions table is partitioned by the blob_id, so it should always work when we try to select some data for a particular blob. Either way, we now have enough endpoints to test whatever theory we might have:

http://localhost:8080/api/blob-groups
http://localhost:8080/api/blob-groups/by-code/{code}
http://localhost:8080/api/blob-groups/count
http://localhost:8080/api/blob-groups/{blobGroupId}
http://localhost:8080/api/blob-groups/{blobGroupId}/blobs
http://localhost:8080/api/blob-groups/{blobGroupId}/blobs/by-year/{year}
http://localhost:8080/api/blob-groups/{blobGroupId}/blobs/by-year/{year}/{blobId}
http://localhost:8080/api/blob-groups/{blobGroupId}/blobs/by-year/{year}/{blobId}/blob-versions
http://localhost:8080/api/blob-groups/{blobGroupId}/blobs/by-year/{year}/{blobId}/blob-versions/by-version/{version}
http://localhost:8080/api/blob-groups/{blobGroupId}/blobs/by-year/{year}/{blobId}/blob-versions/count
http://localhost:8080/api/blob-groups/{blobGroupId}/blobs/by-year/{year}/{blobId}/blob-versions/latest
http://localhost:8080/api/blob-groups/{blobGroupId}/blobs/by-year/{year}/{blobId}/blob-versions/{blobId}
http://localhost:8080/api/blob-groups/{blobGroupId}/blobs/{blobId}
http://localhost:8080/api/blob-groups/{blobGroupId}/blobs/{blobId}/blob-versions
http://localhost:8080/api/blob-groups/{blobGroupId}/blobs/{blobId}/blob-versions/by-version/{version}
http://localhost:8080/api/blob-groups/{blobGroupId}/blobs/{blobId}/blob-versions/count
http://localhost:8080/api/blob-groups/{blobGroupId}/blobs/{blobId}/blob-versions/latest
http://localhost:8080/api/blob-groups/{blobGroupId}/blobs/{blobId}/blob-versions/{blobId}
http://localhost:8080/api/blob-versions
http://localhost:8080/api/blob-versions/count
http://localhost:8080/api/blob-versions/{blobVesionId}
http://localhost:8080/api/blobs
http://localhost:8080/api/blobs/by-year/{year}
http://localhost:8080/api/blobs/by-year/{year}/count
http://localhost:8080/api/blobs/by-year/{year}/{blobId}
http://localhost:8080/api/blobs/count
http://localhost:8080/api/blobs/{blobId}
http://localhost:8080/api/clients
http://localhost:8080/api/clients/count
http://localhost:8080/api/clients/{clientId}
http://localhost:8080/api/clients/{clientId}/documents
http://localhost:8080/api/clients/{clientId}/documents/{documentId}
http://localhost:8080/api/clients/{clientId}/documents/{documentId}/blobs
http://localhost:8080/api/clients/{clientId}/documents/{documentId}/blobs/by-year/{year}
http://localhost:8080/api/clients/{clientId}/documents/{documentId}/blobs/by-year/{year}/{blobId}
http://localhost:8080/api/clients/{clientId}/documents/{documentId}/blobs/{blobId}
http://localhost:8080/api/documents
http://localhost:8080/api/documents/count
http://localhost:8080/api/documents/{documentId}

But how do we actually get the data? With a bit of Java code, of course!

Generating the test data

The workload that I want to test urges us to have the following structure:

  • we're going to generate some vague blob groups
  • we're going to generate some clients
  • these clients are going to have documents (for various years)
  • each of those documents is going to have blobs, which are basically attachments
  • those blobs, in turn, will have multiple versions each

All of these should also be customizable, to first (and locally) run with a lower amount of data, but to eventually scale out to something more substantial, once I acquire the hardware. For that, I've created some configuration that'll let me tell the application itself to do this upon startup:

38 test configuration

Here you can notice additional configuration for local testing, for enabling/disabling MinIO uploads, because locally I have my docker containers running off of an HDD and not the boot SSD, so actually writing the files would slow everything down, including the database!

After that, it's just a matter of adding something to actually generate the data, which in my case is just a basic class, given that the dependency container already has everything I'll need for this:

/**
 * The entrypoint for generating all of the test data.
 * This SHOULD be able to handle millions of rows.
 */
private void generateAllTestData() {
    logger.info("Generating all test data...");
    PidgeotTestDataConfiguration testDataConfiguration = PidgeotDependencyContainer.INSTANCE.getPidgeotApiConfiguration().getPidgeotTestDataConfiguration();
    PidgeotTestDataCounter pidgeotTestDataCounter = PidgeotDependencyContainer.INSTANCE.getPidgeotTestDataCounter();
    pidgeotTestDataCounter.resetGlobalCounters();
    List<String> blobGroupIds = generateBlobGroupData(testDataConfiguration, pidgeotTestDataCounter);
    generateClientData(testDataConfiguration, pidgeotTestDataCounter, blobGroupIds);
    logger.info("Finished generating all test data...");
}

The actual code is basically just lots of different integer streams with the amount of data that we need to generate, as well as some global mutable counters for figuring out how far along we are, for example:

/**
 * Generates all the clients and then their related data.
 */
private void generateClientData(PidgeotTestDataConfiguration testDataConfiguration, PidgeotTestDataCounter pidgeotTestDataCounter, List<String> blobGroupIds) {
    ClientDao clientDao = PidgeotDependencyContainer.INSTANCE.getClientDao();
    Long clientCount = testDataConfiguration.getClientCount();
    logger.info("Generating data for: {} clients...", clientCount);
    LongStream.range(0, clientCount).forEach(streamIndex -> {
        long currentIndex = pidgeotTestDataCounter.getClientCounter().getAndIncrement();
        logger.info("Generating data for client: {} / {}", currentIndex + 1, clientCount);
        Client client = new Client(
                null,
                String.format("Test Client #%d", currentIndex),
                TestDataLocations.getRandomLocation(),
                String.format("client-%d@kronis.dev", currentIndex),
                String.format("https://client-%d.kronis.dev", currentIndex)
        );
        String clientId = clientDao.insert(client);
        generateDocumentData(testDataConfiguration, pidgeotTestDataCounter, clientId, blobGroupIds);
    });
}

Perhaps the most clever bit of code is around simulating file uploads, which we achieve thanks to a bunch of instances of InputStream that we create from strings, simulating small test data (storage is also limited on MinIO side, after all, I can't have a million files that are 10 MB each):

/**
 * We are going to create versions for blobs, the table with most of the data!
 * No longer does logging, because useless lower on the call stack.
 * This also interacts with MinIO!
 */
private void generateVersionsForBlob(PidgeotTestDataConfiguration testDataConfiguration, PidgeotTestDataCounter pidgeotTestDataCounter, LocalDateTime blobCreatedAt, String blobId) {
    BlobVersionDao blobVersionDao = PidgeotDependencyContainer.INSTANCE.getBlobVersionDao();
    PidgeotMinIOClient minIOClient = PidgeotDependencyContainer.INSTANCE.getMinIOClient();
    Long blobVersionCount = testDataConfiguration.getBlobVersionCountPerBlob();
    Mutable<LocalDateTime> currentVersionCreatedAt = new MutableObject<>(blobCreatedAt);
    LongStream.range(0, blobVersionCount).forEach(streamIndex -> {
        long currentIndex = pidgeotTestDataCounter.getBlobVersionCounter().getAndIncrement();
        TestDataTimes.addAFewSecondsToTime(currentVersionCreatedAt);
        String fileContentsText = String.format("This is a test file #%d for blob: %s", currentIndex, blobId);
        byte[] fileContentsTextBytes = fileContentsText.getBytes(StandardCharsets.UTF_8);
        String filename = String.format("test-file-%d.txt", currentIndex);
        BlobVersion blobVersion = new BlobVersion(
                null,
                blobId,
                filename,
                "text/plain",
                TestFiles.getMD5Hex(fileContentsTextBytes),
                streamIndex,
                (long) fileContentsTextBytes.length,
                currentVersionCreatedAt.getValue()
        );
        String blobVersionId = blobVersionDao.insert(blobVersion);
        uploadToMinIOForBlobVersion(minIOClient, currentVersionCreatedAt, fileContentsTextBytes, filename);
    });
}

/**
 * If we want to actually persist data to MinIO, then we can do so.
 * This can also be disabled in the configuration for performance reasons, if needed.
 */
private void uploadToMinIOForBlobVersion(PidgeotMinIOClient minIOClient, Mutable<LocalDateTime> currentVersionCreatedAt, byte[] fileContentsTextBytes, String filename) {
    PidgeotTestDataConfiguration pidgeotTestDataConfiguration = PidgeotDependencyContainer.INSTANCE.getPidgeotApiConfiguration().getPidgeotTestDataConfiguration();
    Boolean shouldUpload = pidgeotTestDataConfiguration.getActuallyUploadToMinIO();
    if (!shouldUpload) {
        return;
    }

    String bucketName = String.valueOf(currentVersionCreatedAt.getValue().getYear());
    InputStream fileContents = TestFiles.textContentsToInputStream(fileContentsTextBytes);
    try {
        minIOClient.uploadFile(bucketName, filename, fileContents);
    } catch (Exception e) {
        logger.error("Failed to upload file with bucket name: {} and filename: {} to MinIO!", bucketName, filename, e);
    }
}

In addition, since our logic depends on some network IO, I also went with a thread pool solution to parallelize some of the data initialization, given the need for this to work quickly:

/**
 * Generates all the documents and their related data.
 * No longer does logging, because useless lower on the call stack.
 */
private void generateDocumentData(PidgeotTestDataConfiguration testDataConfiguration, PidgeotTestDataCounter pidgeotTestDataCounter, String clientId, List<String> blobGroupIds) {
    DocumentDao documentDao = PidgeotDependencyContainer.INSTANCE.getDocumentDao();
    PidgeotTestDataThreadPool pidgeotTestDataThreadPool = PidgeotDependencyContainer.INSTANCE.getPidgeotTestDataThreadPool();
    ExecutorService threadPool = pidgeotTestDataThreadPool.getThreadPool();
    Long documentCount = testDataConfiguration.getDocumentCountForClients();
    CountDownLatch countDownLatch = new CountDownLatch(documentCount.intValue());
    LongStream.range(0, documentCount).forEach(streamIndex -> {
        threadPool.submit(() -> {
            try {
                long currentIndex = pidgeotTestDataCounter.getDocumentCounter().getAndIncrement();
                LocalDateTime createdAt = TestDataTimes.getRandomYearRandomTime();
                LocalDateTime validUntil = TestDataTimes.addRandomLongDurationForTime(createdAt);
                Document document = new Document(
                        null,
                        clientId,
                        String.format("document-code-%d", currentIndex),
                        String.format("Test Document #%d", currentIndex),
                        createdAt,
                        validUntil
                );
                String documentId = documentDao.insert(document);
                generateBlobsForDocument(testDataConfiguration, pidgeotTestDataCounter, blobGroupIds, documentId, createdAt, validUntil);
            } catch (Exception e) {
                throw e;
            } finally {
                countDownLatch.countDown(); // regardless of whether successful or not
            }
        });
    });
    // we're going to wait for all the documents before proceeding, to not overload the thread pool
    try {
        countDownLatch.await();
    } catch (InterruptedException e) {
        throw new RuntimeException(e);
    }
    logger.info("Finished generating ALL documents for client, total documents so far: {}, total blobs so far: {}, total blob versions so far: {}",
            pidgeotTestDataCounter.getDocumentCounter().getValue(),
            pidgeotTestDataCounter.getBlobCounter().getValue(),
            pidgeotTestDataCounter.getBlobVersionCounter().getValue());
}

You'll notice me using a CountDownLatch here, to not overwhelm the thread pool. The choice that I made was to generate all of the documents for each client in parallel, whereas the iteration through the clients is done sequentially. This means that all of the blob and blob version generation will take place on one of the threads in the thread pool, per document.

This should keep the amount of tasks that the thread pool needs to execute reasonable, while also respecting the fact that most of the CPUs that I have available to me (or can afford to rent) will have a somewhat limited amount of cores. In short, too much parallelization wouldn't make sense, because of the resource limitations.

The actual reason for needing this functionality is that uploads and interaction with the database are currently blocking options - while we wait for MinIO or PostgreSQL to process our requests, the threads will hang, even though not a lot of work is done.

This could be partially solved thanks to something like non-blocking/async database drivers like the Eclipse Vert.X project has attempted to implement, but in the Java space the technology is still somewhat immature, so using it would be a risk. Using a bunch of threads is less efficient, resource wise, but is a passable solution.

Regardless, after a little while, I successfully got a bunch of data both in my database, as well as MinIO instance, about which I had somewhat sparse console output:

39 generate test data

(you'll also notice small inconsistencies with the counters, this was before adding some synchronization)

So, the good news is that I can now get as much data in PostgreSQL as needed:

40 generated data in DB

(here you can see that the data amounts are correct, according to the configuration)

The same applies to having data in my S3 compatible blob store, even though it's just a number of comparatively small text files:

41 generated data in cyberduck

A brief consideration of how I generate this data will very quickly also inform me of how this would scale, seemingly minor changes to the parameters would increase the amount of data by an order of magnitude:

42 how this scales

Which, ultimately, is my goal.

Testing with some increasingly higher amounts of data reveals that time-wise, this scales generation scales near linearly:

  • generating 10'000 records takes 71 seconds (1000 records per 7,10 seconds)
  • generating 100'000 records takes 859 seconds (1000 records per 8,59 seconds)

So I should be okay to scale this upwards, to test whether selecting data is similarly quick once the tables grow to a sufficient size. It's just that before that, I need to package the application up into something that can run on a beefier server, maybe also have some TLS in front of it, as well as at least basicauth (assuming a public server), all of which I might handle on the server itself.

Summary

So far, writing a Dropwizard application in Java was a decent experience. It's not the most quickest stack when it comes to actually writing code, but it is reasonably performant during runtime, as well as the ecosystem is pretty mature and stable for the most part.

I also enjoyed the fact that Dropwizard actually let me manually initialize any dependencies that I might have, in whatever ways I wanted to do that. While using my dependency container idea isn't exactly a best practice, neither is @Autowired on fields in Spring Boot, yet people still sometimes use those for convenience.

One thing I absolutely won't deny, however, is that the application looked rather differently by the time I finished writing it, rather than how I thought it would when I started out. Sometimes the framework or other libraries enforce a certain structure upon you, whereas other times your requirements change - when it will come to actually generating millions of records, I'll scale this whole thing horizontally!

You can read the next part here: (3/4) Pidgeot: a system for millions of documents; Deploying and Scaling

By the way, want an affordable VPN or VPS hosting in Europe?
Personally, I use Time4VPS for almost all of my hosting nowadays, including this very site and my homepage!
(affiliate link so I get discounts from signups; I sometimes recommend it to other people, so also put a link here)
Maybe you want to donate some money to keep this blog going?
If you'd like to support me, you can send me a donation through PayPal. There won't be paywalls for the content I make, but my schedule isn't predictable enough for Patreon either. If you like my blog, feel free to throw enough money for coffee my way!