Alex Lunkov

Posted on Feb 15, 2024

Storing a Java collection in a file storage

#java #programming #discuss #api

If you deal with a large datasets stored in a Java collection, lets say java.util.List, sooner or later you would encounter with a situation that allocated memory is not enough to hold all your data. It may be even small number of elements in a collection, but many of them take much of memory. It happens that a process must accumulate large dataset, and there is no way to process data in chunks to fit allocated memory. There are multiple ways to resolve an issue, from storing data in a file to temporarily store data in Redis.

We encountered with similar issue in a system which is developed for many years, and it is not so simple just to change way how data is accumulated in memory and utilized. A Java collection is full of large objects and there are multiple threads in the same app for that specific activity. Of course, there are few instances of a service which processes data, sometimes a dataset is small, sometimes a dataset is large and that is really frustrating, it is not simple to choose a scaling model. In such cases I usually say - simpler, faster, and more reliable to rewrite rather than "fine-tune", but this time we do fine-tuning :) Let's leave behind a curtain why a system operates with data which does not fit allocated memory 🤠 it happens. Sometimes we can change an implementation in more robust way, but sometimes we need to find a compromise with existing solution.

We implemented a small library: a Java collection data is stored in a file system instead of RAM and a convenient, Java Stream like interface for operating on data is provided.

Okay, the library - FStream. A central class FCollection which is similar to java.util.List but with reduced number of methods. With FCollection you can add new items, sort them with a comparator, iterate elements, and create a instance of FStream which is also reduced version of Java Stream. With FStream you can apply sequential operations on elements of a collection.

Example

See, in following code snapshot all data is stored in a file located in a temporary directory. Data is written to a file system immediately as data is added. But it it possible to operate on the items over FStream.

FCollection<SomeClassName> collection = FCollection.create();

// add elements to a collection
collection.add(instance);

// iterate elements of a collection
Iterator<SomeClassName> i = collection.iterator();
while (i.hasNext()) {
    consumer.accept(i.next());
}

// also iterates over all elements in a collection
collection.forEach(this::consumer);

// create a new collection
FCollection<AnotherClassName> collection2 = collection.stream()
        .filter(o -> o.isActive() == true)
        .map(this::convert)
        .sort((o1, o2) -> o1.compareTo(o2))
        .collect();

// destroy collections' data in a file storage
collection.close();
collection2.close();

How it works

Create a collection

When a collection is created, for instance with a method create, then a new file is created in a /tmp directory or in a custom directory if specified.

FCollection<SomeClassName> collection = FCollection.create();

Add items in a collection

Adding operation of a new item to a collection consists of an item serialization and writing to a collection's file in a file storage. Serialization is done by default with a FJdkSerializer, but it is possible to use a custom serializer. Customization is described below.

// add elements to a collection
collection.add(instance);

Apply operations on a collection

An approach here is absolutely the same with Java Stream - a developer can specify operations takes on each element of a collection in a function way. As result, a new collection is created, stored in a file storage.

FCollection<AnotherClassName> collection2 = collection.stream()
        .filter(o -> o.isActive() == true)
        .map(this::convert)
        .sort((o1, o2) -> o1.compareTo(o2))
        .collect();

Customization

So far it is possible to specify where to store temporary data of a collections, and assign a custom serializer for a collection. A serializer must implement FSerializer interface. After that a collection can be created with a builder.

FCollection<String> c = 
        FCollection.builder()
        .serializer(new CustomSerializer())
        .storageLocation("/your/location")
        .build();

Want to try it out?

Visit project's GitHub repository: https://github.com/alex-53-8/fstream

Top comments (7)

Nils • Feb 17 '24

Thanks for sharing this library, I really like the idea behind it to store big amounts of data on disk instead of in-memory in a simple way during processing.

I've been playing around for a little while with the usage example program and the code in general and - if you don't mind - I'd like to share some detailed feedback. There are some speed optimizations that might be useful and I also came across some thoughts on Java Collection integration.
As I don't want to look rude by just dropping several snippets of codes here in the comments out of the blue, I'd kindly ask beforehand if it's okay to do so.

Alex Lunkov • Feb 17 '24

it is true that the library needs optimization for speed, definitely writing to a file storage is always slower that storing in memory.

I welcome improvements and thoughts, please share your opinion :) it is really interesting to me!

Nils • Feb 17 '24

The execution time of the example program dropped significantly on my machine when adding the following changes:

Implementing FOutputStream.write(byte[]) to forward the given data directly to raf.write(byte[])
Implementing FOutputStream.write(byte[], int, int) to forward the given data to .write(byte[], int, int)

A similar approach works for FInputStream by implementing read(byte[], int, int) but here a little more logic needs to be added to correspond with what you've implemented in read():

    @Override
    public int read(byte[] b, int off, int len) throws IOException {
        raf.seek(pos);
        int remaining = (int) (endOfBlock - pos);
        if (remaining <= 0) {
            return -1;
        }
        int readCount = raf.read(b, 0, Math.min(remaining, len));
        pos += readCount;
        return readCount;
    }

And of course I've also looked into options for providing a (more or less) compatible Java Collections integration.
Similar to the unmodifiable Collections that Java itself offers, one could implement only the "read" methods of Collection and discard every write call silently or even with an exception. The only missing "read" methods size() and isEmpty could be implemented by seeking through the raf in FFileStorage, counting the jumps and just jumping from header to header (and ignoring the real data in between). And isEmpty() might be something like return (raf.length() == 0L);.
And because FCollection (and the underlaying FFileCollection) even offer a Java Iterator implementation, a "real" Stream could be easily derived:

    public Stream<T> asJavaStream() {
        return StreamSupport.stream(Spliterators.spliteratorUnknownSize(this.iterator(), Spliterator.ORDERED), false);
    }

Also, I had some trouble starting the program unter MS Windows because the default storage path /tmp resolves to C:\tmp which mostly does not exist, so the file cannot be created and a manual configuration must be used. Maybe the logic of java.io.File.createTempFile(...) could be somehow incorporated for the default case to handle the selection of the temporary directory.

Alex Lunkov • Feb 19 '24

Hi Nils
Thank you for your suggestions,

it is really good catch for modifying output/input stream classes - that increases performance, also I will replace "/tmp" default with a system defined temporary directory.

it could seem that implementation an interface "Collection" in "FCollection" should be simple - indeed, there are not so many methods to implement, but I would avoid using default implementation for creating a stream out of an iterator as it will lead to accumulating all data in memory again, especially for sorting. When I find a way how property implement Stream interface for FStream, I believe I will extend Collection interface.

Alex Lunkov • Feb 15 '24

The main reason is in complexity of adding such functionality to existing Java Collections - we would need to override not only basic operations of collections: add, remove, iterate (all of them would need to support storing and retrieving data in/from a file system), but also conduct very complicated work to support all Java "streams" methods to read and write data in a file storage instead of storing data in RAM. That would be really huge effort :) so far we have limited number of supported methods of java streams in FStream, only what we needed for solving our tasks, but probably we would extend in the future list of supported operation and finally implement Collection and Stream interfaces in our library

Arnaud Dagnelies • Feb 17 '24

Reminds me of github.com/dagnelies/FileMap I wrote long ago. It's focused on maps rather than lists though.