loading...
Cover image for Fetching Millions of Rows with Streams in Node.js

Fetching Millions of Rows with Streams in Node.js

_patrickgod profile image Patrick God Originally published at patrickgod.com ・3 min read

Have you ever had to face the challenge of fetching several million rows of data with lots of columns from a database and display them on the web?

Well, I had to do this recently. Although I was not convinced that this would make any sense at all, I still had to do it.

Anyways, here's the solution.

But first, the technology stack: Node.js, Sequelize & MariaDB. The client doesn't matter, because at times the data was almost 4 GB big, so Chrome crashed anyway eventually.

Sequelize was the first big problem.

The solution to getting such big data results is streaming. Receiving that data with one big call led to Node crashing. So streaming is the answer, and Sequelize did not support streaming at that point.

Well, I was able to manually add streaming to Sequelize, but the service call takes ages in the end.

Here's a snippet of a usual Sequelize call:

await sequelize.authenticate();
const result = await sequelize.query(sql, { type: sequelize.QueryTypes.SELECT });
res.send(result);

That's it. Of course, some parts are missing. Like the whole database configuration and the actual definition of the get() call (e.g. where does res come from?). But I think, you get the idea.

When you run this code, the result is simple. Node crashes. You could give node more memory with --max-old-space-size=8000, for instance, but that's not a real solution.

As already mentioned, you could kind of force Sequelize to stream the data. Now, how would that look like?

var Readable = stream.Readable;
var i = 1;
var s = new Readable({
    async read(size) {
        const result = await sequelize.query(
            sql + ` LIMIT 1000000 OFFSET ${(i - 1) * 1000000}`, { type: sequelize.QueryTypes.SELECT });
        this.push(JSON.stringify(result));
        i++;
        if (i === 5) {
            this.push(null);
        }
    }
});
s.pipe(res);

In this example, I knew the number of rows I would get back from the database, hence the line with if (i === 5). It was just a test. You have to send null to end the stream. You could, of course, get the count of the whole result first and modify the code accordingly.

The whole idea behind this is to make smaller database calls and return the chunks with the help of the stream. This works, Node does not crash, but it still takes ages - almost 10 minutes for 3.5 GB.

Streaming with Sequelize

What's the alternative?

The MariaDB Node.js connector.

That's how a usual query would look like:

const mariadb = require('mariadb');
const pool = mariadb.createPool({ host: "HOST", user: "USER", password: "PASSWORD", port: 3308, database: "DATABASE", connectionLimit: 5 });
let conn = await pool.getConnection();
const result = await conn.query(sql);
res.send(result);

It's much faster. But let me jump right to the streaming code:

let conn = await pool.getConnection();
const queryStream = conn.queryStream(sql);
const ps = new stream.PassThrough();
const transformStream = new stream.Transform({
    objectMode: true,
    transform: function transformer(chunk, encoding, callback) {
        callback(null, JSON.stringify(chunk));
    }
});
stream.pipeline(
    queryStream,
    transformStream,
    ps,
    (err) => {
        if (err) {
            console.log(err)
            return res.sendStatus(400);
        }
    })
ps.pipe(res);

This may look a bit cryptic, but what happens here is, that you create a pipeline where you put stuff through. First, the queryStream which is the result of the database query. Then the transformStream to send the stringified chunks (only strings and buffers are allowed here, hence stringifying the object). And finally the PassThrough and a function for an error case.

With ps.pipe(res) you stream the result to the client.

And here's the result:
Streaming with MariaDB Node.js connector

Under 4 minutes for the same data and you won't even notice that Node needs a bit RAM.

So, if you're challenged with a similar task, think about streaming the data.

Or you convince your client, that this kind of requirement is unrealistic for the web.

P.S. Pagination was not an option. We needed the whole data at once.

Image created by brgfx on freepik.com.


But wait, there’s more!

Posted on by:

_patrickgod profile

Patrick God

@_patrickgod

Into code as long as I can remember. First games, then web, now both. Located in the sweet Taunus-region in Germany. Always eager to learn, create and teach something new.

Discussion

pic
Editor guide
 

What does the initial request from the browser look like? Is it just a single GET or long-poll or does this set up an event stream (aka server-sent events)? My first thought is the browser will timeout your request and there's going to be no way to prevent it so I'm curious to know how you keep that connection up

 

Hi Jordan,

Thanks for your question. Well, Chrome's default timeout did not get hit. And regarding Node, I changed the timeout with req.setTimeout(2400000);.
Hope this answers your question.

Take care,
Patrick

 

Great topic! Have you tested on a production server? Did the response time considerably increase?

 

Thanks Austin.

Haven't tested it yet on production, I'm afraid. This was just a little study. Replacing Sequelize completely would be more complex right now. But maybe we'll switch to the MariaDB connector on some places.

 

Good approach! Works as use Datagrams in java