Fetching Millions of Rows with Streams in Node.js

#tutorial #node #webdev #beginners

Have you ever had to face the challenge of fetching several million rows of data with lots of columns from a database and display them on the web?

Well, I had to do this recently. Although I was not convinced that this would make any sense at all, I still had to do it.

Anyways, here's the solution.

But first, the technology stack: Node.js, Sequelize & MariaDB. The client doesn't matter, because at times the data was almost 4 GB big, so Chrome crashed anyway eventually.

Sequelize was the first big problem.

The solution to getting such big data results is streaming. Receiving that data with one big call led to Node crashing. So streaming is the answer, and Sequelize did not support streaming at that point.

Well, I was able to manually add streaming to Sequelize, but the service call takes ages in the end.

Here's a snippet of a usual Sequelize call:



await sequelize.authenticate();
const result = await sequelize.query(sql, { type: sequelize.QueryTypes.SELECT });
res.send(result);

That's it. Of course, some parts are missing. Like the whole database configuration and the actual definition of the get() call (e.g. where does res come from?). But I think, you get the idea.

When you run this code, the result is simple. Node crashes. You could give node more memory with --max-old-space-size=8000, for instance, but that's not a real solution.

As already mentioned, you could kind of force Sequelize to stream the data. Now, how would that look like?



var Readable = stream.Readable;
var i = 1;
var s = new Readable({
    async read(size) {
        const result = await sequelize.query(
            sql + ` LIMIT 1000000 OFFSET ${(i - 1) * 1000000}`, { type: sequelize.QueryTypes.SELECT });
        this.push(JSON.stringify(result));
        i++;
        if (i === 5) {
            this.push(null);
        }
    }
});
s.pipe(res);

In this example, I knew the number of rows I would get back from the database, hence the line with if (i === 5). It was just a test. You have to send null to end the stream. You could, of course, get the count of the whole result first and modify the code accordingly.

The whole idea behind this is to make smaller database calls and return the chunks with the help of the stream. This works, Node does not crash, but it still takes ages - almost 10 minutes for 3.5 GB.

What's the alternative?

The MariaDB Node.js connector.

That's how a usual query would look like:



const mariadb = require('mariadb');
const pool = mariadb.createPool({ host: "HOST", user: "USER", password: "PASSWORD", port: 3308, database: "DATABASE", connectionLimit: 5 });
let conn = await pool.getConnection();
const result = await conn.query(sql);
res.send(result);

It's much faster. But let me jump right to the streaming code:



let conn = await pool.getConnection();
const queryStream = conn.queryStream(sql);
const ps = new stream.PassThrough();
const transformStream = new stream.Transform({
    objectMode: true,
    transform: function transformer(chunk, encoding, callback) {
        callback(null, JSON.stringify(chunk));
    }
});
stream.pipeline(
    queryStream,
    transformStream,
    ps,
    (err) => {
        if (err) {
            console.log(err)
            return res.sendStatus(400);
        }
    })
ps.pipe(res);

This may look a bit cryptic, but what happens here is, that you create a pipeline where you put stuff through. First, the queryStream which is the result of the database query. Then the transformStream to send the stringified chunks (only strings and buffers are allowed here, hence stringifying the object). And finally the PassThrough and a function for an error case.

With ps.pipe(res) you stream the result to the client.

And here's the result:

Under 4 minutes for the same data and you won't even notice that Node needs a bit RAM.

So, if you're challenged with a similar task, think about streaming the data.

Or you convince your client, that this kind of requirement is unrealistic for the web.

P.S. Pagination was not an option. We needed the whole data at once.

Image created by brgfx on freepik.com.

But wait, there’s more!

Let’s connect on Twitter, YouTube, LinkedIn or here on dev.to.
Get the 5 Software Developer’s Career Hacks for free.
Enjoy more valuable articles for your developer life and career on patrickgod.com.

Top comments (4)

Patrick God • Mar 6 '20

Hi Jordan,

Thanks for your question. Well, Chrome's default timeout did not get hit. And regarding Node, I changed the timeout with req.setTimeout(2400000);.
Hope this answers your question.

Take care,
Patrick

Austin Felipe • Feb 13 '20

Great topic! Have you tested on a production server? Did the response time considerably increase?

Patrick God • Feb 14 '20

Thanks Austin.

Haven't tested it yet on production, I'm afraid. This was just a little study. Replacing Sequelize completely would be more complex right now. But maybe we'll switch to the MariaDB connector on some places.