DEV Community

loading...
Cover image for Fetching Millions of Rows with Streams in Node.js

Fetching Millions of Rows with Streams in Node.js

_patrickgod profile image Patrick God Originally published at patrickgod.com ・3 min read

Have you ever had to face the challenge of fetching several million rows of data with lots of columns from a database and display them on the web?

Well, I had to do this recently. Although I was not convinced that this would make any sense at all, I still had to do it.

Anyways, here's the solution.

But first, the technology stack: Node.js, Sequelize & MariaDB. The client doesn't matter, because at times the data was almost 4 GB big, so Chrome crashed anyway eventually.

Sequelize was the first big problem.

The solution to getting such big data results is streaming. Receiving that data with one big call led to Node crashing. So streaming is the answer, and Sequelize did not support streaming at that point.

Well, I was able to manually add streaming to Sequelize, but the service call takes ages in the end.

Here's a snippet of a usual Sequelize call:

await sequelize.authenticate();
const result = await sequelize.query(sql, { type: sequelize.QueryTypes.SELECT });
res.send(result);
Enter fullscreen mode Exit fullscreen mode

That's it. Of course, some parts are missing. Like the whole database configuration and the actual definition of the get() call (e.g. where does res come from?). But I think, you get the idea.

When you run this code, the result is simple. Node crashes. You could give node more memory with --max-old-space-size=8000, for instance, but that's not a real solution.

As already mentioned, you could kind of force Sequelize to stream the data. Now, how would that look like?

var Readable = stream.Readable;
var i = 1;
var s = new Readable({
    async read(size) {
        const result = await sequelize.query(
            sql + ` LIMIT 1000000 OFFSET ${(i - 1) * 1000000}`, { type: sequelize.QueryTypes.SELECT });
        this.push(JSON.stringify(result));
        i++;
        if (i === 5) {
            this.push(null);
        }
    }
});
s.pipe(res);
Enter fullscreen mode Exit fullscreen mode

In this example, I knew the number of rows I would get back from the database, hence the line with if (i === 5). It was just a test. You have to send null to end the stream. You could, of course, get the count of the whole result first and modify the code accordingly.

The whole idea behind this is to make smaller database calls and return the chunks with the help of the stream. This works, Node does not crash, but it still takes ages - almost 10 minutes for 3.5 GB.

Streaming with Sequelize

What's the alternative?

The MariaDB Node.js connector.

That's how a usual query would look like:

const mariadb = require('mariadb');
const pool = mariadb.createPool({ host: "HOST", user: "USER", password: "PASSWORD", port: 3308, database: "DATABASE", connectionLimit: 5 });
let conn = await pool.getConnection();
const result = await conn.query(sql);
res.send(result);
Enter fullscreen mode Exit fullscreen mode

It's much faster. But let me jump right to the streaming code:

let conn = await pool.getConnection();
const queryStream = conn.queryStream(sql);
const ps = new stream.PassThrough();
const transformStream = new stream.Transform({
    objectMode: true,
    transform: function transformer(chunk, encoding, callback) {
        callback(null, JSON.stringify(chunk));
    }
});
stream.pipeline(
    queryStream,
    transformStream,
    ps,
    (err) => {
        if (err) {
            console.log(err)
            return res.sendStatus(400);
        }
    })
ps.pipe(res);
Enter fullscreen mode Exit fullscreen mode

This may look a bit cryptic, but what happens here is, that you create a pipeline where you put stuff through. First, the queryStream which is the result of the database query. Then the transformStream to send the stringified chunks (only strings and buffers are allowed here, hence stringifying the object). And finally the PassThrough and a function for an error case.

With ps.pipe(res) you stream the result to the client.

And here's the result:
Streaming with MariaDB Node.js connector

Under 4 minutes for the same data and you won't even notice that Node needs a bit RAM.

So, if you're challenged with a similar task, think about streaming the data.

Or you convince your client, that this kind of requirement is unrealistic for the web.

P.S. Pagination was not an option. We needed the whole data at once.

Image created by brgfx on freepik.com.


But wait, there’s more!

Discussion

pic
Editor guide
Collapse
jfbrennan profile image
Jordan Brennan

What does the initial request from the browser look like? Is it just a single GET or long-poll or does this set up an event stream (aka server-sent events)? My first thought is the browser will timeout your request and there's going to be no way to prevent it so I'm curious to know how you keep that connection up

Collapse
_patrickgod profile image
Patrick God Author

Hi Jordan,

Thanks for your question. Well, Chrome's default timeout did not get hit. And regarding Node, I changed the timeout with req.setTimeout(2400000);.
Hope this answers your question.

Take care,
Patrick

Collapse
austinfelipe profile image
Austin Felipe

Great topic! Have you tested on a production server? Did the response time considerably increase?

Collapse
_patrickgod profile image
Patrick God Author

Thanks Austin.

Haven't tested it yet on production, I'm afraid. This was just a little study. Replacing Sequelize completely would be more complex right now. But maybe we'll switch to the MariaDB connector on some places.

Collapse
sjdonado profile image
Juan Rodriguez

Good approach! Works as use Datagrams in java