Ali Amjad

Posted on Jun 11, 2022

MySQL's Data Streaming : What is it & How it works ?

#mysql #node #database #datascience

In the article, I will be discussing one of the most interesting features of MySQL that you gonna need all the time and discuss how it works in the background. So let's begin 😋

I - Introduction to Data Streaming

So let's begin with What is Data? , it's information that has been translated into a form that is efficient for movement or processing. Relative to today's computers and transmission media, data is information converted into binary digital form.
Based on statistics, the amount of text data created every day across the globe is more than 18 billion. So data is being created and transferred between computers all across the globe all the time.
Data can be transferred in multiple methods, to understand the data streaming transfer, we need to understand the traditional way and learn why we need something like streaming.

A common way to transfer a file is that all the packets must arrive at the destination to be reassembled and then reach their destination, an example is when you send an image and you need all bits of this image to display it thus it must be delivered first.

However, if you are waiting for a video to load, are you waiting for the full video to download and play? Well not all of us, that's why we have streaming services like Youtube, Netflix, and others and you can start playing any of them right away, and that's where a streaming idea comes to play.

What makes streams unique, is that instead of a program reading a file into memory all at once like in the traditional way, streams read chunks of data piece by piece, processing its content without keeping it all in memory.
Instead of waiting for the 100 MB video to load, we can get it to chunk by chunk and load each 1MB as they are consumed from the stream and start displaying it right away.

II - What do we mean by Database's Data Streaming?

As we have a clear understanding of what we mean by data streaming, and how it will help us in the database world is our next topic.

Let's say we have a table from SQL that has some data in it ( roughly 1000 ) and you make a select statement to retrieve them for some O(N) calculation, it's fast enough that you don't feel any delay for your report.

However, let's say we have 1B rows ... your select statement takes enough time and your O ( N ) calculation is taking extra time as it has to wait for the rows to be retrieved and then start going through all of them to do the calculations.

Now the fun part, how can we improve this? Yes, you got it right, Let's see how streams for this select statement will help us.
Instead of waiting for 1 Billion rows, let's get a row one by one so when a row is fetched by the DB, We make some calculations on it right away as they are a chunk of this data and we process chunk by chunk, and then we send it back to the user before receiving next chunk.

By the end of this fetching, you will have a 1B calculated rows that are sent back to the user and the user didn't wait at all, and We only optimized an O ( N ) calculation, this will hugely improve your more complex calculations.

Database Streaming will reduce the user waiting time and optimize your calculations on the huge amount of data.

Note: This is the explanation for our Article only as we get deeper in the next sections, Database streaming also has a huge impact on the Data science of creating Pipelines and data lakes but that's for another article

III - Cases you can solve with MySQL's Data Streaming

Streaming your database data can have many applications. I usually use it for calculations and reports required by the business that requires calculations on Billions of rows and must be done in a fast way.

Maybe you are creating an Excel File for big data, or a PDF for a huge report then inserting one into them as they are fetched is way faster.

You may want to have a video player and store the binary data in a database and you can stream it back to the user. You may have a gallery that images fetched from DB, you can display an image by image or much more applications.

If you are a data scientist and trying to create a Pipeline by yourself to migrate data between two databases, then you can stream them daily to be in sync, Or if you are looking for a data lake to change the data then you can modify chunk by chunk while streaming the data.

In short, you can improve your current structure to be much faster for any case.

IV - Introduction to Knex.js (NPM package for Node.js)

Before coding our streamer, we will be using the Knex.js library as our query builder.

Knex.js (You know they also pronounce the "K" in the front? lol) is a "batteries included" SQL query builder for PostgreSQL, CockroachDB, MSSQL, MySQL, MariaDB, SQLite3, Better-SQLite3, Oracle, and Amazon Redshift designed to be flexible, portable, and fun to use.

They provide a beautiful way to use SQL in your Node.js, You can refer to their official documentation to know more about this amazing product.

You can check the documentation Here: Knex.js Documentation
and You can check the Streaming Documentation: Knex.js Stream Documentation

One of the Knex.js features is Streams for MySQL, You can stream your query and modify it with javascript in a very easy way.

However, Knex uses streams that are provided by original Mysql npm for node.js which is made by Felix Geisendörfer and his amazing team that made it easier for Knex to use their streaming into their library. We will discuss how the Mysql package achieved this in later sections.

V - Introduction to Streaming in Node.js

One last thing worth mentioning is the node.js Streaming module, for the implementation we will be doing in the next section, node.js streaming has also its role in it as well as the functionality from MySQL provides, so let's briefly explain what is node.js stream?

The Stream module is a native module that is shipped by default in Node.js. The Stream is an instance of the EventEmitter class, which handles events asynchronously in Node.js. Due to their superclass, streams are inherently event-based.

There are 4 types of streams in Node.js:

Writable: Used to write data sequentially
Readable: Used to read data sequentially
Duplex: Used to both read and write data sequentially
Transform: Where data can be modified when writing or reading. Take compression for an example, with a stream like this you can write compressed data and read decompressed data.

This is briefly about the node.js streaming module, for more information you can read all about streams at Node.js Official documentation.

VI - Implementation of MySQL's Data Streaming with Node.js

In this section, we will be coding our streamer with the Knex.js Package, so let's begin right away.

As a side note, Node.js and Knex.js package basic usage knowledge is required because I will be focusing on the stream only throughout the coding.

First, I will be creating a file called "stream.js" with an async function called "sample" that gonna be our sample in this article.



const database = require('./database/connection');


 async function sample(){

    console.log("Hi , This is a sample function");

 }

 sample();

we have knex mysql connection at top , and i can run this file with "node stream.js"

Then , I will be creating a table from SQL so that we can write queries on .. i will quickly write migration for it and add some test data into it.



CREATE TABLE `sample` (
  `id` int unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(255) NOT NULL DEFAULT '',
  PRIMARY KEY (`id`)
)

I did add some dummy data into this database , around 3000 records



SELECT COUNT(*) FROM sample; //returns 3000

Now i will use Knex.js Stream function to modify each one as they are fetched



const database = require("./database/connection");

async function sample() {
  console.log("Started At :", new Date().toISOString());
  await database("sample")
    .select()
    .stream((stream) => {
      stream.on("data", (row) => {
        console.log(row.name, `At : ${new Date().toISOString()}`);
      });
    });
    console.log("Ended At :", new Date().toISOString());
}

sample();

Running this code will print the rows as they are fetched from the DB with the exact date start and end date, you can make the difference between them.

What we did, we simply write a select * for this table, and then we used the .stream function provided by knex.js, then we are listening on "data" which indicates when each row arrived, there are other events like on Error to handle the error occurrence in the stream.

This is an example of the output :

VII - How MySQL's Data Streaming works and How it's implemented with Node.js? 😱

Finally, let's discuss how this works in the background and how this stream works behind the scenes.

First, Knex.js is a Query builder that only provides a stream interface, in other meaning that they have added another layer on the top of a feature to make it easier to be used by programmers and that's why it's kinda difficult to know how it works behind the scenes from Knex.js Documentation.

Originally Stream feature comes from the original MySQL-node client package which knex.js is depending on for MySQL. the MySQL package does provide a brief on how it's working in their documentation, You can read it from the MySQL stream.

However to this point we have explained, it's unclear if it's native MySQL functionality or something made possible with node.js only.

So let's dive deeper to see how this part is coded and we could get a lead from there.

Mysql for node.js is an open-source package, so you can visit how it's made through their GitHub repository , after wandering around for some time, you will get some leads from their implementation.

if you look into this file where the code relies on , you see they have used a MySql's Text Protocol called "COM_QUERY" to make this work, so let's dive more deeply into what's this protocol do?

COM_QUERY (SELECT statement message parsing) is one of the MySQL communication Text Protocols, let's focus on what do we mean by Text Protocol and let's compare it to Binary Protocol :
The difference is really whether the protocol is oriented around data structures or text strings, for example, HTTP is a text protocol, even though when it sends a jpeg image, it just sends the raw bytes, not a text encoding of them.
So basically by Text protocols in MySQL, we can send and receive data without any encodings and the benefit of Com Query is that we can parse the text to extract our needs.
You can get a list of MySql's Communication Protocols, and a List of MySQL Text Protocols at their official Documentation.

Back to COM_QUERY, Let's get into more advanced details on how it works :

Step 1) Client Command or Client Side:

A COM_QUERY is used to send the server a text-based query that is executed immediately, in other meaning that when you provide this "Select *" and chain it to the Stream function, it will send this query to the server and then start executing right away.

Step 2) The MySQL server responds with 4 packet possibilities

1- If there is an execution error, such as SQL syntax error, return the ERR package

2- If the execution is successful, but no data is found, return the OK package

3- If the client executes load data local infile 'filename' into table

, LOCAL_INFILE_REQUEST is returned.

4- If a result set ( means there was some data ) is returned, the delivered package is Resultset.

Step 3) Let's focus on the Result Set

The Result set means the Data is Found, which means we are getting a sequence of packets.

The Result Set is the combination of two parts that we receive,

First is the column definitions, which contain information about the columns and the data types and schema details in general.
Second is the Rows, Each Row is a packet and we get the rows as packets.

The structure of the ResultSet response packet is as follows:

1 - The first packet: is the column length packet.

2 - Followed by n field description packets, each field description is a packet, After all, field information descriptions, an EOF packet or OK packet will be sent as the separator between the field definition and the data (Row).then the OK packet is returned

3 - Next is the line data packet, one data packet per line, including the packet header and the message body.

4 - The final end packet, which may also be an EOF or OK packet

In short and more generally speaking, after Com Query has been executed ... we receive a sequence of data packets that we each of them is a row.

So let's focus on the MySQL package and how they have used that, after providing a string query, they will send a com query to the server ... they receive the row packets and for each packet, they parse it to extract the row that is fetched and they use Node.js streams to emit them as events right away and you listen on "Data" to get the rows.

Well, that was a lot of theoretical explanation xD, But now we have a better understanding that it's working because of a MySql protocol.

VIII - Database Data Streaming with Other Databases and Languages

As for the Databases, I am not sure how many other databases support this but my research and playground was only MySQL.

As for other Languages than Node.js, this is only something made friendlier with the node.js streams module ... this is achievable with other languages as well, I have seen Java examples doing the same thing.

IX - Conclusion

In Article we discussed A lot About Streams, Now you should have some level of understanding of what streams are in databases. We have provided some examples and a deep explanation of what happens in the background.
Please refer back to MySQL Documentation or any links provided above to learn more about this.

X - Thank You

Thanks for reading this article, this is one of the Articles I enjoyed when I was writing it.
hopefully, you gained new information when you read through it.

You can Go through my Account for more articles.

DEV Community

MySQL's Data Streaming : What is it & How it works ?

Top comments (0)

Read next

AI System Learns How Objects Interact in Images to Generate New Scenes with Same Relationships

AI Breakthrough: Context Analysis Boosts Visual Puzzle-Solving Accuracy to 76%

Node.js + ioredis + elasticache

Integrating Google Calendar API in Node.JS: A Guide to Event Creation and Meeting Scheduling