Our website grows, which is of course a very welcome trend, but there is a negative side to it: the data in our database grows substantially, too. We had changed our processes to archive our largest tables data periodically out of our main database but this did not solve everything and especially our logical backup / restore procedures were still becoming increasingly cumbersome.
Recently, we took a closer look at the problem with the aim to make creating our logical database dumps and restoring them fast again. And for sure there were places to radically optimize! Most of the changes had one thing in common: MySQL Shell.
Note: this post will assume a running MySQL server, simply because we use it ourselves. It should work for Percona server, too, for others (MariaDB) it’s left for you to try…
What we use logical database dumps for
Logical dumps (i.e. exporting the data into a set of conveniently formatted text files that can be later imported back again) serve two main purposes in our project:
- together with binary logs they serve as a ”backup“ backup data store in case the main restoration process (via ZFS snapshots) fails for some reason,
- they allow our developers to work with a subset of production data on their local machines.
The two purposes have naturally quite opposite requirements, for example we definitely need a complete backup in case of a serious database incident but, at the same time, we are fine with just a reasonable subset of the data for local developer imports (we don’t want to waste resources on our devs machines). We used to deal with this issue by exporting several different database dumps with mysqldump
, one for each purpose, e.g. one full backup and another partial one for the developer imports. This of course made our export runs even longer.
But we don’t have to do any of this any more thanks to MySQL Shell.
What is MySQL Shell
Compared to MySQL server itself, the Shell is a relatively new tool. In general, it is an advanced MySQL client and code editor. It supports scripting the server via SQL, Python and JavaScript out of the box. And it also comes with various utilities for dumping and importing data that we leveraged.
The biggest advantage of MySQL Shell dumps, compared to older approaches, is that the tool was designed with high performance in mind since the beginning, so:
- It uses threads to export / import the data in parallel.
- It optionally divides larger tables into chunks of data that can also be imported in parallel.
- It can defer creating indexes after importing the table data.
- It compresses the exported data using
zstd
, a very fast compression algorithm by Facebook. - A big benefit for us is also that the Shell allows importing a subset of data from the dump selectively.
We will talk about all this below. A very nice intro to the performance of the Shell dumps / imports is also in this series of posts by Kenny Gryp. The benchmarks show that MySQL shell outperforms all other solutions (specifically mysqldump
, mysqlpump
and mydumper
) in most situations.
Installing MySQL Shell
There are official binaries available for download at the MySQL site so if you use one of the supported platforms, the easiest option is just to grab the binary. For Ubuntu / Debian Linux users, there is also an official APT repository as well as a YUM repository for RedHat / Oracle Linux. The repositories include the Shell and have the benefit of automatic version updates. For some other Linux distributions, such as Arch Linux, there are source packages to compile (note though that the compilation can take a few hours…). MacOS and Windows binaries are available, too.
A particular limitation is that the Linux binaries only support Intel-like 64bit architectures. Currently, you will not be able to install the Shell on ARM processors, so unfortunately this is a no-go on ARM-based cloud instances with Linux.
Be sure to install the latest version of the Shell available. This tool is currently under heavy development and new features are added at high pace.
Setting up the Shell
We recommend going through the options and configuring the Shell after installation. Shell options are stored in $HOME/.mysqlsh/options.json
. For example, we have something like this in our Shell-ready accounts:
{
"history.autoSave": "true",
"history.maxSize": "10000",
"defaultMode": "sql",
"pager": "less -SFw"
}
- The first two lines switch on commands history auto-saving.
- The next line ensures that the Shell will be started in the SQL mode, ready to accept SQL queries.
- And, finally, we set up a nice pager for the results, with horizontal scrolling, paging only longer outputs and some movement highlighting.
Also, if you like, don’t forget to set up the Shell prompt, we found it highly customizable and truly helpful.
Making a database dump
Our main backup script is now as simple as it gets. We run it periodically from our CRON:
#!/bin/bash
# Main script for exporting logical production db backups.
# It uses MySQL Shell to make an instance dump.
current_time=`date +"%FT%T"`
backup_dir="/mnt/backups/nere-db-backup-${current_time}"
# perform a full instance backup
mysqlsh -h localhost --js <<EOF
util.dumpInstance('${backup_dir}',
{ bytesPerChunk: '100M',
threads: 8,
compatibility: ['strip_definers'] })
EOF
First, it determines where to store the database dump, we use a simple timestamp-based system in the backups.
The main part is the MySQL Shell run. It uses the util.dumpInstance()
utility to make a dump of all databases on our MySQL server instance. The tool creates a directory and exports all database objects into various files in it.
The options we use to create the dump are as follows:
-
bytesPerChunk
: We opt to divide larger tables into approximately 100 megabyte chunks of data (it’s the size before compression). Chunks allow the import utility to load even a single table with multiple threads, i.e. much faster. -
threads
: We use 8 threads to make the dump and recommend testing which number works best for you. Usually it’s a compromise between high export speed and the smallest possible negative impact on your production server. -
strip_definers
: This compatibility option strips theDEFINER
clause from views and other objects in the dump. This allows to re-create them even without having the same users andGRANT
s on the target MySQL instance.
MySQL Shell, when enough permissions are available to the user making the dump, uses a special sequence of locks to minimize locking issues during the export. Details can be found in the documentation under the consistent
option, let us just state here that since we run the dump under the root
account, we never had any issues with locking related to the dumps, even under heavier production load conditions.
OK, how fast is the dump run?
This is the summary statistics for one of our exports made during normal server load:
Duration: 00:13:43s
Schemas dumped: 4
Tables dumped: 281
Uncompressed data size: 316.47 GB
Compressed data size: 23.02 GB
Compression ratio: 13.7
Rows written: 266414294
Bytes written: 23.02 GB
Average uncompressed throughput: 384.43 MB/s
Average compressed throughput: 27.97 MB/s
MySQL Shell exported the ~266 million rows in our 4 databases comprising ~300 GB of data in about 14 minutes. The compressed dump itself takes ~23 GB. That’s not bad!
Partial imports of the logical dumps
To allow our developers to load part of the production database into their local development MySQL instances, we need to be able to partially load data from the full backup. With the util.loadDump()
utility that MySQL Shell provides, it is actually quite easy. The tool allows to load the whole dump or just a given database or even a particular table.
Since some of the tables in our database are quite huge and the historical data in them is not that important for local developer instances, we took the Shell options one step further and crafted an import script that allows us to load only a given number of the latest chunks (the freshest data) of such tables. To explain that, we need to say a few words about how the MySQL Shell dumps are organized in files.
A brief dive into the dump files structure
MySQL Shell dumps are directories with the following types of files in it:
-
dump metadata files -
@.done.json
,@.json
and similar: these files contain information about all the schemas, their tables and their chunks that the dump contains. They also store the options that the dump was exported with as well as SQL commands to re-create the database user accounts if needed. -
database metadata files -
database_name.json
,database_name.sql
: these two files provide further information about the particular schema in the dump and a DDL SQL command with its definition. -
table metadata files -
database_name@table_name.json
,database_name@table_name.sql
: again, these two files provide the metadata about the particular table as well as the SQL command to recreate the table structure. A similar pattern is used for other objects (views etc.), too. -
(chunked) table data files -
database_name@table_name@@chunk_number.tsv.zst
,database_name@table_name@@chunk_number.tsv.zst.idx
: finally, these files contain the table data in the TSV format, compressed withzstd
. If the data cannot be represented as a simple text, it is Base64-encoded. (We are not sure what the.idx
files are for, they seem to contain the length of the data in the main TSV file encoded in a binary format.)
The nice thing about this structure is that it is easy to guess what each file does and everything is human-readable, it’s all just SQL, JSON and tab-separated values.
Partial dump data downloads
Before we can import the data into a local MySQL instance, we need to download the relevant part of the dump to the local machine. Our import script uses rsync
with a set of pre-configured --include
and --exclude
options under the hood to do that. Overall, we can typically download only about 2 GB of data out of the 23 GB dump.
After battling the partial imports for a while, we came to the following important requirements that MySQL Shell has in order to successfully load some data:
- All data files for the given table or schema (also mentioned in the metadata files) must exist in the local dump copy.
- All metadata files for the given table or schema must also be present.
- And finally, the metadata files for the whole dump must be always present in the local dump copy.
So, for example, if we wanted to import table T
from database D
, the tool needs to find all chunked TSV data files for table D.T
+ the metadata files for table D.T
+ the metadata for the whole dump. If we wanted to import the whole schema D
, all data for D.*
tables + their metadata + the metadata of schema D
+ the whole dump metadata must be present. If the Shell does not find the required files, it considers the dump incomplete and unfortunately just says ”No data loaded“ and exits without further hints.
Hacking the Shell for ”latest chunks“ imports
The above information would be already sufficient if all we wanted was to import a subset of tables from a schema and all their data and ignore all other tables. We could rsync
the relevant data and metadata files from the server and call the loadDump
Shell utility with the includeTables
option to load them.
However, for some huge tables we actually needed to load just the latest data from them and for that we took a different approach and diverted a bit from the standard usage of the import tool.
The idea behind is simple: what if instead we made the tool believe it has the full backup available? Perhaps we could actually download only those data chunks that we really need to import and make up the rest? And that is exactly how our partial imports script works.
Luckily, the MySQL Shell does not cross-check the size of the table data files with its metadata records. Thus, we can create ”fake“ empty data files for those table chunks that we don’t want to import. Then, we can call the Shell loadDump
utility to load the ”whole“ database.
Creating a zstd
-compressed empty TSV file is easy:
echo -n '' | zstd > database_name@table_name@@chunk_number.tsv.zst
We also make up the accompanying .idx
file (although we are not sure it’s actually needed) using a binary zero value, i.e. like this:
echo -n -e '\x0\x0\x0\x0\x0\x0\x0\x0' \
> database_name@table_name@@chunk_number.tsv.zst.idx
So, how does the partial imports script work?
In principle, the script that is capable of partial imports of our dumps works as follows:
- It finds the latest backup directory on the server.
- It calculates which files it needs to download, based on a config file that lists the tables completely excluded and the tables that we only want the latest chunks of data from.
- It downloads the files using
rsync
. Now we have an incomplete local copy of the dump, with some of the data / metadata files missing. Such a dump cannot be loaded by the Shell yet. - It grabs a full file listing of the original backup on the server and for each file missing locally, it creates a ”fake“ empty file using one of the commands mentioned above. Now we have a ”complete“ local copy of the dump, with some of the data / metadata files empty.
- It
DROP
s all objects (tables, views, etc.) that will be re-imported to the database. MySQL Shell requires all loaded objects to be non-existent in the target database instance. - Finally, it calls the MySQL Shell
loadDump
utility, to import ”the whole“ database. It uses several optimizations while doing this that we’ll mention below.
The final command to load the data dump in MySQL Shell is similar to this:
mysqlsh -h localhost -u db_user --js <<EOF
util.loadDump('${local_backup_dir}',
{ threads: 4,
includeSchemas: [ 'production_db' ],
skipBinlog: true,
deferTableIndexes: 'all',
analyzeTables: 'on',
progressFile: '' })
EOF
Some of the options are particularly interesting here:
-
threads
- Again, the number of parallel threads is something to experiment with and will differ for each machine. -
includeSchemas
- We tell MySQL Shell to import the given schema from our local ”full backup“. -
skipBinLog
- This skips writing the import SQL queries to the binary log, should anyone have it enabled. This prevents unnecessary slowing down the import and taking up disk space. -
deferTableIndexes
- We observed the highest import speeds when we made the Shell create indices after loading the data using this option. -
analyzeTables
- This option makes the Shell callANALYZE TABLE
for all tables after the import is done. This should speed up using the re-imported tables a bit and does not slow down the import process much. -
progressFile
- We don’t support temporarily interrupting the import process so we don’t need to track the progress of it.
Speed optimizations during the data import
We’ve already mentioned some of the speed optimizations, but let’s recap them all:
Partial imports - we never download and import data that we actually don’t need on the target database instance, as we explained above.
Deferring indexes creation - we found that creating indexes after loading the data speeds up the whole import about 2×. But this is dependent on the tables / indexes structure so we always recommend testing both variants.
Disabling binary logs - all our developers have them disabled anyway but if not, they are switched off during the import procedure.
Disabling the Redo log - this is something we haven’t talked about yet. Redo log is a disk-based log that is used during MySQL server crash recovery. Disabling it during the import speeds up the process tremendously, about 2-3×. But beware! Disabling this log makes your whole database server vulnerable to an irrecoverable loss of data in case the server crashes. You should never use this option if you have important data in your db server. Plus, you should always have a backup ready when using this.
Summary
With all these optimizations in place, our typical local development import (~2 GB of compressed data in ~180 tables) runs under 10 minutes on my own machine, excluding the download time. It creates about ~15 GB of data and indexes in the MySQL server data directory.
We feel we probably reached the limits of what is currently technically possible to do about logical dumps speed optimization. Both the export and import is run in multiple parallel threads and the throughput of each of them is maximized. The future of the local developer imports then lies in further reducing the data imported. We are very happy with the current solution as it also opens totally new possibilities such as data anonymization during the imports, for the future, too.
Overall, we think the MySQL Shell utilities are great and they now take an important part of our backup / restore strategy.
If you like reading stuff like this, you might want to follow us on Twitter.
Top comments (4)
Thanks
Thank you! this is the info I needed (what files do I need to download from dumpInstance to load only some schemas/tables) and it took me forever to find it.
Nice article and nice read!
Have a question though. I assume your database tables have foreign key relations with each other. When you partially import data into your local development mysql instances, how did you resolve the issues with foreign key constraints?
For example, if table A foreign key references table B, taking the latest chunks of table A & B may not guarantee that all the necessary foreign key referenced rows exist in table B.
Do you have any tips for handling this when importing data into local development databases?
I have to make a confession here - we don't use foreign keys in our db, we rely on Rails validations only, so I never had to think about it but I can absolutely see the issue and thanks for the good question!
Two options come to my mind:
database@table.sql
files inside the backup directory) to remove the constraint(s) just before starting the re-import. That way you could be able to selectively switch off the referential integrity where needed. This sure is a compromise but in a development db it might not be that much of an issue.database@table@chunk.tsv
files) before re-import to reveal the actual relationships between table chunks. E.g. if the last record of chunk 10 of table A references row 1234 from table B, you might be able to find which chunk of table B has that row and ensure importing table B up to that chunk. This sounds quite tedious and assumes the referential data mostly just grows and is not updated in wilder ways, though…So, no easy solution here, I guess, good luck and please get back to me if you manage to make it a success!