DuckDB: an embedded DB for data wrangling

#python #datascience #duckdb #sql

Last week, ThoughtWorks released it's latest edition of the Technology Radar. One of the new entries to the platform section was DuckDB. This new DB sounded interesting, so I decided to check it out.

What is DuckDB

ThoughtWorks describes it as

DuckDB is an embedded, columnar database for data science and analytical workloads. Analysts spend significant time cleaning and visualizing data locally before scaling it to servers. Although databases have been around for decades, most of them are designed for client-server use cases and therefore not suitable for local interactive queries. To work around this limitation analysts usually end up using in-memory data-processing tools such as Pandas or data.table. Although these tools are effective, they do limit the scope of analysis to the volume of data that can fit in memory. We feel DuckDB neatly fills this gap in tooling with an embedded columnar engine that is optimized for analytics on local, larger-than-memory data sets.

Similar to SQLite, it's a relational database, that supports SQL, without the necessity of installing and managing an SQL server. Additionally, it is optimized to be super-fast, even with large datasets, that don't fit in memory.

Test drive

Test data creation

To test a database, first you need some data. So I created a python script and used Faker to create the following CSV files:

persons.csv (10.000 rows)
id,name,street,city,email
1,Ronald Montgomery,300 Smith Heights Apt. 722,Shannonview,arellanotyler@ramirez.com

books.csv (10.000 rows)
1,978-0-541-64306-5,Exclusive systemic knowledge user,1,27.31

orderItems  (1.000.000 rows)
id,person_id,book_id,quantity,date
1,7001,47034,3,2020-08-16

Installation of DuckDB

In order to use it, you have to install the DuckDB library. This is done using pip install duckdb==0.2.2

The test

For the test, I defined the following task: Create a CSV file, that contains the total amount of the sold books (quantity * price) per person category.

This is the code to solve this task

import duckdb
from time import time
start = time()

# Connect to database. 
# If no filename is specified, the db will be created in memory
conn = duckdb.connect()

# Create tables and load data from CSV files
conn.execute("CREATE TABLE persons as Select * from read_csv_auto ('persons.csv')")
conn.execute("CREATE TABLE books as Select * from  read_csv_auto ('books.csv')")
conn.execute("CREATE TABLE orderItems as Select * from  read_csv_auto ('orderItems.csv')")

# Execute the query to get the result and use copy to export it as CSV file
conn.execute("""copy (SELECT category, round(sum(quantity * price), 2) amount FROM orderItems 
inner Join persons on person_id = persons.id 
inner Join books on book_id = books.id
group by category
order by category) to 'result.csv' (HEADER)""")

# Print execution time
print("Executed in ", time() - start)

The execution time is around 2 seconds on my PC and the result file looks like this:

category,amount
1,13203562.05
2,13120658.42
3,12378199.17
4,12183193.4
5,13450846.14
6,13111841.91
7,12438200.33
8,12750379.26
9,12881481.69
10,12118417.6

Summary

So what do I think about DuckDB after this quick test? I have to say, I really like it. I've worked with SQL for a long time and thanks to DuckDB, I can reuse this skill to wrangle with data. I can work in memory and seamless switch to using a database file, if the data exceeds memory.

What do you think? Ready to give DuckDB a try? BTW: It also plays nice with pandas too.