Last week, ThoughtWorks released it's latest edition of the Technology Radar. One of the new entries to the platform section was DuckDB. This new DB sounded interesting, so I decided to check it out.
What is DuckDB
ThoughtWorks describes it as
DuckDB is an embedded, columnar database for data science and analytical workloads. Analysts spend significant time cleaning and visualizing data locally before scaling it to servers. Although databases have been around for decades, most of them are designed for client-server use cases and therefore not suitable for local interactive queries. To work around this limitation analysts usually end up using in-memory data-processing tools such as Pandas or data.table. Although these tools are effective, they do limit the scope of analysis to the volume of data that can fit in memory. We feel DuckDB neatly fills this gap in tooling with an embedded columnar engine that is optimized for analytics on local, larger-than-memory data sets.
Similar to SQLite, it's a relational database, that supports SQL, without the necessity of installing and managing an SQL server. Additionally, it is optimized to be super-fast, even with large datasets, that don't fit in memory.
Test data creation
To test a database, first you need some data. So I created a python script and used Faker to create the following CSV files:
persons.csv (10.000 rows) id,name,street,city,email 1,Ronald Montgomery,300 Smith Heights Apt. 722,Shannonview,email@example.com books.csv (10.000 rows) 1,978-0-541-64306-5,Exclusive systemic knowledge user,1,27.31 orderItems (1.000.000 rows) id,person_id,book_id,quantity,date 1,7001,47034,3,2020-08-16
Installation of DuckDB
In order to use it, you have to install the DuckDB library. This is done using
pip install duckdb==0.2.2
For the test, I defined the following task: Create a CSV file, that contains the total amount of the sold books (quantity * price) per person category.
This is the code to solve this task
import duckdb from time import time start = time() # Connect to database. # If no filename is specified, the db will be created in memory conn = duckdb.connect() # Create tables and load data from CSV files conn.execute("CREATE TABLE persons as Select * from read_csv_auto ('persons.csv')") conn.execute("CREATE TABLE books as Select * from read_csv_auto ('books.csv')") conn.execute("CREATE TABLE orderItems as Select * from read_csv_auto ('orderItems.csv')") # Execute the query to get the result and use copy to export it as CSV file conn.execute("""copy (SELECT category, round(sum(quantity * price), 2) amount FROM orderItems inner Join persons on person_id = persons.id inner Join books on book_id = books.id group by category order by category) to 'result.csv' (HEADER)""") # Print execution time print("Executed in ", time() - start)
The execution time is around 2 seconds on my PC and the result file looks like this:
category,amount 1,13203562.05 2,13120658.42 3,12378199.17 4,12183193.4 5,13450846.14 6,13111841.91 7,12438200.33 8,12750379.26 9,12881481.69 10,12118417.6
So what do I think about DuckDB after this quick test? I have to say, I really like it. I've worked with SQL for a long time and thanks to DuckDB, I can reuse this skill to wrangle with data. I can work in memory and seamless switch to using a database file, if the data exceeds memory.
What do you think? Ready to give DuckDB a try? BTW: It also plays nice with pandas too.
Top comments (0)