This post is an adaptation of the one I originally published in the Orchest blog. Enjoy!
This post is the first part of our series “SQL on Python”, in which we will explore different Python libraries that help you manipulate and query your data using SQL or a SQL-inspired syntax.
Why SQL, after all?
SQL (the initials for Structured Query Language, also known as ISO/IEC 9075-1:2016) was originally designed in the 70s for managing relational databases, but nowadays, it is being used for analytics workloads as well.
SQL has lots of benefits for analytics, to name a few:
- It’s easy to pick up: SQL is a domain-specific language, rather than a general-purpose language, and as such it has more limited scope and fewer syntax elements to learn.
- It’s everywhere: SQL is a family of query languages available in many systems, and all of them share some core common characteristics. When you learn a particular SQL dialect (PostgreSQL, SQL Server, Google Standard SQL, others), you can easily transfer your skills from other dialects with ease.
- It’s fast: SQL is a statically typed language, which allows query planning systems to perform sophisticated optimizations. This, along with the decades of accumulated knowledge about relational databases, allow SQL implementations to have difficult to beat performance.
However, if you are used to the Python or R ecosystems (pandas, Polars, data.table, dplyr), you are probably spoiled by how easy it is to download a CSV or Parquet file from somewhere, launch a Python or R process, read it, and start querying and manipulating it.
Comparatively, this bootstrapping process is a bit more tedious with SQL: assuming you have, say, a local PostgreSQL database up and running and a CSV file, you would need to create a table with the appropriate schema, import the data using COPY, and hope that there are no inconsistencies, missing data, or weird date formats. If the file happened to be Parquet, you would need to work a bit more.
To try to make the process a bit more lightweight, you could try to convert your CSV or Parquet to SQLite, a widely available, in-process SQL database. However, SQLite was designed with transactional use cases in mind, and therefore might not scale well with some analytical workloads.
In summary: SQL is appealing, but the boilerplate not so much. What if you could run SQL for your analytics workloads without having to configure a database, just by importing a module in your Python or R process, and make your queries blazing fast? What if, rather than having to choose between Python or SQL, you could use both?
https://twitter.com/anyfactor/status/1551650476651081729
Enter DuckDB
DuckDB is an open source (MIT) high-performance, in-process SQL database for analytics. It is a relatively new project (the first public release was in June 2019), but got tremendously popular in a short period of time.
DuckDB popularity is growing (we like this image so much)
DuckDB can read data from different sources:
- From CSV or Parquet files
- From pandas DataFrame or Arrow Table objects in the process memory
- From PostgreSQL tables (by reading the binary data directly!)
Some of the DuckDB operations have out-of-core capabilities (similar to Vaex or the new streaming mode of Polars), which means that it can read data that is larger than RAM!
Finally, DuckDB offers some additions on top of standard SQL that make it very pleasant to use, for example friendlier error messages or, behold, trailing commas!
Trying out DuckDB
For this example, we will use a dataset containing all mentions of climate change on Reddit before September 2022 obtained from Kaggle. Our generic goal is to understand the sentiment of these mentions.
I have published an Orchest pipeline that contains all the necessary files so you can run these code snippets on JupyterLab easily: the first step downloads the data using your Kaggle API key, and the second step performs some exploratory analysis.
First steps with DuckDB
You can install DuckDB with conda/mamba or pip:
mamba install -y "python-duckdb=0.5.1"
# Or, alternatively, with pip
# pip install "duckdb==0.5.1"
The first step to start using DuckDB is creating a connection object. This mimics the Python Database API 2.0, also implemented by other projects like SQLite and psycopg2:
import duckdb
conn = duckdb.connect()
By default, duckdb.connect()
will return a connection to an in-memory database, which will be perfectly fine for reading data from external files. In fact, you can run a SQL query directly on the CSV file straight away!
In [3]: conn.execute("""
...: SELECT COUNT(*)
...: FROM '/data/reddit-climate/the-reddit-climate-change-dataset-comments.csv'
...: """).fetchall()
Out[3]: [(4600698,)]
As you can see, the comments CSV file contains 4.6+ million rows. This took about 50 seconds on an Orchest instance though, which is not very impressive for just a COUNT(*) operation. What about converting the CSV to Parquet, as we did in my blog post about Arrow? This time, we can use DuckDB for that:
import os
csv_files = !ls /data/reddit-climate/\*.csv
for filename in csv_files:
print(f"Reading {filename}...")
destination_file = os.path.splitext(filename)[0] + ".parquet"
if os.path.isfile(destination_file):
continue
conn.execute(f"""
COPY (SELECT * FROM '{filename}')
TO '{destination_file}' (FORMAT 'parquet')
""")
And now, let’s repeat the query on the Parquet file:
In [8]: %%timeit
...: conn.execute("""
...: SELECT COUNT(*)
...: FROM '/data/reddit-climate/the-reddit-climate-change-dataset-comments.parquet'
...: """).fetchall()
234 ms ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Approximately a 200x speedup over the same operation using CSV! That is a better baseline for running the rest of the queries.
Querying Parquet files with DuckDB
Since you will be referring to the same file several times, it’s a good moment to create a view for it. This will allow you to query the Parquet file without copying all the data to memory:
conn.execute("""
CREATE VIEW comments AS
SELECT \* FROM '/data/reddit-climate/the-reddit-climate-change-dataset-comments.parquet'
""")
Next, let’s find out which subreddits had the most number of comments about climate change:
In [11]: conn.query("""
...: SELECT
...: "subreddit.name" AS subreddit\_name,
...: COUNT(*) AS num_comments,
...: FROM comments
...: GROUP BY subreddit_name
...: ORDER BY num_comments DESC
...: LIMIT 10
...: """).fetchall()
Out[11]:
[('politics', 370018),
('worldnews', 351195),
('askreddit', 259848),
('collapse', 94696),
('news', 94558),
('futurology', 89945),
('science', 71453),
('environment', 70444),
('canada', 66813),
('australia', 60239)]
Unsurprisingly, /r/politics, /r/worldnews, and /r/collapse were among the subreddits with the largest number of comments about climate change.
What about the overall sentiment of those comments?
In [12]: conn.query("""
...: SELECT
...: AVG(sentiment) AS average_sentiment,
...: STDDEV(sentiment) AS stddev_sentiment,
...: FROM comments
...: """).fetchall()
Out[12]: [(-0.005827451977706203, 0.6581439484369691)]
In [13]: conn.query("""
...: SELECT
...: "subreddit.name" AS subreddit_name,
...: COUNT(*) AS num_comments,
...: AVG(sentiment) AS average_sentiment,
...: STDDEV(sentiment) AS stddev_sentiment,
...: FROM comments
...: WHERE subreddit_name IN (
...: SELECT "subreddit.name" AS subreddit_name
...: FROM comments
...: GROUP BY subreddit_name
...: ORDER BY COUNT(*) DESC
...: LIMIT 10
...: )
...: GROUP BY subreddit_name
...: ORDER BY num_comments DESC
...: """).fetchall()
Out[13]:
[('politics', 370018, -0.018118589649651674, 0.6600297061408),
('worldnews', 351195, -0.058001587387908435, 0.6405990095462681),
('askreddit', 259848, -0.068637218639235, 0.6089748718101456),
('collapse', 94696, -0.1332661626390419, 0.6667106776062662),
('news', 94558, -0.09367126059175682, 0.6276134461239258),
('futurology', 89945, 0.0018637489115630797, 0.6506820198836241),
('science', 71453, 0.04588216852922973, 0.6248484283076333),
('environment', 70444, -0.015670189810189843, 0.6467846578160414),
('canada', 66813, 0.021118244331091468, 0.6408319443539487),
('australia', 60239, -0.021869519296548085, 0.6405803819103508)]
While the overall sentiment is slightly negative (with a large standard deviation), some subreddits like /r/askreddit and /r/collapse exhibited a sentiment more negative than average. Others like /r/science and /r/canada were slightly positive.
Most interestingly, all these queries ran in about 2 seconds!
DuckDB also has integration with Jupyter through the ipython-sql extension and the DuckDB SQLAlchemy driver, which allows you to query your data using an even more compact syntax:
In [1]: %load_ext sql
In [2]: %sql duckdb:///:memory:
In [3]: %%sql
...: CREATE VIEW comments AS
...: SELECT * FROM '/data/reddit-climate/the-reddit-climate-change-dataset-comments.parquet'
...:
...:
* duckdb:///:memory:
Done.
Out[3]: []
In [4]: %sql SELECT COUNT(*) FROM comments
* duckdb:///:memory:
Done.
Out[4]: [(4600698,)]
Interoperability with Python dataframe libraries
Did you notice how we were using conn.execute()
all the time? As we said above, this method follows the widely used Python DBAPI 2.0. However, DuckDB can return richer objects by using conn.query()
instead:
rel = conn.query("""
SELECT
"subreddit.name" AS subreddit_name,
COUNT(*) AS num_comments,
FROM comments
GROUP BY subreddit_name
ORDER BY num_comments DESC
LIMIT 10
""")
This method returns an instance of DuckDBPyRelation
, which can be pretty printed in Jupyter:
In [5]: type(rel)
Out[5]: duckdb.DuckDBPyRelation
In [6]: rel
Out[6]:
---------------------
--- Relation Tree ---
---------------------
Subquery
---------------------
-- Result Columns --
---------------------
- subreddit_name (VARCHAR)
- num_comments (BIGINT)
---------------------
-- Result Preview --
---------------------
subreddit_name num_comments
VARCHAR BIGINT
[ Rows: 10]
politics 370018
worldnews 351195
askreddit 259848
collapse 94696
news 94558
futurology 89945
science 71453
environment 70444
canada 66813
australia 60239
Moreover, you can efficiently retrieve the data from this relation and convert it to several Python objects:
- A dictionary of masked NumPy arrays using
.fetchnumpy()
- A pandas DataFrame using
.df()
or its aliases (.fetchdf()
,.fetch_df()
) - An Arrow Table using
.arrow()
or.fetch_arrow_table()
- An Arrow record batch reader using
.fetch_record_batch(chunk_size)
Therefore, you can easily convert query results to a pandas DataFrame, and also a Polars one (since you can pass an Arrow table directly):
In [6]: rel.df() # pandas
Out[6]:
subreddit_name num_comments
0 politics 370018
1 worldnews 351195
2 askreddit 259848
3 collapse 94696
4 news 94558
5 futurology 89945
6 science 71453
7 environment 70444
8 canada 66813
9 australia 60239
In [7]: import polars as pl
In [8]: data = rel.arrow() # Arrow data
In [9]: pl.DataFrame(data) # Polars
Out[9]: shape: (10, 2)
┌────────────────┬──────────────┐
│ subreddit_name ┆ num_comments │
│ --- ┆ --- │
│ str ┆ i64 │
╞════════════════╪══════════════╡
│ politics ┆ 370018 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ worldnews ┆ 351195 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ askreddit ┆ 259848 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ collapse ┆ 94696 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ science ┆ 71453 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ environment ┆ 70444 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ canada ┆ 66813 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ australia ┆ 60239 │
└────────────────┴──────────────┘
Note: Result objects returned by conn.execute()
also have these methods, but they consume the data after they are called and therefore are not so convenient.
Conversely, you can transfer data from pandas or Arrow to DuckDB. Or, more precisely: you can use DuckDB to query pandas or Arrow objects that live in memory! Moreover, DuckDB can read local variables without having to do anything:
In [13]: df_most_comments = rel.df()
In [14]: df_most_comments.head() # pandas
Out[14]:
subreddit_name num_comments
0 politics 370018
1 worldnews 351195
2 askreddit 259848
3 collapse 94696
4 news 94558
In [15]: conn.execute("""
...: SELECT subreddit_name
...: FROM df_most_comments -- Sorcery!
...: LIMIT 5
...: """).fetchall()
Out[15]: [('politics',), ('worldnews',), ('askreddit',), ('collapse',), ('news',)]
You can also manually register a compatible object with a given name:
In [17]: conn.register("most_comments_arrow", data)
Out[17]: <duckdb.DuckDBPyConnection at 0x7f9be41434f0>
In [18]: conn.execute("""
...: SELECT subreddit_name
...: FROM most_comments_arrow
...: LIMIT 5
...: """).fetchall()
Out[18]: [('politics',), ('worldnews',), ('askreddit',), ('collapse',), ('news',)]
Or, using the %sql
magic as before:
In [21]: %sql output << SELECT subreddit_name FROM df_most_comments LIMIT 5
* duckdb:///:memory:
Done.
Returning data to local variable output
In [22]: output.DataFrame() # pandas
Out[22]:
subreddit_name
0 politics
1 worldnews
2 askreddit
3 collapse
4 news
In other words: you can transparently go back and forth between DuckDB and your favourite Python dataframe library. Cool!
Other features
Without extending ourselves too much, there are a few extra interesting things about DuckDB you should check out:
- Relational API: Apart from executing your SQL queries, DuckDBPyRelation objects have some basic filtering and aggregation methods. For example, you can do things like:
In [27]: rel.filter("num_comments > 100000").order("subreddit_name").df()
Out[27]:
subreddit_name num_comments
0 askreddit 259848
1 politics 370018
2 worldnews 351195
The documentation is still in progress, but potentially the DuckDB team will expand it in the future!
- The CLI: DuckDB has a command-line client you can use directly from your terminal, without even launching a Python or Jupyter interpreter:
$ ./duckdb -c '
> SELECT "subreddit.name" AS subreddit_name,
> COUNT(*) AS num_comments
> FROM "/data/reddit-climate/the-reddit-climate-change-dataset-comments.parquet"
> GROUP BY subreddit_name
> ORDER BY num_comments DESC
> LIMIT 10
> '
┌────────────────┬──────────────┐
│ subreddit_name │ num_comments │
├────────────────┼──────────────┤
│ politics │ 370018 │
│ worldnews │ 351195 │
│ askreddit │ 259848 │
│ collapse │ 94696 │
│ news │ 94558 │
│ futurology │ 89945 │
│ science │ 71453 │
│ environment │ 70444 │
│ canada │ 66813 │
│ australia │ 60239 │
└────────────────┴──────────────┘
Should you use DuckDB?
Although DuckDB is a wonderful piece of technology, “there is no silver bullet” and there might be cases in which you might want to use something else. The project homepage itself hints at some of these cases:
- For transactional workloads, you might want to use SQLite, or a more sophisticated transactional database like PostgreSQL. Remember, DuckDB was created for analytics!
- When several people are reading or writing the same data, using a warehouse might make more sense.
Other than that, as you saw above if you are looking for a lightweight and fast solution for in-process analytics, and you want to leverage both your general-purpose language of choice (Python, R, others) as well as SQL, DuckDB might be exactly what you want.
In upcoming articles of this series we will describe some more alternatives you might find interesting. Stay tuned!
Thanks to Alex Monahan and Elliana May for reviewing early drafts of this blog post. All remaining errors are my own.
Top comments (0)