This post is an adaptation of the one I originally published in the Orchest blog. Since I wrote it the Vaex project has become inactive, and more modern alternatives, like Polars, have gained support for out-of-core processing. I still wanted to copy it here for historical purposes and to give continuity to the series.
Vaex is an open-source Python library that provides lazy, out-of-core dataframes "to visualize and explore big tabular datasets". Initially, it was a GUI tool created to visualize the massive Gaia star catalog, and later on, it evolved to become a powerful data manipulation library. But what do "lazy" and "out-of-core" mean in this context?
- "Lazy" means that certain operations are not fully executed immediately: instead, their evaluation is delayed or postponed until either the result is explicitly requested or aggregation is performed.
- "Out-of-core" refers to a set of techniques that allow the user to manipulate data that is larger than the available RAM by reading chunks from the disk. This happens transparently.
Vaex is not the only lazy, out-of-core dataframe library: other projects like Dask also apply these concepts, although in a very different way. We will cover Dask in a future post!
(If you're unsure about the pronunciation, the maintainers call it /vəks/, hence with a short, neutral vowel sound)
You can install Vaex with conda/mamba or pip:
mamba install -y "vaex=4.9" # Or, alternatively pip install "vaex==4.9"
vaex is actually a metapackage, so you might want to pick exactly which parts of it you are interested in. For this tutorial, you will only need vaex-core and vaex-viz:
mamba install -y "vaex-core=4.9" vaex-viz # Or, alternatively pip install "vaex-core==4.9" vaex-viz
You have more detailed installation instructions in the official documentation.
Let's first load the last .parquet file of our dataset by doing
import vaex df_2018 = vaex.open("/data/airline-delays/2018.parquet")
The type of the returned object is vaex.dataframe.DataFrameLocal, which is the "Base class for DataFrames that work with local file/data" (docs). You will notice that the open call finishes almost immediately, even though the dataframe has more than 7 million rows:
In : len(df_2018) Out: 7213446
The reason is that, with binary file formats, Vaex uses memory-mapping, hence allowing you to manipulate datasets that are larger than RAM (similar to what PyArrow does, as covered in the first part of the series).
Vaex supports several binary file formats (Feather, Parquet, and some domain-specific formats like HDF5 and FITS) as well as text-based formats (CSV, JSON, ASCII). However, the latter cannot be memory-mapped, and therefore you need to be a bit more careful when using them:
- If the data fits in memory, call the vaex.open function as normal, taking into account that the whole dataset will be loaded.
- If the data does not fit in memory, pass convert=True and a chunk_size parameter to open so that Vaex converts the data to a binary format behind the scenes, using extra disk space in the process.
Now, back to our dataframe: to display more information about it, you can call the .info() method, which will show:
- Number of rows
- Name and type information of the columns
- First five and last five rows of the dataframe
.describe() will work like its pandas counterpart and display a summary of statistical information of every column:
As you can see, Vaex, like pandas, also leverage the interactive capabilities of Jupyter to display the information in a visually appealing way.
Like in pandas, you can index a specific column of a Vaex dataframe using indexing:
In : df_2018["OP_CARRIER"] # Or df_2018.col.OP_CARRIER Out: Expression = OP_CARRIER Length: 7,213,446 dtype: string (column) ---------------------------------------- 0 UA 1 UA 2 UA 3 UA 4 UA ... 7213441 AA 7213442 AA 7213443 AA 7213444 AA 7213445 AA
But unlike pandas, this returns an Expression: a representation of a computation step that has not been evaluated yet. Expressions are one of the power features of Vaex, since they allow you to chain several operations one after another in a lazy way, hence without actually computing them:
# Normal subtraction of two columns df_2018["ACTUAL_ELAPSED_TIME"] - df_2018["CRS_ELAPSED_TIME"] # Creation of an equivalent expression by using indexing df_2018["ACTUAL_ELAPSED_TIME - CRS_ELAPSED_TIME"] # NumPy functions also create expressions import numpy as np np.sqrt(df_2018["DISTANCE"]) # Or they can be used inside the indexing itself! df_2018["sqrt(DISTANCE)"]
There are two ways to fully evaluate an expression:
- Requesting an in-memory representation, by retrieving the .values property (which will return a NumPy array) or calling .to_pandas_series() . Make sure the data fits in RAM!
- Performing an aggregation (like .mean() , .max() , and so forth), like this:
In : df_2018["CANCELLED"].sum() Out: array(116584.)
For more complex expressions and filters, Vaex supports displaying interactive progress bars:
In : # We exclude "CANCELLED" flights and ones that arrived on time ...: # to compute the mean delay time ...: delayed = df_2018[(df_2018["CANCELLED"] == 0.0) & (df_2018["DEP_DELAY"] > 0)] ...: delayed["DEP_DELAY"].mean(progress="rich") # Fancy progress bar! mean ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 01.05s └── vaex.agg.mean('DEP_DELAY') ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 01.05s ├── vaex.agg.sum('DEP_DELAY') ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 01.04s └── vaex.agg.count('DEP_DELAY') ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 01.04s Out: array(38.18255826)
Now that you know how it's done, let's read all the .parquet files at once and perform the same computation:
In : df = vaex.open("/data/airline-delays/\*.parquet") In : len(df) Out: 61556964 In : delayed = df[(df["CANCELLED"] == 0.0) & (df["DEP_DELAY"] > 0)] In : delayed["DEP_DELAY"].mean(progress="rich") mean ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 09.03s └── vaex.agg.mean('DEP_DELAY') ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 09.03s ├── vaex.agg.sum('DEP_DELAY') ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 09.03s └── vaex.agg.count('DEP_DELAY') ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 09.03s Out: array(32.15845342)
Vaex crunches more than 60 million rows in less than 10 seconds 🔥
Finally, Vaex has some built-in visualization capabilities that intelligently wrap its aggregation and binning functions to offer a higher level interface:
Fast histogram of 60+ million rows
One of the interesting differences between Vaex and pandas is that the concept of the index does not exist. This means that, when converting data from pandas using vaex.from_pandas, you will need to decide whether to include the index as a normal column (passing copy_index=True) or discard it entirely (copy_index=False, the default behavior).
On the other hand, Vaex does not have an equivalent of the .loc accessor, so to filter by rows and columns you will have to chain several indexing operations. This is not a problem though, because Vaex does not copy the data.
If you compare Vaex and pandas running times on an interactive setting (for example, on JupyterLab), you might observe that sometimes Vaex is slightly slower. However, if you perform proper microbenchmarking, you will notice that Vaex is actually faster than pandas, even for data that fits in RAM:
In : %%timeit -n1 -r1 ...: df = vaex.open("/data/airline-delays/2018.parquet") ...: delayed = df[(df["CANCELLED"] == 0.0) & (df["DEP_DELAY"] > 0)] ...: mean_delayed = delayed["DEP_DELAY"].mean() ...: 1.31 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each) In : %%timeit -n1 -r1 ...: df_pandas = pd.read_parquet("/data/airline-delays/2018.parquet") ...: delayed_pandas = df_pandas.loc[(df_pandas["CANCELLED"] == 0.0) & (df_pandas["DEP_DELAY"] > 0)] ...: mean_delayed_pandas = delayed_pandas["DEP_DELAY"].mean() ...:3.81 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
And this is without taking into account that reading all the data with pandas in a RAM-constrained environment is simply impossible:
# Equivalent of # df = vaex.open("/data/airline-delays/\*.parquet") # Jupyter shortcut to read all files fnames = !ls /data/airline-delays/\*.parquet # Loads everything in memory, # and therefore it might blow up! df = pd.concat([ pd.read_parquet(fname) for fname in fnames ])
Moral of the story: benchmarking is difficult!
In addition to what we saw in this short introduction, Vaex has powerful visualization capabilities, as well as some features that are not found in other libraries that can prove useful if you are working with scientific data, like propagation of uncertainties, just-in-time compilation of math-heavy expressions, and much more.
If you already have your data in a supported binary format on disk, or if it's in a text-based format and you have enough space to convert it, Vaex is an excellent solution to process it in an efficient way. As you can see, Vaex mimics the pandas API, but it is not based on it and it deviates in some small places. On the other hand, it is a younger library with a smaller user base, and some parts of the documentation might not be as complete - fortunately, the maintainers compensate by quickly responding to issues and feedback.
- Use Vaex if you have large amounts of data on disk that don't fit into memory, if you want to leverage fast visualization capabilities, if you have a scientific use case, or if you are not worried about learning an API slightly different from pandas.
- Don't use Vaex if you are looking for solutions to quickly migrate a large pandas codebase, or if you are on a storage-constrained environment reading lots of data from the network or in CSV format.
In upcoming articles of this series we will describe some more alternatives you might find interesting. Stay tuned!