DEV Community

Juan Luis Cano Rodríguez
Juan Luis Cano Rodríguez

Posted on • Originally published at orchest.io

Out-of-core processing with Vaex

This post is an adaptation of the one I originally published in the Orchest blog. Since I wrote it the Vaex project has become inactive, and more modern alternatives, like Polars, have gained support for out-of-core processing. I still wanted to copy it here for historical purposes and to give continuity to the series.

Vaex: a new approach to dataframes

Vaex is an open-source Python library that provides lazy, out-of-core dataframes "to visualize and explore big tabular datasets". Initially, it was a GUI tool created to visualize the massive Gaia star catalog, and later on, it evolved to become a powerful data manipulation library. But what do "lazy" and "out-of-core" mean in this context?

  • "Lazy" means that certain operations are not fully executed immediately: instead, their evaluation is delayed or postponed until either the result is explicitly requested or aggregation is performed.
  • "Out-of-core" refers to a set of techniques that allow the user to manipulate data that is larger than the available RAM by reading chunks from the disk. This happens transparently.

Vaex is not the only lazy, out-of-core dataframe library: other projects like Dask also apply these concepts, although in a very different way. We will cover Dask in a future post!

(If you're unsure about the pronunciation, the maintainers call it /vəks/, hence with a short, neutral vowel sound)

Vaex logo

First steps with Vaex

For this example we will use a dataset of Airline Delay and Cancellation Data from 2009 to 2018 obtained from Kaggle, which itself originates from the U.S. Department of Transportation.

You can install Vaex with conda/mamba or pip:

mamba install -y "vaex=4.9"  
# Or, alternatively  
pip install "vaex==4.9"
Enter fullscreen mode Exit fullscreen mode

vaex  is actually a metapackage, so you might want to pick exactly which parts of it you are interested in. For this tutorial, you will only need vaex-core and vaex-viz:

mamba install -y "vaex-core=4.9" vaex-viz  
# Or, alternatively  
pip install "vaex-core==4.9" vaex-viz
Enter fullscreen mode Exit fullscreen mode

You have more detailed installation instructions in the official documentation.

Let's first load the last .parquet file of our dataset by doing

import vaex  
df_2018 = vaex.open("/data/airline-delays/2018.parquet")
Enter fullscreen mode Exit fullscreen mode

The type of the returned object is vaex.dataframe.DataFrameLocal, which is the "Base class for DataFrames that work with local file/data" (docs). You will notice that the open call finishes almost immediately, even though the dataframe has more than 7 million rows:

In [2]: len(df_2018)
Out[2]: 7213446
Enter fullscreen mode Exit fullscreen mode

The reason is that, with binary file formats, Vaex uses memory-mapping, hence allowing you to manipulate datasets that are larger than RAM (similar to what PyArrow does, as covered in the first part of the series).

Vaex supports several binary file formats (Feather, Parquet, and some domain-specific formats like HDF5 and FITS) as well as text-based formats (CSV, JSON, ASCII). However, the latter cannot be memory-mapped, and therefore you need to be a bit more careful when using them:

  • If the data fits in memory, call the vaex.open function as normal, taking into account that the whole dataset will be loaded.
  • If the data does not fit in memory, pass convert=True and a chunk_size  parameter to open so that Vaex converts the data to a binary format behind the scenes, using extra disk space in the process.

Now, back to our dataframe: to display more information about it, you can call the .info()  method, which will show:

  • Number of rows
  • Name and type information of the columns
  • First five and last five rows of the dataframe

Beginning of  raw `.info()` endraw  result

Beginning of .info() result

In addition, .describe() will work like its pandas counterpart and display a summary of statistical information of every column:

Output of  raw `.describe()` endraw  result

Output of .describe() result

As you can see, Vaex, like pandas, also leverage the interactive capabilities of Jupyter to display the information in a visually appealing way.

The power of expressions

Like in pandas, you can index a specific column of a Vaex dataframe using indexing:

In [10]: df_2018["OP_CARRIER"]  # Or df_2018.col.OP_CARRIER  
Out[10]: Expression = OP_CARRIER  
Length: 7,213,446 dtype: string (column)  
----------------------------------------  
     0  UA  
     1  UA  
     2  UA  
     3  UA  
     4  UA  
   ...  
7213441  AA  
7213442  AA  
7213443  AA  
7213444  AA  
7213445  AA
Enter fullscreen mode Exit fullscreen mode

But unlike pandas, this returns an Expression: a representation of a computation step that has not been evaluated yet. Expressions are one of the power features of Vaex, since they allow you to chain several operations one after another in a lazy way, hence without actually computing them:

# Normal subtraction of two columns  
df_2018["ACTUAL_ELAPSED_TIME"] - df_2018["CRS_ELAPSED_TIME"]  

# Creation of an equivalent expression by using indexing  
df_2018["ACTUAL_ELAPSED_TIME - CRS_ELAPSED_TIME"]  

# NumPy functions also create expressions  
import numpy as np  

np.sqrt(df_2018["DISTANCE"])  

# Or they can be used inside the indexing itself!  
df_2018["sqrt(DISTANCE)"]
Enter fullscreen mode Exit fullscreen mode

There are two ways to fully evaluate an expression:

  • Requesting an in-memory representation, by retrieving the .values  property (which will return a NumPy array) or calling .to_pandas_series() . Make sure the data fits in RAM!
  • Performing an aggregation (like .mean() , .max() , and so forth), like this:
In [12]: df_2018["CANCELLED"].sum()  
Out[12]: array(116584.)
Enter fullscreen mode Exit fullscreen mode

For more complex expressions and filters, Vaex supports displaying interactive progress bars:

In [15]: # We exclude "CANCELLED" flights and ones that arrived on time  
   ...: # to compute the mean delay time  
   ...: delayed = df_2018[(df_2018["CANCELLED"] == 0.0) & (df_2018["DEP_DELAY"] > 0)]  
   ...: delayed["DEP_DELAY"].mean(progress="rich")  # Fancy progress bar!  
 mean                                          ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 01.05s  
   └──   vaex.agg.mean('DEP_DELAY')                ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 01.05s  
    ├──   vaex.agg.sum('DEP_DELAY')             ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 01.04s[1]  
    └──   vaex.agg.count('DEP_DELAY')           ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 01.04s[1]  
Out[15]: array(38.18255826)
Enter fullscreen mode Exit fullscreen mode

Now that you know how it's done, let's read all the .parquet  files at once and perform the same computation:

In [16]: df = vaex.open("/data/airline-delays/\*.parquet")  
  
In [17]: len(df)  
Out[17]: 61556964  

In [18]: delayed = df[(df["CANCELLED"] == 0.0) & (df["DEP_DELAY"] > 0)]  

In [19]: delayed["DEP_DELAY"].mean(progress="rich")  
 mean                                          ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 09.03s  
   └──   vaex.agg.mean('DEP_DELAY')                ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 09.03s  
       ├──   vaex.agg.sum('DEP_DELAY')             ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 09.03s[1]  
    └──   vaex.agg.count('DEP_DELAY')           ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 09.03s[1]  
Out[19]: array(32.15845342)
Enter fullscreen mode Exit fullscreen mode

Vaex crunches more than 60 million rows in less than 10 seconds 🔥

Finally, Vaex has some built-in visualization capabilities that intelligently wrap its aggregation and binning functions to offer a higher level interface:

Fast histogram of 60+ million rows

Fast histogram of 60+ million rows

Some differences with pandas

One of the interesting differences between Vaex and pandas is that the concept of the index does not exist. This means that, when converting data from pandas using vaex.from_pandas, you will need to decide whether to include the index as a normal column (passing copy_index=True) or discard it entirely (copy_index=False, the default behavior).

On the other hand, Vaex does not have an equivalent of the .loc accessor, so to filter by rows and columns you will have to chain several indexing operations. This is not a problem though, because Vaex does not copy the data.

A note on performance

If you compare Vaex and pandas running times on an interactive setting (for example, on JupyterLab), you might observe that sometimes Vaex is slightly slower. However, if you perform proper microbenchmarking, you will notice that Vaex is actually faster than pandas, even for data that fits in RAM:

In [20]: %%timeit -n1 -r1  
   ...: df = vaex.open("/data/airline-delays/2018.parquet")  
   ...: delayed = df[(df["CANCELLED"] == 0.0) & (df["DEP_DELAY"] > 0)]  
   ...: mean_delayed = delayed["DEP_DELAY"].mean()  
   ...:  
1.31 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)  

In [21]: %%timeit -n1 -r1  
   ...: df_pandas = pd.read_parquet("/data/airline-delays/2018.parquet")  
   ...: delayed_pandas = df_pandas.loc[(df_pandas["CANCELLED"] == 0.0) & (df_pandas["DEP_DELAY"] > 0)]  
   ...: mean_delayed_pandas = delayed_pandas["DEP_DELAY"].mean()  
   ...:3.81 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Enter fullscreen mode Exit fullscreen mode

And this is without taking into account that reading all the data with pandas in a RAM-constrained environment is simply impossible:

# Equivalent of  
# df = vaex.open("/data/airline-delays/\*.parquet")  

# Jupyter shortcut to read all files  
fnames = !ls /data/airline-delays/\*.parquet  

# Loads everything in memory,  
# and therefore it might blow up!  
df = pd.concat([  
   pd.read_parquet(fname) for fname in fnames  
])
Enter fullscreen mode Exit fullscreen mode

Moral of the story: benchmarking is difficult!

Should you use Vaex?

In addition to what we saw in this short introduction, Vaex has powerful visualization capabilities, as well as some features that are not found in other libraries that can prove useful if you are working with scientific data, like propagation of uncertainties, just-in-time compilation of math-heavy expressions, and much more.

If you already have your data in a supported binary format on disk, or if it's in a text-based format and you have enough space to convert it, Vaex is an excellent solution to process it in an efficient way. As you can see, Vaex mimics the pandas API, but it is not based on it and it deviates in some small places. On the other hand, it is a younger library with a smaller user base, and some parts of the documentation might not be as complete - fortunately, the maintainers compensate by quickly responding to issues and feedback.

In summary:

  • Use Vaex if you have large amounts of data on disk that don't fit into memory, if you want to leverage fast visualization capabilities, if you have a scientific use case, or if you are not worried about learning an API slightly different from pandas.
  • Don't use Vaex if you are looking for solutions to quickly migrate a large pandas codebase, or if you are on a storage-constrained environment reading lots of data from the network or in CSV format.

In upcoming articles of this series we will describe some more alternatives you might find interesting. Stay tuned!

Top comments (0)