A NumPy array is a Python object that stores data in a contiguous C-array buffer. The excellent performance of these arrays comes not only from this compact representation, but also from the ability of arrays to share "views" of that buffer among many arrays. NumPy makes frequent use of "no-copy" array operations, producing derived arrays without copying underling data buffers. By taking full advantage of NumPy's efficiency, the StaticFrame DataFrame library offers orders-of-magnitude better performance than Pandas for many common operations.
No-Copy Operations with NumPy Arrays
The phrase "no-copy" describes an operation on a container (here, an array or a DataFrame) where a new instance is created, but underlying data is referenced, not copied. While some new memory is allocated for the instance, the size is generally insignificant compared to a potentially very large amount of underlying data.
NumPy makes no-copy operations the primary way of working with arrays. When you slice a NumPy array, you get a new array that shares the data from which it was sliced. Slicing an array is a no-copy operation. Extraordinary performance is gained by not having to copy already-allocated contiguous buffers, but instead just storing offsets and strides into that data.
For example, the difference between slicing an array of 100,000 integers (~0.1 µs), and slicing and then copying the same array (~10 µs), is two orders of magnitude.
>>> import numpy as np
>>> data = np.arange(100_000)
>>> %timeit data[:50_000]
123 ns ± 0.565 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> %timeit data[:50_000].copy()
13.1 µs ± 48.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
We can illustrate how this works by examining two attributes of NumPy arrays. The flags
attribute displays details of how the array's memory is being referenced. The base
attribute, if set, provides a handle to the array that actually holds the buffer this array references.
In the example below, we create an array, take a slice, and look at the flags
of the slice. We see that, for the slice, OWNDATA
is False
, and that the base
of the slice is the original array (they have the same object identifier).
>>> a1 = np.arange(12)
>>> a1
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
>>> a2 = a1[:6]
>>> a2.flags
C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
>>> id(a1), id(a2.base)
(140506320732848, 140506320732848)
These derived arrays are "views" of the original array. Views can only be taken under certain conditions: reshaping, transposing, or slicing.
For example, after reshaping the initial 1D array into a 2D array, OWNDATA
is False
, showing that it still references the original array's data.
>>> a3 = a1.reshape(3,4)
>>> a3
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> a3.flags
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
>>> id(a3.base), id(a1)
(140506320732848, 140506320732848)
Both horizontal and vertical slices of this 2D array similarly result in arrays that simply reference the original array's data. Again, OWNDATA
is False
, and the base
of the slice is the original array.
>>> a4 = a3[:, 2]
>>> a4
array([ 2, 6, 10])
>>> a4.flags
C_CONTIGUOUS : False
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
>>> id(a1), id(a4.base)
(140506320732848, 140506320732848)
While creating light-weight views of shared memory buffers offers significant performance advantages, there is a risk: mutating any one of those arrays will mutate all of them. As shown below, the assignment of -1 into our most-derived array is reflected in every associated array.
>>> a4[0] = -1
>>> a4
array([-1, 6, 10])
>>> a3
array([[ 0, 1, -1, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> a2
array([ 0, 1, -1, 3, 4, 5])
>>> a1
array([ 0, 1, -1, 3, 4, 5, 6, 7, 8, 9, 10, 11])
Side-effects like this should concern you. Passing around views of shared buffers to clients that can mutate those buffers can lead to serious flaws. There are two solutions to this problem.
One option is for the caller to make explicit "defensive" copies every time a new array is created. This removes the performance advantage of sharing views but ensures that mutating an array does not lead to unexpected side effects.
Another option, requiring no sacrifice in performance, is to make the array immutable. By doing so, views of arrays can be shared without concern of mutation causing unexpected side effects.
A NumPy array can easily be made immutable by setting the writeable
flag to False
on the flags
interface. After setting this value, the flags
display shows WRITEABLE
as False
and attempting to mutate this array results in an exception.
>>> a1.flags.writeable = False
>>> a1.flags
C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : True
WRITEABLE : False
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
>>> a1[0] = -1
Traceback (most recent call last):
File "<console>", line 1, in <module>
ValueError: assignment destination is read-only
The best performance is possible, with no risk of side-effects, by embracing immutable views of NumPy arrays.
The Advantages of No-Copy DataFrame Operations
This insight, that an immutable-array-based data model offers the best performance with the minimum risk, was foundational to the creation of the StaticFrame DataFrame library. As StaticFrame (like Pandas) manages data stored in NumPy arrays, embracing the usage of array views (without having to make defensive copies) offers significant performance advantages. Without an immutable data model, Pandas cannot make such use of array views.
StaticFrame is not yet always faster than Pandas: Pandas has very performant operations for joins and other specialized transformations. But when leveraging no-copy array operations, StaticFrame can be a lot faster.
To compare performance, we will use the FrameFixtures library to create two DataFrames of 10,000 rows by 1,000 columns of heterogeneous types. For both we can convert the StaticFrame Frame
into a Pandas DataFrame
.
>>> import static_frame as sf
>>> import pandas as pd
>>> sf.__version__, pd.__version__
('0.9.21', '1.5.1')
>>> import frame_fixtures as ff
>>> f1 = ff.parse('s(10_000,1000)|v(int,int,str,float)')
>>> df1 = f1.to_pandas()
>>> f2 = ff.parse('s(10_000,1000)|v(int,bool,bool,float)')
>>> df2 = f2.to_pandas()
A simple example of the advantage of a no-copy operation is renaming an axis. With Pandas, all underlying data is defensively copied. With StaticFrame, all underlying data is re-used; only lightweight outer containers have to be created. StaticFrame (~0.01 ms) is almost four orders of magnitude faster than Pandas (~100 ms).
>>> %timeit f1.rename(index='foo')
35.8 µs ± 496 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit df1.rename_axis('foo')
167 ms ± 4.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Given a DataFrame, it is often necessary to make a column into the index. When Pandas does this, it has to copy the column data to the index, as well as copy all the underlying data. StaticFrame can re-use a view of the column in the index, as well as re-use all of the underlying data. StaticFrame (~1 ms) is two orders of magnitude faster than Pandas (~100 ms).
>>> %timeit f1.set_index(0)
1.25 ms ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit df1.set_index(0, drop=False)
166 ms ± 3.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Extracting a subset of columns from a DataFrame is another common operation. For StaticFrame, this is a no-copy operation: the returned DataFrame simply holds views to the column data in the original DataFrame. StaticFrame (~10 µs) can do this an order of magnitude faster than Pandas (~100 µs).
>>> %timeit f1[[10, 50, 100, 500]]
25.4 µs ± 471 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit df1[[10, 50, 100, 500]]
729 µs ± 4.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
It is common to concatenate two or more DataFrames. If they have the same index, and we concatenate them horizontally, StaticFrame can re-use all the underlying data of the inputs, making this form of concatenation a no-copy operation. StaticFrame (~1 ms) can do this two orders of magnitude faster than Pandas (~100 ms).
>>> %timeit sf.Frame.from_concat((f1, f2), axis=1, columns=sf.IndexAutoFactory)
1.16 ms ± 50.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit pd.concat((df1, df2), axis=1)
102 ms ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Conclusion
NumPy is designed to take advantage of sharing views of data. Because Pandas permits in-place mutation, it cannot make optimal use of NumPy views. As StaticFrame is built on an immutable data model, side-effect mutation risk is eliminated and no-copy operations are embraced, providing a significant performance advantage.
Top comments (0)