Numpy is all about math. That is what you will hear from most people. Although numpy is an essential package to do mathematical computations, some people are still out of the loop as to how it can be used. This post aims to clarify how you can make the most out of Numpy.
This is going to be a multi-part series since I'm going to cover different types of mathematics here, so it can't all fit in one post in a reasonable amount of reading time.
pip install numpy
will install numpy for you. You can also get it from a scientific distribution like Anaconda but I think that approach is overkill if you're only going to use numpy.
Basic element of numpy
The most fundamental data structure inside numpy is the multi-dimensional array, of type ndarray
. This allows you to work with arrays with more than one dimension (or axis as numpy calls them). In normal python the closest you get to this is the list, which has only one dimension (axis). Not only do a decent number of mathematical operations need more than one dimension, for example matrix operations, you may also need to operate on several numbers in parallel, like if you have audio samples, or imported statistical data.
Because of the intended use cases of multi-dimensional arrays, numpy's ndarray
s must have a rectangular size. They will have the same number of elements in each column, same number of elements in each row, and same number of elements for any other dimensions. All elements must have the same data type, so you can't mix floats and integers in a single array for example. The one exception is that there can be arrays of generic python objects. Obviously these are not efficient to operate on.
Fast but high-level
When implementing any algorithm or computation you usually have two choices. Implement it in a fast low-level language (such as C) but you can't use any convenience functions because those are lacking. Or you can use a high-level language (such as Python) that provides convenience functions but at the expense of speed.
But because of the way Python is designed, it is possible to compile modules written in the C language and run them directly in Python, with complete Python bindings. This gives you the best of both worlds, and is the approach that Numpy took. Therefore, you can do things like write c = a * b
to do matrix multiplication in Python but suffer almost no performance penalty because the operation is written in the C language.
In other words, numpy can vectorize (aka broadcast) mathematical operations with little runtime latency.
Important array properties
The following properties of ndarray
(aka array
) are some of the most used properties in all of numpy, because they tell basic information about the array. Assuming a
is a numpy.array
(or numpy.ndarray
):
-
a.ndim
: The number of dimensions (axes) in the array. -
a.shape
: The shape of the array, returned as a list. The list as the size of each dimension. -
a.size
: Total number of elements. Equal to the product of the elements ofa.shape
. -
a.dtype
: The datatype of the array. This is a type object. There are also types corresponding to fixed-size number types, provided by numpy, such asnumpy.int32
,numpy.int16
, andnumpy.float64
. -
a.itemsize
: The size of an element ofa.dtype
. In C/C++ this corresponds tosizeof(data_type)
.
The following property is used internally by numpy:
-
a.data
: The raw data in the array. We usually won't need to use this property.
Creating ndarray
s
The simplest way to create an ndarray
is to pass a list to ndarray
's constructor, like numpy.array([1.0, 2.4, 4.2])
. Remember that (see end of article). This creates a one-dimensional arrray.array
is an alias for ndarray
To create a 2D array, pass a list of lists like numpy.array([(1.5,2,3), (4,5,6)])
. It doesn't matter if you use tuples or lists to create the array.
You can tell numpy that you want to create an array of complex numbers by passing dtype=complex
, for example numpy.array( [ [1,2], [3,4] ], dtype=complex )
. The purpose of the dtype
keyword argument is to specify the type of the array.
You must specify the array in square brackets (or parentheses, as a tuple). If you just make a list without square brackets, the array creation will fail.
To create an array of zeroes with a predefined size, use numpy.zeros()
and pass the array dimensions in a tuple. There is also np.ones()
for an array with all ones. An array with undefined values can be made with np.empty()
:
>>> np.empty( (3,4) )
array([[6.94831542e-310, 2.66185574e-316, 4.94065646e-324,
4.94065646e-324],
[4.94065646e-324, 0.00000000e+000, 0.00000000e+000,
0.00000000e+000],
[0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
0.00000000e+000]])
Printing arrays
The way numpy arrays are printed is described in its documentation:
- the last axis is printed from left to right,
- the second-to-last is printed from top to bottom,
- the rest are also printed from top to bottom, with each slice separated from the next by an empty line.
You can change the shape of the array by using np.resize()
, or the array's resize()
method. These take a tuple of dimensions that partitions the elements first as one very long list, strolling along each element and giving them new indices in the new array shape. They don't return anything. These resize functions are destructive, as in, they change the shape of the argument. if you just want to return a new shaped array without modifying its argument then use np.reshape()
or the array's reshape()
method.
One of the resize/reshape dimensions can be -1, which means calculate the size of that dimension using the size of the array and the new sizes of the other dimensions.
Numpy only prints the first few and last few elements of large arrays (specifically, the first and last elements of rows and columns). It won't print the middle elements of large arrays, whether the are rows, columns, or other dimensions.
>>> print(np.arange(10000).reshape(100,100))
[[ 0 1 2 ... 97 98 99]
[ 100 101 102 ... 197 198 199]
[ 200 201 202 ... 297 298 299]
...
[9700 9701 9702 ... 9797 9798 9799]
[9800 9801 9802 ... 9897 9898 9899]
[9900 9901 9902 ... 9997 9998 9999]]
If your adrenaline is pumping high and you want numpy to print every single value in the array (in which case you're either very brave or just have a really fast computer processor 😉), you can change numpy's printing options to do this. Import the sys
module first, and then use set_printoptions()
to set the threshold to the maximum. In code, np.set_printoptions(threshold=sys.maxsize)
.
A convenient feature of numpy arrays is that when you print them, they format the array in a way that makes the shape intuitive to see. Not only that, it can also make ranges of numbers as arrays, just like the built-in range()
function. np.arange()
will do the trick for integer numbers:
np.arange( 10, 30, 5 )
array([10, 15, 20, 25])
>>> np.arange( 0, 2, 0.3 )
array([ 0. , 0.3, 0.6, 0.9, 1.2, 1.5, 1.8])
As you can see, floating-point arguments can be passed to arange()
, but it is much better to use np.linspace()
to make a range of floating-point arguments. Unlike arange()
, linspace()
can determine the number of elements in the sequence because that is passed to it as an argument, while arange()
relies on a step size. This becomes crucial when dealing with ranges of floats because internal floating-point processor precision is finite, and that causes arithmetic with division results to be slightly imprecise. To see what I mean, run this example code:
>>> 1/10 + 1/5
0.30000000000000004 # Incorrect
>>> 3/10
0.3
This is not a formatting error. This is a widespread arithmetic problem not limited to Python. Most other languages such as Javascript and C++ are vulnerable to this. That's why step sizes are not very efficient for floats.
Anyway, this is how you would use linspace()
in practice:
>>> from numpy import pi
>>> np.linspace( 0, 2, 9 ) # 9 numbers from 0 to 2
array([ 0. , 0.25, 0.5 , 0.75, 1. , 1.25, 1.5 , 1.75, 2. ])
>>> x = np.linspace( 0, 2*pi, 100 ) # useful to evaluate function at lots of points
>>> f = np.sin(x)
The formatting of some of the numbers is long but makes even spacing. One other feature this code snippet just demonstrated is the ability to run a function on all the elements of the array at once. This becomes very powerful for computing things in parallel, since no loops are necessary; they are handled internally in numpy.
Basic array operations
Nothing fancy here, just the normal operations you would expect from numerical arrays:
>>> a = np.array( [20,30,40,50] )
>>> b = np.arange( 4 )
>>> b
array([0, 1, 2, 3])
>>> c = a-b
>>> c
array([20, 29, 38, 47])
>>> b**2 # Note there is exponentiation.
array([0, 1, 4, 9])
>>> 10*np.sin(a)
array([ 9.12945251, -9.88031624, 7.4511316 , -2.62374854])
>>> a<35
array([ True, True, False, False])
Also there are arithmetical operations directly under the np
namespace, np.add()
, np.subtract()
, np.multiply()
, np.divide()
, np.remainder()
. These are not ndarray methods.
>>> np.array([5,2,60,40,2]) % np.array([4,9,52,9,2])
array([1, 2, 8, 4, 0])
Matrix multiplication is the only operation in this section that will look exotic to most people. It's denoted with the @
operator in Python 3.5+, or with the dot()
method in all versions of Python 3 (the versions supported by current Numpy releases, to clarify).
>>> A = np.array( [[1,1],
... [0,1]] )
>>> B = np.array( [[2,0],
... [3,4]] )
>>> A * B # element-wise product
array([[2, 0],
[0, 4]])
>>> A @ B # matrix product
array([[5, 4],
[3, 4]])
>>> A.dot(B) # another matrix product
array([[5, 4],
[3, 4]])
Operation with assignment (e.g. +=, *=) don't implicitly cast between types. Running <float-array> += <int-array>
will fail with an error. Outside of operations with assignment, assigning a new variable the arithmetical result of an operation involving two other arrays of different types will create an array using a type with enough precision to store the result. An example describes this best:
>>> a = np.ones(3, dtype=np.int32)
>>> b = np.linspace(0,pi,3)
>>> b.dtype.name
'float64'
>>> c = a+b
>>> c
array([ 1. , 2.57079633, 4.14159265])
>>> c.dtype.name
'float64'
>>> d = np.exp(c*1j)
>>> d
array([ 0.54030231+0.84147098j, -0.84147098+0.54030231j,
-0.54030231-0.84147098j])
>>> d.dtype.name
'complex128'
sum()
, min()
, and max()
methods are also present and do what you would expect. But with no arguments, they compute the global result across the entire array. The axis
keyword argument applies the function across the specified axis/dimension.
>>> a = np.random.random((2,3))
>>> a
array([[ 0.18626021, 0.34556073, 0.39676747],
[ 0.53881673, 0.41919451, 0.6852195 ]])
>>> a.sum()
2.5718191614547998 # It summed all of the elements.
>>> a.min()
0.1862602113776709
>>> a.max()
0.6852195003967595
>>> a.sum(axis=1)) # 1 denotes rows, use 0 for columns, use other numbers for higher dimensions
[0.92858841 1.64323074]
>>> a.min(axis=1)
[0.18626021 0.41919451]
>>> a.max(axis=1)
[0.39676747 0.6852195 ]
>>> a.cumsum(axis=1) # cumulative sum along each row
[[0.18626021 0.53182094 0.92858841]
[0.53881673 0.95801124 1.64323074]]
You also have element-wise trigonometric functions (sin
, cos
, tan
among others) and pretty much every other math function available in C. (The C standard library is not known for having extensive math functions.) These are informally called "universal functions" (ufunc).
Indexing arrays
(One-dimensional) arrays can be indexed like regular python lists (indices to the right being higher dimensions, sort of like C) and there is nothing more for me to add here. Arrays with more dimensions can have each axis indexed like a range.
>>> def f(x,y):
... return 10*x+y
...
>>> b = np.fromfunction(f,(5,4),dtype=int)
>>> b
array([[ 0, 1, 2, 3],
[10, 11, 12, 13],
[20, 21, 22, 23],
[30, 31, 32, 33],
[40, 41, 42, 43]])
>>> b[2,3]
23
>>> b[1:3, : ] # each column in the second and third row of b
array([[10, 11, 12, 13],
[20, 21, 22, 23]])
As you can see, np.fromfunction()
takes a function, and applies it to an array of numbers, the elements in each dimension successively increasing from zero to the axis size.
There is one important caveat about indexing multi-dimensional arrays. Missing axes in the range are treated as if they were :
, i.e. entire slices. As an example, b[-1]
is equivalent to b[-1,:]
.
Insert three ellipses in the middle of an index to specify the indexes in between as whole slices. As an example from the documentation, assuming an array x
has 5 dimensions then:
- x[1,2,...] is equivalent to x[1,2,:,:,:],
- x[...,3] to x[:,:,:,:,3] and
- x[4,...,5,:] to x[4,:,:,5,:].
One last indexing operation covered in this section is iterating. A standard for row in b:
language construct iterates over the first axis:
>>> for row in b:
... print(row)
...
[0 1 2 3]
[10 11 12 13]
[20 21 22 23]
[30 31 32 33]
[40 41 42 43]
If the example made things confusing then it would be helpful to not think of indexing in terms of rows and columns anymore, and solely as number of dimensions. For a 3 dimensional array of shape (5,6,7) for example:
- Dimension one, 5 is the length of the "depth", "volume" dimension
- Dimension two, 6 is the length of the "height", column dimension
- Dimension three, 7 is the length of the "width", row dimension.
If the shape was just (5,6), so 2D array, then 5 would be the height and 6 the width.
This example will iterate over every element in the array:
>>> for element in b.flat:
... print(element)
...
0
1
2
3
10
11
12
13
20
21
22
23
30
31
32
33
40
41
42
43
The flat
property shown above is just a generator. If you attempt to expend it with a comprehension, you will end up with a python list (or an ndarray of a single generator object), not an ndarray. Use the ravel()
method to make a long flat one-dimensional array.
Array concatenation
Arrays can be stacked together horizontally (hstack()
method) or vertically (vstack()
method). There is also column_stack()
which will stack arrays along the second-to-last dimension (column).
# Mixing some IPython shell in some examples
In [1]: a = np.floor(10*np.random.random((2,3,2)))
In [2]: b = np.floor(10*np.random.random((2,3,2)))
In [3]: a
Out[3]:
array([[[0., 3.],
[2., 3.]],
[[3., 9.],
[6., 4.]]])
# [0., 2.] column will concatenate with [4., 5., 4.] column below
In [4]: b
Out[4]:
array([[[4., 3.],
[5., 2.],
[4., 2.]],
[[4., 6.],
[7., 1.],
[8., 0.]]])
In [5]: np.column_stack((a,b))
Out[5]:
array([[[0., 3.],
[2., 3.],
[4., 3.],
[5., 2.],
[4., 2.]],
[[3., 9.],
[6., 4.],
[4., 6.],
[7., 1.],
[8., 0.]]])
There is also np.ma.rowstack()
which concatenates arrays across the first dimension (not necessarily rows, as what is seen as "rows" in formatted output is actually the last dimension):
In [1]: np.ma.row_stack((np.zeros((3)),np.ones((3)))).shape
Out[1]: (2, 3)
In [2]: np.ma.row_stack((np.zeros((3,3)),np.ones((3,3)))).shape
Out[2]: (6, 3)
In [3]: np.ma.row_stack((np.zeros((1,3)),np.ones((1,3)))).shape
Out[3]: (2, 3)
In [4]: np.ma.row_stack((np.zeros((1,3,3)),np.ones((1,3,3)))).shape
Out[4]: (2, 3, 3)
Last there is np.r_()
and it's slightly more elaborate cousin np.c_()
. r_
makes a row out of a bunch of numbers and ranges:
In [1]: np.r_[1:4,0,4:9:2]
Out[1]: array([1, 2, 3, 0, 4, 6, 8]
And c_
makes a column out of ndarrays (not ranges):
In [1]: np.c_[np.array([[1,2]]), 0, 0, np.array([[4,5,6]])]
Out[1]: array([[1, 2, 0, 0, 4, 5, 6]])
Splitting arrays
The np.split(a, indices, axis)
function splits an array across a specified axis at the given indices, there may be multiple indices passed (but they should all be in order). The np.hsplit()
allows you to split an array across the second axis (axis=1
), and np.vsplit()
splits across the first axis (axis=0
). Again, first and second axes might not correspond to rows and columns. Column and row are second-to-last and last axes respectively.
>>> a = np.floor(10*np.random.random((2,12)))
>>> a
array([[ 9., 5., 6., 3., 6., 8., 0., 7., 9., 7., 2., 7.],
[ 1., 4., 9., 2., 2., 1., 0., 6., 2., 2., 4., 0.]])
>>> np.hsplit(a,(3,4)) # Split a after the third and the fourth column
[array([[ 9., 5., 6.],
[ 1., 4., 9.]]), array([[ 3.],
[ 2.]]), array([[ 6., 8., 0., 7., 9., 7., 2., 7.],
[ 2., 1., 0., 6., 2., 2., 4., 0.]])]
>>> a = np.floor(10*np.random.random((12,2)))
>>> a
array([[8., 6.],
[9., 9.],
[4., 1.],
[0., 7.],
[3., 8.],
[7., 7.],
[9., 2.],
[2., 6.],
[0., 9.],
[2., 1.],
[7., 8.],
[0., 3.]])
>>> np.vsplit(a,(3,4))
[array([[8., 6.],
[9., 9.],
[4., 1.]]),
array([[0., 7.]]),
array([[3., 8.],
[7., 7.],
[9., 2.],
[2., 6.],
[0., 9.],
[2., 1.],
[7., 8.],
[0., 3.]])]
What you just saw above is splitting by indices. All of these can also split using sections, so it makes resulting arrays of length shape[axis] % section
. np.array_split()
will actually make less sections than that if the axis doesn't divide the section length equally, while np.split()
won't let you pass equally dividing sections at all (and is otherwise the same as np.array_split()
).
Copying arrays
Just assigning a variable an array, or passing it as a function argument, does not copy it. There are two different types of copies that can be done on ndarrays. They are shallow copy and deep copy. The shallow copy copies the numpy object but not the elements. This means that you can resize a shallow copy without affecting the object copied from, but if you change the elements inside, the elements will be changed inside both objects.
>>> a = np.arange(12)
>>> c = a.view()
>>> c is a
False
>>> c.base is a # c is a view (shallow copy) of the data owned by a
True
>>> c.flags.owndata
False
>>>
>>> c.shape = 2,6 # a's shape doesn't change
>>> a.shape
(3, 4)
>>> c[0,4] = 1234 # a's data changes
>>> a
array([[ 0, 1, 2, 3],
[1234, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> d = a.copy() # a new array object with new data is created (deep copy)
>>> d is a
False
>>> d.base is a # d doesn't share anything with a
False
>>> d[0,0] = 9999
>>> a
array([[ 0, 10, 10, 3],
[1234, 10, 10, 7],
[ 8, 10, 10, 11]])
Deep copying is useful when you have a slice of the copied object. It points to the original object so if you want to get rid of the original object because it's too large, you can make a deep copy of the slice, which shouldn't be an expensive operation. If you just make a shallow copy or no copy of the array, then the array data won't actually be deleted by del
.
And we're done
We haven't yet scratched the surface of numpy, people. But we got a long read. There will be more numpy guides here in the future.
Much material was sourced from the numpy manual.
If you see anything incorrect here please let me know so I can fix it.
Image by Arek Socha from Pixabay
Corrections
I realized that ndarray
is a different method from array
. array
will create all the arrays described above, but ndarray
actually takes the shape of the array to create, instead of values to fill the array with. Because no values are specified to ndarray
, arrays created with ndarray
are initialized with garbage values. Sorry for the confusion this caused.
Top comments (0)