Python is one of the most popular programming languages today. It’s an interpreted and highlevel language with elegant and readable syntax. However, Python is generally significantly slower than Java, C#, and especially C, C++, or Fortran. Sometimes performance issues and bottlenecks might seriously affect the usability of applications.
Fortunately, in most cases, there are solutions to improve the performance of Python programs. There are choices developers can take to improve the speed of their code. For example, the general advice is to use optimized Python builtin or thirdparty routines, usually written in C or Cython. Besides, it’s faster to work with local variables than with globals, so it’s a good practice to copy a global variable to a local before the loop. And so on.
Finally, there’s always the possibility to write own Python functions in C, C++, or Cython, call them from the application and replace Python bottleneck routines. But this is usually an extreme solution and is rarely necessary for practice.
Often performance issues arise when using Python loops, especially with a large number of iterations. There is a number of useful tricks to improve your code and make it run faster, but that’s beyond the scope here.
This article compares the performance of several approaches when summing two sequences elementwise:
 Using the while loop
 Using the for loop
 Using the for loop with list comprehensions
 Using the thirdparty library
numpy
However, performance isn’t the only concern when developing software. Moreover, according to Donald Knuth in The Art of Computer Programming, “premature optimization is the root of all evil (or at least most of it) in programming”. After all, “readability counts”, as stated in the Zen of Python by Tim Peters.
Problem Statement
We’ll try to sum two sequences elementwise. In other words, we’ll take two sequences (list or arrays) of the same size and create the third one with the elements obtained by adding the corresponding elements from the inputs.
Preparation
We’ll import Python builtin package random and generate a list r that contains 100.000 pseudorandom numbers from 0 to 99 (inclusive):
import random
r = [random.randrange(100) for _ in range(100_000)]
We’ll also use the thirdparty package numpy
, so let’s import it:
import numpy as np
And we’re ready to go!
Simple Loops
Let’s first see some simple Python loops in action.
Using Pure Python
We’ll start with two lists with 1.000 elements each. The integer variable n represents the length of each list. The lists x and y are obtained by randomly choosing n elements from r:
n = 1_000
x, y = random.sample(r, n), random.sample(r, n)
Let’s see what is the time needed to get a new list z with n elements, each of which is the sum of the corresponding elements from x and y.
We’ll test the performance of the while loop first:
%%timeit
i, z = 0, []
while i < n:
z.append(x[i] + y[i])
i += 1
The output is:
160 µs ± 1.44 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Please, note that the output of timeit
depends on many factors and might be different each time.
The for loop in Python is better optimized for the cases like this, that is to iterate over collections, iterators, generators, and so on. Let’s see how it works:
%%timeit
z = []
for i in range(n):
z.append(x[i] + y[i])
The output is:
122 µs ± 188 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In this case, the for loop is faster, but also more elegant compared to while.
List comprehensions are very similar to the ordinary for loops. They are suitable for simple cases (like this one). In addition to being more compact, they are usually slightly faster since some overhead is removed:
%%timeit
z = [x[i] + y[i] for i in range(n)
The output is:
87.2 µs ± 490 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Please, have in mind that you can’t apply list comprehensions in all cases when you need loops. Some more complex situations require the ordinary for or even while loops.
Using Python with NumPy
numpy
is a thirdparty Python library often used for numerical computations. It’s especially suitable to manipulate arrays. It offers a number of useful routines to work with arrays, but also allows writing compact and elegant code without loops.
Actually, the loops, as well as other performancecritical operations, are implemented in numpy
on the lower level. That allows numpy
routines to be much faster compared to pure Python code. One more advantage is the way numpy
handles variables and types.
Let’s first use the lists of Python integers x and y to create corresponding numpy
arrays of 64bit integers:
x_, y_ = np.array(x, dtype=np.int64), np.array(y, dtype=np.int64)
Summing two numpy
arrays x_ and y_ elementwise is as easy as x_ + y_. But let’s check the performance:
%%timeit
z = x_ + y_
The output is:
1.03 µs ± 5.09 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
That’s almost 85 times faster than when we used list comprehensions. And the code is extremely simple and elegant. numpy
arrays can be a much better choice for working with large arrays. Performance benefits are generally greater when data is bigger.
It might be even better. If we’re OK with using 32bit integers instead of 64bit, we can save both memory and time in some cases:
x_, y_ = np.array(x, dtype=np.int32), np.array(y, dtype=np.int32)
We can add these two arrays as before:
%%timeit
z = x_ + y_
The output is:
814 ns ± 5.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
The results obtained when n is larger, that is 10_000 and 100_000 are shown in the table below. They show the same relationships, as in this case, with even higher performance boost when using numpy
.
Nested Loops
Let’s now compare the nested Python loops.
Using Pure Python
We’ll work with two lists called x and y again. Each of them will contain 100 inner lists with 1.000 pseudorandom integer elements. Thus, x and y will actually represent the matrices with 100 rows and 1.000 columns:
m, n = 100, 1_000
x = [random.sample(r, n) for _ in range(m)]
y = [random.sample(r, n) for _ in range(m)]
Let’s see the performance of adding them using two nested while loops:
%%timeit
i, z = 0, []
while i < m:
j, z_ = 0, []
while j < n:
z_.append(x[i][j] + y[i][j])
j += 1
z.append(z_)
i += 1
The output is:
19.7 ms ± 271 µs per loop (mean ± std. dev. of 7 runs, 100 loops each
Again, we can get some performance improvement with the nested for loops:
%%timeit
z = []
for i in range(m):
z_ = []
for j in range(n):
z_.append(x[i][j] + y[i][j])
z.append(z_)
The output is:
16.4 ms ± 303 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In some cases, the nested for loops can be used with lists comprehensions, enabling an additional benefit:
%%timeit
z = [[x[i][j] + y[i][j] for j in range(n)] for i in range(m)]
The output is:
12.1 ms ± 99.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
We can see that in the case of nested loops, list comprehensions are faster than the ordinary for loops, which are faster than while.
In this case, we have 100.000 (100×1.000) integer elements in each list. This example is slightly slower than the one with 100.000 elements and a single loop. That’s the conclusion for all three approaches (list comprehensions, ordinary for, and while loops).
Using Python with NumPy
numpy
is great to work with multidimensional arrays. Let’s use x and y to create corresponding numpy
arrays of 64bit integers:
x_, y_ = np.array(x, dtype=np.int64), np.array(y, dtype=np.int64)
And let’s check the performance:
%%timeit
z = x_ + y_
The output is:
69.9 µs ± 909 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
That’s about 173 times faster than list comprehensions. But it might be even faster if we use 32bit integers:
x_, y_ = np.array(x, dtype=np.int32), np.array(y, dtype=np.int32)
The performance check is done as before:
%%timeit
z = x_ + y_
The output is:
34.3 µs ± 44.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
which is twice faster than with 64bit integers.
Results Summary
This table summarizes the obtained results:
Number of elements, m × n  1×1.000  1×10.000  1×100.000  100×1.000 

Python while

160 μs  1.66 ms  17.0 ms  19.7 ms 
Python for

122 μs  1.24 ms  13.4 ms  16.4 ms 
Python for with list comprehension 
87.2 μs  894 μs  8.92 ms  12.1 ms 
numpy with 64bit integers 
1.03 μs  6.36 μs  71.9 μs  69.9 μs 
numpy with 32bit integers 
814 ns  2.87 μs  34.1 μs  34.3 μs 
Conclusions
This article compares the performance of Python loops when adding two lists or arrays elementwise. The results show that list comprehensions were faster than the ordinary for loop, which was faster than the while loop. The simple loops were slightly faster than the nested loops in all three cases.
numpy
offers the routines and operators that can substantially reduce the amount of code and increase the speed of execution. It’s especially useful when working with single and multidimensional arrays.
Please, have in mind that the conclusions or relations among the results obtained here are not applicable, valid, or useful in all cases! They are presented for illustration. The proper way to handle inefficiencies is to discover the bottlenecks and perform own tests.
Discussion
Not really measuring loop performance. The while and for loops are measuring 'append' performance, and the numpy loop fails to include the overhead of constructing the numpy arrays, and getting the results out again.
That said, interesting article, certainly illustrates the idea.
Append is too slow, must compare allocating the full list before the loop
Do the times for
numpy
take into account the time required to copy the Python data to the numpy object? If not, what are the results that way?Numpy code you tried is vectorized so it runs faster. You could also try cython or numba to check the improvement in execution time.