## DEV Community

João M.C. Teixeira

Posted on • Updated on

# Python multiprocessing VS Numpy vectorization - one example

Today, I came across a very nice article about parallelization and multiprocessing using Python `Pool` and a new software named Ray which seamlessly distributes `Pool` over a cluster.

The author used a Monte-Carlo implementation to calculate the pi number as an example of a procedure that can be parallelized using `Pool` and Ray. The author then compares the performance of all the different possibilities: from a single core in a computer to local multicore with `Pool` to, finally, a cluster distributed approach with Ray. The best performance was, obviously, in the cluster with Ray, giving:

``````sample: 10_000_000_000
pi ~= 3.141599
Finished in 131.37s
``````

I perfectly understand that the pi calculation example was chosen purely for demonstration porpuses. However, I found it a very nice opportunity to use it as an example to practice vectorization principles with Numpy. Please go to the original post and read the raw Monte-Carlo sampling implementation using `for` loops.

How would you convert that example into a vectorized Numpy-based implementation?

This is how I solved it, you are very welcomed to comment and help me improve it further.

``````def perform(size):
xxyy = np.random.uniform(-1, size=(size, 2))
norm = np.linalg.norm(xxyy, axis=1)
inside = norm <= 1
return np.sum(inside)

start = time.time()
sample = 10_000_000_000
times = 20
insiders = 0
size = sample // times
print('size: ', size)
for _ in range(times):
insiders += perform(size)

pi = 4 * insiders / sample
print("pi ~= {}".format(pi))
print("Finished in: {:.2f}s".format(time.time()-start))
``````

Using a single core in my laptop, I got:

``````pi ~= 3.1415858588
Finished in: 236.04s
``````
``````CPU~Quad core Intel Core i7-8550U (-MT-MCP-)
speed/max~800/4000 MHz
Kernel~4.15.0-99-generic x86_64
Mem~7178.7/32050.2MB
HDD~2250.5GB(56.6% used)
Procs~300
Client~Shell
inxi~2.3.56
``````

Notice that the `sample` function was designed to split the calculation into several parts so that my laptop won't run out of memory when creating so large arrays. However, that does not affect significantly the performance of the calculation.

What do you think?
Cheers,

Umberto Giuriato

Would use a list comprehension instead of the for loop improve the performance?

João M.C. Teixeira

The computational cost here is indeed in generating the random numbers and managing the large arrays. At the end, list comprehensions are also for loops. In my hands now, calculating `1_000_000_000` samples:

``````# (...)
for _ in range(times):
insiders += perform(size)

# gives
pi ~= 3.14154544
Finished in: 40.36s
``````

while,

``````# (...)
insiders = sum(perform(size) for i in range(times))

# gives
size:  25000000
pi ~= 3.141650028
Finished in: 41.85s
``````

Cheers,