Python multiprocessing VS Numpy vectorization - one example

#python #multiprocessing #numpy #vectorization

Today, I came across a very nice article about parallelization and multiprocessing using Python Pool and a new software named Ray which seamlessly distributes Pool over a cluster.

The author used a Monte-Carlo implementation to calculate the pi number as an example of a procedure that can be parallelized using Pool and Ray. The author then compares the performance of all the different possibilities: from a single core in a computer to local multicore with Pool to, finally, a cluster distributed approach with Ray. The best performance was, obviously, in the cluster with Ray, giving:

sample: 10_000_000_000
pi ~= 3.141599
Finished in 131.37s

I perfectly understand that the pi calculation example was chosen purely for demonstration porpuses. However, I found it a very nice opportunity to use it as an example to practice vectorization principles with Numpy. Please go to the original post and read the raw Monte-Carlo sampling implementation using for loops.

How would you convert that example into a vectorized Numpy-based implementation?

Try it out for yourself first before reading the answer below.

SPOILER ALERT!! YOU ARE ABOUT TO READ THE ANSWER!

This is how I solved it, you are very welcomed to comment and help me improve it further.

def perform(size):
    xxyy = np.random.uniform(-1, size=(size, 2))
    norm = np.linalg.norm(xxyy, axis=1)
    inside = norm <= 1
    return np.sum(inside)


start = time.time()
sample = 10_000_000_000
times = 20
insiders = 0
size = sample // times
print('size: ', size)
for _ in range(times):
    insiders += perform(size)

pi = 4 * insiders / sample
print("pi ~= {}".format(pi))
print("Finished in: {:.2f}s".format(time.time()-start))

Using a single core in my laptop, I got:

pi ~= 3.1415858588
Finished in: 236.04s

CPU~Quad core Intel Core i7-8550U (-MT-MCP-)
speed/max~800/4000 MHz
Kernel~4.15.0-99-generic x86_64
Mem~7178.7/32050.2MB
HDD~2250.5GB(56.6% used)
Procs~300
Client~Shell
inxi~2.3.56

Notice that the sample function was designed to split the calculation into several parts so that my laptop won't run out of memory when creating so large arrays. However, that does not affect significantly the performance of the calculation.

What do you think?
Cheers,

Top comments (2)

Umberto Giuriato • May 17 '20

Would use a list comprehension instead of the for loop improve the performance?

João M.C. Teixeira • May 17 '20

The computational cost here is indeed in generating the random numbers and managing the large arrays. At the end, list comprehensions are also for loops. In my hands now, calculating 1_000_000_000 samples:

# (...)
for _ in range(times):
    insiders += perform(size)

# gives
pi ~= 3.14154544
Finished in: 40.36s

while,

# (...)
insiders = sum(perform(size) for i in range(times))

# gives
size:  25000000
pi ~= 3.141650028
Finished in: 41.85s

Cheers,