Today, I came across a very nice article about parallelization and multiprocessing using Python `Pool`

and a new software named Ray which seamlessly distributes `Pool`

over a cluster.

The author used a Monte-Carlo implementation to calculate the pi number as an example of a procedure that can be parallelized using `Pool`

and Ray. The author then compares the performance of all the different possibilities: from a single core in a computer to local multicore with `Pool`

to, finally, a cluster distributed approach with Ray. The best performance was, obviously, in the cluster with Ray, giving:

```
sample: 10_000_000_000
pi ~= 3.141599
Finished in 131.37s
```

I perfectly understand that the pi calculation example was chosen purely for demonstration porpuses. However, I found it a very nice opportunity to use it as **an example to practice vectorization principles with Numpy**. Please go to the original post and read the raw Monte-Carlo sampling implementation using `for`

loops.

**How would you convert that example into a vectorized Numpy-based implementation?**

Try it out for yourself first before reading the answer below.

**SPOILER ALERT!! YOU ARE ABOUT TO READ THE ANSWER!**

This is how I solved it, you are very welcomed to comment and help me improve it further.

```
def perform(size):
xxyy = np.random.uniform(-1, size=(size, 2))
norm = np.linalg.norm(xxyy, axis=1)
inside = norm <= 1
return np.sum(inside)
start = time.time()
sample = 10_000_000_000
times = 20
insiders = 0
size = sample // times
print('size: ', size)
for _ in range(times):
insiders += perform(size)
pi = 4 * insiders / sample
print("pi ~= {}".format(pi))
print("Finished in: {:.2f}s".format(time.time()-start))
```

Using a single core in my laptop, I got:

```
pi ~= 3.1415858588
Finished in: 236.04s
```

```
CPU~Quad core Intel Core i7-8550U (-MT-MCP-)
speed/max~800/4000 MHz
Kernel~4.15.0-99-generic x86_64
Mem~7178.7/32050.2MB
HDD~2250.5GB(56.6% used)
Procs~300
Client~Shell
inxi~2.3.56
```

Notice that the `sample`

function was designed to split the calculation into several parts so that my laptop won't run out of memory when creating so large arrays. However, that does not affect significantly the performance of the calculation.

What do you think?

Cheers,

## Top comments (2)

Would use a list comprehension instead of the for loop improve the performance?

The computational cost here is indeed in generating the random numbers and managing the large arrays. At the end, list comprehensions are also for loops. In my hands now, calculating

`1_000_000_000`

samples:while,

Cheers,