The other day, I helped a friend with some Project Euler problems. Those are little programming puzzles that involve maths and coding. I am really quite fond of them, and I recommend you to try some if you want to hone your skills and you are not afraid of trying something a bit difficult.

One thing that comes up in some of these challenges is needing prime numbers. As the challenges are all about writing code, it means needing to implement a prime number generator.

A prime number is a natural number that is only divisible by 1 and itself. Some examples are: 2, 3, 7, 11, 13, 17, and 19.

There are quite a few strategies that can be used to generate them, and in fact quite a few more efficient than the one I'm going to discuss here. The point of this is not to find the best strategy, but rather to see how it can be implemented well and optimized.

## Generating prime numbers up to one million

This is perhaps the most naive solution for generating prime numbers up to on million in Python. In this solution, we simply iterate through numbers up to one million, and then try to see if they are divisible by anything other than 1 and itself.

```
primes = []
for candidate in range(1000000):
is_prime = True
for divisor in range(2, candidate - 1):
if candidate % divisor == 0:
is_prime = False
break
if is_prime:
primes.append(candidate)
```

This solution is incredibly inefficient, it takes forever to finish. We can do a lot better than this. There is three optimizations that we can implement:

- We don't even need to check even numbers, because we know that they are divisible by two. So we start our primes list with
`[2]`

and then skip over all even candidate numbers in the first place. This means less candidates to try. - All positive natural numbers are either prime or the product of (zero) or more primes. This means that instead of checking all candidate primes for divisibility by
`[2..candidate-1]`

, we can simply check them for divisibility by the primes we have already found. This means less division tests. - We only need to check for divisibility by primes up to
`sqrt(candidate)`

, because for a candidate`c`

to be divisible by`a`

, it means there exists a solution for the formula $c = a \cdot b$ . If a is larger than the square root,`b`

must be smaller than the square root. For this reason, it is sufficient to check for divisors up to the square root. This also means that we need less division tests.

With these optimisations in place, our code looks like this:

```
from math import sqrt
primes = [2]
maximum = 1000000
for num in range(3, maximum, 2):
is_prime = True
square_root = sqrt(num)
for prime in primes:
if num % prime == 0:
is_prime = False
break
if prime > square_root:
break
if is_prime:
primes.append(num)
```

The runtime of this solution improved to only taking about 940ms on my machine, demonstrating that these optimisations have massively improved the implementation.

## Implementing it in Rust

I will translate the algorithm into Rust here to show you what it would look like and how well it performs.

I have to say that this is kind of cheating, because Rust is a compiled language, which means that the compiler has optimization opportunities that an interpreted language such as Python does not have.

Creating a new, empty Rust project with `cargo new primes`

, we can translate the Python version into Rust, only making use of iterators to sift through the list of primes to try dividing.

```
fn main() {
let mut primes = vec![2];
let maximum: u64 = 1000000;
for candidate in 3..maximum {
let square_root = (candidate as f64).sqrt() as u64 + 1;
let is_prime = primes
.iter()
.take_while(|p| p <= &&square_root)
.all(|p| candidate % p != 0);
if is_prime {
primes.push(candidate);
}
}
}
```

What we do here is very similar to what we did in the Python version:

- We start out with an array of primes, pre-filled with the first prime,
- We iterate through the numbers 3 up to our maximum,
- For every candidate, we determine the square root. If none of the primes up to the square root of the candidate are a divisor, we consider the candidate to be prime and append it to the list.

One difference to note here: running this in release mode (using `cargo run --release`

) is so fast that it only takes about 30ms to generate one million primes. In the time that Python can generate on million primes, Rust can generate 20 million primes. This is to be expected, because Rust can take advantage of compilation.

## Introducing Parallelism

These solutions so far have been lacking in one aspect: they have not been able to take advantage of multiple CPU cores. How hard would it be to take the current Rust solution and introduce some parallelism?

The rayon crate in Rust is designed to help with exactly that: making it easy to write code in a multi-threaded way. I have never worked with it, so let's give it a go.

We can get started by adding rayon to the list of dependencies:

```
cargo add rayon
```

Now, rayon gives us some interesting primitives to work with. For example, any Rust code that uses iterators can be transformed into running in parallel by replacing `.iter()`

with `.par_iter()`

. Behind the scenes, rayon handles starting a thread pool, dividing up the work, and waiting for the threads to finish.

For example, we could replace the iterator-based code that we use to check if a number is divisible by some primes:

```
use rayon::prelude::*;
let is_prime = primes
.par_iter()
.filter(|c| c < &&square_root)
.all(|p| candidate % p != 0);
```

However, doing this is actually bad for performance. This iteration walks the list of primes linearly, and as such it is actually very fast already (good locality, friendly for CPU caches). Breaking this down to run on multiple threads does more harm than good, and brings our runtime from 30ms up to 17s. Can we find a better optimization opportunity?

We have to try to break down the algorithm into bigger chunks. So rather than breaking down the loop where we test individual divisors, we can break it down into testing different candidate primes in multiple threads.

One thing we must fix before doing that is dealing with mutability. Our list of primes is used for two purposes: we use it to check candidate numbers for being prime by checking divisibility (read-only usage) and we append new primes to the end of it (read-write usage). The issue here is that when we work multi-threaded, it becomes problematic to use the same data in both a read-only and read-write manner. The easy solution to that is splitting that up into two separate variables: the `primes`

array is now only used read-only, and we introduce a new `new_primes`

array that we append new primes to. Occasionally we append the `new_primes`

array onto `primes`

.

Here is what all of this looks like together. Note that I've had to import rayon's prelude (importing that gives us access to all of rayon's iterator adapters) and we are using `par_extend()`

, which will extend an array in parallel. So in this implementation, we will test different candidates for primality at the same time.

```
use rayon::prelude::*;
fn main() {
let mut primes = vec![2];
let maximum: u64 = 1_000_000;
loop {
let mut new_primes = vec![];
let last_prime = *primes.last().unwrap();
if last_prime > maximum {
break;
}
new_primes.par_extend((last_prime..last_prime*2)
.into_par_iter()
.filter(|candidate| {
let square_root = (*candidate as f64).sqrt() as u64 + 1;
primes
.iter()
.take_while(|p| p <= &&square_root)
.all(|p| candidate % p != 0)
})
);
primes.append(&mut new_primes);
}
}
```

This implementation cuts down the time it takes to compute primes up to one million down to about 11ms on my machine.

## Results

Here are the raw results from the rest runs:

Algorithm | Language | Time (relative change) |
---|---|---|

Naive (1 million) | Python | 24m7s |

Better (1 million) | Python | 940ms (153936% faster) |

Better (1 million) | Rust | 30ms (3133% faster) |

Parallel (1 million) | Rust | 11ms (273% faster) |

We have learned that it is quite easy to generate prime numbers in both Python and Rust. Generating them quickly takes a little bit of experimentation. Rayon helps a lot in making things multi-threaded, it's almost magic. But we still had to identify the best places to introduce parallelism, and we had to change the algorithm slightly.

Just for fun, I decided to see how many primes I could generate with the Rayon implementation. After running it for 2m25s, it went up to the prime 5222275627, which I thought was quite impressive. Running it for longer, about 30m (close to how long the naive Python implementation took to generate primes up to one million), it was able to come up with the prime 41778204911.

It was a lot of fun to figure out how to generate primes quickly. I hope that you learned something, maybe you will find a use-case for rayon yourself. It was a pleasure to work with.

## Top comments (0)