If we put aside employing SIMD extensions (SSE, MMX, etc.) and locking into a specific CPU architecture/making changes to existing code; don't test perfectly parallelizable tasks on a high-end server with 88 cores and hyper-threading... Will Mojo be 68,000 times faster?
My first takeaway - Mojo is a completely different language. You can't start typing standard Python in a
.mojo file and expect it to be compilable by Mojo right away.
Second takeaway - there are no arrays/lists :) I had to create my own implementation:
struct Array[T: AnyType]: var data: Pointer[T] var size: Int fn __init__(inout self, size: Int): self.size = size self.data = Pointer[T].alloc(self.size) fn __getitem__(self, i: Int) -> T: return self.data.load(i) fn __setitem__(self, i: Int, value: T): self.data.store(i, value) fn __moveinit__(inout self, owned existing: Self): self.data = existing.data self.size = existing.size fn __del__(owned self): self.data.free()
3rd takeaway - there's a Rust-like ownership memory management model (no Garbage Collector). I had to use the
^ operator and implement
Array.__moveinit__() to allow the array created in a function to be returned:
fn mandelbrot() -> Array[Int]: let output = Array[Int](width*height) for h in range(height): let cy = min_y + h * scaley for w in range(width): let cx = min_x + w * scalex let i = mandelbrot_0(ComplexFloat64(cx, cy)) output[width*h+w] = i return output^ # transfer ownership
4th - there's an out-of-the-box interop with Python, though you need some manual imports before you can touch .py files or use PiPy modules.
5th - it's still in early development and many things are still missing (docs, types, community), and it works only on Intel CPUs with Linux. Yet it is very fast :)
I took the Mandelbrot example from the Mojo blog and reimplemented it in 4 flavors:
- mandelbrot.py - baseline implementation
mandelbrot.🔥 - Python translation into Mojo. It was not as straightforward as expected.
- No arrays,
Tensorstructure can't be used as an alternative as long as you don't want to use SIMD
- Different type names, i.e.,
Intis written with an uppercase
- Had issues interoperating with Python's
time, used Mojo's alternative
letkeyword creates immutable variables
- Ownership/transferring return value via
- No arrays,
mandelbrot_numba.py - all I did was cloning the Python file, importing Numba and putting 2
@njitdecorators on functions
mandelbrot_numba_prange.py - although I didn't intend to add parallelization to this mixture, it was so easy with Numba (way fewer steps than with Mojo), I couldn't resist. I simply changed decorators (
parallel=True) and replaced 2
I tested these files (
python3 'file name' and
mojo 'file name') on the same Linux VM:
- Ubuntu 20.04.3 LTS, 64-bit, Intel Core i5-8257U @ 1.4GHz x 2, VMWare Workstation Player 17.0.1
|Language/variant||Time (seconds)||x Baseline|
|Python + Numba||0,68||x 15.9|
|Python + Numba (fastmath)*||0,64||x 16.9|
|Python + Numba (prange)||0,38||x 28,4|
|Mojo *||0,32||x 33,8|
(*) When checking the produced results, I noticed minor discrepancies in the produced set. Apparently, different float capabilities are used by Mojo and Numba with the
fastmath flag. For some corner cases, roundings/comparisons were different for the same set of input params.
I had this misconception about Mojo that it can be an in-place substitute for Python, i.e., if you had some Python code that you wanted to accelerate, copy and paste it to a Mojo project, and you are covered. That turned out not to be the case.
At the same time, Numba is the right tool for quick wins in performance with an existing Python code base.
As pointed out by my colleague, Python implementation used NumPy which is a bit of cheating. And he suggested to use Python without 3rd parties...
This pure version gave 0,29 seconds without utilising Numba's parallelisation and 0,19 seconds with
prange(). This is the best result!
As pointed by a different person, while NumPy and Mojo used a separate class/struct for handling complex numbers, the variant without NumPy operated on two numbers (real and imaginary parts) directly. That is clearly an optimisation trick and doesn't add up to code readability.
Hence the 3rd version of Python implementation. This time it defines a
Complex class with 3 operations and uses it's instances to handle complex numbers' math.
And here're the numbers:
- Pure Python ~27 minutes (1672 second)
- Numba, @njit() - 8,3 seconds (200x boost)
- Numba, @njit() and prange() - 4,3 seconds