Multiprocessing in Python (Part 1)

#python #computerscience #tutorial #programming

A lot of you have probably been in a situation where you need to carry out multiple tasks, or a repetitive action on multiple items, like doing your homework, or even something as little as doing your laundry. It’s so much easier when we have the ability to do multiple things at the same time. Like have multiple washing machines for our laundry, or 5 people do your homework.

The same principle also applies to computing. There are times when we have lots of data and we would like to perform the same action on all of our data. The problem now is, that it’s the same action and we have lots of data. This slows down our program.

import time

def our_function():
    print("Processing stuff...")
    time.sleep(5)
    print("Done")

def normal_linear_method():

    our_function()
    our_function()
    our_function()

normal_linear_method()
# Time taken: about 15 seconds

Let’s assume it takes exactly 5 seconds to complete the action or function on the data. If we have 100 units of data to process, it's going to take us 500 seconds, which is about 8 minutes of our time. What if I told you there was a way we could speed things up from 8 minutes back to our unit time of 5 seconds?

Multithreading in Python

The first technique we will use to solve our problem is something called multithreading. Multithreading works by constantly switching the context (basically the state of the task it’s working on at the moment) such that an illusion of parallel processing is achieved. This concept is also known as concurrency.

# Example of task speed up using multithreading

from threading import Thread
import time

def using_multithreading():

    # Our threads
    t1  = Thread(target=our_function)
    t2 = Thread(target=our_function)
    t3 = Thread(target=our_function)

    # Starting our threads
    t1.start()
    t2.start()
    t3.start()

    # We join the threads/processes so our main thread/process
    # can wait for it to be completed before terminating

    t1.join()
    t2.join()
    t3.join()

using_multithreading()
# time taken: about 5 seconds

Multiprocessing in Python

The second technique we will use to solve our problem is multiprocessing. While multithreading in python makes use of context switching, multiprocessing in python runs each of the processes in parallel. Each process has its own copy of the entire program's memory and runs on its own core.

# Example of task speed up using multiprocessing
import time
from multiprocessing import Process

def using_multiprocessing():
    # Our processes
    p1  = Process(target=our_function)
    p2 = Process(target=our_function)

    # Starting our processes
    p1.start()
    p2.start()
    p1.join()
    p2.join()

if __name__ == '__main__':

    start = time.perf_counter()
    using_multiprocessing()
    stop = time.perf_counter()

    print("Time taken {}".format(stop-start))

Multiprocessing vs Multithreading: Parallelism vs Concurrency

Both multiprocessing and multithreading come in handy. The question is, when should we use what.

We use multithreading for IO-bound operations, like reading data from a file, or pooling data from a server.
We use multiprocessing for CPU-bound operations, like image processing, training a machine learning model, big data processing, etc.

Running multiple processes at once

There are times when we want to run a function on a sequence of data. Say we have a list of 100 units of data, and we would like to apply our function to all of them in parallel or concurrently. There are different approaches we can take:

Approach 1: iteratively create processes and start them

In this approach, we’ve used a loop to create a process for all our data and start them. The problem with this approach is that we can’t really get the output of the processes easily.

import time
from multiprocessing import Process

def multiple_processes():

    # Spawn our processes iteratively
    processes = [
        Process(target=operation, args=(x,)) 
        for x in data
    ]

    for process in processes:
        # Iteratively start all processes
        process.start()

    for process in processes:
        process.join()

    return 

if __name__ == '__main__':

    start = time.perf_counter()
    multiple_processes()
    stop = time.perf_counter()

    print("Time taken {}".format(stop-start))
    # time taken: about 8 seconds

Approach 2: The ProcessPoolExecutor

In this approach, we’ve used something called a pool, which is an easier and neater way to manage our computing resources. Although this is slower than spawning the processes iteratively, its way neater and allows us to use the output of those processes in our main process.

# Using multiprocessing with ProcessPoolExecutor
import time
from concurrent.futures import \
    ProcessPoolExecutor, as_completed


def multiple_processes_pooling():

    with ProcessPoolExecutor() as executor:
        process_futures = [
            executor.submit(operation, x) 
            for x in data
        ]
        results = [
            p.result() 
            for p in 
            as_completed(process_futures)
        ]

        print(results)


if __name__ == '__main__':

    start = time.perf_counter()
    multiple_processes_pooling()
    stop = time.perf_counter()

    print("Time taken {}".format(stop-start))
    # time taken: about 50 seconds

Approach 3: ProcessPoolExecutor().map

In this approach, instead of iteratively submitting a process to our pool executor, we’ve used the executor.map method to submit all of the data in the list at once. The output of this function is the result of all the completed processes.

import time
from concurrent.futures import ProcessPoolExecutor

# Using the executor.map
def pooling_map():

    with ProcessPoolExecutor() as executor:
        results = executor.map(operation, data)

        print([res for res in results])

if __name__ == '__main__':

    start = time.perf_counter()
    pooling_map()
    stop = time.perf_counter()

    print("Time taken {}".format(stop-start))
    # time taken: about 50 seconds

Very Important to remember

If you look at the time output, you’d notice that the time taken isn't exactly the unit time, there are four main factors that affect this.

The computer in use can affect its time, as well as other programs running on your PC. The code was tested using an intel Core i5 7th generation computer.
It takes a few microseconds for our program to properly set up our processes and start it.
When there are more processes than we have CPU cores, our system automatically queues the pending processes and helps us manage them properly.
And finally, it takes a few microseconds for our program to properly close processes.

That being said, it’s important to note that we only use multiprocessing when there’s a lot of data and the operation takes a lot of time to be completed.

Conclusion

Multiprocessing and Multithreading help us to speed up our programs.
Multiprocessing is most suited for CPU-bound operations, like machine learning and data processing.
Multithreading is most suited for IO-bound operations, like communicating with servers, or the file system.
Multiprocessing is not a magic wand; Don't use it unless you have to, or it could actually slow down your code.