The Python Oracle

How do I parallelize a simple Python loop?

Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn

--

Music by Eric Matyas
https://www.soundimage.org
Track title: Ocean Floor

--

Chapters
00:00 Question
00:50 Accepted answer (Score 283)
01:30 Answer 2 (Score 165)
03:54 Answer 3 (Score 104)
04:27 Answer 4 (Score 89)
09:33 Thank you

--

Full question
https://stackoverflow.com/questions/9786...

Accepted answer links:
[multiprocessing]: http://docs.python.org/library/multiproc...

Answer 2 links:
https://blog.dominodatalab.com/simple-pa.../
[asyncio]: https://docs.python.org/3/library/asynci...

Answer 3 links:
[joblib]: https://joblib.readthedocs.io/en/latest/...

Answer 4 links:
[here]: https://docs.python.org/3/library/asynci...
[image]: https://i.stack.imgur.com/ruy9N.png
[image]: https://i.stack.imgur.com/W9XF5.png
[image]: https://i.stack.imgur.com/3NnBC.png
[image]: https://i.stack.imgur.com/abaC6.png

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #parallelprocessing

#avk47



ACCEPTED ANSWER

Score 307


The CPython implementation currently has a global interpreter lock (GIL) that prevents threads of the same interpreter from concurrently executing Python code. This means CPython threads are useful for concurrent I/O-bound workloads, but usually not for CPU-bound workloads. The naming calc_stuff() indicates that your workload is CPU-bound, so you want to use multiple processes here (which is often the better solution for CPU-bound workloads anyway, regardless of the GIL).

There are two easy ways of creating a process pool into the Python standard library. The first one is the multiprocessing module, which can be used like this:

pool = multiprocessing.Pool(4)
out1, out2, out3 = zip(*pool.map(calc_stuff, range(0, 10 * offset, offset)))

Note that this won't work in the interactive interpreter due to the way multiprocessing is implemented.

The second way to create a process pool is concurrent.futures.ProcessPoolExecutor:

with concurrent.futures.ProcessPoolExecutor() as pool:
    out1, out2, out3 = zip(*pool.map(calc_stuff, range(0, 10 * offset, offset)))

This uses the multiprocessing module under the hood, so it behaves identically to the first version.




ANSWER 2

Score 199


from joblib import Parallel, delayed
def process(i):
    return i * i
    
results = Parallel(n_jobs=2)(delayed(process)(i) for i in range(10))
print(results)  # prints [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

The above works beautifully on my machine (Ubuntu, package joblib was pre-installed, but can be installed via pip install joblib).

Taken from https://blog.dominodatalab.com/simple-parallelization/


Edit on Mar 31, 2021: On joblib, multiprocessing, threading and asyncio

  • joblib in the above code uses import multiprocessing under the hood (and thus multiple processes, which is typically the best way to run CPU work across cores - because of the GIL)
  • You can let joblib use multiple threads instead of multiple processes, but this (or using import threading directly) is only beneficial if the threads spend considerable time on I/O (e.g. read/write to disk, send an HTTP request). For I/O work, the GIL does not block the execution of another thread
  • Since Python 3.7, as an alternative to threading, you can parallelise work with asyncio, but the same advice applies like for import threading (though in contrast to latter, only 1 thread will be used; on the plus side, asyncio has a lot of nice features which are helpful for async programming)
  • Using multiple processes incurs overhead. Think about it: Typically, each process needs to initialise/load everything you need to run your calculation. You need to check yourself if the above code snippet improves your wall time. Here is another one, for which I confirmed that joblib produces better results:
import time
from joblib import Parallel, delayed

def countdown(n):
    while n>0:
        n -= 1
    return n


t = time.time()
for _ in range(20):
    print(countdown(10**7), end=" ")
print(time.time() - t)  
# takes ~10.5 seconds on medium sized Macbook Pro


t = time.time()
results = Parallel(n_jobs=2)(delayed(countdown)(10**7) for _ in range(20))
print(results)
print(time.time() - t)
# takes ~6.3 seconds on medium sized Macbook Pro



ANSWER 3

Score 106


To parallelize a simple for loop, joblib brings a lot of value to raw use of multiprocessing. Not only the short syntax, but also things like transparent bunching of iterations when they are very fast (to remove the overhead) or capturing of the traceback of the child process, to have better error reporting.

Disclaimer: I am the original author of joblib.




ANSWER 4

Score 80


What's the easiest way to parallelize this code?

Use a PoolExecutor from concurrent.futures. Compare the original code with this, side by side. First, the most concise way to approach this is with executor.map:

...
with ProcessPoolExecutor() as executor:
    for out1, out2, out3 in executor.map(calc_stuff, parameters):
        ...

or broken down by submitting each call individually:

...
with ThreadPoolExecutor() as executor:
    futures = []
    for parameter in parameters:
        futures.append(executor.submit(calc_stuff, parameter))

    for future in futures:
        out1, out2, out3 = future.result() # this will block
        ...

Leaving the context signals the executor to free up resources

You can use threads or processes and use the exact same interface.

A working example

Here is working example code, that will demonstrate the value of :

Put this in a file - futuretest.py:

from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
from time import time
from http.client import HTTPSConnection

def processor_intensive(arg):
    def fib(n): # recursive, processor intensive calculation (avoid n > 36)
        return fib(n-1) + fib(n-2) if n > 1 else n
    start = time()
    result = fib(arg)
    return time() - start, result

def io_bound(arg):
    start = time()
    con = HTTPSConnection(arg)
    con.request('GET', '/')
    result = con.getresponse().getcode()
    return time() - start, result

def manager(PoolExecutor, calc_stuff):
    if calc_stuff is io_bound:
        inputs = ('python.org', 'stackoverflow.com', 'stackexchange.com',
                  'noaa.gov', 'parler.com', 'aaronhall.dev')
    else:
        inputs = range(25, 32)
    timings, results = list(), list()
    start = time()
    with PoolExecutor() as executor:
        for timing, result in executor.map(calc_stuff, inputs):
            # put results into correct output list:
            timings.append(timing), results.append(result)
    finish = time()
    print(f'{calc_stuff.__name__}, {PoolExecutor.__name__}')
    print(f'wall time to execute: {finish-start}')
    print(f'total of timings for each call: {sum(timings)}')
    print(f'time saved by parallelizing: {sum(timings) - (finish-start)}')
    print(dict(zip(inputs, results)), end = '\n\n')

def main():
    for computation in (processor_intensive, io_bound):
        for pool_executor in (ProcessPoolExecutor, ThreadPoolExecutor):
            manager(pool_executor, calc_stuff=computation)

if __name__ == '__main__':
    main()

And here's the output for one run of python -m futuretest:

processor_intensive, ProcessPoolExecutor
wall time to execute: 0.7326343059539795
total of timings for each call: 1.8033506870269775
time saved by parallelizing: 1.070716381072998
{25: 75025, 26: 121393, 27: 196418, 28: 317811, 29: 514229, 30: 832040, 31: 1346269}

processor_intensive, ThreadPoolExecutor
wall time to execute: 1.190223217010498
total of timings for each call: 3.3561410903930664
time saved by parallelizing: 2.1659178733825684
{25: 75025, 26: 121393, 27: 196418, 28: 317811, 29: 514229, 30: 832040, 31: 1346269}

io_bound, ProcessPoolExecutor
wall time to execute: 0.533886194229126
total of timings for each call: 1.2977914810180664
time saved by parallelizing: 0.7639052867889404
{'python.org': 301, 'stackoverflow.com': 200, 'stackexchange.com': 200, 'noaa.gov': 301, 'parler.com': 200, 'aaronhall.dev': 200}

io_bound, ThreadPoolExecutor
wall time to execute: 0.38941240310668945
total of timings for each call: 1.6049387454986572
time saved by parallelizing: 1.2155263423919678
{'python.org': 301, 'stackoverflow.com': 200, 'stackexchange.com': 200, 'noaa.gov': 301, 'parler.com': 200, 'aaronhall.dev': 200}

Processor-intensive analysis

When performing processor intensive calculations in Python, expect the ProcessPoolExecutor to be more performant than the ThreadPoolExecutor.

Due to the Global Interpreter Lock (a.k.a. the GIL), threads cannot use multiple processors, so expect the time for each calculation and the wall time (elapsed real time) to be greater.

IO-bound analysis

On the other hand, when performing IO bound operations, expect ThreadPoolExecutor to be more performant than ProcessPoolExecutor.

Python's threads are real, OS, threads. They can be put to sleep by the operating system and reawakened when their information arrives.

Final thoughts

I suspect that multiprocessing will be slower on Windows, since Windows doesn't support forking so each new process has to take time to launch.

You can nest multiple threads inside multiple processes, but it's recommended to not use multiple threads to spin off multiple processes.

If faced with a heavy processing problem in Python, you can trivially scale with additional processes - but not so much with threading.