Python multiprocessing memory usage

--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Puzzle Game Looping

--

Chapters
00:00 Python Multiprocessing Memory Usage
01:14 Accepted Answer Score 34
02:22 Answer 2 Score 2
02:57 Thank you

--

Full question
https://stackoverflow.com/questions/1474...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #linux #memorymanagement #multiprocessing

#avk47

ACCEPTED ANSWER

Score 34

The multiprocessing module is effectively based on the fork system call which creates a copy of the current process. Since you are loading the huge data before you fork (or create the multiprocessing.Process), the child process inherits a copy of the data.

However, if the operating system you are running on implements COW (copy-on-write), there will only actually be one copy of the data in physical memory unless you modify the data in either the parent or child process (both parent and child will share the same physical memory pages, albeit in different virtual address spaces); and even then, additional memory will only be allocated for the changes (in pagesize increments).

You can avoid this situation by calling multiprocessing.Process before you load your huge data. Then the additional memory allocations will not be reflected in the child process when you load the data in the parent.

Edit: reflecting @Janne Karila's comment in the answer, as it is so relevant: "Note also that every Python object contains a reference count that is modified whenever the object is accessed. So, just reading a data structure can cause COW to copy."

ANSWER 2

Score 2

As recommended in the documentation (-> Explicitly pass resources to child processes), I would make the large data (explicitly) global, so if there is copy-on-write (COW) available and you fork new processes (in macOS it is spawn by default nowadays), the data is available in the child processes:

def loadHugeData():
    global data
    return data

def processHugeData(res_queue):
    global data
    for item in data:
        res_queue.put(result)
    res_queue.put("END")

But keep in mind, that Python data structures are copied. You would need to some more low-level data types such as numpy because of Python GIL.