Skip to content Skip to sidebar Skip to footer

Multiprocessing Global Variable Memory Copying

I am running a program which loads 20 GB data to the memory at first. Then I will do N (> 1000) independent tasks where each of them may use (read only) part of the 20 GB data.

Solution 1:

In linux, forked processes have a copy-on-write view of the parent address space. forking is light-weight and the same program runs in both the parent and the child, except that the child takes a different execution path. As a small exmample,

import os
var = "unchanged"
pid = os.fork()
if pid:
    print('parent:', os.getpid(), var)
    os.waitpid(pid, 0)
else:
    print('child:', os.getpid(), var)
    var = "changed"

# show parent and child views
print(os.getpid(), var)

Results in

parent: 22642 unchanged
child: 22643 unchanged
22643 changed
22642 unchanged

Applying this to multiprocessing, in this example I load data into a global variable. Since python pickles the data sent to the process pool, I make sure it pickles something small like an index and have the worker get the global data itself.

import multiprocessing as mp
import os

my_big_data = "well, bigger than this"

def worker(index):
    """get char in big data"""
    return my_big_data[index]

if __name__ == "__main__":
    pool = mp.Pool(os.cpu_count())
    for c in pool.imap_unordered(worker, range(len(my_big_data)), chunksize=1):
        print(c)

Windows does not have a fork-and-exec model for running programs. It has to start a new instance of the python interpreter and clone all relevant data to the child. This is a heavy lift!


Post a Comment for "Multiprocessing Global Variable Memory Copying"