Skip to content Skip to sidebar Skip to footer

Python Parallel Processing To Unzip Files

I'm new to parallel processing in python. I have a piece of code below, that walks through all directories and unzips all tar.gz files. However, it takes quite a bit of time. impor

Solution 1:

It's a pretty easy to create a pool to do the work. Just pull the extractor out into a separate worker.

import tarfile
import gzip
import os
import multiprocessing as mp

def unziptar(fullpath):
    """worker unzips one file"""
    print 'extracting... {}'.format(fullpath)
    tar = tarfile.open(fullpath, 'r:gz')
    tar.extractall(os.path.dirname(fullpath))
    tar.close()

def fanout_unziptar(path):
    """create pool to extract all"""
    my_files = []
    for root, dirs, files in os.walk(path):
        for i in files:
            if i.endswith("tar.gz"):
                my_files.append(os.path.join(root, i))

    pool = mp.Pool(min(mp.cpu_count(), len(my_files))) # number of workers
    pool.map(unziptar, my_files, chunksize=1)
    pool.close()


if __name__=="__main__":
    path = 'C://path_to_folder'
    fanout_unziptar(path)
    print 'tar.gz extraction has completed'

Post a Comment for "Python Parallel Processing To Unzip Files"