Skip to content Skip to sidebar Skip to footer

Spacy With Joblib Library Generates _pickle.picklingerror: Could Not Pickle The Task To Send It To The Workers

I have a large list of sentences (~7 millions), and I want to extract the nouns from them. I used joblib library to parallelize the extracting process, like in the following: impor

Solution 1:

Q: What is the problem in my code?

Well, most probably the issue comes not from the code, but from the "hidden" processing, that appears, once n_jobs directs ( and joblib internally orchestrates ) to prepare that many exact copies of the main process, so as to let them work independently one of each other ( effectively thus escaping from GIL-locking and mapping the multiple process-flows onto physical hardware resources )

This step is responsible for making copies of all pythonic objects and was known to use Pickle for doing this. The Pickle module was known for its historical principal limitations on what can be pickled and what cannot.

The error message confirms this:

TypeError: self.c_map cannot be converted to a Python object for pickling

One may try a trick to supply Mike McKearns dill module instead of Pickle and test, if your "problematic" python objects will get pickled with this module without throwing this error.

dill has the same API signatures, so a pure import dill as pickle may help with leaving all the other code the same.

I had the same problems, with large models to get distributed into and back from multiple processes and the dill was a way to go. Also the performance has increased.

Bonus: dill allows to save / restore the full python interpreter state!

This was a cool side-effect of finding dill, once import dill as pickle was done, pickle.dump_session( <aFile> ) will save ones complete state-full copy of the python interpreter session. This can be restored, if needed ( post-crash restores, trained trained and optimised ML-model state-fully saved / restored, incremental learning ML-model state-fully saved and re-distributed for remote restores for the deployed user-bases, etc. )

Solution 2:

Same issue. I solved by changing the backend from loky to threading in Parallel.

Solution 3:

An additional answer for my question:

I didn't find a solution for Joblib with Spacy, but instead to parallelize the process, I found that Spacy released something called Pipeline, where you can parse large number of documents with multi-threads.

I applied it with the same example above:

class nouns:

    def get_nouns(self, sentences):
        start = time.time()
        docs = nlp.pipe(sentences, n_threads=-1)
        result = [ ' '.join([token.text for token in doc if token.tag_ in ['NN', 'NNP', 'NNS', 'NNPS']]) for doc in docs]
        print('Time Elapsed {} ms'.format((time.time() - start) * 1000))
        print(result)


if __name__ == '__main__':
    sentences = ['we went to the school yesterday',
                 'The weather is really cold',
                 'Can we catch the dog?',
                 'How old are you John?',
                 'I like diving and swimming',
                 'Can the world become united?']
    obj = nouns()
    obj.get_nouns(sentences)

Solution 4:

I had a similar problem with paralleling lemmatization, but with another library pymystem3.

from pymystem3 import Mystem
mystem = Mystem()
    
def preprocess_text(text):
   ...
   tokens = mystem.lemmatize(text)
   ...
   text = " ".join(tokens)
   returntext

data_set = Parallel(n_jobs=-1)(delayed(preprocess_text)(article) for article in tqdm(articles))

The solution was to put initialization into function.

def preprocess_text(text):
   ...
   mystem = Mystem()
   tokens = mystem.lemmatize(text)
   ...
   text = " ".join(tokens)
   returntext

I suspect you could try the same with nlp = spacy.load

Solution 5:

Just want to add my two cents. Use @staticmethod over your class method and spare the auto-injected self-object to prevent accidentally serializing a whole framework, as happened in my case (flask). As the framework does a lot of behind-the-scenes injections and blow-up the serialization dependencies.

Post a Comment for "Spacy With Joblib Library Generates _pickle.picklingerror: Could Not Pickle The Task To Send It To The Workers"