Spacy With Joblib Library Generates _pickle.picklingerror: Could Not Pickle The Task To Send It To The Workers
Solution 1:
Q: What is the problem in my code?
Well, most probably the issue comes not from the code, but from the "hidden" processing, that appears, once n_jobs
directs ( and joblib
internally orchestrates ) to prepare that many exact copies of the main process, so as to let them work independently one of each other ( effectively thus escaping from GIL-locking and mapping the multiple process-flows onto physical hardware resources )
This step is responsible for making copies of all pythonic objects and was known to use Pickle
for doing this. The Pickle
module was known for its historical principal limitations on what can be pickled and what cannot.
The error message confirms this:
TypeError: self.c_map cannot be converted to a Python object for pickling
One may try a trick to supply Mike McKearns dill
module instead of Pickle
and test, if your "problematic" python objects will get pickled with this module without throwing this error.
dill
has the same API signatures, so a pure import dill as pickle
may help with leaving all the other code the same.
I had the same problems, with large models to get distributed into and back from multiple processes and the dill
was a way to go. Also the performance has increased.
Bonus:
dill
allows to save / restore the full python interpreter state!
This was a cool side-effect of finding dill
, once import dill as pickle
was done, pickle.dump_session( <aFile> )
will save ones complete state-full copy of the python interpreter session. This can be restored, if needed ( post-crash restores, trained trained and optimised ML-model state-fully saved / restored, incremental learning ML-model state-fully saved and re-distributed for remote restores for the deployed user-bases, etc. )
Solution 2:
Same issue. I solved by changing the backend from loky
to threading
in Parallel
.
Solution 3:
An additional answer for my question:
I didn't find a solution for Joblib with Spacy, but instead to parallelize the process, I found that Spacy released something called Pipeline, where you can parse large number of documents with multi-threads.
I applied it with the same example above:
class nouns:
def get_nouns(self, sentences):
start = time.time()
docs = nlp.pipe(sentences, n_threads=-1)
result = [ ' '.join([token.text for token in doc if token.tag_ in ['NN', 'NNP', 'NNS', 'NNPS']]) for doc in docs]
print('Time Elapsed {} ms'.format((time.time() - start) * 1000))
print(result)
if __name__ == '__main__':
sentences = ['we went to the school yesterday',
'The weather is really cold',
'Can we catch the dog?',
'How old are you John?',
'I like diving and swimming',
'Can the world become united?']
obj = nouns()
obj.get_nouns(sentences)
Solution 4:
I had a similar problem with paralleling lemmatization, but with another library pymystem3
.
from pymystem3 import Mystem
mystem = Mystem()
def preprocess_text(text):
...
tokens = mystem.lemmatize(text)
...
text = " ".join(tokens)
returntext
data_set = Parallel(n_jobs=-1)(delayed(preprocess_text)(article) for article in tqdm(articles))
The solution was to put initialization into function.
def preprocess_text(text):
...
mystem = Mystem()
tokens = mystem.lemmatize(text)
...
text = " ".join(tokens)
returntext
I suspect you could try the same with nlp = spacy.load
Solution 5:
Just want to add my two cents. Use @staticmethod over your class method and spare the auto-injected self-object to prevent accidentally serializing a whole framework, as happened in my case (flask). As the framework does a lot of behind-the-scenes injections and blow-up the serialization dependencies.
Post a Comment for "Spacy With Joblib Library Generates _pickle.picklingerror: Could Not Pickle The Task To Send It To The Workers"