Skip to content Skip to sidebar Skip to footer

Python Multiprocessing: Broken Pipe Exception After Increasing Pool Size

The exception I get. All I did that I increased pool count Code def parse(url): r = request.get(url) POOL_COUNT = 75 with Pool(POOL_COUNT) as p: result = p.map(parse, links)

Solution 1:

I was seeing Broken Pipe exception too. But mine is more complicated.

One reason that increasing the pool size alone will lead to exception would be you're getting too many things in request module so it could leads to not enough memory. Then it will seg-fault especially you have a small swap.

Edit1: I believe it's caused by memory usage. Too many pool connections used up too many memory and it finally get broken. It's very hard to debug and I myself limited my pool size to 4 since I have a small RAM and big packages.

Solution 2:

This simple version of you code works perfect here with any number of POOL_COUNT

from multiprocessing import Pool
defparse(url):
  r = url
  print(r)

POOL_COUNT = 90with Pool(processes=POOL_COUNT) as p:
    links = [str(i) for i inrange(POOL_COUNT)]
    result = p.map(parse, links)

Doesn't it? So the problem should be in request part, maybe needs a sleep?

Solution 3:

I tried to reproduce on a AWS t2.small instance (2GB RAM as you described) with the following script (note that you missed a s in requests.get(), assuming you are using the requests library, and also the return was missing):

from multiprocessing import Pool
import requests
defparse(url):
  a = requests.get(url)
  if a.status_code != 200:
    print(a)
  return a.text
POOL_COUNT = 120
links = ['http://example.org/'for i inrange(1000)]
with Pool(POOL_COUNT) as p:
  result = p.map(parse, links)
print(result)

Sadly, I didn't run into the same issue as you did.

From the stack trace you posted it seems that the problem is in launching the parse function, not in the requests module itself. It looks like the main process cannot send data to one of the launched processes.

Anyway: This operation is not CPU bound, the bottleneck is the network (most probably the remote servers max connections, or also probably), you are much better off using multithreading. This is most probably also faster, because multiprocessing.map needs to communicate between the processes, that means that the return of parse needs to be pickled and then sent to the main process.

To try with threads instead of processes, simply do from multiprocessing.pool import ThreadPool and replace Pool with ThreadPool in your code.

Post a Comment for "Python Multiprocessing: Broken Pipe Exception After Increasing Pool Size"