The Scrapy-redis Program Does Not Close Automatically
Solution 1:
scrapy-redis
will always wait for new urls to be pushed in the redis queue. When the queue is empty, the spider goes in idle state and waits new urls. That's what I used to close my spider once the queue is empty.
When the spider is in idle (when it does nothing), I check if there is still something left in the redis queue. If not, I close the spider with close_spider
. The following code is located in the spider
class:
@classmethoddeffrom_crawler(cls, crawler, *args, **kwargs):
from_crawler = super(SerpSpider, cls).from_crawler
spider = from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.idle, signal=scrapy.signals.spider_idle)
return spider
defidle(self):
ifself.q.llen(self.redis_key) <= 0:
self.crawler.engine.close_spider(self, reason='finished')
Solution 2:
Well scrapy-redis
is made to be always open waiting for more urls to be pushed in the redis queue, but if you want to close it you could do it with a pipeline, here:
classTestPipeline(object):
def__init__(self, crawler):
self.crawler = crawler
self.redis_db = None
self.redis_len = 0@classmethoddeffrom_crawler(cls, crawler):
return cls(crawler)
defopen_spider(self, spider):
self.redis_len = len(spider.server.keys('your_redis_key'))
defprocess_item(self, item, spider):
self.redis_len -= 1if self.redis_len <= 0:
self.crawler.engine.close_spider(spider, 'No more items in redis queue')
return item
I will explain how it works in open_spider
the pipeline get the total of keys in the redis queue and in process_item
it decrements the redis_len
variable and when it reach zero send a close signal in the last item.
Post a Comment for "The Scrapy-redis Program Does Not Close Automatically"