Skip to content Skip to sidebar Skip to footer

How To Run A Scrapy Scraper Multiple Times, Simultaneously, On Different Input Websites And Write To Different Output Files?

Does anyone know how I could run the same Scrapy scraper over 200 times on different websites, each with their respective output files? Usually in Scrapy, you indicate the output f

Solution 1:

multiple ways:

  • Create a pipeline to drop the items with configurable parameters, like running scrapy crawl myspider -a output_filename=output_file.txt. output_filename is added as an argument to the spider, and now you can access it from a pipeline like:

    classMyPipeline(object):defprocess_item(self, item, spider):
            filename = spider.output_filename
            # now do your magic with filename
  • You can run scrapy within a python script, and then also do your things with the output items.

Solution 2:

I'm doing a similar thing. Here is what I have done:

  1. Write the crawler as you normally would, but make sure to implement feed exports. I have the feed export push the results directly to an S3 bucket. Also, I recommend that you accept the website as a command line parameter to the script. (Example here)
  2. Setup scrapyd to run your spider
  3. Package and deploy your spider to scrapyd using scrapyd-client
  4. Now, with your list of websites, simply issue a single curl command per URL to your scrapyd process.

I've used the above strategy to shallow scrape two million domains, and I did it in less than 5 days.

Post a Comment for "How To Run A Scrapy Scraper Multiple Times, Simultaneously, On Different Input Websites And Write To Different Output Files?"