How To Run A Scrapy Scraper Multiple Times, Simultaneously, On Different Input Websites And Write To Different Output Files?
Does anyone know how I could run the same Scrapy scraper over 200 times on different websites, each with their respective output files? Usually in Scrapy, you indicate the output f
Solution 1:
multiple ways:
Create a
pipeline
to drop the items with configurable parameters, like runningscrapy crawl myspider -a output_filename=output_file.txt
. output_filename is added as an argument to the spider, and now you can access it from a pipeline like:classMyPipeline(object):defprocess_item(self, item, spider): filename = spider.output_filename # now do your magic with filename
You can run scrapy within a python script, and then also do your things with the output items.
Solution 2:
I'm doing a similar thing. Here is what I have done:
- Write the crawler as you normally would, but make sure to implement feed exports. I have the feed export push the results directly to an S3 bucket. Also, I recommend that you accept the website as a command line parameter to the script. (Example here)
- Setup
scrapyd
to run your spider - Package and deploy your spider to scrapyd using scrapyd-client
- Now, with your list of websites, simply issue a single curl command per URL to your scrapyd process.
I've used the above strategy to shallow scrape two million domains, and I did it in less than 5 days.
Post a Comment for "How To Run A Scrapy Scraper Multiple Times, Simultaneously, On Different Input Websites And Write To Different Output Files?"