How To Run A Scrapy Scraper Multiple Times, Simultaneously, On Different Input Websites And Write To Different Output Files?

May 03, 2024 Post a Comment

Does anyone know how I could run the same Scrapy scraper over 200 times on different websites, each with their respective output files? Usually in Scrapy, you indicate the output f

Solution 1:

multiple ways:

Create a pipeline to drop the items with configurable parameters, like running scrapy crawl myspider -a output_filename=output_file.txt. output_filename is added as an argument to the spider, and now you can access it from a pipeline like:
```
classMyPipeline(object):defprocess_item(self, item, spider):
        filename = spider.output_filename
        # now do your magic with filename
```
You can run scrapy within a python script, and then also do your things with the output items.

Solution 2:

I'm doing a similar thing. Here is what I have done:

Write the crawler as you normally would, but make sure to implement feed exports. I have the feed export push the results directly to an S3 bucket. Also, I recommend that you accept the website as a command line parameter to the script. (Example here)
Setup scrapyd to run your spider
Package and deploy your spider to scrapyd using scrapyd-client
Now, with your list of websites, simply issue a single curl command per URL to your scrapyd process.

I've used the above strategy to shallow scrape two million domains, and I did it in less than 5 days.

Baca Juga

Learn Python Tutorials

How To Run A Scrapy Scraper Multiple Times, Simultaneously, On Different Input Websites And Write To Different Output Files?

Solution 1:

Solution 2:

Post a Comment for "How To Run A Scrapy Scraper Multiple Times, Simultaneously, On Different Input Websites And Write To Different Output Files?"