Scrapy With Multiple Pages
I have created a simple scrapy project, In which, I got the total page number from the initial site example.com/full. Now I need to scrape all the page starting from example.com/pa
Solution 1:
You must search for the 'next_page' object and continue to loop while it is on the page.
# -*- coding: utf-8 -*-import scrapy
from scrapy.http import Request
classSanetSpider(scrapy.Spider):
name = 'sanet'
allowed_domains = ['sanet.st']
start_urls = ['https://sanet.st/full/']
defparse(self, response):
yield {
# Do something.'result': response.xpath('//h3[@class="posts-results"]/text()').extract_first()
}
# next_page = /page-{}/ where {} number of page.
next_page = response.xpath('//a[@data-tip="Next page"]/@href').extract_first()
# next_page = https://sanet.st/page-{}/ where {} number of page.
next_page = response.urljoin(next_page)
# If next_page have valueif next_page:
# Recall parse with url https://sanet.st/page-{}/ where {} number of page.yield scrapy.Request(url=next_page, callback=self.parse)
If you run this code with the "-o sanet.json" key you will get the following result.
scrapy runspider sanet.py -o sanet.json
[
{"result": "results 1 - 15 from 651"},
{"result": "results 16 - 30 from 651"},
{"result": "results 31 - 45 from 651"},
...
etc.
...
{"result": "results 631 - 645 from 651"},
{"result": "results 646 - 651 from 651"}
]
Solution 2:
from scrapy.http import Request
defparse(self, response):
total_pages = response.xpath("//body/section/div/section/div/div/ul/li[6]/a/text()").extract_first()
urls = ('https://example.com/page-{}'.format(i) for i inrange(1,total_pages))
for url in urls:
yield Request(url, callback=self.parse_page)
defparse_page(self, response):
# do the stuff
Solution 3:
an alternative way as shown in the tutorial is to use yield response.follow(url, callback=self.parse_page)
and it supports relative URLs directly.
Post a Comment for "Scrapy With Multiple Pages"