Scrapy Yield A Request, Parse In The Callback, But Use The Info In The Original Function
So I'm trying to test some webpages in scrapy, my idea is to yield a Request to the URLS that satisfy the condition, count the number of certain items on the page, and then within
Solution 1:
Take a look at inline_requests package, which should let you achieve this.
Other solution is to not insist on returning the result from original method (filter_categories
in your case), but rather use request chaining with meta
attribute of requests and return the result from the last parse method in the chain (test_page
in your case).
Solution 2:
If I understood you correct: you want to yield scrapy.Request
to URLS that will have True
condition. Am I right? Here some example for it:
defparse(self, response):
if self.test_page(response):
item = Item()
item['url'] = 'xpath or css'yield item
if condition:
yield Request(url=new_link, callback = self.parse, dont_filter=True)
deftest_page(self, link):
... parse the response...
returnTrue/False depending
If you give more info I'll try help more.
It's part of my code
defparse(self, response):
if'tag'in response.url:returnself.parse_tag(response)
if'company'in response.url:returnself.parse_company(response)
defparse_tag(self, response):
try:
news_list = response.xpath("..//div[contains(@class, 'block block-thumb ')]")
company = response.meta['company']
for i innews_list:
item = Item()
item['date'] = i.xpath("./div/div/time/@datetime").extract_first()
item['title'] = i.xpath("./div/h2/a/text()").extract_first()
item['description'] = i.xpath("./div/p//text()").extract_first()
item['url'] = i.xpath("./div/h2/a/@href").extract_first()
item.update(self.get_common_items(company))
item['post_id'] = response.meta['post_id']
if item['title']:
yield scrapy.Request(item['url'], callback=self.parse_tags, meta={'item': item})
has_next = response.xpath("//div[contains(@class, 'river-nav')]//li[contains(@class, 'next')]/a/@href").extract_first()
ifhas_next:
next_url = 'https://example.com' + has_next + '/'yield scrapy.Request(next_url, callback=self.parse_tag,
meta=response.meta)
defparse_tags(self, response):
item = response.meta['item']
item['tags'] = response.xpath(".//div[@class='accordion recirc-accordion']//ul//li[not(contains(@class, 'active'))]//a/text()").extract()
yield item
Solution 3:
you can use:
response.meta
response.body
yield from a function
to refactor your spider
Post a Comment for "Scrapy Yield A Request, Parse In The Callback, But Use The Info In The Original Function"