Scrapy Spider Not Returning Any Results

June 27, 2023 Post a Comment

This is my first attempt to create a spider, kindly spare me if I have not done it properly. Here is the link to the website I am trying to extract data from. http://www.4icu.org/i

Solution 1:

As stated in the comment for the question there are some issues with your code.

First of all, you do not need two methods -- because in the parse method you call the same URL as you did in start_urls.

To get some information from the site try using the following code:

def parse(self, response):
    for tr in response.xpath('//div[@class="section group"][5]/div[@class="col span_2_of_2"][1]/table//tr'):
        if tr.xpath(".//td[@class='i']"):
            name = tr.xpath('./td[1]/a/text()').extract()[0]
            location = tr.xpath('./td[2]//text()').extract()[0]
            print name, location

and adjust it to your needs to fill your item (or items).

As you can see, your browser displays an additional tbody in the table which is not present when you scrape with Scrapy. This means you often need to judge what you see in the browser.

Solution 2:

Here is the working code

    import scrapy
    from scrapy.spider import Spider
    from scrapy.http import Request

    class CollegesItem(scrapy.Item):
    # define the fields for your item here like:
        name = scrapy.Field()
        location = scrapy.Field()
    class CollegesSpider(Spider):
        name = 'colleges'
        allowed_domains = ["4icu.org"]
        start_urls = ('http://www.4icu.org/in/',)

        def parse(self, response):
            for tr in response.xpath('//div[@class="section group"] [5]/div[@class="col span_2_of_2"][1]/table//tr'):
                if tr.xpath(".//td[@class='i']"):
                    item = CollegesItem()
                    item['name'] = tr.xpath('./td[1]/a/text()').extract()[0]
                    item['location'] = tr.xpath('./td[2]//text()').extract()[0]
                    yield item

after running the command spider

    >>scrapy crawl colleges -o mait.json

Following is the snippet of results:

    [[[[[[[{"name": "Indian Institute of Technology Bombay", "location": "Mumbai"},
    {"name": "Indian Institute of Technology Madras", "location": "Chennai"},
    {"name": "University of Delhi", "location": "Delhi"},
    {"name": "Indian Institute of Technology Kanpur", "location": "Kanpur"},
    {"name": "Anna University", "location": "Chennai"},
    {"name": "Indian Institute of Technology Delhi", "location": "New Delhi"},
    {"name": "Manipal University", "location": "Manipal ..."},
    {"name": "Indian Institute of Technology Kharagpur", "location": "Kharagpur"},
    {"name": "Indian Institute of Science", "location": "Bangalore"},
    {"name": "Panjab University", "location": "Chandigarh"},
    {"name": "National Institute of Technology, Tiruchirappalli", "location": "Tiruchirappalli"}, .........

Learn Python Tutorials

Scrapy Spider Not Returning Any Results

Solution 1:

Solution 2:

Post a Comment for "Scrapy Spider Not Returning Any Results"