Skip to content Skip to sidebar Skip to footer

Not Able To Extract Nested Table Body With Pandas From Webpage

I am trying to extract nested table from the url 'http://gsa.nic.in/report/janDhan.html' using pandas with code: import pandas as pd url ='http://gsa.nic.in/report/janDhan.html' ta

Solution 1:

The table is being populated by javascript, so it is not in the HTML that pandas is fetching. You can confirm this by viewing the source of the page in your browser and searching for values that are in the table, such as "PRADESH."

The solution is to use a library such as requests-html or selenium to scrape the javascript-rendered page. Then you can parse that HTML with pandas.

from requests_html import HTMLSession

s = HTMLSession()
r = s.get(url)
r.html.render()

table = pd.read_html(r.html)[3]

Solution 2:

So as Eric pointed out the table is being populated by JavaScript.

However, is quite easy to intercept the API call the page is doing internally by using Chrome's developer tools.

Go to network tab and filter by XHR and you will find the endpoint the page is making calls to, which is

http://gsa.nic.in/gsaservice/services/service.svc/gsastatereport?schemecode=PMJDY

enter image description here Then a simple script like this will get you the data nicely formatted

import json
import pandas as pd
import requests


r = requests.get('http://gsa.nic.in/gsaservice/services/service.svc/gsastatereport?schemecode=PMJDY')
data = json.loads(r.json()['d'])
pd.DataFrame(data[0]['data'])

LGDStateCode    StateName   totalSaturatedVillage   villageSaturatedTillDate    TotalBeneficiaries  TotalBeneficiariesRegisteredTillDate    Saturation
028  ANDHRA PRADESH  3053052723827238100.00112  ARUNACHAL PRADESH   299283423313999994.49218  ASSAM   3042237564881562187895.85310  BIHAR   635544923569013197.5

Post a Comment for "Not Able To Extract Nested Table Body With Pandas From Webpage"