Scraping Excel From Website Using Python With _dopostback Link Url Hidden
For last few days I am trying to scrap the following website (link pasted below) which has a few excels and pdfs available in a table. I am able to do it for the home page successf
Solution 1:
For aspx sites, you need to look for things like __EVENTTARGET
, __EVENTVALIDATION
etc.. and post those parameters with each request, this will get all the pages and using requests with bs4:
import requests
from bs4 import BeautifulSoup
from urlparse import urljoin # python 3 use from urllib.parse import urljoin # All the keys need values set bar __EVENTTARGET, that stays the same.
data = {
"__EVENTTARGET": "gvFlex",
"__VIEWSTATE": "",
"__VIEWSTATEGENERATOR": "",
"__VIEWSTATEENCRYPTED": "",
"__EVENTVALIDATION": ""}
defvalidate(soup, data):
for k in data:
# update post values in data.if k != "__EVENTTARGET":
data[k] = soup.select_one("#{}".format(k))["value"]
defget_all_excel():
base = "http://accord.fairfactories.org/ffcweb/Web"
url = "http://accord.fairfactories.org/ffcweb/Web/ManageSuppliers/InspectionReportsEnglish.aspx"with requests.Session() as s:
# Add a user agent for each subsequent request.
s.headers.update({"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0"})
r = s.get(url)
bs = BeautifulSoup(r.content, "lxml")
# get links from initial page.for xcl in bs.select("a[id*=CAP]"):
yield urljoin(base, xcl["href"])
# need to re-validate the post data in our dict for each request.
validate(bs, data)
last = bs.select_one("a[href*=Page$Last]")
i = 2# keep going until the last page button is not visiblewhile last:
# Increase the counter to set the target to the next page
data["__EVENTARGUMENT"] = "Page${}".format(i)
r = s.post(url, data=data)
bs = BeautifulSoup(r.content, "lxml")
for xcl in bs.select("a[id*=CAP]"):
yield urljoin(base, xcl["href"])
last = bs.select_one("a[href*=Page$Last]")
# again re-validate for next request
validate(bs, data)
i += 1for x in (get_all_excel()):
print(x)
If we run it on the first three pages, you can see we get the data you want:
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9965
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9552
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10650
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11969
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10086
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10905
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10840
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9229
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11310
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9178
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9614
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9734
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10063
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10871
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9468
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9799
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9278
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=12252
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9342
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9966
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11595
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9652
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10271
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10365
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10087
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9967
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11740
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=12375
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11643
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10952
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=12013
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9810
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10953
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10038
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9664
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=12256
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9262
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9210
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9968
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9811
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11610
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9455
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11899
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10273
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9766
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9969
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10088
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10366
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9393
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9813
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11795
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9814
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11273
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=12187
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10954
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9556
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11709
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9676
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10251
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10602
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10089
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9908
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10358
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9469
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11333
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9238
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9816
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9817
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10736
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10622
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9394
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9818
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10592
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9395
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11271
Post a Comment for "Scraping Excel From Website Using Python With _dopostback Link Url Hidden"