What Is A Simple Way To Extract The List Of Urls On A Webpage Using Python?
I want to create a simple web crawler for fun. I need the web crawler to get a list of all links on one page. Does the python library have any built in functions that would make th
Solution 1:
This is actually very simple with BeautifulSoup.
from BeautifulSoup import BeautifulSoup
[element['href'] for element in BeautifulSoup(document_contents).findAll('a', href=True)]
# [u'http://example.com/', u'/example', ...]
One last thing: you can use urlparse.urljoin
to make all URLs absolute. If you need the link text, you can use something like element.contents[0]
.
And here's how you might tie it all together:
import urllib2
import urlparse
from BeautifulSoup import BeautifulSoup
defget_all_link_targets(url):
return [urlparse.urljoin(url, tag['href']) for tag in
BeautifulSoup(urllib2.urlopen(url)).findAll('a', href=True)]
Solution 2:
There's an article on using HTMLParser to get URLs from <a>
tags on a webpage.
The code is this:
from HTMLParser import HTMLParser from urllib2 import urlopen
classSpider(HTMLParser):def__init__(self, url):
HTMLParser.__init__(self)
req = urlopen(url)
self.feed(req.read())
defhandle_starttag(self, tag, attrs):
if tag == 'a'andattrs:
print "Found link => %s" % attrs[0][1]
Spider('http://www.python.org')
If you ran that script, you'd get output like this:
rafe@linux-7o1q:~> python crawler.py Found link => / Found link => #left-hand-navigation Found link => #content-body Found link => /search Found link => /about/ Found link => /news/ Found link => /doc/ Found link => /download/ Found link => /community/ Found link => /psf/ Found link => /dev/ Found link => /about/help/ Found link => http://pypi.python.org/pypi Found link => /download/releases/2.7/ Found link => http://docs.python.org/ Found link => /ftp/python/2.7/python-2.7.msi Found link => /ftp/python/2.7/Python-2.7.tar.bz2 Found link => /download/releases/3.1.2/ Found link => http://docs.python.org/3.1/ Found link => /ftp/python/3.1.2/python-3.1.2.msi Found link => /ftp/python/3.1.2/Python-3.1.2.tar.bz2 Found link => /community/jobs/ Found link => /community/merchandise/ Found link => margin-top:1.5em Found link => margin-top:1.5em Found link => margin-top:1.5em Found link => color:#D58228; margin-top:1.5em Found link => /psf/donations/ Found link => http://wiki.python.org/moin/Languages Found link => http://wiki.python.org/moin/Languages Found link => http://www.google.com/calendar/ical/b6v58qvojllt0i6ql654r1vh00%40group.calendar.google.com/public/basic.ics Found link => http://wiki.python.org/moin/Python2orPython3 Found link => http://pypi.python.org/pypi Found link => /3kpoll Found link => /about/success/usa/ Found link => reference Found link => reference Found link => reference Found link => reference Found link => reference Found link => reference Found link => /about/quotes Found link => http://wiki.python.org/moin/WebProgramming Found link => http://wiki.python.org/moin/CgiScripts Found link => http://www.zope.org/ Found link => http://www.djangoproject.com/ Found link => http://www.turbogears.org/ Found link => http://wiki.python.org/moin/PythonXml Found link => http://wiki.python.org/moin/DatabaseProgramming/ Found link => http://www.egenix.com/files/python/mxODBC.html Found link => http://sourceforge.net/projects/mysql-python Found link => http://wiki.python.org/moin/GuiProgramming Found link => http://wiki.python.org/moin/WxPython Found link => http://wiki.python.org/moin/TkInter Found link => http://wiki.python.org/moin/PyGtk Found link => http://wiki.python.org/moin/PyQt Found link => http://wiki.python.org/moin/NumericAndScientific Found link => http://www.pasteur.fr/recherche/unites/sis/formation/python/index.html Found link => http://www.pentangle.net/python/handbook/ Found link => /community/sigs/current/edu-sig Found link => http://www.openbookproject.net/pybiblio/ Found link => http://osl.iu.edu/~lums/swc/ Found link => /about/apps Found link => http://docs.python.org/howto/sockets.html Found link => http://twistedmatrix.com/trac/ Found link => /about/apps Found link => http://buildbot.net/trac Found link => http://www.edgewall.com/trac/ Found link => http://roundup.sourceforge.net/ Found link => http://wiki.python.org/moin/IntegratedDevelopmentEnvironments Found link => /about/apps Found link => http://www.pygame.org/news.html Found link => http://www.alobbs.com/pykyra Found link => http://www.vrplumber.com/py3d.py Found link => /about/apps Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => /channews.rdf Found link => /about/website Found link => http://www.xs4all.com/ Found link => http://www.timparkin.co.uk/ Found link => /psf/ Found link => /about/legal
You can use regex then to distinguish between absolute and relative URLs.
Solution 3:
Solution done using libxml.
import urllib
importlibxml2parse_opts= libxml2.HTML_PARSE_RECOVER + \
libxml2.HTML_PARSE_NOERROR + \
libxml2.HTML_PARSE_NOWARNINGdoc= libxml2.htmlReadDoc(urllib.urlopen(url).read(), '', None, parse_opts)
print [ i.getContent() for i in doc.xpathNewContext().xpathEval("//a/@href") ]
Post a Comment for "What Is A Simple Way To Extract The List Of Urls On A Webpage Using Python?"