Skip to content Skip to sidebar Skip to footer

What Is A Simple Way To Extract The List Of Urls On A Webpage Using Python?

I want to create a simple web crawler for fun. I need the web crawler to get a list of all links on one page. Does the python library have any built in functions that would make th

Solution 1:

This is actually very simple with BeautifulSoup.

from BeautifulSoup import BeautifulSoup

[element['href'] for element in BeautifulSoup(document_contents).findAll('a', href=True)]

# [u'http://example.com/', u'/example', ...]

One last thing: you can use urlparse.urljoin to make all URLs absolute. If you need the link text, you can use something like element.contents[0].

And here's how you might tie it all together:

import urllib2
import urlparse
from BeautifulSoup import BeautifulSoup

defget_all_link_targets(url):
    return [urlparse.urljoin(url, tag['href']) for tag in
            BeautifulSoup(urllib2.urlopen(url)).findAll('a', href=True)]

Solution 2:

There's an article on using HTMLParser to get URLs from <a> tags on a webpage.

The code is this:

from HTMLParser import HTMLParser from urllib2 import urlopen

classSpider(HTMLParser):def__init__(self, url):
        HTMLParser.__init__(self)
        req = urlopen(url)
        self.feed(req.read())

    defhandle_starttag(self, tag, attrs):
        if tag == 'a'andattrs:
            print "Found link => %s" % attrs[0][1]

Spider('http://www.python.org')

If you ran that script, you'd get output like this:

rafe@linux-7o1q:~> python crawler.py
Found link => /
Found link => #left-hand-navigation
Found link => #content-body
Found link => /search
Found link => /about/
Found link => /news/
Found link => /doc/
Found link => /download/
Found link => /community/
Found link => /psf/
Found link => /dev/
Found link => /about/help/
Found link => http://pypi.python.org/pypi
Found link => /download/releases/2.7/
Found link => http://docs.python.org/
Found link => /ftp/python/2.7/python-2.7.msi
Found link => /ftp/python/2.7/Python-2.7.tar.bz2
Found link => /download/releases/3.1.2/
Found link => http://docs.python.org/3.1/
Found link => /ftp/python/3.1.2/python-3.1.2.msi
Found link => /ftp/python/3.1.2/Python-3.1.2.tar.bz2
Found link => /community/jobs/
Found link => /community/merchandise/
Found link => margin-top:1.5em
Found link => margin-top:1.5em
Found link => margin-top:1.5em
Found link => color:#D58228; margin-top:1.5em
Found link => /psf/donations/
Found link => http://wiki.python.org/moin/Languages
Found link => http://wiki.python.org/moin/Languages
Found link => http://www.google.com/calendar/ical/b6v58qvojllt0i6ql654r1vh00%40group.calendar.google.com/public/basic.ics
Found link => http://wiki.python.org/moin/Python2orPython3
Found link => http://pypi.python.org/pypi
Found link => /3kpoll
Found link => /about/success/usa/
Found link => reference
Found link => reference
Found link => reference
Found link => reference
Found link => reference
Found link => reference
Found link => /about/quotes
Found link => http://wiki.python.org/moin/WebProgramming
Found link => http://wiki.python.org/moin/CgiScripts
Found link => http://www.zope.org/
Found link => http://www.djangoproject.com/
Found link => http://www.turbogears.org/
Found link => http://wiki.python.org/moin/PythonXml
Found link => http://wiki.python.org/moin/DatabaseProgramming/
Found link => http://www.egenix.com/files/python/mxODBC.html
Found link => http://sourceforge.net/projects/mysql-python
Found link => http://wiki.python.org/moin/GuiProgramming
Found link => http://wiki.python.org/moin/WxPython
Found link => http://wiki.python.org/moin/TkInter
Found link => http://wiki.python.org/moin/PyGtk
Found link => http://wiki.python.org/moin/PyQt
Found link => http://wiki.python.org/moin/NumericAndScientific
Found link => http://www.pasteur.fr/recherche/unites/sis/formation/python/index.html
Found link => http://www.pentangle.net/python/handbook/
Found link => /community/sigs/current/edu-sig
Found link => http://www.openbookproject.net/pybiblio/
Found link => http://osl.iu.edu/~lums/swc/
Found link => /about/apps
Found link => http://docs.python.org/howto/sockets.html
Found link => http://twistedmatrix.com/trac/
Found link => /about/apps
Found link => http://buildbot.net/trac
Found link => http://www.edgewall.com/trac/
Found link => http://roundup.sourceforge.net/
Found link => http://wiki.python.org/moin/IntegratedDevelopmentEnvironments
Found link => /about/apps
Found link => http://www.pygame.org/news.html
Found link => http://www.alobbs.com/pykyra
Found link => http://www.vrplumber.com/py3d.py
Found link => /about/apps
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => /channews.rdf
Found link => /about/website
Found link => http://www.xs4all.com/
Found link => http://www.timparkin.co.uk/
Found link => /psf/
Found link => /about/legal

You can use regex then to distinguish between absolute and relative URLs.

Solution 3:

Solution done using libxml.

import urllib
importlibxml2parse_opts= libxml2.HTML_PARSE_RECOVER + \
            libxml2.HTML_PARSE_NOERROR + \
            libxml2.HTML_PARSE_NOWARNINGdoc= libxml2.htmlReadDoc(urllib.urlopen(url).read(), '', None, parse_opts)
print [ i.getContent() for i in doc.xpathNewContext().xpathEval("//a/@href") ]

Post a Comment for "What Is A Simple Way To Extract The List Of Urls On A Webpage Using Python?"