Skip to content Skip to sidebar Skip to footer

How To Scrape Subcategories And Pages In Categories Of A Category Wikipedia Page Using Python

So I'm trying to scrape all the subcategories and pages under the category header of the Category page: 'Category: Class-based programming languages' found at: https://en.wikipedi

Solution 1:

Ok so after doing more research and study, I was able to find an answer to my own question. Using the libraries urllib.request and json, I imported the wikipedia url file in format json and simply printed its categories out that way. Here's the code I used to get the subcategories:

pages = urllib.request.urlopen("https://en.wikipedia.org/w/api.phpaction=query&list=categorymembers&cmtitle=Category:Class-based%20programming%20languages&format=json&cmlimit=500&cmtype=subcat")
data = json.load(pages)
query = data['query']
category = query['categorymembers']
forx in category:
    print (x['title'])

And you can do the same thing for pages in category. Thanks to Nemo for trying to help me out!

Solution 2:

import requests
from lxml importhtmlwiki_page= requests.get('https://en.wikipedia.org/wiki/Category:Class based_programming_languages')
tree = html.fromstring(wiki_page.content)

To build your intuition of how to use this, right click on, say, 'C++', and click 'inspect' and you'll see the panel in the right will have highlighted

<aclass="CategoryTreeLabel  CategoryTreeLabelNs14   
CategoryTreeLabelCategory"href="/wiki/Category:C%2B%2B">C++</a>

Right click on this, and click 'copy xpath'. For C++ this will give you

//*[@id="mw-subcategories"]/div/ul[1]/li/div/div[1]/a

Similarly, under the pages, for 'ActionScript' we get

//*[@id="mw-pages"]/div/div/div[1]/ul/li[1]/a

So if you're looking for all the subcategory/page names, you could do, for example

pages = tree.xpath('//*[@id="mw-pages"]/text()')
subcategories = tree.xpath('//*[@id="mw-subcategories"]/text()')

For more information see here and here

Post a Comment for "How To Scrape Subcategories And Pages In Categories Of A Category Wikipedia Page Using Python"