Skip to content Skip to sidebar Skip to footer

Unable To Get All The Data Including Links From A Tr Tag

I've written a script in python to get data from some html elements which are in a table. I have roughly picked some data which are within a tr tag. My goal is to get the data (inc

Solution 1:

You can use either bs4 or regular expressions:

bs4:

from bs4 import BeautifulSoup as soup
s = soup(content, 'lxml')
new_data = list(zip([i.text for i in s.find_all('a')], [i['href'] for i in s.find_all('a', href=True)]))

Output:

[(u'Charles Hard Townes', '/wiki/Charles_Hard_Townes'), (u'Nikolay Basov', '/wiki/Nikolay_Basov'), (u'Alexander Prokhorov', '/wiki/Alexander_Prokhorov'), (u'Dorothy Hodgkin', '/wiki/Dorothy_Hodgkin'), (u'Konrad Emil Bloch', '/wiki/Konrad_Emil_Bloch'), (u'Feodor Felix Konrad Lynen', '/wiki/Feodor_Felix_Konrad_Lynen'), (u'Jean-Paul Sartre', '/wiki/Jean-Paul_Sartre'), (u'[D]', '#endnote_Note1D'), (u'Martin Luther King, Jr.', '/wiki/Martin_Luther_King,_Jr.')]

Regex:

import re
new_data = map(lambda x:filter(None, x)[0], re.findall('href="(.*?)"|title="(.*?)">', content))
final_data = [(new_data[i], new_data[i+1]) for i in range(0, len(new_data)-1, 2)]

Output:

[('/wiki/Charles_Hard_Townes', 'Charles Hard Townes'), ('/wiki/Nikolay_Basov', 'Nikolay Basov'), ('/wiki/Alexander_Prokhorov', 'Alexander Prokhorov'), ('/wiki/Dorothy_Hodgkin', 'Dorothy Hodgkin'), ('/wiki/Konrad_Emil_Bloch', 'Konrad Emil Bloch'), ('/wiki/Feodor_Felix_Konrad_Lynen', 'Feodor Felix Konrad Lynen'), ('/wiki/Jean-Paul_Sartre', 'Jean-Paul Sartre'), ('#endnote_Note1D', '/wiki/Martin_Luther_King,_Jr.')]

Solution 2:

This modified code got me the href together with the data

from bs4 import BeautifulSoup

content="""
<tr>
    <td align="center">1964</td>
    <td><span class="sortkey">Townes, Charles Hard</span><span class="vcard"><span class="fn"><a href="/wiki/Charles_Hard_Townes" class="mw-redirect" title="Charles Hard Townes">Charles Hard Townes</a></span></span>;<br>
    <span class="sortkey">Basov, Nikolay</span><span class="vcard"><span class="fn"><a href="/wiki/Nikolay_Basov" title="Nikolay Basov">Nikolay Basov</a></span></span>;<br>
    <span class="sortkey">Prokhorov, Alexander</span><span class="vcard"><span class="fn"><a href="/wiki/Alexander_Prokhorov" title="Alexander Prokhorov">Alexander Prokhorov</a></span></span></td>
    <td><span class="sortkey">Hodgkin, Dorothy</span><span class="vcard"><span class="fn"><a href="/wiki/Dorothy_Hodgkin" title="Dorothy Hodgkin">Dorothy Hodgkin</a></span></span></td>
    <td><span class="sortkey">Bloch, Konrad Emil</span><span class="vcard"><span class="fn"><a href="/wiki/Konrad_Emil_Bloch" title="Konrad Emil Bloch">Konrad Emil Bloch</a></span></span>;<br>
    <span class="sortkey">Lynen, Feodor Felix Konrad</span><span class="vcard"><span class="fn"><a href="/wiki/Feodor_Felix_Konrad_Lynen" class="mw-redirect" title="Feodor Felix Konrad Lynen">Feodor Felix Konrad Lynen</a></span></span></td>
    <td><span class="sortkey">Sartre, Jean-Paul</span><span class="vcard"><span class="fn"><a href="/wiki/Jean-Paul_Sartre" title="Jean-Paul Sartre">Jean-Paul Sartre</a></span></span><sup class="reference" id="ref_Note1D"><a href="#endnote_Note1D">[D]</a></sup></td>
    <td><span class="sortkey">King, Jr., Martin Luther</span><span class="vcard"><span class="fn"><a href="/wiki/Martin_Luther_King,_Jr." class="mw-redirect" title="Martin Luther King, Jr.">Martin Luther King, Jr.</a></span></span></td>
    <td align="center"></td>
</tr>
"""
soup = BeautifulSoup(content,"lxml")
for items in soup.select('tr'):
    item_name = [[item.text,item.get('href')] for item in items.select(".fn a")]
    print(item_name)

OUTPUT

[['Charles Hard Townes', '/wiki/Charles_Hard_Townes'], ['Nikolay Basov', '/wiki/Nikolay_Basov'], ['Alexander Prokhorov', '/wiki/Alexander_Prokhorov'], ['Dorothy Hodgkin', '/wiki/Dorothy_Hodgkin'], ['Konrad Emil Bloch', '/wiki/Konrad_Emil_Bloch'], ['Feodor Felix Konrad Lynen', '/wiki/Feodor_Felix_Konrad_Lynen'], ['Jean-Paul Sartre', '/wiki/Jean-Paul_Sartre'], ['Martin Luther King, Jr.', '/wiki/Martin_Luther_King,_Jr.']]

Solution 3:

Slightly simpler: no need to select the table rows separately.

soup = BeautifulSoup(content,"lxml")
links = soup.select('tr .fn a')
for link in links:
    print (link.attrs['href'])
    print (link.text)

Post a Comment for "Unable To Get All The Data Including Links From A Tr Tag"