Skip to content Skip to sidebar Skip to footer

Return Empty Bracket [ ] When Web Scraping

I try to print all the titles on nytimes.com. I used requests and beautifulsoup module. But I got empty brackets in the end. The return result is [ ]. How can I fix this problem? i

Solution 1:

I am assuming that you are trying to retrieve the headlines of nytimes. Doing title = soup.find_all("span", {'class':'balancedHeadline'}) will not get you your results. The <span> tag found using the element selector is often misleading. What you have to do is to look into the source code of the page and find the tags wrapped around the title.

For nytimes its a little tricky because the headlines are wrapped in the <script> tag with a lot of junk inside. Hence what you can do is to "clean" it first and deserialize the string by convertinng it into a python dictionary object.

import requests 
from bs4 import BeautifulSoup
import json

url = "https://www.nytimes.com/"
r = requests.get(url)

r_html = r.text
soup = BeautifulSoup(r_html, "html.parser")

scripts = soup.find_all('script')
for script in scripts:
    if 'preloadedData' in script.text:
        jsonStr = script.text
        jsonStr = jsonStr.split('=', 1)[1].strip() # remove "window.__preloadedData = "
        jsonStr = jsonStr.rsplit(';', 1)[0] # remove trailing ;
        jsonStr = json.loads(jsonStr)

for key,value in jsonStr['initialState'].items():
    try:
        if value['promotionalHeadline'] != "":
            print(value['promotionalHeadline'])
    except:
        continue

outputs

Jeffrey Epstein Autopsy Results Conclude He Hanged Himself
Trump and Netanyahu Put Bipartisan Support for Israel at Risk
Congresswoman Rejects Israel’s Offer of a West Bank Visit
In Tlaib’s Ancestral Village, a Grandmother Weathers a Global Political Storm
Cathay Chief’s Resignation Shows China’s Power Over Hong Kong Unrest
Trump Administration Approves Fighter Jet Sales to Taiwan
Peace Road Map for Afghanistan Will Let Taliban Negotiate Women’s Rights
Debate Flares Over Afghanistan as Trump Considers Troop Withdrawal
In El Paso, Hundreds Show Up to Mourn a Woman They Didn’t Know
Is Slavery’s Legacy in the Power Dynamics of Sports?
Listen: ‘Modern Love’ Podcast
‘The Interpreter’
If You Think Trump Is Helping Israel, You’re a Fool
First They Came for the Black Feminists
How Women Can Escape the Likability Trap
With Trump as President, the World Is Spiraling Into Chaos
To Understand Hong Kong, Don’t Think About Tiananmen
The Abrupt End of My Big-Girl Summer
From Trump Boom to Trump Gloom
What Are Trump and Netanyahu Afraid Of?
King Bibi Bows Before a Tweet
Ebola Could Be Eradicated — But Only if the World Works Together
The Online Mob Came for Me. What Happened to the Reckoning?
A German TV Star Takes On Bullies
Why Is Hollywood So Scared of Climate Change?
Solving Medical Mysteries With Your Help: Now on Netflix

Solution 2:

title = soup.find_all("span", "balanceHeadline")

replace it with

title = soup.find_all("span", {'class':'balanceHeadline'})

Post a Comment for "Return Empty Bracket [ ] When Web Scraping"