Skip to content Skip to sidebar Skip to footer

Beautifulsoup Parse Every Html Files In A Folder Webscraping

My task is to read every html file from a directory. Conditions are to find whether each file contains tags (1) OO (2) QQ

Solution 1:

Your write function is nested in the for loop, that's why you write multiple lines to your index.txt, just move the write out of the loop and put all your parti text to a variable parti_names like this:

participants = soup.find(find_participant)
parti_names = ""
for parti in participants.find_next_siblings("p"):
    if parti.find("strong", text=re.compile(r"(Operator)")):
        break
    parti_names += parti.get_text(strip=True)+","
    print parti.get_text(strip=True)

indexFile = open('index.txt', 'a+')
indexFile.write(filename + ', ' + title.get_text(strip=True) + ticker.get_text(strip=True) + ', ' + d_date.get_text(strip=True) + ', ' + parti_names + '\n' )
indexFile.close()

Update:

You can work with basename to get the file name:

from os.path import basename

# you can call it directly with basename
print(basename("C:/Users/.../output/100107-.html"))

Output:

100107-.html

Post a Comment for "Beautifulsoup Parse Every Html Files In A Folder Webscraping"