Beautifulsoup Parse Every Html Files In A Folder Webscraping

Question

My task is to read every html file from a directory. Conditions are to find whether each file contains tags (1) OO (2) QQ
Solution 1:

Your write function is nested in the for loop, that's why you write multiple lines to your index.txt, just move the write out of the loop and put all your parti text to a variable parti_names like this:

participants = soup.find(find_participant) parti_names = "" for parti in participants.find_next_siblings("p"): if parti.find("strong", text=re.compile(r"(Operator)")): break parti_names += parti.get_text(strip=True)+"," print parti.get_text(strip=True) indexFile = open('index.txt', 'a+') indexFile.write(filename + ', ' + title.get_text(strip=True) + ticker.get_text(strip=True) + ', ' + d_date.get_text(strip=True) + ', ' + parti_names + '\n' ) indexFile.close()

Update:

You can work with basename to get the file name:

from os.path import basename # you can call it directly with basename print(basename("C:/Users/.../output/100107-.html"))

Output:

100107-.html

Learn Python Tutorials

Beautifulsoup Parse Every Html Files In A Folder Webscraping

Solution 1:

Post a Comment for "Beautifulsoup Parse Every Html Files In A Folder Webscraping"