Beautifulsoup Parse Every Html Files In A Folder Webscraping
My task is to read every html file from a directory. Conditions are to find whether each file contains tags (1) OO (2) QQ
Solution 1:
Your write
function is nested in the for
loop, that's why you write multiple lines to your index.txt
, just move the write
out of the loop and put all your parti text to a variable parti_names
like this:
participants = soup.find(find_participant)
parti_names = ""
for parti in participants.find_next_siblings("p"):
if parti.find("strong", text=re.compile(r"(Operator)")):
break
parti_names += parti.get_text(strip=True)+","
print parti.get_text(strip=True)
indexFile = open('index.txt', 'a+')
indexFile.write(filename + ', ' + title.get_text(strip=True) + ticker.get_text(strip=True) + ', ' + d_date.get_text(strip=True) + ', ' + parti_names + '\n' )
indexFile.close()
Update:
You can work with basename
to get the file name:
from os.path import basename
# you can call it directly with basename
print(basename("C:/Users/.../output/100107-.html"))
Output:
100107-.html
Post a Comment for "Beautifulsoup Parse Every Html Files In A Folder Webscraping"