Skip to content Skip to sidebar Skip to footer

Iterate Through Multiple Files And Append Text From Html Using Beautiful Soup

I have a directory of downloaded HTML files (46 of them) and I am attempting to iterate through each of them, read their contents, strip the HTML, and append only the text into a t

Solution 1:

actually you are not reading html file, this should work,

soup=BeautifulSoup(open(webpage,'r').read(), 'lxml')

Solution 2:

If you want to use lxml.html directly here is a modified version of some code I've been using for a project. If you want to grab all the text, just don't filter by tag. There may be a way to do it without iterating, but I don't know. It saves the data as unicode, so you will have to take that into account when opening the file.

import os
import glob

import lxml.html

path = '/'# Whatever tags you want to pull text from.
visible_text_tags = ['p', 'li', 'td', 'h1', 'h2', 'h3', 'h4',
                     'h5', 'h6', 'a', 'div', 'span']

for infile in glob.glob(os.path.join(path, "*.html")):
    doc = lxml.html.parse(infile)

    file_text = []

    for element in doc.iter(): # Iterate once through the entire documenttry:  # Grab tag name and text (+ tail text)   
            tag = element.tag
            text = element.text
            tail = element.tail
        except:
            continue

        words = None# text words split to listif tail: # combine text and tail
            text = text + " " + tail if text else tail
        if text: # lowercase and split to list
            words = text.lower().split()

        if tag in visible_text_tags:
            if words:
                file_text.append(' '.join(words))

    withopen('example.txt', 'a') as myfile:
        myfile.write(' '.join(file_text).encode('utf8'))

Post a Comment for "Iterate Through Multiple Files And Append Text From Html Using Beautiful Soup"