Skip to content Skip to sidebar Skip to footer

Delete Html Element If It Contains A Certain Amount Of Numeric Characters

For transforming a html-formatted file to a plain text file with Python, I need to delete all tables if the text within the table contains more than 40% numeric characters. Specifi

Solution 1:

As I suggested you in comments, I wouldn't use regex to parse and use HTML in code. For example you could use a python library build up for this purpose like BeautifulSoup.

Here an example on how to use it

#!/usr/bin/python
try:
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup
html = """<html><head>Heading</head><bodyattr1='val1'><divclass='container'><divid='class'>Something here</div><div>Something else</div><tablestyle="width:100%"><tr><th>Firstname</th><th>Lastname</th><th>Age</th></tr><tr><td>Jill</td><td>Smith</td><td>50</td></tr><tr><td>Eve</td><td>Jackson</td><td>94</td></tr></table></div></body></html>"""
parsed_html = BeautifulSoup(html, 'html.parser')
print parsed_html.body.find('table').text

So you could end up with a code like that (just to give you an idea)

#!/usr/bin/pythonimport re
try:
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup



deftablereplace(table):
    table = re.sub('<[^>]*>', ' ', str(table))
    numeric = sum(c.isdigit() for c in table)
    print('numeric: ' + str(numeric))
    alphabetic = sum(c.isalpha() for c in table)
    print('alpha: ' + str(alphabetic))
    try:
            ratio = numeric / float(numeric + alphabetic)
            print('ratio: '+ str(ratio))
    except ZeroDivisionError as err:
            ratio = 1if ratio > 0.4:
            returnTrueelse:
            returnFalse

table = """<table style="width:100%">
  <tr>
    <th>Firstname</th>
    <th>Lastname</th> 
    <th>Age</th>
  </tr>
  <tr>
    <td>3241424134213424214321342424214321412</td>
    <td>213423423234242142134214124214214124124</td> 
    <td>213424214234242</td>
  </tr>
  <tr>
    <td>124234412342142414</td>
    <td>1423424214324214</td> 
    <td>2134242141242341241</td>
  </tr>
</table>
"""if tablereplace(table):
        print'replace table'
        parsed_html = BeautifulSoup(table, 'html.parser')
        rawdata = parsed_html.find('table').text
        print rawdata

UPDATE: Anyway just this line of your code strips away all HTML tags, as you will know 'cause you are using it for char/digit counting purpose

table = re.sub('<[^>]*>', ' ', str(table))

But it's not safe, because you could also have <> inside the text of your tags or the HTML could be shattered or misplaced

I left it where it is because for the example it's working. But consider to use BeautifulSoup for all HTML management.

Solution 2:

Thank you for your replies so far!

After intensive research, I found the solution to the mysterious deletion of the whole match. It seemed that the function only considered the first 150 or so characters of the match. However, if you specify table = table.group(0), the whole match is processed. group(0) accounts for the big difference here.

Below you can find my updated script thats works properly (also includes some other minor changes):

def tablereplace(table):
    table= table.group(0)
    table= re.sub('<[^>]*>', '\n', table)
    numeric=sum(c.isdigit() for c intable)
    alphabetic =sum(c.isalpha() for c intable)
    try: 
        ratio =numeric/ (numeric+ alphabetic)
    except ArithmeticError:
        ratio =1else:
        pass
    if ratio >0.4:
        emptystring =''return emptystring
    else:
        returntable 
rawtext = re.sub('<table.+?<\/table>', tablereplace, rawtext, flags=re.IGNORECASE|re.DOTALL)

Post a Comment for "Delete Html Element If It Contains A Certain Amount Of Numeric Characters"