Delete Html Element If It Contains A Certain Amount Of Numeric Characters
Solution 1:
As I suggested you in comments, I wouldn't use regex to parse and use HTML in code. For example you could use a python library build up for this purpose like BeautifulSoup.
Here an example on how to use it
#!/usr/bin/python
try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
html = """<html><head>Heading</head><bodyattr1='val1'><divclass='container'><divid='class'>Something here</div><div>Something else</div><tablestyle="width:100%"><tr><th>Firstname</th><th>Lastname</th><th>Age</th></tr><tr><td>Jill</td><td>Smith</td><td>50</td></tr><tr><td>Eve</td><td>Jackson</td><td>94</td></tr></table></div></body></html>"""
parsed_html = BeautifulSoup(html, 'html.parser')
print parsed_html.body.find('table').text
So you could end up with a code like that (just to give you an idea)
#!/usr/bin/pythonimport re
try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
deftablereplace(table):
table = re.sub('<[^>]*>', ' ', str(table))
numeric = sum(c.isdigit() for c in table)
print('numeric: ' + str(numeric))
alphabetic = sum(c.isalpha() for c in table)
print('alpha: ' + str(alphabetic))
try:
ratio = numeric / float(numeric + alphabetic)
print('ratio: '+ str(ratio))
except ZeroDivisionError as err:
ratio = 1if ratio > 0.4:
returnTrueelse:
returnFalse
table = """<table style="width:100%">
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Age</th>
</tr>
<tr>
<td>3241424134213424214321342424214321412</td>
<td>213423423234242142134214124214214124124</td>
<td>213424214234242</td>
</tr>
<tr>
<td>124234412342142414</td>
<td>1423424214324214</td>
<td>2134242141242341241</td>
</tr>
</table>
"""if tablereplace(table):
print'replace table'
parsed_html = BeautifulSoup(table, 'html.parser')
rawdata = parsed_html.find('table').text
print rawdata
UPDATE: Anyway just this line of your code strips away all HTML tags, as you will know 'cause you are using it for char/digit counting purpose
table = re.sub('<[^>]*>', ' ', str(table))
But it's not safe, because you could also have <> inside the text of your tags or the HTML could be shattered or misplaced
I left it where it is because for the example it's working. But consider to use BeautifulSoup for all HTML management.
Solution 2:
Thank you for your replies so far!
After intensive research, I found the solution to the mysterious deletion of the whole match. It seemed that the function only considered the first 150 or so characters of the match. However, if you specify table = table.group(0), the whole match is processed. group(0) accounts for the big difference here.
Below you can find my updated script thats works properly (also includes some other minor changes):
def tablereplace(table):
table= table.group(0)
table= re.sub('<[^>]*>', '\n', table)
numeric=sum(c.isdigit() for c intable)
alphabetic =sum(c.isalpha() for c intable)
try:
ratio =numeric/ (numeric+ alphabetic)
except ArithmeticError:
ratio =1else:
pass
if ratio >0.4:
emptystring =''return emptystring
else:
returntable
rawtext = re.sub('<table.+?<\/table>', tablereplace, rawtext, flags=re.IGNORECASE|re.DOTALL)
Post a Comment for "Delete Html Element If It Contains A Certain Amount Of Numeric Characters"