Python Regex - Stripping Out Html Tags And Formatting Characters From Inner Html
I'm dealing with single HTML strings like this >> s = 'u>\n Some text
Solution 1:
If I understand you right, you're looking to take this input:
u><br/>\n Some text <br/><br/><u
And receive this output:
\n Some text
This is done simply enough by only caring about what comes between the two inward-pointing brackets. We want:
- A right-bracket
>
(so we know where to begin) - Some text
\n Some text
(the content) which does not contain a left-bracket - A left-bracket
<
(so we know where to end)
You want:
>>>s = 'u><br/>\n Some text <br/><br/><u'>>>re.search(r'>([^<]+)<', s)
<_sre.SRE_Match object; span=(6, 55), match='>\n Some text >
(The captured group can be accessed via .group(1)
.)
Additionally, you may want to use re.findall
if you expect there to be multiple matches per line:
>>> re.findall(r'>([^<]+)<', s)
['\n Some text ']
EDIT: To address the comment: If you have multiple matches and you want to connect them into a single string (effectively removing all HTML-like tag things), do:
>>>s = 'nbsp;<br><br>Some text.<br>Some \n more text.<br'>>>' '.join(re.findall(r'>([^<]+)<', s))
'Some text. Some \n more text.'
Post a Comment for "Python Regex - Stripping Out Html Tags And Formatting Characters From Inner Html"