Skip to content Skip to sidebar Skip to footer

Python Regex - Stripping Out Html Tags And Formatting Characters From Inner Html

I'm dealing with single HTML strings like this >> s = 'u>\n Some text

Solution 1:

If I understand you right, you're looking to take this input:

u><br/>\n                                    Some text <br/><br/><u

And receive this output:

\n                                    Some text 

This is done simply enough by only caring about what comes between the two inward-pointing brackets. We want:

  • A right-bracket > (so we know where to begin)
  • Some text \n Some text (the content) which does not contain a left-bracket
  • A left-bracket < (so we know where to end)

You want:

>>>s = 'u><br/>\n                                    Some text <br/><br/><u'>>>re.search(r'>([^<]+)<', s)
<_sre.SRE_Match object; span=(6, 55), match='>\n                                    Some text >

(The captured group can be accessed via .group(1).)

Additionally, you may want to use re.findall if you expect there to be multiple matches per line:

>>> re.findall(r'>([^<]+)<', s)
['\n                                    Some text ']

EDIT: To address the comment: If you have multiple matches and you want to connect them into a single string (effectively removing all HTML-like tag things), do:

>>>s = 'nbsp;<br><br>Some text.<br>Some \n more text.<br'>>>' '.join(re.findall(r'>([^<]+)<', s))
'Some text. Some \n more text.'

Post a Comment for "Python Regex - Stripping Out Html Tags And Formatting Characters From Inner Html"