Parsing A Non-escaped Apostrophe In A Single-quoted Attribute Value With BeautifulSoup

Question

From a webpage, I want to get all the links and title strings. I use BeautifulSoup 4 for scraping. The links on the webpage look like this:

Solution 1:

The problem is not BeautifulSoup but your HTML, which is invalid. According to the HTML specification, a single-quoted attribute value has the following syntax:

The attribute name, followed by zero or more space characters, followed by a single U+003D EQUALS SIGN character, followed by zero or more space characters, followed by a single U+0027 APOSTROPHE character ('), followed by the attribute value, which, in addition to the requirements given above for attribute values, must not contain any literal U+0027 APOSTROPHE characters ('), and finally followed by a second single U+0027 APOSTROPHE character (').

While all of the parsers supported by BeautifulSoup will try to parse the invalid HTML in your question, none of them will do what you want:

>>> BeautifulSoup(src, "html.parser")

<a href="http://www.example1.com" title="A small secret for better estimates #4/16/2014 8:10:30 AM"> Example 1 </a>
<a #4="" 2014="" 4:36:07="" 9="" am'="" href="http://www.example2.com" make="" me="" t="" think="" title="Don"> Example 2</a>

>>> BeautifulSoup(src, "lxml")

<html><body><a href="http://www.example1.com" title="A small secret for better estimates #4/16/2014 8:10:30 AM"> Example 1 </a>
<a am="" href="http://www.example2.com" make="" me="" t="" think="" title="Don"> Example 2</a>
</body></html>

>>> BeautifulSoup(src, "html5lib")

<html><head></head><body><a href="http://www.example1.com" title="A small secret for better estimates #4/16/2014 8:10:30 AM"> Example 1 </a>
<a #4="" 2014="" 4:36:07="" 9="" am'="" href="http://www.example2.com" make="" me="" t="" think="" title="Don"> Example 2</a>
</body></html>

Neither will any modern browser:

Firefox
Firefox developer tools

Chrome
Chrome developer tools

IE 11
IE11 developer tools

If you want to represent an apostrophe inside a single-quoted attribute value, you need to use the ' character entity reference:

>>> BeautifulSoup("""
... <a href='http://www.example1.com' title='A small secret for better estimates #4/16/2014 8:10:30 AM'> Example 1 </a>
... <a href='http://www.example2.com' title='Don&apos;t make me think #4/9/2014 4:36:07 AM'> Example 2</a>
... """)

<html><body><a href="http://www.example1.com" title="A small secret for better estimates #4/16/2014 8:10:30 AM"> Example 1 </a>
<a href="http://www.example2.com" title="Don't make me think #4/9/2014 4:36:07 AM"> Example 2</a>
</body></html>

Alternatively, you can use a double-quoted attribute value:

>>> BeautifulSoup("""
... <a href='http://www.example1.com' title='A small secret for better estimates #4/16/2014 8:10:30 AM'> Example 1 </a>
... <a href='http://www.example2.com' title="Don't make me think #4/9/2014 4:36:07 AM"> Example 2</a>
... """)

<html><body><a href="http://www.example1.com" title="A small secret for better estimates #4/16/2014 8:10:30 AM"> Example 1 </a>
<a href="http://www.example2.com" title="Don't make me think #4/9/2014 4:36:07 AM"> Example 2</a>
</body></html>

Learn Python Tutorials

Parsing A Non-escaped Apostrophe In A Single-quoted Attribute Value With BeautifulSoup

Solution 1:

Post a Comment for "Parsing A Non-escaped Apostrophe In A Single-quoted Attribute Value With BeautifulSoup"