What Is The Difference In Accessing Cloudflare Website Using Chromedriver/chrome In Normal/headless Mode Through Selenium Python

February 15, 2024 Post a Comment

I have a question about --headless mode in Python Selenium for Chrome. Code from selenium import webdriver from selenium.webdriver.common.desired_capabilities import DesiredCapab

Solution 1:

It's the HTTP User-Agent header that Cloudflare doesn't like.

To get around this issue, simply change your user-agent chrome option (below code is for Selenium in Python):

option.add_argument('--headless')option.add_argument("user-agent=Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36")

Solution 2:

I tested using this server-side script:

<?phpecho"<pre><code>";
var_dump($_SERVER);
echo"</code></pre>";
?>
<script>
    var el = document.getElementsByTagName('code')[0];
    for(var prop in window.navigator){
        var str = JSON.stringify(window.navigator[prop])
        el.innerHTML = el.innerHTML + "window.navigator." + prop + " = " + str + "\n";
    }
    var skip_props = ['parent', 'top', 'frames', 'self', 'window'];
    for(var prop in window){
        if (skip_props.indexOf(prop) > -1) { continue; }
        el.innerHTML = el.innerHTML + "window." + prop + " = ";
        var str = JSON.stringify(window[prop])
        el.innerHTML = el.innerHTML + str + "\n";
    }
</script>

I loaded this page using ChromeDriver, with and without using --headless, and printed the output using print(driver.find_element_by_tag_name('code').text). I then diff-ed both outputs. Here's the differences I found:

HTTP Accept-Language header: en-US,en;q=0.9 vs en-US
HTTP User-Agent header: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36 vs Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/83.0.4103.61 Safari/537.36 (Note the HeadlessChrome mention in the second string.)
Javascript window.navigator.plugins: {"0":{"0":{}},"1":{"0":{}},"2":{"0":{},"1":{}}} vs {}
Javascript window.navigator.mimeTypes: {"0":{},"1":{},"2":{},"3":{}} vs {}
Javascript window.outerWidth: 1367 vs 0
Javascript window.outerHeight: 641 vs 0

Of note: in the Python script you posted, you are missing a few lines, to remove the window.webdriver property (without this, it is trivial for the server to detect you are using WebDriver) [ref]:

driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
    "source": """
    Object.defineProperty(navigator, 'webdriver', {
      get: () => undefined
    })
  """
})

Solution 3:

I took your code, removed the optional arguments and added a few arguments to execute the test as follows:

Code Block:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions() 
options.add_argument("start-maximized")
options.add_argument("--headless")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get("https://www.manta.com/c/mm2956g/mashuda-contractors")
print(driver.page_source)
driver.quit()

Console Output:

<htmlclass="js"lang="en-US"style="opacity: 1; visibility: visible;"><!--<![endif]--><head><title>Access denied | www.manta.com used Cloudflare to restrict access</title><metacharset="UTF-8"><metahttp-equiv="Content-Type"content="text/html; charset=UTF-8"><metahttp-equiv="X-UA-Compatible"content="IE=Edge,chrome=1"><metaname="robots"content="noindex, nofollow"><metaname="viewport"content="width=device-width,initial-scale=1,maximum-scale=1"><linkrel="stylesheet"id="cf_styles-css"href="/cdn-cgi/styles/cf.errors.css"type="text/css"media="screen,projection"><!--[if lt IE 9]><link rel="stylesheet" id='cf_styles-ie-css' href="/cdn-cgi/styles/cf.errors.ie.css" type="text/css" media="screen,projection" /><![endif]--><styletype="text/css">body{margin:0;padding:0}</style><!--[if gte IE 10]><!--><scripttype="text/javascript"src="/cdn-cgi/scripts/zepto.min.js"></script><!--<![endif]--><!--[if gte IE 10]><!--><scripttype="text/javascript"src="/cdn-cgi/scripts/cf.common.js"></script><!--<![endif]--></head><body><divid="cf-wrapper"><divclass="cf-alert cf-alert-error cf-cookie-error"id="cookie-alert"data-translate="enable_cookies">Please enable cookies.</div><divid="cf-error-details"class="cf-error-details-wrapper"><divclass="cf-wrapper cf-header cf-error-overview"><h1><spanclass="cf-error-type"data-translate="error">Error</span><spanclass="cf-error-code">1020</span><smallclass="heading-ray-id">Ray ID: 53fd7c2fca12d5fc • 2019-12-04 11:36:52 UTC</small></h1><h2class="cf-subheadline">Access denied</h2></div><!-- /.header --><section></section><!-- spacer --><divclass="cf-section cf-wrapper"><divclass="cf-columns two"><divclass="cf-column"><h2data-translate="what_happened">What happened?</h2><p>This website is using a security service to protect itself from online attacks.</p></div></div></div><!-- /.section --><divclass="cf-error-footer cf-wrapper"><p><spanclass="cf-footer-item">Cloudflare Ray ID: <strong>53fd7c2fca12d5fc</strong></span><spanclass="cf-footer-separator">•</span><spanclass="cf-footer-item"><span>Your IP</span>: 123.201.54.43</span><spanclass="cf-footer-separator">•</span><spanclass="cf-footer-item"><span>Performance &amp; security by</span><ahref="https://www.cloudflare.com/5xx-error-landing?utm_source=error_footer"id="brand_link"target="_blank">Cloudflare</a></span></p></div><!-- /.error-footer --></div><!-- /#cf-error-details --></div><!-- /#cf-wrapper --><scripttype="text/javascript">window._cf_translation = {};


</script></body></html>

Analysis

From the extracted page source it is pretty clear using --headless argument you are reaching to a page with:

Heading as: Access denied | www.manta.com used Cloudflare to restrict access.
Some information: What happened?: This website is using a security service to protect itself from online attacks.

Conclusion

The Browsing Context i.e. Chrome Browser session is getting detected as a BOT and the navigation is blocked.

Outro

You can find a couple of relevant discussions in:

Solution 4:

Cloudflare aims to block bots. They assume headless browser is used by data scrapers so they are blocking it. from Cloudflare What is Data Scraping?

*A headless browser is a type of web browser, much like Chrome or Firefox, but it doesn’t have a visual user interface by default, allowing it to move much faster than a typical web browser. By essentially running at the level of a command line, a headless browser is able to avoid rendering entire web applications. Data scrapers write bots that use headless browsers to request data more quickly, as there is no human viewing each page being scraped.

Solution 5:

When scraping CloudFlare protected website, here is the list of things you need to do:

Ensure you are sending headers identical (and in the same order) to what browser sends
Ensure you are using non-datacenter ip address range
And if it still does not work, like in my case...

I encountered the same issue when scraping one ecommerce website (guess dot com). Changing headers order didn't fix it for me. My conclusions: apparently, CloudFlare analyses the TLS fingerprint of the request and throws 403 (1020) code in case the fingerprint matches node.js/python/curl which are usually used for scraping. The solution is to emulate the fingeprint of some popular browser - and the most obvious way would be to use Puppeteer.js with puppeteer extra stealth plugin. And it worked! But.. since Puppeteer was not fast enough for my use case (I put it mildly.. Puppeteer is insane in terms of resources and sluggishness) I had to build an utility which uses boringSSL (the SSL lib used by Chrome) - and since compiling C/C++ code and figuring out the cryptic compilation errors of some TLS library is no fun for most of web devs - I wrapped it as an API server, which you can try here: https://rapidapi.com/restyler/api/scrapeninja

Read more on how CloudFlare analyzes TLS: https://blog.cloudflare.com/monsters-in-the-middleboxes/

Learn Python Tutorials