Currently, I'm attempting to scrape a few retail websites to compare prices using Selenium and BeautifulSoup. However, there's a particular page Selenium won't load. The browser opens, but the page remains blank.
I've tried googling to check if there are certain websites Selenium has difficulty accessing, but turned up nothing. I was using the Gecko driver for Firefox, and changed to the Chromedriver to make sure there wasn't an issue with the particular browser.
Code:
URL = "https://shop.coles.com.au/a/national/everything/browse/pantry/breakfast?pageNumber=1"
browser = webdriver.Chrome(executable_path="C:\ChromeDriver\chromedriver")
browser.get(URL)
Related
I am working with a wxPython project and I used webbrowser (Convenient Web-browser controller) it ispossible to get list of urls which are opened in my web browser, for example in Google Chrome.
I would like to use chromedriver to scrape some stories from fanfiction.net.
I try the following:
from selenium import webdriver
import time
path = 'D:\chromedriver\chromedriver.exe'
browser = webdriver.Chrome(path)
url1 = 'https://www.fanfiction.net/s/8832472'
url2 = 'https://www.fanfiction.net/s/5218118'
browser.get(url1)
time.sleep(5)
browser.get(url2)
The first link opens (sometimes I have to wait 5 seconds). When I want to load the second url, cloudflare intervens and wants me to solve captchas - which are not solvable, atleast cloudflare does not recognize this.
This happens also, if I enter the links manually in chromedriver (so in the GUI). However, if I do the same things in normal chrome, everything works just as fine (I do not even get the waiting period on the first link) - even in private mode and all cookies deleted. I could reproduce this on several machines.
Now my question: To my intuition, chromedriver was just the normal chrome browser which allowed to be controlled. What is the difference to normal chrome, how does Cloudflare distinguish both, and how can I mask my chromedriver as normal chrome? (I do not intend to load many pages in very short time, so it should not look like a bot).
I hope my question is clear
This error message...
...implies that the Cloudflare have detected your requests to the website as an automated bot and subsequently denying you the access to the application.
Solution
In these cases the a potential solution would be to use the undetected-chromedriver to initialize the Chrome Browsing Context.
undetected-chromedriver is an optimized Selenium Chromedriver patch which does not trigger anti-bot services like Distill Network / Imperva / DataDome / Botprotect.io. It automatically downloads the driver binary and patches it.
Code Block:
import undetected_chromedriver as uc
from selenium import webdriver
import time
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
driver = uc.Chrome(options=options)
url1 = 'https://www.fanfiction.net/s/8832472'
url2 = 'https://www.fanfiction.net/s/5218118'
driver.get(url1)
time.sleep(5)
driver.get(url2)
References
You can find a couple of relevant detailed discussions in:
Selenium app redirect to Cloudflare page when hosted on Heroku
How to bypass being rate limited ..HTML Error 1015 using Python
I'm using Selenium & NodeJS to automate download tests on Chrome.
I noticed that Chrome's download protection behaves differently when clicking a link or redirecting to URL automatically vs actually typing the URL in the address bar and clicking ENTER (Chrome protection doesn't mark some files as blocked when actually typing the URL).
What I tried till now and still got blocks for some files:
driver.get(url)
driver.executeScript to redirect to the url
driver.executeScript to create A element and clicking on it.
opening a new tab and then driver.get(url)
Is there a way to imitate the address bar typing and ENTER clicking with Selenium?
Selenium does not support sending keys to the browser address bar, unfortunately.
Someone suggested a solution with win32com.client library for Node.js here
I haven't tried this but found a similar question on this thread.
I am trying to scrape a website that renders all the date with JS. It pages of tables, however you can only access certain page via search box or by clicking an arrow to move to next page. It is impossible to access certain page by url.
I need to change proxy on each page. If I reload webdriver, I must execute all the searches to access e.g. 124102nd page. It is very time as well as computationally intensive.
Anyone could help me on this?
I'm working on a browser extension (compatible with Chrome, FF, Opera, and Edge) and I'm trying to figure out how to associate requests to domains outside of the current page. For example, when you go to google.com a lot of requests to domains other than google.com occur such as to gstatic.com.
An extension like NoScript shows all of the requested domains that a page made and lets you allow or deny. I'm trying to get a similar functionality.
Is this something that can be done in the content script or is there some way to keep state information in the background script that I can then display in the popup? Obviously it's possible but I'm just not seeing which callback I can use.