I would like to use chromedriver to scrape some stories from fanfiction.net.
I try the following:
from selenium import webdriver
import time
path = 'D:\chromedriver\chromedriver.exe'
browser = webdriver.Chrome(path)
url1 = 'https://www.fanfiction.net/s/8832472'
url2 = 'https://www.fanfiction.net/s/5218118'
browser.get(url1)
time.sleep(5)
browser.get(url2)
The first link opens (sometimes I have to wait 5 seconds). When I want to load the second url, cloudflare intervens and wants me to solve captchas - which are not solvable, atleast cloudflare does not recognize this.
This happens also, if I enter the links manually in chromedriver (so in the GUI). However, if I do the same things in normal chrome, everything works just as fine (I do not even get the waiting period on the first link) - even in private mode and all cookies deleted. I could reproduce this on several machines.
Now my question: To my intuition, chromedriver was just the normal chrome browser which allowed to be controlled. What is the difference to normal chrome, how does Cloudflare distinguish both, and how can I mask my chromedriver as normal chrome? (I do not intend to load many pages in very short time, so it should not look like a bot).
I hope my question is clear
This error message...
...implies that the Cloudflare have detected your requests to the website as an automated bot and subsequently denying you the access to the application.
Solution
In these cases the a potential solution would be to use the undetected-chromedriver to initialize the Chrome Browsing Context.
undetected-chromedriver is an optimized Selenium Chromedriver patch which does not trigger anti-bot services like Distill Network / Imperva / DataDome / Botprotect.io. It automatically downloads the driver binary and patches it.
Code Block:
import undetected_chromedriver as uc
from selenium import webdriver
import time
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
driver = uc.Chrome(options=options)
url1 = 'https://www.fanfiction.net/s/8832472'
url2 = 'https://www.fanfiction.net/s/5218118'
driver.get(url1)
time.sleep(5)
driver.get(url2)
References
You can find a couple of relevant detailed discussions in:
Selenium app redirect to Cloudflare page when hosted on Heroku
How to bypass being rate limited ..HTML Error 1015 using Python
Related
so I wonder what ports does Selenium uses with ChromeDriver to run? like from what port to what port selenium runs. Example 1500-3000
I'm trying to run selenium while using NordVpn but it says that it couldn't find any free port so I'm looking for a port list that selenium chromedriver uses to whitelist them in nordvpn and be able to run Selenium while using NordVPN
You can use Chrome DevTools Protocol. Try with below steps:
Add the path of the Chrome executable to the environment variable PATH.
Launch Chrome with a custom flag, and open a port for remote debugging
Please make sure the path to chrome’s executable is added to the environment variable PATH. You can check it by running the command chrome.exe (on Windows) or Google/ Chrome ( on Mac). It should launch the Chrome browser.
If you get a similar message as below that means Chrome is not added to your system’s path:
'chrome' is not recognized as an internal or external command,
operable program or batch file.
If this is the case, please feel free to Google how to add chrome to PATH?
Launch browser with custom flags
To enable Chrome to open a port for remote debugging, we need to launch it with a custom flag –
chrome.exe --remote-debugging-port=9222 --user-data-dir="C:\selenum\ChromeProfile"
For --remote-debugging-portvalue you can specify any port that is open.
For --user-data-dir flag you need to pass a directory where a new Chrome profile will be created. It is there just to make sure chrome launches in a separate profile and doesn’t pollute your default profile.
You can now play with the browser manually, navigate to as many pages, and perform actions and once you need your automation code to take charge, you may run your automation script. You just need to modify your Selenium script to make Selenium connect to that opened browser.
You can verify if Chrome is launched in the right way:
Launch a new browser window normally (without any flag), and navigate to http://127.0.0.1:9222
Confirm if you see the Google homepage reference in the second browser window
Launch browser with options
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9222")
#Change chrome driver path accordingly
chrome_driver = "C:\chromedriver.exe"
driver = webdriver.Chrome(chrome_driver, chrome_options=chrome_options)
print driver.title
The URL 127.0.0.1 denotes your localhost. We have supplied the same port, i.e 9222, that we used to launch Chrome with --remote-debugging-port flag. While you can use any port, you need to make sure it is open and available to use.
I'm using Selenium & NodeJS to automate download tests on Chrome.
I noticed that Chrome's download protection behaves differently when clicking a link or redirecting to URL automatically vs actually typing the URL in the address bar and clicking ENTER (Chrome protection doesn't mark some files as blocked when actually typing the URL).
What I tried till now and still got blocks for some files:
driver.get(url)
driver.executeScript to redirect to the url
driver.executeScript to create A element and clicking on it.
opening a new tab and then driver.get(url)
Is there a way to imitate the address bar typing and ENTER clicking with Selenium?
Selenium does not support sending keys to the browser address bar, unfortunately.
Someone suggested a solution with win32com.client library for Node.js here
I haven't tried this but found a similar question on this thread.
Currently, I'm attempting to scrape a few retail websites to compare prices using Selenium and BeautifulSoup. However, there's a particular page Selenium won't load. The browser opens, but the page remains blank.
I've tried googling to check if there are certain websites Selenium has difficulty accessing, but turned up nothing. I was using the Gecko driver for Firefox, and changed to the Chromedriver to make sure there wasn't an issue with the particular browser.
Code:
URL = "https://shop.coles.com.au/a/national/everything/browse/pantry/breakfast?pageNumber=1"
browser = webdriver.Chrome(executable_path="C:\ChromeDriver\chromedriver")
browser.get(URL)
I want to make a script that open a url with python and stay in website for seconds and then do this over and over to increase website traffic.
with tor and request lib in python I write this script and I config tor to change IP every 5 seconds :
import requests
import time
url = 'https://google.com'
while True:
proxy = {'http': 'socks5://127.0.0.1:9150'}
print(requests.get(url, proxies=proxy).text)
time.sleep(5)
But when I checked my google analytic or my Alexa account, I notice the traffics which made by this script, aren't affect.
I wonder how can I make traffics for a website which affect and the tools like google analytic couldn't find that my traffics aren't fake either.
It won't help at all. See how Alexa traffic rankings are determined. Metrics are collected from a panel of users with certain browser extensions installed that report their browsing habits, or by Alexa Javascript code you install on your site.
Given those metrics for collection, visiting your site with Tor and Python code won't have any impact on your ranking.
This is because you are sending a web request.
If I ran that code and removed the .text, it would give me a response 200 code.
You are just checking if the website is online via a proxy
So instead of doing it with a python request, you can use selenium. Selenium with creating a browser and can execute mouse events with programming you can also use the time to put some difference between clicks so it won't look like a bot. Here you can check how to do it in details- https://ervidhayak.com/blog/detail/increase-traffic-on-website-with-python-selenium
As of about a week ago, my website was working fine. Since Chrome version 54, I can't get it to load. The HTTPS request doesn't get any response and shows a status of "(canceled)". It loads just fine in Chromium, Firefox, Safari, and even Chrome 53. Chrome's developer tools don't give any helpful information - see the image.
Here is what it looks like in Chromium:
(You'll note that the second image shows the subdomain www. That's because, when the naked domain loads properly, it redirects to the subdomain.)
I tried modifying my server code (Node, Express) to print a message upon receipt of each request, and it doesn't even print when I visit the site in Chrome (54.0.2840.71 (64-bit)). It does print when I visit in Chromium (53.0.2785.143 (64-bit)).
I even tried using a different computer. Same thing - fails in Chrome, succeeds in Safari.
What could make it behave like this? I don't know where to begin troubleshooting this.
I don't really understand the behaviour, but I found a way to fix it in my app. I was using the NPM module spdy in place of Node's built-in https module to serve my app over https. Switching back to the built-in module solved the problem. (It's a simple change - the APIs are compatible.) I don't know whether spdy consistently has this issue in Chrome 54, but I've wasted too much time on this issue, so I will leave further investigation as an exercise for the archaeologist who next digs up this answer.