Chromdriver UserData configuration with Selenium - python-3.x

I have this simple function with Selenium
def config():
path = r'C:\Users\George\Desktop\Bot\User Data'
options = webdriver.ChromeOptions()
options.add_argument('user-data-dir=' + path)
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
driver = webdriver.Chrome(executable_path=r'C:\Users\George\Desktop\Bot\chromedriver.exe', chrome_options=options)
return driver
Which is working fine and as expected, But am making a simple bot to do tasks over a website that require sessions and cookies and so on.
So what I have to make each now and then, Is to copy the real browser User Data from:
C:\Users\George\AppData\Local\Google\Chrome\User Data
to my random location
'C:\Users\George\Desktop\Bot\User Data'
And it works fine, The reason am doing this is that I want to avoid login, and have the browser that I work with independent with the bot driver browser, And I Usually keep it laid down and never open while its doing my work.
Question
Is there a way, That I can automatically get the current session instant of copying Files (like get one open tab with all information attached )?
Is there a better way to do this? ( am sure there are plenty of better ways)
Thanks for the help, Any suggestion, Links, Blogs are much appriciated.

You won't be required to copy the real browser User Data from:
C:\Users\George\AppData\Local\Google\Chrome\User Data
to:
C:\Users\George\Desktop\Bot\User Data
each now and then if you point user-data-dir to the real google-chrome User Data directory as follows:
def config():
options = webdriver.ChromeOptions()
# next line you need to replace the random location with actual browser data location
options.add_argument(r"--user-data-dir=C:\Users\George\AppData\Local\Google\Chrome\User Data")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
driver = webdriver.Chrome(executable_path=r'C:\Users\George\Desktop\Bot\chromedriver.exe', chrome_options=options)
return driver
References
You can find a couple of relevant detailed discussion in:
Selenium: Point towards default Chrome session

Related

Google consent automatic accepting

I am new newbie to Selenium Driver. I am trying to click 'Agree' button automatically on google search pop up using selenium driver interface. I am using python and the code is:
driver.get(f"https://www.google.com")
driver.find_element_by_xpath('//*[#id="introAgreeButton"]').click()
But this is not working. Can someone help me?google auto pop up for consent
this worked for me:
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument('disable-infobars') #attempting to disable infobar
browser=webdriver.Chrome(options=options,
executable_path=r'/usr/local/bin/chromedriver')
browser.get('http://www.google.com')
browser.switch_to.frame(browser.find_element_by_xpath("//iframe[contains(#src,
'consent.google.com')]"))
time.sleep(5)
browser.find_element_by_xpath('//*[#id="introAgreeButton"]/span/span').click()
browser.find_elements(By.XPATH, "//form //div[#role = 'button' and #id =
'introAgreeButton'").click()

Python: Log into the website using selenium?

I want to login to the following website:
https://www.investing.com/equities/oil---gas-dev-historical-data
Here is what I have tried so far:
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.set_preference("browser.download.folderList", 2)
profile.set_preference("browser.download.manager.showWhenStarting", False)
profile.set_preference("browser.download.dir", './ogdc.csv')
profile.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/x-gzip")
driver = webdriver.Firefox(firefox_profile=profile)
driver.get('https://www.investing.com/equities/oil---gas-dev-historical-data')
driver.find_element_by_class_name("login bold")
driver.find_element_by_id('Email').send_keys('myemail')
driver.find_element_by_id('Password').send_keys('mypass')
driver.find_element_by_partial_link_text("Download Data").click()
But I get following Exception:
NoSuchElementException
How can I login to the above website?
You need click Sign in first to bring up the login popup, try the following code:
driver.get('https://www.investing.com/equities/oil---gas-dev-historical-data')
driver.find_element_by_css_selector("a[class*='login']").click()
driver.find_element_by_id('loginFormUser_email').send_keys('myemail')
driver.find_element_by_id('loginForm_password').send_keys('mypass')
driver.find_element_by_xpath("//div[#id='loginEmailSigning']//following-sibling::a[#class='newButton orange']").click()
From what I can tell, you need to click the Sign In button („login bold”) in order to open the popup. That one is not part of the initial DOM load and need some time to show up. So basically, wait for #Email and #Password to get visible.
You should be able to use an explicit wait for this.

Change proxy in chromedriver for scraping purposes

I'm scraping Bet365, probably one of the most tricky websites I've encountered, with selenium and Chrome.
The issue with this page is that, even though my scraper takes sleeps so in no way it runs faster of what a human could, at some point, sometimes, it blocks my ip from a random amount of time (between half and 2 hours).
So, I'm looking into proxies to change my IP and resume my scraping. And here is where I'm kind of stuck trying to decide how to approach this
I've used 2 different free ip providers as follows
https://gimmeproxy.com
I wasn't able to make this one work, I'm emailing their support, but what I have, which should work is as follows
import requests
api="MY_API_KEY" #with the free plan I can ask 240 times a day for an IP
adder="&post=true&supportsHttps=true&maxCheckPeriod=3600"
url="https://gimmeproxy.com/api/getProxy?"
r=requests.get(url=url,params=adder)
THIS IS EDITED
apik="api_key={}".format(api)
r=requests.get(url=url,params=apik+adder)
aaand I get no answer. 404 error not found. NOW WORKS, MY BAD
My second approach is through this other site sslproxy
With this one, you scrape the page, and you get a list of 100 IPs, theoretically checked and working. So, I've set up a loop in which I try a random IP from that list, and if it doesn't work deletes it from the list and tries again. This approach works hen trying to open Bet365.
for n in range(1, 100):
proxy_index=random.randint(0, len(proxies) - 1)
proxi=proxies[proxy_index]
PROXY=proxi['ip']+':'+proxi['port']
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server={}'.format(PROXY))
url="https://www.bet365.es"
try:
browser=webdriver.Chrome(path,options=chrome_options)
browser.get(url)
WebDriverWait(browser,10)..... #no need to post the whole condition
break
except:
del proxies[proxy_index]
browser.quit()
Well, with this one I succeded on trying to open Bet365, and I'm still checking, but I think this webdriver is going to be much slower than my original one, with no proxy.
So, my question is, is it expected that using proxy the scraping is going to be much slower, or does it depend on the proxy used? If so, does anyone recommed a different (or better, surely) approach?
I don't see any significant issue either in your approach or your code block. However, another approach would be to make use of all the proxies marked with in the Last Checked column which gets updated within the Free Proxy List.
As a solution you can write a script to grab all the proxies available and create a List dynamically every time you initialize your program. The following program will invoke a proxy from the Proxy List one by one until a successful proxied connection is established and verified through the Page Title of https://www.bet365.es to contain the text bet365. An exception may arise because the free proxy which your program grabbed was overloaded with users trying to get their proxy traffic through.
Code Block:
driver.get("https://sslproxies.org/")
driver.execute_script("return arguments[0].scrollIntoView(true);", WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[#class='table table-striped table-bordered dataTable']//th[contains(., 'IP Address')]"))))
ips = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[#class='table table-striped table-bordered dataTable']//tbody//tr[#role='row']/td[position() = 1]")))]
ports = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[#class='table table-striped table-bordered dataTable']//tbody//tr[#role='row']/td[position() = 2]")))]
driver.quit()
proxies = []
for i in range(0, len(ips)):
proxies.append(ips[i]+':'+ports[i])
print(proxies)
for i in range(0, len(proxies)):
try:
print("Proxy selected: {}".format(proxies[i]))
options = webdriver.ChromeOptions()
options.add_argument('--proxy-server={}'.format(proxies[i]))
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get("https://www.bet365.es")
if "Proxy Type" in WebDriverWait(driver, 20).until(EC.title_contains("bet365")):
# Do your scrapping here
break
except Exception:
driver.quit()
print("Proxy was Invoked")
Console Output:
['190.7.158.58:39871', '175.139.179.65:54980', '186.225.45.146:45672', '185.41.99.100:41258', '43.230.157.153:52986', '182.23.32.66:30898', '36.37.160.253:31450', '93.170.15.214:56305', '36.67.223.67:43628', '78.26.172.44:52490', '36.83.135.183:3128', '34.74.180.144:3128', '206.189.122.177:3128', '103.194.192.42:55546', '70.102.86.204:8080', '117.254.216.97:23500', '171.100.221.137:8080', '125.166.176.153:8080', '185.146.112.24:8080', '35.237.104.97:3128']
Proxy selected: 190.7.158.58:39871
Proxy selected: 175.139.179.65:54980
Proxy selected: 186.225.45.146:45672
Proxy selected: 185.41.99.100:41258

Selenium - Logging Into Facebook Causes Web Driver to Stall Indefinitely

I have a simple program that logs into Facebook and gets 3 urls:
def setup_driver():
prefs = {"profile.default_content_setting_values.notifications": 2}
chrome_options = webdriver.ChromeOptions()
chrome_options.add_experimental_option("prefs", prefs)
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
driver = webdriver.Chrome(executable_path="./chromedriver_linux",
chrome_options=chrome_options)
return driver
def log_into_facebook(driver):
driver.get("https://www.facebook.com/")
email_field = driver.find_element_by_id("email")
email_field.send_keys("<MY EMAIL ADDRESS>")
password_field = driver.find_element_by_id("pass")
password_field.send_keys("<MY FB PASSWORD>")
driver.find_element_by_id("loginbutton").click()
if __name__ == "__main__":
driver = setup_driver()
log_into_facebook(driver)
print("before getting url 1")
driver.get('https://facebook.com/2172111592857876')
print("before getting url 2")
#Stackoverflow is breaking indentation
driver.get('https://www.facebook.com/beaverconfessions/posts/2265225733546461')
print("before getting url 3")
driver.get('https://www.facebook.com/beaverconfessions/posts/640487179353666')
print("finished getting 3 urls")
On my local machine, this program runs fine. However, on my AWS EC2 instance, this program makes my instance unusable (the Python script will hang/stall after "before getting url 2" is printed to the console. While the script is hanging, the EC2 instance will become so slow that other programs on the instance will also stop working properly. I need to forcefully close the program with Ctrl-C in order for the instance to start being responsive again.). However, if I comment out log_into_facebook(driver), then the program runs fine.
I would try to get an stacktrace, but the program doesn't actually crash, rather it just never reaches "before getting url 3".
It is worth nothing, previously I was getting "invalid session id" errors with a program that was similar to this (it also logged into Facebook and then called driver.get several times).
Update: Removing the --no-sandbox option from the webdriver seemed to fix the problem. I'm not sure why. I originally had this option in place because I was previously having a "unable to fix open pages" error, and I read that "--no-sandbox" would fix the error.
chrome_options.add_argument('--no-sandbox')
Roymunson reports that the appropriate way to fix the hanging problem is:
Avoid specifying the --no-sandbox option in the webdriver.

"Eager" Page Load Strategy workaround for Chromedriver Selenium in Python

I want to speed up the loading time for pages on selenium because I don't need anything more than the HTML (I am trying to scrape all the links using BeautifulSoup). Using PageLoadStrategy.NONE doesn't work to scrape all the links, and Chrome no longer supports PageLoadStrategy.EAGER. Does anyone know of a workaround to get PageLoadStrategy.EAGER in python?
ChromeDriver is the standalone server which implements WebDriver's wire protocol for Chromium. Chrome and Chromium are still in the process of implementing and moving to the W3C standard. Currently ChromeDriver is available for Chrome on Android and Chrome on Desktop (Mac, Linux, Windows and ChromeOS).
As per the current WebDriver W3C Editor's Draft The following is the table of page load strategies that links the pageLoadStrategy capability keyword to a page loading strategy state, and shows which document readiness state that corresponds to it:
However, if you observe the current implementation of of ChromeDriver, the Chrome DevTools does takes into account the following document.readyStates:
document.readyState == 'complete'
document.readyState == 'interactive'
Here is a sample relevant log:
[1517231304.270][DEBUG]: DEVTOOLS COMMAND Runtime.evaluate (id=11) {
"expression": "var isLoaded = document.readyState == 'complete' || document.readyState == 'interactive';if (isLoaded) { var frame = document.createElement('iframe'); frame.name = 'chromedriver dummy frame'; ..."
}
As per WebDriver Status you will find the list of all WebDriver commands and their current support in ChromeDriver based on what is in the WebDriver Specification. Once the implementation are completed from all aspects PageLoadStrategy.EAGER is bound to be functionally present within Chrome Driver.
You only use normal or none as the pageLoadStrategy in chromdriver. So either choose none and handle everything yourself or wait for the page load as it normally happens

Resources