I have a simple program that logs into Facebook and gets 3 urls:
def setup_driver():
prefs = {"profile.default_content_setting_values.notifications": 2}
chrome_options = webdriver.ChromeOptions()
chrome_options.add_experimental_option("prefs", prefs)
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
driver = webdriver.Chrome(executable_path="./chromedriver_linux",
chrome_options=chrome_options)
return driver
def log_into_facebook(driver):
driver.get("https://www.facebook.com/")
email_field = driver.find_element_by_id("email")
email_field.send_keys("<MY EMAIL ADDRESS>")
password_field = driver.find_element_by_id("pass")
password_field.send_keys("<MY FB PASSWORD>")
driver.find_element_by_id("loginbutton").click()
if __name__ == "__main__":
driver = setup_driver()
log_into_facebook(driver)
print("before getting url 1")
driver.get('https://facebook.com/2172111592857876')
print("before getting url 2")
#Stackoverflow is breaking indentation
driver.get('https://www.facebook.com/beaverconfessions/posts/2265225733546461')
print("before getting url 3")
driver.get('https://www.facebook.com/beaverconfessions/posts/640487179353666')
print("finished getting 3 urls")
On my local machine, this program runs fine. However, on my AWS EC2 instance, this program makes my instance unusable (the Python script will hang/stall after "before getting url 2" is printed to the console. While the script is hanging, the EC2 instance will become so slow that other programs on the instance will also stop working properly. I need to forcefully close the program with Ctrl-C in order for the instance to start being responsive again.). However, if I comment out log_into_facebook(driver), then the program runs fine.
I would try to get an stacktrace, but the program doesn't actually crash, rather it just never reaches "before getting url 3".
It is worth nothing, previously I was getting "invalid session id" errors with a program that was similar to this (it also logged into Facebook and then called driver.get several times).
Update: Removing the --no-sandbox option from the webdriver seemed to fix the problem. I'm not sure why. I originally had this option in place because I was previously having a "unable to fix open pages" error, and I read that "--no-sandbox" would fix the error.
chrome_options.add_argument('--no-sandbox')
Roymunson reports that the appropriate way to fix the hanging problem is:
Avoid specifying the --no-sandbox option in the webdriver.
Related
I'm crawlling some web pages for my research.
I want to inject javascript code below when redirecting to other page:
window.alert = function() {};
I tried to inject the javascript code below using WebDriverWait, so that selenium may execute the code as soon as the driver redirect to new page. But It doesn't work.
while (some conditions) :
try:
WebDriverWait(browser, 5).until(
lambda driver: original_url != browser.current_url)
browser.execute_script("window.alert = function() {};")
except:
//do sth
original_url = browser.current_url
It seems that the driver execute javascript code after the page loaded because the alert that made in the redirected page is showing.
Chrome 14+ blocks alerts inside onunload (https://stackoverflow.com/a/7080331/3368011)
But, I think the following questions may help you:
JavaScript before leaving the page
How to call a function before leaving page with Javascript
JavaScript before leaving the page
I solved my problem in other way.
I tried and tried again with browser.switch_to_alert but it didn't work. Then I found that it was deprecated, so not works correctly. I checked the alert and dismiss it in every 1 second with following code :
while *some_condition* :
try:
Alert(browser).dismiss()
except:
print("no alert")
continue
This works very fine, in Windows 10, python 3.7.4
I'm scraping Bet365, probably one of the most tricky websites I've encountered, with selenium and Chrome.
The issue with this page is that, even though my scraper takes sleeps so in no way it runs faster of what a human could, at some point, sometimes, it blocks my ip from a random amount of time (between half and 2 hours).
So, I'm looking into proxies to change my IP and resume my scraping. And here is where I'm kind of stuck trying to decide how to approach this
I've used 2 different free ip providers as follows
https://gimmeproxy.com
I wasn't able to make this one work, I'm emailing their support, but what I have, which should work is as follows
import requests
api="MY_API_KEY" #with the free plan I can ask 240 times a day for an IP
adder="&post=true&supportsHttps=true&maxCheckPeriod=3600"
url="https://gimmeproxy.com/api/getProxy?"
r=requests.get(url=url,params=adder)
THIS IS EDITED
apik="api_key={}".format(api)
r=requests.get(url=url,params=apik+adder)
aaand I get no answer. 404 error not found. NOW WORKS, MY BAD
My second approach is through this other site sslproxy
With this one, you scrape the page, and you get a list of 100 IPs, theoretically checked and working. So, I've set up a loop in which I try a random IP from that list, and if it doesn't work deletes it from the list and tries again. This approach works hen trying to open Bet365.
for n in range(1, 100):
proxy_index=random.randint(0, len(proxies) - 1)
proxi=proxies[proxy_index]
PROXY=proxi['ip']+':'+proxi['port']
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server={}'.format(PROXY))
url="https://www.bet365.es"
try:
browser=webdriver.Chrome(path,options=chrome_options)
browser.get(url)
WebDriverWait(browser,10)..... #no need to post the whole condition
break
except:
del proxies[proxy_index]
browser.quit()
Well, with this one I succeded on trying to open Bet365, and I'm still checking, but I think this webdriver is going to be much slower than my original one, with no proxy.
So, my question is, is it expected that using proxy the scraping is going to be much slower, or does it depend on the proxy used? If so, does anyone recommed a different (or better, surely) approach?
I don't see any significant issue either in your approach or your code block. However, another approach would be to make use of all the proxies marked with in the Last Checked column which gets updated within the Free Proxy List.
As a solution you can write a script to grab all the proxies available and create a List dynamically every time you initialize your program. The following program will invoke a proxy from the Proxy List one by one until a successful proxied connection is established and verified through the Page Title of https://www.bet365.es to contain the text bet365. An exception may arise because the free proxy which your program grabbed was overloaded with users trying to get their proxy traffic through.
Code Block:
driver.get("https://sslproxies.org/")
driver.execute_script("return arguments[0].scrollIntoView(true);", WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[#class='table table-striped table-bordered dataTable']//th[contains(., 'IP Address')]"))))
ips = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[#class='table table-striped table-bordered dataTable']//tbody//tr[#role='row']/td[position() = 1]")))]
ports = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[#class='table table-striped table-bordered dataTable']//tbody//tr[#role='row']/td[position() = 2]")))]
driver.quit()
proxies = []
for i in range(0, len(ips)):
proxies.append(ips[i]+':'+ports[i])
print(proxies)
for i in range(0, len(proxies)):
try:
print("Proxy selected: {}".format(proxies[i]))
options = webdriver.ChromeOptions()
options.add_argument('--proxy-server={}'.format(proxies[i]))
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get("https://www.bet365.es")
if "Proxy Type" in WebDriverWait(driver, 20).until(EC.title_contains("bet365")):
# Do your scrapping here
break
except Exception:
driver.quit()
print("Proxy was Invoked")
Console Output:
['190.7.158.58:39871', '175.139.179.65:54980', '186.225.45.146:45672', '185.41.99.100:41258', '43.230.157.153:52986', '182.23.32.66:30898', '36.37.160.253:31450', '93.170.15.214:56305', '36.67.223.67:43628', '78.26.172.44:52490', '36.83.135.183:3128', '34.74.180.144:3128', '206.189.122.177:3128', '103.194.192.42:55546', '70.102.86.204:8080', '117.254.216.97:23500', '171.100.221.137:8080', '125.166.176.153:8080', '185.146.112.24:8080', '35.237.104.97:3128']
Proxy selected: 190.7.158.58:39871
Proxy selected: 175.139.179.65:54980
Proxy selected: 186.225.45.146:45672
Proxy selected: 185.41.99.100:41258
I'm coding my first telegram bot, but now I have to serve multiple user at the same time.
This code it's just a little part, but it should help me to use multithread with selenium
class MessageCounter(telepot.helper.ChatHandler):
def __init__(self, *args, **kwargs):
super(MessageCounter, self).__init__(*args, **kwargs)
def on_chat_message(self, msg):
content_type, chat_type, chat_id = telepot.glance(msg)
chat_id = str(chat_id)
browser = browserSelenium.start_browser(chat_id)
userIsLogged = igLogin.checkAlreadyLoggedIn(browser, chat_id)
print(userIsLogged)
TOKEN = "***"
bot = telepot.DelegatorBot(TOKEN, [
pave_event_space()(
per_chat_id(), create_open, MessageCounter, timeout=10),
])
MessageLoop(bot).run_as_thread()
while 1:
time.sleep(10)
when the bot recive any message it starts a selenium session calling this function:
def start_browser(chat_id):
global browser
try:
browser.get('https://www.google.com')
#igLogin.checkAlreadyLoggedIn(browser)
#links = telegram.getLinks(24)
#instagramLikes(browser, links)
except Exception as e:
print("type error: " + str(e))
print('No such session! starting webDivers!')
sleep(3)
# CLIENT CONNECTION !!
chrome_options = Options()
chrome_options.add_argument('user-data-dir=/home/ale/botTelegram/users/'+ chat_id +'/cookies')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--lang=en')
print("Starting WebDrivers")
browser = webdriver.Chrome(options=chrome_options)
start_browser(chat_id)
return browser
and then this one check if the user is logged:
def checkAlreadyLoggedIn(browser, chat_id):
browser.get('https://www.instagram.com/instagram/')
try:
WebDriverWait(browser, 5).until(EC.element_to_be_clickable(
(By.XPATH, instagramClicks.buttonGoToProfile))).click()
print('User already Logged')
return True
except:
print('User not Logged')
userLogged = login(browser, chat_id)
return userLogged
and if the user is not logged it try to log the user in whit username and password
so, basically, when I write at the bot with one account everithing works fine, but if I write to the bot from two different account it opens two browser, but it controll just one.
What I mean it's that for example, one window remain over the google page, and then the other one recive two times the comand, so, even when it has to write the username, it writes the username two times
How can I interract with multiple sessions?
WebDriver is not thread-safe. Having said that, if you can serialise access to the underlying driver instance, you can share a reference in more than one thread. This is not advisable. But you can always instantiate one WebDriver instance for each thread.
Ideally the issue of thread-safety isn't in your code but in the actual browser bindings. They all assume there will only be one command at a time (e.g. like a real user). But on the other hand you can always instantiate one WebDriver instance for each thread which will launch multiple browsing tabs/windows. Till this point it seems your program is perfect.
Now, different threads can be run on same Webdriver, but then the results of the tests would not be what you expect. The reason behind is, when you use multi-threading to run different tests on different tabs/windows a little bit of thread safety coding is required or else the actions you will perform like click() or send_keys() will go to the opened tab/window that is currently having the focus regardless of the thread you expect to be running. Which essentially means all the test will run simultaneously on the same tab/window that has focus but not on the intended tab/window.
Reference
You can find a relevant detailed discussion in:
Chrome crashes after several hours while multiprocessing using Selenium through Python
I have pyppeteer code that browses around. Let's assume it only clicks on a tags.
It runs fine on my local Windows machine, but breaks whenever I run it remotely on a Linux server. Same conda env, same code.
The relevant part of my code, simplified, looks like:
async def act(self):
element = self.element
async def get_action():
tag_name = await self.page.evaluate(
'elem => { return elem.tagName.toLowerCase(); }',
element)
action = None
if tag_name == 'a':
action = element.click()
else:
action = async_pass()
return action
async def get_action_future():
# gather syntax based on:
# https://miyakogi.github.io/pyppeteer/reference.html#pyppeteer.page.Page.click
action = await get_action()
future_action = asyncio.gather(
action,
asyncio.sleep(0.001), # dirty, dirty work-around, doesn't work nicely otherwise
)
waited_future = await asyncio.shield(future_action)
if waited_future[0] is None:
await self.page.waitForNavigation(self.wait_options)
return None
await get_action_future()
It runs fine on my Windows machine.
When I start it on a Linux machine, it starts off OK, whether there's navigation or not. Then, after a few navigation clicks, I'm getting a timeout, then another error:
Error encountered: Navigation Timeout Exceeded: 20000 ms exceeded.
# then I trigger the element selector and the act method again, wrapped in try/except
Error encountered: Protocol Error (Runtime.callFunctionOn): Session closed. Most likely the page has been closed.
I'm stuck on this problem for a while and would appreciate any help!
My environment includes:
python=3.6, pyppeteer=0.0.25.
BTW:
I noticed that this question has a similar error. BUT, the error is different (Protocol error (Page.navigate): Target closed instead of Protocol Error (Runtime.callFunctionOn)), as well as the environment (node.js, Puppeteer, etc.).
It seems that Pypepeteer is no longer maintained. I was experiencing the same issue and migrated to Selenium which is working well.
I want to get the current url when I am running Selenium.
I looked at this stackoverflow page: How do I get current URL in Selenium Webdriver 2 Python?
and tried the things posted but it's not working. I am attaching my code below:
from selenium import webdriver
#launch firefox
driver = webdriver.Firefox()
url1='https://poshmark.com/search?'
# search in a window a window
driver.get(url1)
xpath='//input[#id="user-search-box"]'
searchBox=driver.find_element_by_xpath(xpath)
brand="freepeople"
style="top"
searchBox.send_keys(' '.join([brand,"sequin",style]))
from selenium.webdriver.common.keys import Keys
#EQUIValent of hitting enter key
searchBox.send_keys(Keys.ENTER)
print(driver.current_url)
my code prints https://poshmark.com/search? but it should print: https://poshmark.com/search?query=freepeople+sequin+top&type=listings&department=Women because that is what selenium goes to.
The issue is that there is no lag between your searchBox.send_keys(Keys.ENTER) and print(driver.current_url).
There should be some time lag, so that the statement can pick the url change. If your code fires before url has actually changed, it gives you old url only.
The workaround would be to add time.sleep(1) to wait for 1 second. A hard coded sleep is not a good option though. You should do one of the following
Keep polling url and wait for the change to happen or the url
Wait for a object that you know would appear when the new page comes
Instead of using Keys.Enter simulate the operation using a .click() on search button if it is available
Usually when you use click method in selenium it takes cared of the page changes, so you don't see such issues. Here you press a key using selenium, which doesn't do any kind of waiting for page load. That is why you see the issue in the first place
I had the same issue and I came up with solution that uses default explicit wait (see how explicit wait works in documentation).
Here is my solution
class UrlHasChanged:
def __init__(self, old_url):
self.old_url = old_url
def __call__(self, driver):
return driver.current_url != self.old_url:
#contextmanager
def url_change(driver):
current_url = driver.current_url
yield
WebDriverWait(driver, 10).until(UrlHasChanged(current_url))
Explanation:
At first, I created my own wait condition (see here) that takes old_url as a parameter (url from before action was made) and checks whether old url is the same like current_url after some action. It returns false when both urls are the same and true otherwise.
Then, I created context manager to wrap action that I wanted to make, and I saved url before action was made, and after that I used WebDriverWait with created before wait condition.
Thanks to that solution I can now reuse this function with any action that changes url to wait for the change like that:
with url_change(driver):
login_panel.login_user(normal_user['username'], new_password)
assert driver.current_url == dashboard.url
It is safe because WebDriverWait(driver, 10).until(UrlHasChanged(current_url)) waits until current url will change and after 10 seconds it will stop waiting by throwing an exception.
What do you think about this?
I fixed this problem by clicking on the button by using href. Then do driver.get(hreflink). Click() was not working for me!