I'm scraping Bet365, probably one of the most tricky websites I've encountered, with selenium and Chrome.
The issue with this page is that, even though my scraper takes sleeps so in no way it runs faster of what a human could, at some point, sometimes, it blocks my ip from a random amount of time (between half and 2 hours).
So, I'm looking into proxies to change my IP and resume my scraping. And here is where I'm kind of stuck trying to decide how to approach this
I've used 2 different free ip providers as follows
https://gimmeproxy.com
I wasn't able to make this one work, I'm emailing their support, but what I have, which should work is as follows
import requests
api="MY_API_KEY" #with the free plan I can ask 240 times a day for an IP
adder="&post=true&supportsHttps=true&maxCheckPeriod=3600"
url="https://gimmeproxy.com/api/getProxy?"
r=requests.get(url=url,params=adder)
THIS IS EDITED
apik="api_key={}".format(api)
r=requests.get(url=url,params=apik+adder)
aaand I get no answer. 404 error not found. NOW WORKS, MY BAD
My second approach is through this other site sslproxy
With this one, you scrape the page, and you get a list of 100 IPs, theoretically checked and working. So, I've set up a loop in which I try a random IP from that list, and if it doesn't work deletes it from the list and tries again. This approach works hen trying to open Bet365.
for n in range(1, 100):
proxy_index=random.randint(0, len(proxies) - 1)
proxi=proxies[proxy_index]
PROXY=proxi['ip']+':'+proxi['port']
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server={}'.format(PROXY))
url="https://www.bet365.es"
try:
browser=webdriver.Chrome(path,options=chrome_options)
browser.get(url)
WebDriverWait(browser,10)..... #no need to post the whole condition
break
except:
del proxies[proxy_index]
browser.quit()
Well, with this one I succeded on trying to open Bet365, and I'm still checking, but I think this webdriver is going to be much slower than my original one, with no proxy.
So, my question is, is it expected that using proxy the scraping is going to be much slower, or does it depend on the proxy used? If so, does anyone recommed a different (or better, surely) approach?
I don't see any significant issue either in your approach or your code block. However, another approach would be to make use of all the proxies marked with in the Last Checked column which gets updated within the Free Proxy List.
As a solution you can write a script to grab all the proxies available and create a List dynamically every time you initialize your program. The following program will invoke a proxy from the Proxy List one by one until a successful proxied connection is established and verified through the Page Title of https://www.bet365.es to contain the text bet365. An exception may arise because the free proxy which your program grabbed was overloaded with users trying to get their proxy traffic through.
Code Block:
driver.get("https://sslproxies.org/")
driver.execute_script("return arguments[0].scrollIntoView(true);", WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[#class='table table-striped table-bordered dataTable']//th[contains(., 'IP Address')]"))))
ips = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[#class='table table-striped table-bordered dataTable']//tbody//tr[#role='row']/td[position() = 1]")))]
ports = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[#class='table table-striped table-bordered dataTable']//tbody//tr[#role='row']/td[position() = 2]")))]
driver.quit()
proxies = []
for i in range(0, len(ips)):
proxies.append(ips[i]+':'+ports[i])
print(proxies)
for i in range(0, len(proxies)):
try:
print("Proxy selected: {}".format(proxies[i]))
options = webdriver.ChromeOptions()
options.add_argument('--proxy-server={}'.format(proxies[i]))
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get("https://www.bet365.es")
if "Proxy Type" in WebDriverWait(driver, 20).until(EC.title_contains("bet365")):
# Do your scrapping here
break
except Exception:
driver.quit()
print("Proxy was Invoked")
Console Output:
['190.7.158.58:39871', '175.139.179.65:54980', '186.225.45.146:45672', '185.41.99.100:41258', '43.230.157.153:52986', '182.23.32.66:30898', '36.37.160.253:31450', '93.170.15.214:56305', '36.67.223.67:43628', '78.26.172.44:52490', '36.83.135.183:3128', '34.74.180.144:3128', '206.189.122.177:3128', '103.194.192.42:55546', '70.102.86.204:8080', '117.254.216.97:23500', '171.100.221.137:8080', '125.166.176.153:8080', '185.146.112.24:8080', '35.237.104.97:3128']
Proxy selected: 190.7.158.58:39871
Proxy selected: 175.139.179.65:54980
Proxy selected: 186.225.45.146:45672
Proxy selected: 185.41.99.100:41258
Related
I have a simple program that logs into Facebook and gets 3 urls:
def setup_driver():
prefs = {"profile.default_content_setting_values.notifications": 2}
chrome_options = webdriver.ChromeOptions()
chrome_options.add_experimental_option("prefs", prefs)
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
driver = webdriver.Chrome(executable_path="./chromedriver_linux",
chrome_options=chrome_options)
return driver
def log_into_facebook(driver):
driver.get("https://www.facebook.com/")
email_field = driver.find_element_by_id("email")
email_field.send_keys("<MY EMAIL ADDRESS>")
password_field = driver.find_element_by_id("pass")
password_field.send_keys("<MY FB PASSWORD>")
driver.find_element_by_id("loginbutton").click()
if __name__ == "__main__":
driver = setup_driver()
log_into_facebook(driver)
print("before getting url 1")
driver.get('https://facebook.com/2172111592857876')
print("before getting url 2")
#Stackoverflow is breaking indentation
driver.get('https://www.facebook.com/beaverconfessions/posts/2265225733546461')
print("before getting url 3")
driver.get('https://www.facebook.com/beaverconfessions/posts/640487179353666')
print("finished getting 3 urls")
On my local machine, this program runs fine. However, on my AWS EC2 instance, this program makes my instance unusable (the Python script will hang/stall after "before getting url 2" is printed to the console. While the script is hanging, the EC2 instance will become so slow that other programs on the instance will also stop working properly. I need to forcefully close the program with Ctrl-C in order for the instance to start being responsive again.). However, if I comment out log_into_facebook(driver), then the program runs fine.
I would try to get an stacktrace, but the program doesn't actually crash, rather it just never reaches "before getting url 3".
It is worth nothing, previously I was getting "invalid session id" errors with a program that was similar to this (it also logged into Facebook and then called driver.get several times).
Update: Removing the --no-sandbox option from the webdriver seemed to fix the problem. I'm not sure why. I originally had this option in place because I was previously having a "unable to fix open pages" error, and I read that "--no-sandbox" would fix the error.
chrome_options.add_argument('--no-sandbox')
Roymunson reports that the appropriate way to fix the hanging problem is:
Avoid specifying the --no-sandbox option in the webdriver.
I'm coding my first telegram bot, but now I have to serve multiple user at the same time.
This code it's just a little part, but it should help me to use multithread with selenium
class MessageCounter(telepot.helper.ChatHandler):
def __init__(self, *args, **kwargs):
super(MessageCounter, self).__init__(*args, **kwargs)
def on_chat_message(self, msg):
content_type, chat_type, chat_id = telepot.glance(msg)
chat_id = str(chat_id)
browser = browserSelenium.start_browser(chat_id)
userIsLogged = igLogin.checkAlreadyLoggedIn(browser, chat_id)
print(userIsLogged)
TOKEN = "***"
bot = telepot.DelegatorBot(TOKEN, [
pave_event_space()(
per_chat_id(), create_open, MessageCounter, timeout=10),
])
MessageLoop(bot).run_as_thread()
while 1:
time.sleep(10)
when the bot recive any message it starts a selenium session calling this function:
def start_browser(chat_id):
global browser
try:
browser.get('https://www.google.com')
#igLogin.checkAlreadyLoggedIn(browser)
#links = telegram.getLinks(24)
#instagramLikes(browser, links)
except Exception as e:
print("type error: " + str(e))
print('No such session! starting webDivers!')
sleep(3)
# CLIENT CONNECTION !!
chrome_options = Options()
chrome_options.add_argument('user-data-dir=/home/ale/botTelegram/users/'+ chat_id +'/cookies')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--lang=en')
print("Starting WebDrivers")
browser = webdriver.Chrome(options=chrome_options)
start_browser(chat_id)
return browser
and then this one check if the user is logged:
def checkAlreadyLoggedIn(browser, chat_id):
browser.get('https://www.instagram.com/instagram/')
try:
WebDriverWait(browser, 5).until(EC.element_to_be_clickable(
(By.XPATH, instagramClicks.buttonGoToProfile))).click()
print('User already Logged')
return True
except:
print('User not Logged')
userLogged = login(browser, chat_id)
return userLogged
and if the user is not logged it try to log the user in whit username and password
so, basically, when I write at the bot with one account everithing works fine, but if I write to the bot from two different account it opens two browser, but it controll just one.
What I mean it's that for example, one window remain over the google page, and then the other one recive two times the comand, so, even when it has to write the username, it writes the username two times
How can I interract with multiple sessions?
WebDriver is not thread-safe. Having said that, if you can serialise access to the underlying driver instance, you can share a reference in more than one thread. This is not advisable. But you can always instantiate one WebDriver instance for each thread.
Ideally the issue of thread-safety isn't in your code but in the actual browser bindings. They all assume there will only be one command at a time (e.g. like a real user). But on the other hand you can always instantiate one WebDriver instance for each thread which will launch multiple browsing tabs/windows. Till this point it seems your program is perfect.
Now, different threads can be run on same Webdriver, but then the results of the tests would not be what you expect. The reason behind is, when you use multi-threading to run different tests on different tabs/windows a little bit of thread safety coding is required or else the actions you will perform like click() or send_keys() will go to the opened tab/window that is currently having the focus regardless of the thread you expect to be running. Which essentially means all the test will run simultaneously on the same tab/window that has focus but not on the intended tab/window.
Reference
You can find a relevant detailed discussion in:
Chrome crashes after several hours while multiprocessing using Selenium through Python
I had a code snippet that I tried a couple of weeks ago, that was getting the HTML of the page no problem (I was learning about NLP and wanted to try a couple of things with the titles). However, now when I perform a request with proxies (proxies=session.proxies) I get404. When I omit using the proxies everything is fine (that means that I am using my own IP, headers in both cases are identical)... Can someone help me with what is going on here? I am positive that the IPs are blocked. When I use some free proxies from the Internet everything is fine. But they are super unstable, and so it's not possible to do anything. I have also looked at the ips that this code snippet produces:
with Controller.from_port(port=9051) as controller:
controller.authenticate('my_pass')
controller.signal(Signal.NEWNYM)
time.sleep(controller.get_newnym_wait())
2 ips worked (Netherlands and France) from like 20 (Austria, UK, Netherlands (but starts with different two leading digits), Liberia, France (also different two leading digis)) that I have tried so far. Is it possible to tell tor to use proxies from specific countries? I guess that wouldn't help me much. If only I could cycle through the working IPs, but I have read somewhere that that is not possible.
Here is the code that preceeds the above code snippet:
import requests
global session
session = requests.session()
session.proxies = {}
session.proxies['http'] = 'socks5h://localhost:9050'
session.proxies['https'] = 'socks5h://localhost:9050'
url = 'https://www.reuters.com/finance/stocks/company-news/AAPL.OQ?date=12222016'
html = requests.get(url, timeout=(120, 120), proxies=session.proxies)
print(html)
Tried changing user agents to no avail.
Also note, that for the above to work, you need this.
Running a super simple webserver that, for the most part, just needs to strip out the query info from requests from a 3rd party script and return some files:
def run(server_class=HTTPServer, handler_class=MyHandler, port=xxx):
server_address = ('', port)
httpd = server_class(server_address, handler_class)
print ('Starting httpd...')
httpd.serve_forever()
One object that the server must return is an HTML page that then loads some javascript, which then loads three.js, which then proceeds to load a ton of objects from this same server.
Which works, but after reloading 1-5 times, it would usually result in WinError[10053] and the server locking up! After that, connections would be rejected or timeout. Not sure if this is due to too many requests, or something to do with Three.js's load function's connections.
This took hours and I couldn't find a specific solution, so I'll post an answer down below. Feel free to chime in other answers
By multi-threading the server/handler, I've been able to handle at least one user. It may still be a throughput problem, but this has been enough for now:
class ThreadingHTTPServer(socketserver.ThreadingMixIn, HTTPServer):
pass
def run(server_class=ThreadingHTTPServer, handler_class=MyHandler, port=xxx):
server_address = ('', port)
httpd = server_class(server_address, handler_class)
print ('Starting httpd...')
httpd.serve_forever()
Some other things I tried/failed:
Editing self.path and calling SimpleHTTPRequestHandler.do_GET(self) //Couldn't get path changing to affect the simple handler
allow_reuse_address = True/False // No effect
manually setting close_connection = True //No effect
some header play //No effect
Hope this saves someone!
I want to get the current url when I am running Selenium.
I looked at this stackoverflow page: How do I get current URL in Selenium Webdriver 2 Python?
and tried the things posted but it's not working. I am attaching my code below:
from selenium import webdriver
#launch firefox
driver = webdriver.Firefox()
url1='https://poshmark.com/search?'
# search in a window a window
driver.get(url1)
xpath='//input[#id="user-search-box"]'
searchBox=driver.find_element_by_xpath(xpath)
brand="freepeople"
style="top"
searchBox.send_keys(' '.join([brand,"sequin",style]))
from selenium.webdriver.common.keys import Keys
#EQUIValent of hitting enter key
searchBox.send_keys(Keys.ENTER)
print(driver.current_url)
my code prints https://poshmark.com/search? but it should print: https://poshmark.com/search?query=freepeople+sequin+top&type=listings&department=Women because that is what selenium goes to.
The issue is that there is no lag between your searchBox.send_keys(Keys.ENTER) and print(driver.current_url).
There should be some time lag, so that the statement can pick the url change. If your code fires before url has actually changed, it gives you old url only.
The workaround would be to add time.sleep(1) to wait for 1 second. A hard coded sleep is not a good option though. You should do one of the following
Keep polling url and wait for the change to happen or the url
Wait for a object that you know would appear when the new page comes
Instead of using Keys.Enter simulate the operation using a .click() on search button if it is available
Usually when you use click method in selenium it takes cared of the page changes, so you don't see such issues. Here you press a key using selenium, which doesn't do any kind of waiting for page load. That is why you see the issue in the first place
I had the same issue and I came up with solution that uses default explicit wait (see how explicit wait works in documentation).
Here is my solution
class UrlHasChanged:
def __init__(self, old_url):
self.old_url = old_url
def __call__(self, driver):
return driver.current_url != self.old_url:
#contextmanager
def url_change(driver):
current_url = driver.current_url
yield
WebDriverWait(driver, 10).until(UrlHasChanged(current_url))
Explanation:
At first, I created my own wait condition (see here) that takes old_url as a parameter (url from before action was made) and checks whether old url is the same like current_url after some action. It returns false when both urls are the same and true otherwise.
Then, I created context manager to wrap action that I wanted to make, and I saved url before action was made, and after that I used WebDriverWait with created before wait condition.
Thanks to that solution I can now reuse this function with any action that changes url to wait for the change like that:
with url_change(driver):
login_panel.login_user(normal_user['username'], new_password)
assert driver.current_url == dashboard.url
It is safe because WebDriverWait(driver, 10).until(UrlHasChanged(current_url)) waits until current url will change and after 10 seconds it will stop waiting by throwing an exception.
What do you think about this?
I fixed this problem by clicking on the button by using href. Then do driver.get(hreflink). Click() was not working for me!