hope you all are fine.
I'm opening a link using this code,
def load_url():
driver = webdriver.Chrome()
driver.get(url)
def scrap_matches(urls):
for url in urls:
load_url(url)
Use Case:
I'm using a list of urls and passing them one by one to "load_url" function.
I'm using multithreading i.e. There are 4 threads and each thread is working on a list of urls like this,
# part_1 and part_2 are the lists.
t1 = threading.Thread(target=scrap_matches, args=(part_1,))
t2 = threading.Thread(target=scrap_matches, args=(part_2,))
t1.start()
t2.start()
t1.join()
t2.join()
Sometimes an empty chrome tab opens without loading any content of page. Link is present in link portion of chrome. When I refreshes that tab manually then website loads successfully.
I need to avoid this situation. I have searched about it but could not find any satisfactory answer.
Related
I have a list of about 13.000 websites. From those each of those links, I intend to scrape information one by one by means of Python, Beautiful Soup and Selenium.
For most websites, the scraping process works fine. However, Selenium occasionally encounters a problem with a specific link. For instance, it gave the following error message when it loaded one of them:
WebDriverException: Message: unknown error: net::ERR_SSL_BAD_RECORD_MAC_ALERT (Session info: chrome=90.0.4430.93)
When I went to the driver and reloaded the page manually, it worked well. Unfortunately, the error stopped the whole scraping process. When I run the process again, I wish to prevent this from happening another time.
Here the first part of the loop I use to scrape the links:
for house in all_nd:
if str(requests.head(house)) == '<Response [200]>':
driver.get(house)
house_html = driver.page_source
house_soup = BeautifulSoup(huis_html)
Here, all_nd is a Python 3 list of strings of websites to houses and apartments. They're all prepended with 'https://'.
Question
How do I ensure that the scraping process is not stopped by a (temporary) error from a website? How do I jump to the next link in the list and continue with the for loop?
You should use try-except and in case of exception to continue to the next iteration.
for house in all_nd:
try:
if str(requests.head(house)) == '<Response [200]>':
driver.get(house)
house_html = driver.page_source
house_soup = BeautifulSoup(huis_html)
except:
continue
I looked up Selenium python documentation and it allows one to take screenshots of an element. I tried the following code and it worked for small pages (around 3-4 actual A4 pages when you print them):
from selenium.webdriver import FirefoxOptions
firefox_profile = webdriver.FirefoxProfile()
firefox_profile.set_preference("browser.privatebrowsing.autostart", True)
# Configure options for Firefox webdriver
options = FirefoxOptions()
options.add_argument('--headless')
# Initialise Firefox webdriver
driver = webdriver.Firefox(firefox_profile=firefox_profile, options=options)
driver.maximize_window()
driver.get(url)
driver.find_element_by_tag_name("body").screenshot("career.png")
driver.close()
When I try it with url="https://waitbutwhy.com/2020/03/my-morning.html", it gives the screenshot of the entire page, as expected. But when I try it with url="https://waitbutwhy.com/2018/04/picking-career.html", almost half of the page is not rendered in the screenshot (the image is too large to upload here), even though the "body" tag does extend all the way down in the original HTML.
I have tried using both implicit and explicit waits (set to 10s, which is more than enough for a browser to load all contents, comments and discussion section included), but that has not improved the screenshot capability. Just to be sure that selenium was in fact loading the web page properly, I tried loading without the headless flag, and once the webpage was completely loaded, I ran driver.find_element_by_tag_name("body").screenshot("career.png"). The screenshot was again half-blank.
It seems that there might be some memory constraints put on the screenshot method (although I couldn't find any), or the logic behind the screenshot method itself is flawed. I can't figure it out though. I simply want to take the screenshot of the entire "body" element (preferably in a headless environment).
You may try this code, just that you need to install a package from command prompt using the command pip install Selenium-Screenshot
import time
from selenium import webdriver
from Screenshot import Screenshot_Clipping
driver = webdriver.Chrome()
driver.maximize_window()
driver.implicitly_wait(10)
driver.get("https://waitbutwhy.com/2020/03/my-morning.html")
obj=Screenshot_Clipping.Screenshot()
img_loc=obj.full_Screenshot(driver, save_path=r'.', image_name='capture.png')
print(img_loc)
time.sleep(5)
driver.close()
Outcome/Result comes out to be like, you just need to zoom the screenshot saved
Hope this works for you!
I am developing a python application to send some images via whatsapp but when I try to attach the image the word is broken, does anyone know what happens?
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
AbreAnexo = driver.find_element_by_css_selector('span[data-icon="clip"]')
AbreAnexo.click()
AbreImagem = driver.find_element_by_css_selector('button[class="Ijb1Q"]')
AbreImagem.click()
pyautogui.typewrite("C:\\Users\\f_teicar\\Documents\\Lanchonete\\001-Cardapio.png",interval=0.02)
time.sleep(5)
pyautogui.press('enter')
time.sleep(5)
pyautogui.press('enter')
Expected is to write C:\Users\f_teicar\Documents\Lanchonete\001-Cardapio.png
but output is it car\Documents\Lanchonete\001-Cardapio.png
Especially on external web sites, it takes a while for the page to load. This means that the next step (or part of it) might be ignored as the page isn't ready to receive further operations from the Selenium client.
time.sleep(n) where n is the number of seconds to wait, is a quick way of waiting for the page to load, but if it takes a bit longer than the time you specify, it will fail, and if it loads much faster, then it will waste time. So I use a function to wait for the page like this.
#contextmanager
from selenium import webdriver
from selenium.webdriver.support.expected_conditions import staleness_of
from contextlib import contextmanager
def wait_for_page_load(timeout=MAX_WAIT):
""" Wait for a new page that isn't the old page
"""
old_page = driver.find_element_by_tag_name('html')
yield
webdriver.support.ui.WebDriverWait(driver, timeout).until(staleness_of(old_page))
To call the function, use something like
with self.wait_for_page_load():
AbreImagem.click()
where the second line is anything that causes a new page to load. Note that this procedure depends on the presence of the tag in the old page, which is usually pretty reliable.
I am using python to scrape some data from a website in combination with selenium and Beautiful Soup. This page has buttons you can click which change the data displayed in the tables, but this is all handled by the javascript in the page. The page url does not change.
Selenium successfully renders the javascript on the page on load, but it continues using the previous state (before the clicks) therefore, scraping the same data instead of the new data.
I tried following the solutions given on Obey The Testing Goat but it always seemed to timeout and not turn the state stale. I've tried waiting for 10 seconds manually by using a time.sleep for it to wait for the state to possibly refresh in a while. I've tried using WebDriverWait to wait until the old page turned stale. I've tried looking through the selenium documentation for possible solutions. The code presented below attempts to use the solution presented in the website, but it simply times out no matter the timeout rate.
from selenium.webdriver.support.wait import WebDriverWait
from contextlib import contextmanager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.expected_conditions import staleness_of
class MySeleniumTest():
# assumes self.browser is a selenium webdriver
def __init__(self, browser, soup):
self.browser = browser
self.soup = soup
#contextmanager
def wait_for_page_load(self, timeout=30):
old_page = self.browser.find_element_by_tag_name('html')
yield
WebDriverWait(self.browser, timeout).until(staleness_of(old_page))
def tryChangingState(self):
with self.wait_for_page_load(timeout=20):
og_state = self.soup
tab = self.browser.find_element_by_link_text('Breakfast')
tab.click()
tab = self.browser.find_element_by_link_text('Lunch')
tab.click()
new_state = self.soup
# check if the HTML code has changed
print(og_state != new_state)
# create tester object
tester = MySeleniumTest(browser, soup)
# try changing state by after clicking on button
tester.tryChangingState()
I'm not sure if I'm using it in the correct way or not. I also tried creating a new with self.wait_for_page_load(timeout=20): after the first click and put the rest of the code within that, but this also did not work. I would expect og_state != new_state to result in true implying the HTML changed, but the actual result is false.
Original poster here. I found the reason for the issue. The state was being updated in selenium but since I was using Beautiful Soup for parsing, the Beautiful Soup object was using the source code from the previous selenium web driver object. But updating the soup object each time the page was clicked, the scraper was able to successfully gather the new data.
I updated the soup object by simply calling soup = BeautifulSoup(browser.page_source, 'lxml')
In other words, I didn't need to worry about the state of the selenium web driver, it was simply an issue of updating the source code the parser was reading.
Good Afternoon :) Having a problem with my Python3 Gtk3 application and Selenium WebDriver (ChromeDriver). Also, using Linux if it matters.
Basically, the user presses a button to start the Selenium webdriver automation and then as the automation process is going, it 'SHOULD' give feedback to the user in the GUI (See Content.content_liststore.append(list(item)) and LogBox.log_text_buffer).
However, it's not adding anything into the content_liststore until after fb_driver.close() is done. In the meantime, the Gtk window just "hangs".
Now, I've been looking into multithreading in hopes of the GUI being responsive to this feedback but I've also been reading that Selenium doesn't like multithreading (but I presume thats running multiple browsers/tabs (which this is not)).
So, my question is; Is multithreading the go-to fix for getting this to work?
# ELSE IF, FACEBOOK COOKIES DO NOT EXIST, PROCEED TO LOGIN PAGE
elif os.stat('facebook_cookies').st_size == 0:
while True:
try: # look for element, if not found, refresh the webpage
assert "Facebook" in fb_driver.title
login_elem = fb_driver.find_element_by_id("m_login_email")
login_elem.send_keys(facebook_username)
login_elem = fb_driver.find_element_by_id("m_login_password")
login_elem.send_keys(facebook_password)
login_elem.send_keys(Keys.RETURN)
except ElementNotVisibleException:
fb_driver.refresh()
StatusBar.status_bar.push(StatusBar.context_id, "m_login_password element not found, trying again...")
ProblemsLog.text_buffer.set_text("Facebook has hidden the password field, refreshing page...")
else:
query_elem = fb_driver.find_element_by_name("query")
query_elem.send_keys(target)
query_elem.send_keys(Keys.RETURN)
break
m_facebook_url_remove = "query="
m_facebook_url = fb_driver.current_url.split(m_facebook_url_remove, 1)[1] # Remove string before "query="
facebook_url = "https://www.facebook.com/search/top/?q=" + m_facebook_url # Merge left-over string with the desktop url
StatusBar.status_bar.push(StatusBar.context_id, "Facebook elements found")
fb_title = fb_driver.title
fb_contents = [(target_name.title(), "Facebook", facebook_url)]
for item in fb_contents:
Content.content_liststore.append(list(item))
#progress_bar.set_fraction(0.10)
LogBox.log_text_buffer.set_text("Facebook Search Complete")
with open('facebook_cookies', 'wb') as filehandler:
pickle.dump(fb_driver.get_cookies(), filehandler)
fb_driver.close()
I've considered it not working because of the 'while' loop, but another piece of code doesn't have a loop and does the exact same thing, it waits for Selenium to finish before adding content to the GUI.
Additionally, the user can select multiple websites to do this with, so the application can first go to Facebook (do it's business then close), go to LinkedIn (do it's business then close) and so fourth. And it still waits for all the Selenium code to finish before adding anything to the Gtk GUI.
I really hope that makes sense! Thank you :)
Your`s question is the answer you are lookig for. Take a read here https://wiki.gnome.org/Projects/PyGObject/Threading