Broken string on windows search screen - python-3.x

I am developing a python application to send some images via whatsapp but when I try to attach the image the word is broken, does anyone know what happens?
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
AbreAnexo = driver.find_element_by_css_selector('span[data-icon="clip"]')
AbreAnexo.click()
AbreImagem = driver.find_element_by_css_selector('button[class="Ijb1Q"]')
AbreImagem.click()
pyautogui.typewrite("C:\\Users\\f_teicar\\Documents\\Lanchonete\\001-Cardapio.png",interval=0.02)
time.sleep(5)
pyautogui.press('enter')
time.sleep(5)
pyautogui.press('enter')
Expected is to write C:\Users\f_teicar\Documents\Lanchonete\001-Cardapio.png
but output is it car\Documents\Lanchonete\001-Cardapio.png

Especially on external web sites, it takes a while for the page to load. This means that the next step (or part of it) might be ignored as the page isn't ready to receive further operations from the Selenium client.
time.sleep(n) where n is the number of seconds to wait, is a quick way of waiting for the page to load, but if it takes a bit longer than the time you specify, it will fail, and if it loads much faster, then it will waste time. So I use a function to wait for the page like this.
#contextmanager
from selenium import webdriver
from selenium.webdriver.support.expected_conditions import staleness_of
from contextlib import contextmanager
def wait_for_page_load(timeout=MAX_WAIT):
""" Wait for a new page that isn't the old page
"""
old_page = driver.find_element_by_tag_name('html')
yield
webdriver.support.ui.WebDriverWait(driver, timeout).until(staleness_of(old_page))
To call the function, use something like
with self.wait_for_page_load():
AbreImagem.click()
where the second line is anything that causes a new page to load. Note that this procedure depends on the presence of the tag in the old page, which is usually pretty reliable.

Related

Sometimes Chrome Web Driver is not loading link using Selenium Python

hope you all are fine.
I'm opening a link using this code,
def load_url():
driver = webdriver.Chrome()
driver.get(url)
def scrap_matches(urls):
for url in urls:
load_url(url)
Use Case:
I'm using a list of urls and passing them one by one to "load_url" function.
I'm using multithreading i.e. There are 4 threads and each thread is working on a list of urls like this,
# part_1 and part_2 are the lists.
t1 = threading.Thread(target=scrap_matches, args=(part_1,))
t2 = threading.Thread(target=scrap_matches, args=(part_2,))
t1.start()
t2.start()
t1.join()
t2.join()
Sometimes an empty chrome tab opens without loading any content of page. Link is present in link portion of chrome. When I refreshes that tab manually then website loads successfully.
I need to avoid this situation. I have searched about it but could not find any satisfactory answer.

Selenium: Edge webdriver not waiting for page to load before executing next step (python)

I am writing some tests using selenium with Python. So far my suite works perfectly with Chrome and Firefox. However, the same code is not working when I try with the Edge (EdgeHTML). I am using the latest version at the time of writing which is release 17134, version: 6.17134. My tests are running on Windows 10.
The problem is that Edge is not waiting for the page to load. As part of every test, a login is first performed. The credentials are entred and the form submitted. Firefox and Chrome will now wait for the page we are redirected to, to load. However, with Edge, the next code is executed as soon as the login submit button is clicked which of course results in a failed test.
Is this a bug with Edge? It seems a bit too fundamental to be the case. Does the browser need to be configured in a certain manner? I cannot see anything in the documentation.
This is the code run with the last statement resulting in a redirect as we have logged in:
self.driver.find_element_by_id("login-email").send_keys(username)
self.driver.find_element_by_id("login-password").send_keys(password)
self.driver.find_element_by_id("login-messenger").click()
Edge decides it does not need to wait and will then execute the next code which is to navigate to a protected page. The code is:
send_page = SendPage(driver)
send_page.load_page()
More concisely:
self.driver.find_element_by_id("login-messenger").click()
# should wait now for the login redirect before excuting the line below but it does not!
self.driver.get(BasePage.base_url + self.uri)
I can probably perform a workaround by waiting for an element on the extent page to be present thus making Edge wait. This does not feel like the right thing to do. I certainly don't want to have to keep making invasive changes just for Edge.
Any advice please on what I should do?
Is this a bug with Edge? It seems a bit too fundamental to be the
case. Does the browser need to be configured in a certain manner? I
cannot see anything in the documentation.
No, I think it is not a bug with Edge browser. Because of the difference between the browser's performance, perhaps Edge browser will spend more time to load the page.
Generally, we could use the time.sleep(secs) method, WebDriverWait() method and implicitly_wait() method to wait the page load.
The code block below shows you how to wait for a page load to complete. It uses a timeout. It waits for an element to show on the page (you need an element id).
Then if the page is loaded, it shows page loaded. If the timeout period (in seconds) has passed, it will show the timeout error.
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Firefox()
driver.get('https://pythonbasics.org')
timeout = 3
try:
element_present = EC.presence_of_element_located((By.ID, 'main'))
WebDriverWait(driver, timeout).until(element_present)
except TimeoutException:
print("Timed out waiting for page to load")
finally:
print("Page loaded")
More detail information, please check the following articles.
Wait until page is loaded with Selenium WebDriver for Python
How to wait for elements in Python Selenium WebDriver

Selenium still using previous state of page even after clicking a button on a page. How to update to state of the browser/HTML code?

I am using python to scrape some data from a website in combination with selenium and Beautiful Soup. This page has buttons you can click which change the data displayed in the tables, but this is all handled by the javascript in the page. The page url does not change.
Selenium successfully renders the javascript on the page on load, but it continues using the previous state (before the clicks) therefore, scraping the same data instead of the new data.
I tried following the solutions given on Obey The Testing Goat but it always seemed to timeout and not turn the state stale. I've tried waiting for 10 seconds manually by using a time.sleep for it to wait for the state to possibly refresh in a while. I've tried using WebDriverWait to wait until the old page turned stale. I've tried looking through the selenium documentation for possible solutions. The code presented below attempts to use the solution presented in the website, but it simply times out no matter the timeout rate.
from selenium.webdriver.support.wait import WebDriverWait
from contextlib import contextmanager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.expected_conditions import staleness_of
class MySeleniumTest():
# assumes self.browser is a selenium webdriver
def __init__(self, browser, soup):
self.browser = browser
self.soup = soup
#contextmanager
def wait_for_page_load(self, timeout=30):
old_page = self.browser.find_element_by_tag_name('html')
yield
WebDriverWait(self.browser, timeout).until(staleness_of(old_page))
def tryChangingState(self):
with self.wait_for_page_load(timeout=20):
og_state = self.soup
tab = self.browser.find_element_by_link_text('Breakfast')
tab.click()
tab = self.browser.find_element_by_link_text('Lunch')
tab.click()
new_state = self.soup
# check if the HTML code has changed
print(og_state != new_state)
# create tester object
tester = MySeleniumTest(browser, soup)
# try changing state by after clicking on button
tester.tryChangingState()
I'm not sure if I'm using it in the correct way or not. I also tried creating a new with self.wait_for_page_load(timeout=20): after the first click and put the rest of the code within that, but this also did not work. I would expect og_state != new_state to result in true implying the HTML changed, but the actual result is false.
Original poster here. I found the reason for the issue. The state was being updated in selenium but since I was using Beautiful Soup for parsing, the Beautiful Soup object was using the source code from the previous selenium web driver object. But updating the soup object each time the page was clicked, the scraper was able to successfully gather the new data.
I updated the soup object by simply calling soup = BeautifulSoup(browser.page_source, 'lxml')
In other words, I didn't need to worry about the state of the selenium web driver, it was simply an issue of updating the source code the parser was reading.

Selenium Webdriver saving corrupt jpeg

Below is a script that opens a URL, saves the image as a JPEG file, and also saves some html attribute (i.e. the Accession Number) as the file name. The script runs but saves corrupted images; size = 210 bytes with no preview. When I try to open them, the error message suggests the file is damaged.
The reason I am saving the images instead of doing a direct request is to get around the site's security measures, it doesn't seem to allow web scraping. My colleague who tested the script on Windows below got a robot check request (just once at the beginning of the loop) before the images successfully downloaded. I do not get this check from the site, so I believe my script is actually pulling the robot check instead of the webpage as it hasn't allowed me to manually bypass the check. I'd appreciate help addressing this issue, perhaps forcing the robot check when the script opens the first URL.
Dependencies
I am using Python 3.6 on MacOS. If anyone testing this for me is also using Mac and is using Selenium for the first time, please note that a file called "Install Certificates.command" first needs to be executed before you can access anything. Otherwise, it will throw a "Certificate_Verify_Failed" error. Easy to search in Finder.
Download for Selenium ChromeDriver utilized below: https://chromedriver.storage.googleapis.com/index.html?path=2.41/
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import urllib
import time
urls = ['https://www.metmuseum.org/art/collection/search/483452',
'https://www.metmuseum.org/art/collection/search/460833',
'https://www.metmuseum.org/art/collection/search/551844']
#Set up Selenium Webdriver
options = webdriver.ChromeOptions()
#options.add_argument('--ignore-certificate-errors')
options.add_argument("--test-type")
driver = webdriver.Chrome(executable_path="/Users/user/Desktop/chromedriver", chrome_options=options)
for link in urls:
#Load page and pull HTML File
driver.get(link)
time.sleep(2)
soup = BeautifulSoup(driver.page_source, 'lxml')
#Find details (e.g. Accession Number)
details = soup.find_all('dl', attrs={'class':'artwork__tombstone--row'})
for d in details:
if 'Accession Number' in d.find('dt').text:
acc_no = d.find('dd').text
pic_link = soup.find('img', attrs={'id':'artwork__image', 'class':'artwork__image'})['src']
urllib.request.urlretrieve(pic_link, '/Users/user/Desktop/images/{}.jpg'.format(acc_no))
time.sleep(2)

My script produces javascript stuffs Instead of valid links

Running my script I get "javascript:getDetail(19978)" such items as href. The number in braces if concatenated with "https://www.aopa.org/airports/4M3/business/", produces valid links. However, clicking on this newly created links I can see that It gets me to a different page which is not similar to the one if clicked from the original page link. How can I get the original links instead of "javascript:getDetail(19978)". Search should be made writing "All" in the searchbox.
The code I've tried with:
from selenium import webdriver
import time
link = "https://www.aopa.org/airports/4M3/business/"
driver = webdriver.Chrome()
driver.get("https://www.aopa.org/learntofly/school/")
driver.find_element_by_id('searchTerm').send_keys('All')
driver.find_element_by_id('btnSearch').click()
time.sleep(5)
for pro in driver.find_elements_by_xpath('//td/a'):
print(pro.get_attribute("href"))
driver.quit()
Code to create new links with the base url I pasted in my description:
from selenium import webdriver
import time
link = "https://www.aopa.org/airports/4M3/business/"
driver = webdriver.Chrome()
driver.get("https://www.aopa.org/learntofly/school/")
driver.find_element_by_id('searchTerm').send_keys('All')
driver.find_element_by_id('btnSearch').click()
time.sleep(5)
for item in driver.find_elements_by_xpath('//td/a'):
fresh = item.get_attribute("href").replace("javascript:getDetail(","")
print(link + fresh.replace(")",""))
driver.quit()
However, this newly created links lead me to different destinations.
FYC, original links are embedded within elements like the below one:
<td>GOLD DUST FLYING SERVICE, INC.</td>
Clicking link you make an XHR. The page is actually remained the same, but received data from JSON rendered instead of previous content.
If you want to open raw data inside an HTML page you might try something like
# To get list of entries as ["19978", "30360", ... ]
entries = [a.get_attribute('href').split("(")[1].split(")")[0] for a in driver.find_elements_by_xpath('//td/a')]
url = "https://www.aopa.org/learntofly/school/wsSearch.cfm?method=schoolDetail&businessId="
for entry in entries:
driver.get(url + entry)
print(driver.page_source)
You also might use requests to get each JSON response as
import requests
for entry in entries:
print(requests.get(url + entry).json())
without rendering data in browser
If you look at how getDetail() is implemented in the source code and explore the "network" tab when you click each of the search result links, you may see that there are multiple XHR requests issued for a result and there is some client-side logic executed to form a search result page.
If you don't want to dive into replicating all the logic happening to form each of the search result pages - simply go back and forth between the search results page and a single search result page:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("https://www.aopa.org/learntofly/school/")
driver.find_element_by_id('searchTerm').send_keys('All')
driver.find_element_by_id('btnSearch').click()
# wait for search results to be visible
table = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "#searchResults table")))
for index in range(len(table.find_elements_by_css_selector('td a[href*=getDetail]'))):
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "#searchResults table")))
# get the next link
link = table.find_elements_by_css_selector('td a[href*=getDetail]')[index]
link_text = link.text
link.click()
print(link_text)
# TODO: get details
# go back
back_link = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#schoolDetail a[href*=backToList]")))
driver.execute_script("arguments[0].click();", back_link)
driver.quit()
Note the use of Explicit Waits instead of hardcoded "sleeps".
It may actually make sense to avoid using selenium here altogether and approach the problem "headlessly" - doing HTTP requests (via requests module) to the website's API.

Resources