How to scroll down google job page using selenium python - python-3.x

I am trying to scroll down the job posts using below lines, but it will give sometime correct results to scroll down to the end and sometimes it won't.
html = driver.find_element_by_tag_name('html')
time.sleep(5)
html.send_keys(Keys.END)
Can anyone suggest me how to scroll down to the end, please find the link and screenshot below.
https://www.google.com/search?q=upsc+jobs+in+india&rlz=1C1CHBF_enIN869IN869&oq=upsc+jo&aqs=chrome.1.69i57j0i433i512j0i131i433i512j0i512l3j0i131i433i512l2j0i512j0i433i512&sourceid=chrome&ie=UTF-8&ibp=htl;jobs&sa=X&sqi=2&ved=2ahUKEwjR27GN_qPzAhX4ppUCHb_0B_QQkd0GegQIORAB#fpstate=tldetail&sxsrf=AOaemvIxuJXh3if0tw7ezZfjkXRe5DSxsA:1632911697417&htivrt=jobs&htidocid=N1_BNfzt8n8auXjGAAAAAA%3D%3D

The more you scroll the more data you get, basically it's a dynamic web site. I have hardcoded 50 as a dummy number, you can have 100 or any other number for this matter.
You can use the sample code :
driver = webdriver.Chrome(driver_path)
driver.maximize_window()
driver.implicitly_wait(30)
driver.get("https://www.google.com/search?q=upsc+jobs+in+india&rlz=1C1CHBF_enIN869IN869&oq=upsc+jo&aqs=chrome.1.69i57j0i433i512j0i131i433i512j0i512l3j0i131i433i512l2j0i512j0i433i512&sourceid=chrome&ie=UTF-8&ibp=htl;jobs&sa=X&sqi=2&ved=2ahUKEwjR27GN_qPzAhX4ppUCHb_0B_QQkd0GegQIORAB#fpstate=tldetail&sxsrf=AOaemvIxuJXh3if0tw7ezZfjkXRe5DSxsA:1632911697417&htivrt=jobs&htidocid=N1_BNfzt8n8auXjGAAAAAA%3D%3D")
j = 1
for i in range(50):
element = driver.find_element(By.XPATH, f"(//div[#role='heading'])[{j}]")
driver.execute_script("arguments[0].scrollIntoView(true);", element)
j = j + 1

You can also try this to scroll till the end.
driver.get("https://www.google.com/search?q=upsc+jobs+in+india&rlz=1C1CHBF_enIN869IN869&oq=upsc+jo&aqs=chrome.1.69i57j0i433i512j0i131i433i512j0i512l3j0i131i433i512l2j0i512j0i433i512&sourceid=chrome&ie=UTF-8&ibp=htl;jobs&sa=X&sqi=2&ved=2ahUKEwjR27GN_qPzAhX4ppUCHb_0B_QQkd0GegQIORAB#fpstate=tldetail&sxsrf=AOaemvIxuJXh3if0tw7ezZfjkXRe5DSxsA:1632911697417&htivrt=jobs&htidocid=N1_BNfzt8n8auXjGAAAAAA%3D%3D")
i = 0
try:
while True:
options = driver.find_elements_by_xpath("//div[#role='treeitem']")
driver.execute_script("arguments[0].scrollIntoView(true);",options[i])
i+=1
time.sleep(.5)
except:
pass

Related

How to long press (Press and Hold) mouse left key using only Selenium in Python

I am trying to scrape some review data from the Walmart site using Selenium in Python, but it connects this site for human verification. After inspecting this 'Press & Hold' button, somehow when I find the element, it comes out as an [object HTMLIFrameElement], not as a web element. And the element appears randomly inside any of the iframes, among 10 iframes. It can be checked using a loop, but, ultimately we can't take any action in selenium without a web element.
Though this verification also occurs as a popup, I was trying to solve it for this page first. Somehow I located the position of this button using the div as a webelement.
actions = ActionChains(driver)
iframe = driver.find_element_by_xpath("//div[#id='px-captcha']")
frame_x = iframe.location['x']
frame_y = iframe.location['y']
actions.move_to_element(iframe).move_by_offset(frame_x-550, frame_y+70).build().perform()
if I perform a context.click() or right click, it is visible that mouse position is in the middle of the button.
Now, if I can perform long press or Press & Hold the left mouse button for a while, I guess this verification can be cleared. For this I tried to take action using click() and click_and_hold and also with the key_down methods (as pressing ctrl and enter does the same as long press) in action, but no response as these methods release the buttons, can't be long pressed. I tried
actions.move_to_element(iframe).move_by_offset(frame_x-550,frame_y+70).click_and_hold().pause(20).perform()
actions.move_to_element(iframe).move_by_offset(frame_x-550, frame_y+70).actions.key_down(Keys.CONTROL).actions.key_down(Keys.ENTER).pause(20).perform()
.....and so many ways! How can I solve it using Selenium?
Here's my make-shift solution. The key is the release after 10 seconds and click again. This is how I was able to trick the captcha into thinking I held it for just the right amount of time (in my experiments, the captcha hold-down time is randomized and 10 seconds ensures enough time to fully-complete the captcha).
element = driver.find_element_by_css_selector('#px-captcha')
action = ActionChains(driver)
click = ActionChains(driver)
action.click_and_hold(element)
action.perform()
time.sleep(10)
action.release(element)
action.perform()
time.sleep(0.2)
action.release(element)
Here is the full code to handle the Press & Hold captcha case. I have add code to automatically resize the captcha box size to click & hold in the middle of the captcha that is required to be verified.
import os
from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.action_chains import ActionChains
import time
from time import sleep
from random import randint
chromedriver = "C:\Program Files\Python39\Scripts\chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
chromedriver = "C:\Program Files\Python39\Scripts\chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
url = "{Your URL}"
driver.get(url)
sleep(randint(2,3))
element = driver.find_element_by_xpath("//div[#id='px-captcha']")
# print(len(element.text), '- Value found by method text')
action = ActionChains(driver)
click = ActionChains(driver)
frame_x = element.location['x']
frame_y = element.location['y']
# print('x: ', frame_x)
# print('y: ', frame_y)
# print('size box: ', element.size)
# print('x max click: ', frame_x + element.size['width'])
# print('y max click: ', frame_y + element.size['height'])
x_move = frame_x + element.size['width']*0.5
y_move = frame_y + element.size['height']*0.5
action.move_to_element_with_offset(element, x_move, y_move).click_and_hold().perform()
time.sleep(10)
action.release(element)
action.perform()
time.sleep(0.2)
action.release(element)
#Prata Palit
'Press & Hold' button uses 10 iframes,
Random one iframe is visible, other 9 ifame is hidden,
The iframe has cross-domain and cannot get element with javascript.
'Press and hold' button complete the verification speed is also random.
I used feature matching with FLANN.
Get a verified captcha image https://i.stack.imgur.com/ADzMT.png
Use selenium screen shot to get captcha element image.
Don't use OS screen shot.
Because you need to compare the same screenshot.
Check whether the page contains captcha
press and hold
Loop the following operations at intervals of 0.5 seconds
Use the pre-prepared captcha to compare with the captcha on the page
until match or timeout
match captcha https://i.stack.imgur.com/xCqhy.jpg
def solve_blocked(self, retry=3):
'''
Solve blocked
(Cross-domain iframe cannot get elements temporarily)
Simulate the mouse press and hold to complete the verification
'''
if not retry:
return False
element = None
try:
element = WebDriverWait(self.browser,15).until(EC.presence_of_element_located((By.ID,'px-captcha')))
# Wait for the px-captcha element styles to fully load
time.sleep(0.5)
except BaseException as e:
self.logger.info(f'px-captcha element not found')
return
self.logger.info(f'solve blocked:{self.browser.current_url}, Retry {retry} remaining times')
template = cv2.imread(os.path.join(settings.TPL_DIR, 'captcha.png'), 0)
# Set the minimum number of feature points to match value 10
MIN_MATCH_COUNT = 8
if element:
self.logger.info(f'start press and hold')
ActionChains(self.browser).click_and_hold(element).perform()
start_time = time.time()
while 1:
# timeout
if time.time() - start_time > 20:
break
x, y = element.location['x'], element.location['y']
width, height = element.size.get('width'), element.size.get('height')
left = x*self.pixelRatio
top = y*self.pixelRatio
right = (x+width)*self.pixelRatio
bottom = (y+height)*self.pixelRatio
# full screenshot
png = self.browser.get_screenshot_as_png()
im = Image.open(BytesIO(png))
# px-captcha screenshot
im = im.crop((left, top, right, bottom))
target = cv2.cvtColor(np.asarray(im),cv2.COLOR_RGB2BGR)
# Initiate SIFT detector
sift = cv2.SIFT_create()
# find the keypoints and descriptors with SIFT
kp1, des1 = sift.detectAndCompute(template,None)
kp2, des2 = sift.detectAndCompute(target,None)
# create set FLANN match
FLANN_INDEX_KDTREE = 0
index_params = dict(algorithm = FLANN_INDEX_KDTREE, trees = 5)
search_params = dict(checks = 50)
flann = cv2.FlannBasedMatcher(index_params, search_params)
matches = flann.knnMatch(des1,des2,k=2)
# store all the good matches as per Lowe's ratio test.
good = []
# Discard matches greater than 0.7
for m,n in matches:
if m.distance < 0.7*n.distance:
good.append(m)
self.logger.info( "matches are found - %d/%d" % (len(good),MIN_MATCH_COUNT))
if len(good)>=MIN_MATCH_COUNT:
self.logger.info(f'release button')
ActionChains(self.browser).release(element).perform()
return
time.sleep(0.5)
time.sleep(1)
retry -= 1
self.solve_blocked(retry)
One can also use Chromedriver Undetected..
You can follow the below git link for setup.
Quite simple to integrate
https://github.com/ultrafunkamsterdam/undetected-chromedriver
import undetected_chromedriver as uc
driver = uc.Chrome()
driver.get('https://www.example.com')

Python/Selenium - How to parse the URL and click next page?

I am trying to parse the hrefs and the titles of all articles from https://www.weforum.org/agenda/archive/covid-19 but I also want to pull information on the next page.
My code can only pull the current page but is not working on click() next page.
driver.get("https://www.weforum.org/agenda/archive/covid-19")
links =[]
titles = []
while True:
for elem in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.tout__link'))):
links.append(elem.get_attribute('href'))
titles.append(elem.text)
try:
WebDriverWait(driver,5).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".pagination__nav-text"))).click()
WebDriverWait(driver,5).until(EC.staleness_of(elem))
except:
break
Can anyone help me with the issue? Thank you!
The class name 'pagination__nav-text' is not unique. As per the design, it clicks on the first found element which is "Prev" link. so you would not see that working.
Can you try with this approach,
driver.get("https://www.weforum.org/agenda/archive/covid-19")
wait = WebDriverWait(driver,10)
links =[]
titles = []
while True:
for elem in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.tout__link'))):
links.append(elem.get_attribute('href'))
titles.append(elem.text)
try:
print('trying to click next')
WebDriverWait(driver,5).until(EC.presence_of_element_located((By.XPATH,"//div[#class='pagination__nav-text' and contains(text(),'Next')]"))).click()
WebDriverWait(driver,5).until(EC.staleness_of(elem))
except:
break
print(links)
print(titles)
driver.quit()

Accessing the next page using selenium

First, I have never used selenium until yesterday. I was able to scrape the target table correctly after many attempts.
I am currently trying to scrape the tables on sequential pages. It works sometimes and other times it fails immediately. I have spent hours surfing Google and Stack Overflow, but I have not solve my problem. I am sure the answer is something simple, but after 8 hours I need to ask a question to the experts in selenium.
My target url is: RedHat Security Advisories
If there is a question on Stack Overflow that answers my problem, please let me know and I will do some my research and testing.
Here are some of the items that I have tried:
Example 1:
page_number = 0
while True:
try:
page_number += 1
browser.execute_script("return arguments[0].scrollIntoView(true);",
WebDriverWait(browser, 30).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="jumpPoint"]/div[3]/div/div/div[2]/div/div['
'2]/dir-pagination-controls/ul/li[str(page_number))]'))))
browser.find_element_by_xpath('//*[#id="jumpPoint"]/div[3]/div/div/div[2]/div/div[2]/dir-pagination-controls/ul/li[str(page_number)').click()
print(f"Navigating to page {page_number}")
# I added this because my connection was
# being terminated by RedHat
time.sleep(20)
except (TimeoutException, WebDriverException) as e:
print("Last page reached")
break
except Exception as e:
print (e)
break
Example 2:
page_number = 0
while True:
try:
page_number += 1
browser.execute_script("return arguments[0].scrollIntoView(true);",
WebDriverWait(browser, 30).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="jumpPoint"]/div[3]/div/div/div[2]/div/div['
'2]/dir-pagination-controls/ul/li[12]'))))
browser.find_element_by_xpath('//*[#id="jumpPoint"]/div[3]/div/div/div[2]/div/div[2]/dir-pagination-controls/ul/li[12]').click()
print(f"Navigating to page {page_number}")
# I added this because my connection was
# being terminated by RedHat
time.sleep(20)
except (TimeoutException, WebDriverException) as e:
print("Last page reached")
break
except Exception as e:
print (e)
break
You can use the below logic.
lastPage = WebDriverWait(driver,120).until(EC.element_to_be_clickable((By.XPATH,"(//ul[starts-with(#class,'pagination hidden-xs ng-scope')]/li[starts-with(#ng-repeat,'pageNumber')])[last()]")))
driver.find_element_by_css_selector("i.web-icon-plus").click()
pages = lastPage.text
pages = '5'
for pNumber in range(1,int(pages)):
currentPage = WebDriverWait(driver,30).until(EC.element_to_be_clickable((By.XPATH,"//ul[starts-with(#class,'pagination hidden-xs ng-scope')]//a[.='" + str(pNumber) + "']")))
print ("===============================================")
print("Current Page : " + currentPage.text)
currentPage.location_once_scrolled_into_view
currentPage.click()
WebDriverWait(driver,120).until_not(EC.element_to_be_clickable((By.CSS_SELECTOR,"#loading")))
# print rows data here
rows = driver.find_elements_by_xpath("//table[starts-with(#class,'cve-table')]/tbody/tr") #<== getting rows here
for row in rows:
print (row.text) <== I am printing all row data, if you want cell data please update the logic accordingly
time.sleep(randint(1, 5)) #<== this step is optional
I believe you can read data directly using url instead of trying for pagination, this will lead to less sync issues because of which script might be failing
Use this xpath to get total no of pages for the security-updates table.
//*[#id="jumpPoint"]/div[3]/div/div/div[2]/div/div[2]/dir-pagination-controls/ul/li[11]
Run loop till page count get from step 1
Inside loop pass page number in below url and send get request
https://access.redhat.com/security/security-updates/#/security-advisories?q=&p=page_number&sort=portal_publication_date%20desc&rows=10&portal_advisory_type=Security%20Advisory&documentKind=PortalProduct
wait for page to load
Read data from table populated on page
This process will run till the pagination count
Incase you find specific error that site has blocked the user then you can refresh the page with same page_number.

How to automate the crawling without hardcoding any number to it?

I've written a script using python with selenium to scrape names of restaurants from a webpage. It is working great if I hardcode the number of amount I want to parse. The page has got lazy-loading process and it displays 40 names in each scroll. However, my script can handle it. The only thing I would like to improve in my script is that I do not wish to hardcode the number; rather, I want it to detect itself how many are there and parse it successfully. Hope there is someone to help. Here is the code:
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('https://www.yellowpages.ca/search/si/1/pizza/Toronto')
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(3)
links = [posts.text for posts in driver.find_elements_by_xpath("//div[#itemprop='itemListElement']//h3[#itemprop='name']/a")]
if (len(links) == 240):
break
for link in links:
print(link)
driver.quit()
You can check if the number of links has changed in the last iteration
num_Of_links = -1
num = 0
while num != num_Of_links:
num_Of_links = num
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(3)
links = [posts.text for posts in driver.find_elements_by_xpath("//div[#itemprop='itemListElement']//h3[#itemprop='name']/a")]
num = len(links)

selenium - clicking a button

I am trying to pull out the names of all courses offered by Lynda.com together with the subject so that it appears on my list as '2D Drawing -- Project Soane: Recover a Lost Monument with BIM with Paul F. Aubin'. So I am trying to write a script that will go to each subject on http://www.lynda.com/sitemap/categories and pull out the list of courses. I already managed to get Selenium to go from one subject to another and pull the courses. My only problem is that there is a button 'See X more courses' to see the rest of the courses. Sometimes you have to click it couple of times that´s why I used while loop. But selenium doesn´t seem to execute this click. Does anyone know why?
This is my code:
from selenium import webdriver
url = 'http://www.lynda.com/sitemap/categories'
mydriver = webdriver.Chrome()
mydriver.get(url)
course_list = []
for a in [1,2,3]:
for b in range(1,73):
mydriver.find_element_by_xpath('//*[#id="main-content"]/div[2]/div[3]/div[%d]/ul/li[%d]/a' % (a,b)).click()
while True:
#click the button 'See more results' as long as it´s available
try:
mydriver.find_element_by_xpath('//*[#id="main-content"]/div[1]/div[3]/button').click()
except:
break
subject = mydriver.find_element_by_tag_name('h1') # pull out the subject
courses = mydriver.find_elements_by_tag_name('h3') # pull out the courses
for course in courses:
course_list.append(str(subject.text)+" -- " + str(course.text))
# go back to the initial site
mydriver.get(url)
Scroll to element before clicking:
see_more_results = browser.find_element_by_css_selector('button[class*=see-more-results]')
browser.execute_script('return arguments[0].scrollIntoView()', see_more_results)
see_more_results.click()
One solution how to repeat these actions could be:
def get_number_of_courses():
return len(browser.find_elements_by_css_selector('.course-list > li'))
number_of_courses = get_number_of_courses()
while True:
try:
button = browser.find_element_by_css_selector(CSS_SELECTOR)
browser.execute_script('return arguments[0].scrollIntoView()', button)
button.click()
while True:
new_number_of_courses = get_number_of_courses()
if (new_number_of_courses > number_of_courses):
number_of_courses = new_number_of_courses
break
except:
break
Caveat: it's always better to use build-in explicit wait than while True:
http://www.seleniumhq.org/docs/04_webdriver_advanced.jsp#explicit-waits
The problem is that you're calling a method to find element by class name, but you're passing a xpath. if you're sure this is the correct xpath you'll simply need to change to method to 'find_element_by_xpath'.
A recommendation if you allow: Try to stay away from these long xpaths and go through some tutorials on how to write efficient xpath for example.

Resources