web scraping data from glassdoor using selenium

web scraping data from glassdoor using selenium - python-3.x

please I need some help to run this code (https://github.com/PlayingNumbers/ds_salary_proj/blob/master/glassdoor_scraper.py)
In order to scrape job offer data from Glassdoor
Here's the code snippet:
from selenium.common.exceptions import NoSuchElementException, ElementClickInterceptedException
from selenium import webdriver
import time
import pandas as pd
options = webdriver.ChromeOptions()
#Uncomment the line below if you'd like to scrape without a new Chrome window every time.
#options.add_argument('headless')
#Change the path to where chromedriver is in your home folder.
driver = webdriver.Chrome(executable_path=path, options=options)
driver.set_window_size(1120, 1000)
url = "https://www.glassdoor.com/Job/jobs.htm?suggestCount=0&suggestChosen=false&clickSource=searchBtn&typedKeyword="+'data scientist'+"&sc.keyword="+'data scientist'+"&locT=&locId=&jobType="
#url = 'https://www.glassdoor.com/Job/jobs.htm?sc.keyword="' + keyword + '"&locT=C&locId=1147401&locKeyword=San%20Francisco,%20CA&jobType=all&fromAge=-1&minSalary=0&includeNoSalaryJobs=true&radius=100&cityId=-1&minRating=0.0&industryId=-1&sgocId=-1&seniorityType=all&companyId=-1&employerSizes=0&applicationType=0&remoteWorkType=0'
driver.get(url)
#Let the page load. Change this number based on your internet speed.
#Or, wait until the webpage is loaded, instead of hardcoding it.
time.sleep(5)
#Test for the "Sign Up" prompt and get rid of it.
try:
driver.find_element_by_class_name("selected").click()
except NoSuchElementException:
pass
time.sleep(.1)
try:
driver.find_element_by_css_selector('[alt="Close"]').click() #clicking to the X.
print(' x out worked')
except NoSuchElementException:
print(' x out failed')
pass
#Going through each job in this page
job_buttons = driver.find_elements_by_class_name("jl")
I'm getting an empty list
job_buttons
[]

Your problem is with wrong except argument.
With driver.find_element_by_class_name("selected").click() you are trying to click non-existing element. There is no element matching "selected" class name on that page. This causes NoSuchElementException exception as you can see yourself while you are trying to catch ElementClickInterceptedException exception.
To fix this you should use the correct locator or at least the correct argument in except.
Like this:
try:
driver.find_element_by_class_name("selected").click()
except NoSuchElementException:
pass
Or even
try:
driver.find_element_by_class_name("selected").click()
except:
pass
I'm not sure what elements do you want to get into job_buttons.
The search results containing all the details per each job can be found by this:
job_buttons = driver.find_elements_by_css_selector("li.react-job-listing")

Related

How to correctly scroll a page to download each one of available zip files using Selenium on Python?

I'm trying to download all of the zip files from this page that don't end with the word 'CHECKSUM', so far I have managed to write a code that it's supposed to do that but it's not working as expected, here it is:
import time
import numpy as np
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
opt = Options() #the variable that will store the selenium options
opt.add_experimental_option("debuggerAddress", "localhost:9222") #this allows bulk-dozer to take control of your Chrome Browser in DevTools mode.
s = Service(r'C:\Users\ResetStoreX\AppData\Local\Programs\Python\Python39\Scripts\chromedriver.exe') #Use the chrome driver located at the corresponding path
driver = webdriver.Chrome(service=s, options=opt) #execute the chromedriver.exe with the previous conditions
#Why using MarkPrices: https://support.btse.com/en/support/solutions/articles/43000557589-index-price-and-mark-price#:~:text=Index%20Price%20is%20an%20important,of%20cryptocurrencies%20on%20major%20exchanges.&text=Mark%20Price%20is%20the%20price,be%20fair%20and%20manipulation%20resistant.
time.sleep(2)
if driver.current_url == 'https://data.binance.vision/?prefix=data/futures/um/daily/markPriceKlines/ALICEUSDT/1h/' :
number = 2 #initialize an int variable to 2 because the desired web elements in this page starts from 2
counter = 0
while number <= np.size(driver.find_elements(By.XPATH, '//*[#id="listing"]/tr')): #iterate over the tbody array
data_file_name = driver.find_element(By.XPATH, f'//*[#id="listing"]/tr[{number}]/td[1]/a').text
if data_file_name.endswith('CHECKSUM') == False:
current_data_file = driver.find_element(By.XPATH, f'//*[#id="listing"]/tr[{number}]/td[1]/a')
element_position = current_data_file.location
y_position = str(element_position.get('y'))
driver.execute_script(f"window.scrollBy(0,{y_position})", "") #scroll down the page to know what's being added
current_data_file.click()
print(f'saving {data_file_name}')
time.sleep(0.5)
counter += 1
number += 1
print(counter)
My problem occurs at the 20th element (ALICEUSDT-1h-2022-02-04.zip.CHECKSUM), the program stops and throws errors like the one down below:
ElementClickInterceptedException: element click intercepted: Element
is not clickable at point (418, 1294)
Or this other one with a negative position:
ElementClickInterceptedException: element click intercepted: Element
is not clickable at point (418, -1221)
So, I would like to know how could I improve the code above to handle the errors shown, I know it has everything to do with the scrollbar, but I ran out of ideas after using this other line y_position = str(element_position.get('y')+100) and keep getting the same errors.

Stuck in loop <> Code doesn't want to pull anything except row 1

I am stuck in loop, I don't know what to change to make my code work normally...
problem is with CSV file, my file contains list of domains (freedommortgage.com, google.com, amd.com etc.) so when I run code, everything is fine at start, but then it keeps sending me same results all over:
the monthly total visits to freedommortgage.com is 1.10M
So here is my line:
import csv
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import urllib
from captcha2upload import CaptchaUpload
import time
# setting the firefox driver
def init_driver():
driver = webdriver.Firefox(executable_path=r'C:\Users\muki\Desktop\similarweb_scrapper-master\geckodriver.exe')
driver.implicitly_wait(10)
return driver
# solving the captcha (with 2captcha.com)
def captcha_solver(driver):
captcha_src = driver.find_element_by_id('recaptcha_challenge_image').get_attribute("src")
urllib.urlretrieve(captcha_src, "captcha.jpg")
captcha = CaptchaUpload("4cfd308fd703d40291a7e250d743ca84") # 2captcha API KEY
captcha_answer = captcha.solve("captcha.jpg")
wait = WebDriverWait(driver, 10)
captcha_input_box = wait.until(
EC.presence_of_element_located((By.ID, "recaptcha_response_field")))
captcha_input_box.send_keys(captcha_answer)
driver.implicitly_wait(10)
captcha_input_box.submit()
# inputting the domain in similar web search box and finding necessary values
def lookup(driver, domain, short_method):
# short method - inputting the domain in the url
if short_method:
driver.get("https://www.similarweb.com/website/" + domain)
else:
driver.get("https://www.similarweb.com")
attempt = 0
# trying 3 times before quiting (due to second refresh by the website that clears the search box)
while attempt < 1:
try:
captcha_body_page = driver.find_elements_by_class_name("block-page")
driver.implicitly_wait(10)
if captcha_body_page:
print("Captcha ahead, solving the captcha, it may take a few seconds")
captcha_solver(driver)
print("Captcha solved! the program will continue shortly")
time.sleep(20) # to prevent second refresh affecting the upcoming elements finding after captcha solved
# for normal method, inputting the domain in the searchbox instead of url
if not short_method:
input_element = driver.find_element_by_id("js-swSearch-input")
input_element.click()
input_element.send_keys(domain)
input_element.submit()
wait = WebDriverWait(driver, 10)
time.sleep(10)
total_visits = wait.until(
EC.presence_of_element_located((By.XPATH, "//span[#class='engagementInfo-valueNumber js-countValue']")))
total_visits_line = "the monthly total visits to %s is %s" % (domain, total_visits.text)
time.sleep(10)
print('\n' + total_visits_line)
except TimeoutException:
print("Box or Button or Element not found in similarweb while checking %s" % domain)
attempt += 1
print("attempt number %d... trying again" % attempt)
# main
if __name__ == "__main__":
with open('bigdomains.csv', 'rt') as f:
reader = csv.reader(f)
driver = init_driver()
for row in reader:
domain = row[0]
lookup(driver, domain, True) # user need to give as a parameter True or False, True will activate the
# short method, False will take the normal method
(Sorry for the long line of code, but I have to present everything, even tho focus is on the LAST PART of the code)
My question is simple:
Why does it keep taking row number 1 domain, and ignoring the row2 row3 row4, etc...?
Time = delay has to be 10, or more, to avoid captcha issue on this website
if anyone would try to run this, you have to edit name of csv file, and to have few domains in it in format google.com (not www.google.com) of course.

Looks like you're always accessing the same index everytime with:
domain = row[0]
Index 0 is the first item, hence why you keep getting the same value.
This post explains an alternative way to use a for loop in Python.
Accessing the index in 'for' loops?

python3 More button clickable in the 1st page but NOT clickable in the 2nd page

This is the extended question on how to click 'More' button on a webpage.
Below is my previous question and one person kindly answered for it.
Since I'm not that familiar with 'find element by class name' function, I just added that person's revised code on my existing code. So my revised code would not be efficient (my apology).
Python click 'More' button is not working
The situation is, there are two types of 'More' button. 1st one is in the property description part and the 2nd one is in the text reviews part. If you click only one 'More' button from any of the reviews, reviews will be expanded so that you can see the full text reviews.
The issue I run into is that I can click 'More' button for the reviews that are in the 1st page but not clickable for the reviews in the 2nd page.
Below is the error message I get but my code still runs (without stopping once it sees an error).
Message:
no such element: Unable to locate element: {"method":"tag name","selector":"span"}
Based on my understanding, there is entry class and corresponding span for every review. I don't understand why it says python can't find it.
from selenium import webdriver
from selenium.webdriver import ActionChains
from bs4 import BeautifulSoup
review_list=[]
review_appended_list=[]
review_list_v2=[]
review_appended_list_v2=[]
listed_reviews=[]
listed_reviews_v2=[]
listed_reviews_total=[]
listed_reviews_total_v2=[]
final_list=[]
#Incognito Mode
option = webdriver.ChromeOptions()
option.add_argument("--incognito")
#Open Chrome
driver=webdriver.Chrome(executable_path="C:/Users/chromedriver.exe",options=option)
#url I want to visit (I'm going to loop over multiple listings but for simplicity, I just added one listing url).
lists = ['https://www.tripadvisor.com/VacationRentalReview-g30196-d6386734-Hot_51st_St_Walk_to_Mueller_2BDR_Modern_sleeps_7-Austin_Texas.html']
for k in lists:
driver.get(k)
time.sleep(3)
#click 'More' on description part.
link = driver.find_element_by_link_text('More')
try:
ActionChains(driver).move_to_element(link)
time.sleep(1) # time to move to link
link.click()
time.sleep(1) # time to update HTML
except Exception as ex:
print(ex)
time.sleep(3)
# first "More" shows text in all reviews - there is no need to search other "More"
try:
first_entry = driver.find_element_by_class_name('entry')
more = first_entry.find_element_by_tag_name('span')
#more = first_entry.find_element_by_link_text('More')
except Exception as ex:
print(ex)
try:
ActionChains(driver).move_to_element(more)
time.sleep(1) # time to move to link
more.click()
time.sleep(1) # time to update HTML
except Exception as ex:
print(ex)
#begin parsing html and scraping data.
html =driver.page_source
soup=BeautifulSoup(html,"html.parser")
listing=soup.find_all("div", class_="review-container")
all_reviews = driver.find_elements_by_class_name('wrap')
for review in all_reviews:
all_entries = review.find_elements_by_class_name('partial_entry')
if all_entries:
review_list=[all_entries[0].text]
review_appended_list.extend([review_list])
for i in range(len(listing)):
review_id=listing[i]["data-reviewid"]
listing_v1=soup.find_all("div", class_="rating reviewItemInline")
rating=listing_v1[i].span["class"][1]
review_date=listing_v1[i].find("span", class_="ratingDate relativeDate")
review_date_detail=review_date["title"]
listed_reviews=[review_id, review_date_detail, rating[7:8]]
listed_reviews.extend([k])
listed_reviews_total.append(listed_reviews)
for a,b in zip (listed_reviews_total,review_appended_list):
final_list.append(a+b)
#loop over from the 2nd page of the reviews for the same listing.
for j in range(5,20,5):
url_1='-'.join(k.split('-',3)[:3])
url_2='-'.join(k.split('-',3)[3:4])
middle="-or%d-" % j
final_k=url_1+middle+url_2
driver.get(final_k)
time.sleep(3)
link = driver.find_element_by_link_text('More')
try:
ActionChains(driver).move_to_element(link)
time.sleep(1) # time to move to link
link.click()
time.sleep(1) # time to update HTML
except Exception as ex:
print(ex)
# first "More" shows text in all reviews - there is no need to search other "More"
try:
first_entry = driver.find_element_by_class_name('entry')
more = first_entry.find_element_by_tag_name('span')
except Exception as ex:
print(ex)
try:
ActionChains(driver).move_to_element(more)
time.sleep(2) # time to move to link
more.click()
time.sleep(2) # time to update HTML
except Exception as ex:
print(ex)
html =driver.page_source
soup=BeautifulSoup(html,"html.parser")
listing=soup.find_all("div", class_="review-container")
all_reviews = driver.find_elements_by_class_name('wrap')
for review in all_reviews:
all_entries = review.find_elements_by_class_name('partial_entry')
if all_entries:
#print('--- review ---')
#print(all_entries[0].text)
#print('--- end ---')
review_list_v2=[all_entries[0].text]
#print (review_list)
review_appended_list_v2.extend([review_list_v2])
#print (review_appended_list)
for i in range(len(listing)):
review_id=listing[i]["data-reviewid"]
#print review_id
listing_v1=soup.find_all("div", class_="rating reviewItemInline")
rating=listing_v1[i].span["class"][1]
review_date=listing_v1[i].find("span", class_="ratingDate relativeDate")
review_date_detail=review_date["title"]
listed_reviews_v2=[review_id, review_date_detail, rating[7:8]]
listed_reviews_v2.extend([k])
listed_reviews_total_v2.append(listed_reviews_v2)
for a,b in zip (listed_reviews_total_v2,review_appended_list_v2):
final_list.append(a+b)
print (final_list)
if len(listing) !=5:
break
How to enable clicking 'More' button for the 2nd and rest of the pages? so that I can scrape the full text reviews?
Edited Below:
The error messages I get are these two lines.
Message: no such element: Unable to locate element: {"method":"tag name","selector":"span"}
Message: stale element reference: element is not attached to the page document
I guess my whole codes still run because I used try and except function? Usually when python runs into an error, it stops running.

Try it like:
driver.execute_script("""
arguments[0].click()
""", link)

Selenium: is setUpClass(), tearDownClass(), #classmethod decorator let each tests share a single browser instance?

I'm fresh man for Selenium. And here is my two test files, the first one includes 2 test cases and if run it then it opens only 1 Chrome session for both tests.
The second one includes 3 test cases but it opens 1 Chrome session for each test.
From the book, since I use #classmethod decorator for setUpClass(), tearDownClass to set them as class level, then there should be only 1 Browser session for all tests in a file. Please correct me if my understanding is wrong...
-> the first file(searchtests_with_class_methods.py)
import unittest
from selenium import webdriver
class SearchTest(unittest.TestCase):
#classmethod
def setUpClass(cls):
# create a new Chrome session
cls.driver = webdriver.Chrome()
cls.driver.implicitly_wait(30)
cls.driver.maximize_window()
# navigation to the application home page
cls.driver.get("http://demo-store.seleniumacademy.com/")
# ?don't know why need this title here
cls.driver.title
def test_search_by_category(self):
# get the search textbox
self.search_field = self.driver.find_element_by_name("q")
self.search_field.clear()
# enter search keyword and submit
self.search_field.send_keys("phones")
self.search_field.submit()
# get all the anchor elements which have product name displayed
# currently on result page using find_element_by_xpath method
products = self.driver.find_elements_by_xpath("//h2[#class='product-name']/a")
self.assertEqual(3, len(products))
def test_search_by_name(self):
# get the search textbox
self.search_field = self.driver.find_element_by_name("q")
self.search_field.clear()
# enter search keyword and submit
self.search_field.send_keys("salt shaker")
self.search_field.submit()
# get all the anchor elements which have product name displayed
# currently on result page using find_element_by_xpath method
products = self.driver.find_elements_by_xpath("//h2[#class='product-name']/a")
self.assertEqual(1, len(products))
#classmethod
def tearDownClass(cls):
# close the browser window
cls.driver.quit()
if __name__ == '__main__':
unittest.main(verbosity=2)
-> the second file(homepagetests.py)
import unittest
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from builtins import classmethod
class HomePageTest(unittest.TestCase):
#classmethod
def setUp(cls):
# create a new Chrome session
cls.driver = webdriver.Chrome()
cls.driver.implicitly_wait(30)
cls.driver.maximize_window()
# navigate to the application home page
cls.driver.get("http://demo-store.seleniumacademy.com/")
def test_search_field(self):
# check search field exists on Home page
self.assertTrue(self.is_element_present(By.NAME, "q"))
def test_language_option(self):
# check language options dropdown on Home page
self.assertTrue(self.is_element_present(By.ID, "select-language"))
def test_shopping_cart_empty_message(self):
# check content of My Shopping Cart block on Home page
shopping_cart_icon = self.driver.\
find_element_by_css_selector("div.header-minicart span.icon")
shopping_cart_icon.click()
shopping_cart_status = self.driver.\
find_element_by_css_selector("p.empty").text
self.assertEqual("You have no items in your shopping cart.",
shopping_cart_status)
close_button = self.driver.\
find_element_by_css_selector("div.minicart-wrapper a.close")
close_button.click()
#classmethod
def tearDown(cls):
# close the browser window
cls.driver.quit()
def is_element_present(self, how, what):
"""
Utility method to check presence of an element on page
:params how: By locator type
:params what: locator value
"""
try:
self.driver.find_element(by=how, value=what)
except NoSuchElementException as e:
return False
return True
if __name__ == '__main__':
unittest.main(verbosity = 2)
I'm using Python3.7.1, Selenium '3.141.0' and Chrome 72.0.3626.121 on Mac OS 10.13.6.
Confused on this behavior...could you help?

Today I just found what is the problem, actually there is a typo in second file where 'def tearDown(cls):' should be 'def tearDownClass(cls):' since I'm using #classmethod decorator. What a stupid man I am... Finally all tests are passed with only one browser session.
I didn't delete this question in case some guys meet same issue with me in the future.

Selenium web scraping in python cant read .text of elements

I am trying to scrap reviews from verizon website and I found the xpath of reviews by doing inspect on webpage. I am executing below code but this review.text doesnt seems to be working perfectly all the time. I get correct text sometimes and sometimes it just prints Error in message -
Not sure , what am I doing wrong..
from selenium import webdriver
url = 'https://www.verizonwireless.com/smartphones/samsung-galaxy-s7/'
browser = webdriver.Chrome(executable_path='/Users/userName/PycharmProjects/Verizon/chromedriver')
browser.get(url)
reviews = []
xp = '//*[#id="BVRRContainer"]/div/div/div/div/div[3]/div/ul/li[2]/a/span[2]'
# read first ten pages of reviews ==>
for j in range(10):
reviews.extend(browser.find_elements_by_xpath('//*[#id="BVRRContainer"]/div/div/div/div/ol/li[*]/div/div[1]'
'/div/div[2]/div/div/div[1]/p'))
try:
next = browser.find_element_by_xpath(xp)
next.click()
except:
print(j,"error clicking")
# Print reviews ===>
for i, review in enumerate(reviews):
try:
print(review.text)
except:
print("Error in :" review)

You should improve the logic of your code. Note, that you cannot get text of elements from the first page after redirection to next page- you need to get text before clicking "Next" button.
Try to use below code instead:
from selenium import webdriver
from selenium.common.exceptions import WebDriverException
import time
url = 'https://www.verizonwireless.com/smartphones/samsung-galaxy-s7/'
browser = webdriver.Chrome()
browser.get(url)
reviews = []
xp = '//a[span[#class="bv-content-btn-pages-next"]]'
# read first ten pages of reviews ==>
for i in range(10):
for review in browser.find_elements_by_xpath('//div[#class="bv-content-summary-body-text"]/p'):
reviews.append(review.text)
try:
next = browser.find_element_by_xpath(xp)
next.location_once_scrolled_into_view
time.sleep(0.5) # To wait until scrolled down to "Next" button
next.click()
time.sleep(2) # To wait for page "autoscrolling" to first review + until modal window dissapeared
except WebDriverException:
print("error clicking")
for review in reviews:
print(review)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

web scraping data from glassdoor using selenium - python-3.x

Related

How to correctly scroll a page to download each one of available zip files using Selenium on Python?

Stuck in loop <> Code doesn't want to pull anything except row 1

python3 More button clickable in the 1st page but NOT clickable in the 2nd page

Selenium: is setUpClass(), tearDownClass(), #classmethod decorator let each tests share a single browser instance?

Selenium web scraping in python cant read .text of elements

Categories

Resources