Glassdoor scraping using Selenium - python-3.x

I'm trying to scrape Glassdoor using the code given here
https://github.com/PlayingNumbers/ds_salary_proj/blob/master/glassdoor_scraper.py
While executing the code, there are no errors and the website opens, but then nothing happens. I think they have changed the tags on the website. I've tried changing the tags but it's still working.
Here's the code snippet:
def get_jobs(keyword, num_jobs, verbose, path, slp_time):
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(executable_path=path, options=options)
driver.set_window_size(1120, 1000)
url = 'https://www.glassdoor.com/Job/jobs.htm?sc.keyword="' + keyword + '"&locT=C&locId=1147401&locKeyword=San%20Francisco,%20CA&jobType=all&fromAge=-1&minSalary=0&includeNoSalaryJobs=true&radius=100&cityId=-1&minRating=0.0&industryId=-1&sgocId=-1&seniorityType=all&companyId=-1&employerSizes=0&applicationType=0&remoteWorkType=0'
driver.get(url)
jobs = []
while len(jobs) < num_jobs:
time.sleep(slp_time)
try:
driver.find_element_by_class_name("selected").click()
except ElementClickInterceptedException:
pass
time.sleep(.1)
try:
driver.find_element_by_css_selector('[alt="Close"]').click()
print(' x out worked')
except NoSuchElementException:
print('x out failed')
pass
You can find the whole code in the link given above.
Any help would be greatly appreciated!

Can you check the URL generated by the
url = 'https://www.glassdoor.com/Job/jobs.htm?sc.keyword="' + keyword + '"&locT=C&locId=1147401&locKeyword=San%20Francisco,%20CA&jobType=all&fromAge=-1&minSalary=0&includeNoSalaryJobs=true&radius=100&cityId=-1&minRating=0.0&industryId=-1&sgocId=-1&seniorityType=all&companyId=-1&employerSizes=0&applicationType=0&remoteWorkType=0'
And verify it manually if results are displayed in the page

Related

web scraping data from glassdoor using selenium

please I need some help to run this code (https://github.com/PlayingNumbers/ds_salary_proj/blob/master/glassdoor_scraper.py)
In order to scrape job offer data from Glassdoor
Here's the code snippet:
from selenium.common.exceptions import NoSuchElementException, ElementClickInterceptedException
from selenium import webdriver
import time
import pandas as pd
options = webdriver.ChromeOptions()
#Uncomment the line below if you'd like to scrape without a new Chrome window every time.
#options.add_argument('headless')
#Change the path to where chromedriver is in your home folder.
driver = webdriver.Chrome(executable_path=path, options=options)
driver.set_window_size(1120, 1000)
url = "https://www.glassdoor.com/Job/jobs.htm?suggestCount=0&suggestChosen=false&clickSource=searchBtn&typedKeyword="+'data scientist'+"&sc.keyword="+'data scientist'+"&locT=&locId=&jobType="
#url = 'https://www.glassdoor.com/Job/jobs.htm?sc.keyword="' + keyword + '"&locT=C&locId=1147401&locKeyword=San%20Francisco,%20CA&jobType=all&fromAge=-1&minSalary=0&includeNoSalaryJobs=true&radius=100&cityId=-1&minRating=0.0&industryId=-1&sgocId=-1&seniorityType=all&companyId=-1&employerSizes=0&applicationType=0&remoteWorkType=0'
driver.get(url)
#Let the page load. Change this number based on your internet speed.
#Or, wait until the webpage is loaded, instead of hardcoding it.
time.sleep(5)
#Test for the "Sign Up" prompt and get rid of it.
try:
driver.find_element_by_class_name("selected").click()
except NoSuchElementException:
pass
time.sleep(.1)
try:
driver.find_element_by_css_selector('[alt="Close"]').click() #clicking to the X.
print(' x out worked')
except NoSuchElementException:
print(' x out failed')
pass
#Going through each job in this page
job_buttons = driver.find_elements_by_class_name("jl")
I'm getting an empty list
job_buttons
[]
Your problem is with wrong except argument.
With driver.find_element_by_class_name("selected").click() you are trying to click non-existing element. There is no element matching "selected" class name on that page. This causes NoSuchElementException exception as you can see yourself while you are trying to catch ElementClickInterceptedException exception.
To fix this you should use the correct locator or at least the correct argument in except.
Like this:
try:
driver.find_element_by_class_name("selected").click()
except NoSuchElementException:
pass
Or even
try:
driver.find_element_by_class_name("selected").click()
except:
pass
I'm not sure what elements do you want to get into job_buttons.
The search results containing all the details per each job can be found by this:
job_buttons = driver.find_elements_by_css_selector("li.react-job-listing")

Button click works inconsistently (hangs) on Chrome Webdriver while executing from Python Selenium

I am using Selenium to automate a few browser actions on a particular website and I am using the below set of tools to achieve it.
Python 3.8
Selenium
Chrome Web Driver 79.0
Chrome 79.0
The tasks that I do are just fill up a form and then click on the submit button in the form. And this works most of the time except that sometimes it just won't no matter what! The filling up of form is very smooth and then, when the click on the submit button happens then Chrome just hangs there forever. There's no error on the console whatsoever and googling this issue I see this is such a common occurrence and almost all of the solution out there are just workarounds and not actual fixes. And I have tried almost all of them but to no avail. There's an issue on the selenium GitHub page as well which the maintainers weren't interested in much and closed it. How do I even go about resolving this issue. At this moment I am out of ideas actually. Any help would be appreciated. Thanks.
Below is the source code that I am trying to execute.
import time
from selenium import webdriver
import os
import csv
from datetime import datetime
url = 'https://www.nseindia.com/products/content/equities/equities/eq_security.htm'
xpath_get_data_button = '//*[#id="get"]'
xpath_download_link = '/html/body/div[2]/div[3]/div[2]/div[1]/div[3]/div/div[3]/div[1]/span[2]/a'
nse_list_file = 'nse_list.csv'
wait_time = 5
wait_time_long = 10
start_year = 2000
stop_year = 2019
curr_year = start_year
browser = webdriver.Chrome("E:/software/chromedriver_win32/chromedriver.exe")
browser.get(url)
time.sleep(wait_time)
with open(nse_list_file, 'r') as file:
reader = csv.reader(file)
for row in reader:
scrip = row[0]
year_registered = datetime.strptime(row[1], '%d-%m-%Y').year
if year_registered > start_year:
curr_year = year_registered
else:
curr_year = start_year
try:
browser.find_element_by_class_name('reporttitle1').clear()
browser.find_element_by_class_name('reporttitle1').send_keys(scrip)
browser.find_element_by_id('rdDateToDate').click()
while curr_year <= stop_year:
from_date = '01-01-' + str(curr_year)
to_date = '31-12-' + str(curr_year)
browser.find_element_by_id('fromDate').clear()
browser.find_element_by_id('fromDate').send_keys(from_date)
browser.find_element_by_id('toDate').clear()
browser.find_element_by_id('toDate').send_keys(to_date)
time.sleep(wait_time)
browser.find_element_by_xpath(xpath_get_data_button).click()
time.sleep(wait_time_long)
download_link_element = browser.find_element_by_xpath(xpath_download_link).click()
curr_year = curr_year + 1
except Exception as ex:
print('Could not find download link')
print(str(ex))
if os.path.isfile("stop_loading"):
break
browser.quit()
print('DONE')

Check a variable for NoneType and break a while loop

I am very new to Programming and started teaching myself web-scraping with Python.
I am scraping player data from multiple pages of a site and built a while loop which scrapes a 'next'-button's href to get to the next player's page.
Everything is working out fine, except breaking the while loop after the last player available. The 'next'-button will gray out and have no link behind it, therefore I want to stop the iteration and save everything to a csv.
My script looks like this:
#name base url and first page to start
BaseUrl = #url
PageUrl = #also url
while True:
#scraping tables
try:
# retrieve link for 'next' player in order
link = soup.find(attrs={"class": "go_to_next_player"}).get('href')
# join base url and new link href
PageUrl = BaseUrl + link
if link is None:
break
except IndexError as e:
print(e)
break
#writing to csv
I thought I could check if the retrieved href is empty, therefore checking 'is None' and breaking, but I get this error:
In line > PageUrl = BaseUrl + link
TypeError: must be str, not NoneType
Help would be greatly appreciated! I am very new to this, so please disregard my beginner code.
You can check if link is None before doing any operations with it, and then break the loop:
if link is not None:
PageUrl = BaseUrl + link
else:
break

Login Script not Working (Python, Selenium+PhantomJs)

Hey so I'm attempting to sign in on twitter via selenium but my script only fills in the username form and nothing else. I've tried finding the elements via xpath but still nothing. No errors/exceptions raised either so I'm pretty clueless, here's the code:
def login():
"""login script"""
service_args = [
'--proxy=52.183.30.241:8888',
'--proxy-type=http',
'--ignore-ssl-errors=true'
]
execpath = r'C:\Users\Ben\AppData\Roaming\npm\node_modules\phantomjs-prebuilt\lib\phantom\bin\phantomjs.exe'
driver = webdriver.PhantomJS(service_args=service_args, executable_path=execpath)
driver.set_window_size(1024, 768)
driver.get('https:twitter.com/login')
elem = driver.find_element_by_class_name('js-username-field')
elem.send_keys('username')
driver.implicitly_wait(5)
elem = driver.find_element_by_class_name('js-password-field')
elem.send_keys('password')
elem = driver.find_element_by_css_selector('button.submit.btn.primary-btn').click()
time.sleep(5)
driver.save_screenshot('C:\\Users\\Ben\\Pictures\\screen.png')
Here's the screenshot it gives me.
Any help would be appreciated!

Selenium web scraping in python cant read .text of elements

I am trying to scrap reviews from verizon website and I found the xpath of reviews by doing inspect on webpage. I am executing below code but this review.text doesnt seems to be working perfectly all the time. I get correct text sometimes and sometimes it just prints Error in message -
Not sure , what am I doing wrong..
from selenium import webdriver
url = 'https://www.verizonwireless.com/smartphones/samsung-galaxy-s7/'
browser = webdriver.Chrome(executable_path='/Users/userName/PycharmProjects/Verizon/chromedriver')
browser.get(url)
reviews = []
xp = '//*[#id="BVRRContainer"]/div/div/div/div/div[3]/div/ul/li[2]/a/span[2]'
# read first ten pages of reviews ==>
for j in range(10):
reviews.extend(browser.find_elements_by_xpath('//*[#id="BVRRContainer"]/div/div/div/div/ol/li[*]/div/div[1]'
'/div/div[2]/div/div/div[1]/p'))
try:
next = browser.find_element_by_xpath(xp)
next.click()
except:
print(j,"error clicking")
# Print reviews ===>
for i, review in enumerate(reviews):
try:
print(review.text)
except:
print("Error in :" review)
You should improve the logic of your code. Note, that you cannot get text of elements from the first page after redirection to next page- you need to get text before clicking "Next" button.
Try to use below code instead:
from selenium import webdriver
from selenium.common.exceptions import WebDriverException
import time
url = 'https://www.verizonwireless.com/smartphones/samsung-galaxy-s7/'
browser = webdriver.Chrome()
browser.get(url)
reviews = []
xp = '//a[span[#class="bv-content-btn-pages-next"]]'
# read first ten pages of reviews ==>
for i in range(10):
for review in browser.find_elements_by_xpath('//div[#class="bv-content-summary-body-text"]/p'):
reviews.append(review.text)
try:
next = browser.find_element_by_xpath(xp)
next.location_once_scrolled_into_view
time.sleep(0.5) # To wait until scrolled down to "Next" button
next.click()
time.sleep(2) # To wait for page "autoscrolling" to first review + until modal window dissapeared
except WebDriverException:
print("error clicking")
for review in reviews:
print(review)

Resources