Web Scraping with Selenium, No results with XPath

Web Scraping with Selenium, No results with XPath - python-3.x

I'm trying to get the data from https://openaq.org/#/location/Algiers?_k=nv8w8w ,But it always returns a null value.
def getCardDetails(country, url):
local_df = pd.DataFrame(columns=['country','card_url','general','country_link','city', 'PM2.5','date','hour'])
pm = None
date = None
hour = None
general = None
city = None
country_link = None
try:
#wait = WebDriverWait(driver, 3)
#wait.until(EC.presence_of_element_located((By.ID, 'location-fold-stats')))
time.sleep(2)
# Using Xpath we are getting the full text of the sibling that comes
# after the text containing "PM2.5". We will split the full text to
# generate variables for our Data Frame such as "pm", "date" & "hour".
try:
print("inn")
pm_date = driver.find_element(By.XPATH, '//dt[text() = "PM2.5"]/following-sibling::dd[1]').text
# Scraping pollution details from each location page
# and splitting them to save in the relevant variables
text = pm_date.split('µg/m³ at ')
print("nn",pm_date)
pm = float(text[0])
full_date = text[1].split(' ')
date = full_date[0]
hour = full_date[1]
This is my first time with Selenium in webscraping. I'd like to know how XPath works and what is the issue here.

Your XPATH is correct.To get the value from dynamic element you need to induce WebDriverWait() and wait for visibility_of_element_located()
print(WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.XPATH,'//dt[text() = "PM2.5"]/following-sibling::dd[1]'))).text)

Related

Switching to pop-up window that cannot be found via windows_handle Selenium/Pytohn

I dug up my old code, used for Scopus scraping. It was created while I was learning programming. Now a window pops up on the Scopus site that I can't detect using windows_handle.
Scopus Pop-up window
import openpyxl
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.select import Select
import time
import pandas as pd
from openpyxl import load_workbook
DOI = []
TITLE = []
NUM_AUTHORS = []
NUM_AFFILIATIONS = []
AUTHORS = []
YEAR = []
JOURNAL = []
DOCUMENT_TYPE = []
COUNTRIES = []
SEARCH = []
DATES = []
chrome_driver_path = "C:\Development\chromedriver.exe"
driver = webdriver.Chrome(executable_path=chrome_driver_path)
driver.get("https://www.scopus.com/search/form.uri?display=basic#basic")
# searching details
search = input("Search documents: \n")
SEARCH.append(search)
date = input("Do you want to specify dates?(Yes/No)")
if date.capitalize() == "Yes":
driver.find_element(By.CLASS_NAME, 'flex-grow-1').send_keys(search)
driver.find_element(By.XPATH,
"/html/body/div/div/div[1]/div[2]/div/div[3]/div/div[2]/div["
"2]/micro-ui/scopus-homepage/div/div/els-tab/els-tab-panel[1]/div/form/div[2]/div[1]/button["
"2]/span[2]").click()
time.sleep(1)
starting_date = input("Put starting year.")
to_date = input("Put end date.")
DATES.append(starting_date)
DATES.append(to_date)
drop_menu_from = Select(driver.find_element(By.XPATH,
"/html/body/div/div/div[1]/div[2]/div/div[3]/div/div[2]/div["
"2]/micro-ui/scopus-homepage/div/div/els-tab/els-tab-panel["
"1]/div/form/div[2]/div[1]/els-select/div/label/select"))
drop_menu_from.select_by_visible_text(starting_date)
drop_menu_to = Select(driver.find_element(By.XPATH,
"/html/body/div/div/div[1]/div[2]/div/div[3]/div/div[2]/div["
"2]/micro-ui/scopus-homepage/div/div/els-tab/els-tab-panel["
"1]/div/form/div[2]/div[2]/els-select/div/label/select"))
drop_menu_to.select_by_visible_text(to_date)
driver.find_element(By.XPATH,
'/html/body/div/div/div[1]/div[2]/div/div[3]/div/div[2]/div['
'2]/micro-ui/scopus-homepage/div/div/els-tab/els-tab-panel[1]/div/form/div[4]/div['
'2]/button/span[1]').click()
else:
DATES = ["XXX", "YYY"]
driver.find_element(By.CLASS_NAME, 'flex-grow-1').send_keys(search)
driver.find_element(By.XPATH,
"/html/body/div/div/div[1]/div[2]/div/div[3]/div/div[2]/div["
"2]/micro-ui/scopus-homepage/div/div/els-tab/els-tab-panel[1]/div/form/div[2]/div["
"2]/button").click()
time.sleep(2)
doc_num = int(driver.find_element(By.XPATH,
"/html/body/div[1]/div/div[1]/div/div/div[3]/form/div[1]/div/header/h1/span[1]").text.replace(
",", ""))
time.sleep(5)
driver.find_element((By.XPATH, "/html/body/div[11]/div[2]/div[1]/div[4]/div/div[2]/button")).click()
This is how the beginning of the code looks like. The last element
driver.find_element((By.XPATH, "/html/body/div[11]/div[2]/div[1]/div[4]/div/div[2]/button")).click()
should find on click on dismiss button. I do not know how to handle it.
I have tried finding element by driver.find_element, checking if the pop-up window can be detected and handled via windows_handle.

Actually that is not a popup because its code is contained in HTML of the page itself. Popups are either prompts of the browser (not contained in the HTML) or other browser windows (have a separate HTML).
I suggest to target the button by using the text contained in it, in this case we look for a button containing exactly "Dismiss"
driver.find_element(By.XPATH, '//button[text()="Dismiss"]').click()

How to extract multiple text using selenium python

I am scrolling google job page to extract multiple company names, but getting only 2 records.
Can anyone suggest me how to tweak the below code to get all the companies name present next to word 'via' as showing in the below image.
driver.get("https://www.google.com/search?q=bank+jobs+in+india&rlz=1C1CHBF_enIN869IN869&oq=upsc+jo&aqs=chrome.1.69i57j0i433i512j0i131i433i512j0i512l3j0i131i433i512l2j0i512j0i433i512&sourceid=chrome&ie=UTF-8&ibp=htl;jobs&sa=X&sqi=2&ved=2ahUKEwjR27GN_qPzAhX4ppUCHb_0B_QQkd0GegQIORAB#fpstate=tldetail&sxsrf=AOaemvIxuJXh3if0tw7ezZfjkXRe5DSxsA:1632911697417&htivrt=jobs&htidocid=hr3yUBTZAssve05hAAAAAA%3D%3D")
name = []
cnt = 0
try:
while True:
element = driver.find_elements_by_xpath("//div[#role='treeitem']")
driver.execute_script("arguments[0].scrollIntoView(true);", element[cnt])
time.sleep(2)
try:
nam = driver.find_element_by_xpath("//div[contains(#class, 'oNwCmf')]").text
nam1 = nam.split("\nvia ")[1]
name.append(nam1.split("\n")[0])
except:
name.append("")
cnt=cnt+1
except:
pass

Try like this:
Get the name nam using WebElement element(instead of finding with driver). Since we are finding element within elements now, add a dot in the xpath. This will get the name of that particular Element.
try:
while True:
element = driver.find_elements_by_xpath("//div[#role='treeitem']")
driver.execute_script("arguments[0].scrollIntoView(true);", element[cnt])
time.sleep(2)
try:
nam = element[cnt].find_element_by_xpath(".//div[contains(#class, 'oNwCmf')]").text # finds the name of that particular element[cnt], add a dot to find element within element.
print(nam)
nam1 = nam.split("\nvia ")[1]
name.append(nam1.split("\n")[0])
except:
name.append("")
cnt=cnt+1
except:
pass

Web Scraping Python fails to load the url on button.click()

The CSV file contains the names of the countries used. However, after Argentina, it fails to recover the url. And it returns a empty string.
country,country_url
Afghanistan,https://openaq.org/#/locations?parameters=pm25&countries=AF&_k=tomib2
Algeria,https://openaq.org/#/locations?parameters=pm25&countries=DZ&_k=dcc8ra
Andorra,https://openaq.org/#/locations?parameters=pm25&countries=AD&_k=crspt2
Antigua and Barbuda,https://openaq.org/#/locations?parameters=pm25&countries=AG&_k=l5x5he
Argentina,https://openaq.org/#/locations?parameters=pm25&countries=AR&_k=962zxt
Australia,
Austria,
Bahrain,
Bangladesh,
The country.csv looks like this:
Afghanistan,Algeria,Andorra,Antigua and Barbuda,Argentina,Australia,Austria,Bahrain,Bangladesh,Belgium,Bermuda,Bosnia and Herzegovina,Brazil,
The code used is:
driver = webdriver.Chrome(options = options, executable_path = driver_path)
url = 'https://openaq.org/#/locations?parameters=pm25&_k=ggmrvm'
driver.get(url)
time.sleep(2)
# This function opens .csv file that we created at the first stage
# .csv file includes names of countries
with open('1Countries.csv', newline='') as f:
reader = csv.reader(f)
list_of_countries = list(reader)
list_of_countries = list_of_countries[0]
print(list_of_countries) # printing a list of countries
# Let's create Data Frame of the country & country_url
df = pd.DataFrame(columns=['country', 'country_url'])
# With this function we are generating urls for each country page
for country in list_of_countries[:92]:
try:
path = ('//span[contains(text(),' + '\"' + country + '\"' + ')]')
# "path" is used to filter each country on the website by
# iterating country names.
next_button = driver.find_element_by_xpath(path)
next_button.click()
# Using "button.click" we are get on the page of next country
time.sleep(2)
country_url = (driver.current_url)
# "country_url" is used to get the url of the current page
next_button.click()
except:
country_url = None
d = [{'country': country, 'country_url': country_url}]
df = df.append(d)
I've tried increasing the sleep time, not sure what is leading to this?

The challenge you face is that the country list is scrollalble:
A bit convenient that your code stops working when they're not displayed.
It's a relatively easy solution - You need to scroll it into view. I've made a quick test with your code to confirm it's working. I removed the CSV part, hard coded a country that's further down the list and I've the parts to make it scroll to view:
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
import time
def ScrollIntoView(element):
actions = ActionChains(driver)
actions.move_to_element(element).perform()
url = 'https://openaq.org/#/locations?parameters=pm25&_k=ggmrvm'
driver = webdriver.Chrome()
driver.get(url)
driver.implicitly_wait(10)
country = 'Bermuda'
path = ('//span[contains(text(),' + '\"' + country + '\"' + ')]')
next_button = driver.find_element_by_xpath(path)
ScrollIntoView(next_button) # added this
next_button.click()
time.sleep(2)
country_url = (driver.current_url)
print(country_url) # added this
next_button.click()
This is the output from the print:
https://openaq.org/#/locations?parameters=pm25&countries=BM&_k=7sp499
You happy to merge that into your solution? (just say if you need more support)
If it helps a reason you didn't notice for yourself is that try was masking a NotInteractableException. Have a look at how to handle errors here
try statements are great and useful - but it's also good to track when the occur so you can fix them later. Borrowing some code from that link, you can try something like this in your catch:
except:
print("Unexpected error:", sys.exc_info()[0])

Is it possible to move through a HTML Table and grab the data within w/ BeautifulSoup4?

So for a project, I'm working on creating an API to interface with my School's course-finder and I'm struggling to grab the data from the a HTML table they store the data in without using Selenium. I was able to pull the HTML data initially using Selenium but my Instructor says he would prefer if I used BeautifulSoup4 & MechanicalSoup libraries. I got as far as submitting a search and grabbing the HTML table the data is stored in. I'm not sure how to iterate through the data stored in the HTML table as I did with my Selenium code below.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.chrome.options import Options
Chrome_Options = Options()
Chrome_Options.add_argument("--headless") #allows program to run without opening a chrome window
driver = webdriver.Chrome()
driver.get("https://winnet.wartburg.edu/coursefinder/") #sets the Silenium driver
select = Select(driver.find_element_by_id("ctl00_ContentPlaceHolder1_FormView1_DropDownList_Term"))
term_options = select.options
#for index in range(0, len(term_options) - 1):
# select.select_by_index(index)
lst = []
DeptSelect = Select(driver.find_element_by_id("ctl00_ContentPlaceHolder1_FormView1_DropDownList_Department"))
DeptSelect.select_by_visible_text("History") #finds the desiered department
search = driver.find_element_by_name("ctl00$ContentPlaceHolder1$FormView1$Button_FindNow")
search.click() #sends query
table_id = driver.find_element_by_id("ctl00_ContentPlaceHolder1_GridView1")
rows = table_id.find_elements_by_tag_name("tr")
for row in rows: #creates a list of lists containing our data
col_lst = []
col = row.find_elements_by_tag_name("td")
for data in col:
lst.append(data.text)
def chunk(l, n): #class that partitions our lists neatly
print("chunking...")
for i in range(0, len(l), n):
yield l[i:i + n]
n = 16 #each list contains 16 items regardless of contents or search
uberlist = list(chunk(lst, n)) #call chunk fn to partion list
with open('class_data.txt', 'w') as handler: #output of scraped data
print("writing file...")
for listitem in uberlist:
handler.write('%s\n' % listitem)
driver.close #ends and closes Silenium control over brower
This is my Soup Code and I'm wondering how I can take the data from the HTML in a similar way I did above with my Selenium.
import mechanicalsoup
import requests
from lxml import html
from lxml import etree
import pandas as pd
def text(elt):
return elt.text_content().replace(u'\xa0', u' ')
#This Will Use Mechanical Soup to grab the Form, Subit it and find the Data Table
browser = mechanicalsoup.StatefulBrowser()
winnet = "http://winnet.wartburg.edu/coursefinder/"
browser.open(winnet)
Searchform = browser.select_form()
Searchform.choose_submit('ctl00$ContentPlaceHolder1$FormView1$Button_FindNow')
response1 = browser.submit_selected() #This Progresses to Second Form
dataURL = browser.get_url() #Get URL of Second Form w/ Data
dataURL2 = 'https://winnet.wartburg.edu/coursefinder/Results.aspx'
pageContent=requests.get(dataURL2)
tree = html.fromstring(pageContent.content)
dataTable = tree.xpath('//*[#id="ctl00_ContentPlaceHolder1_GridView1"]')
rows = [] #initialize a collection of rows
for row in dataTable[0].xpath(".//tr")[1:]: #add new rows to the collection
rows.append([cell.text_content().strip() for cell in row.xpath(".//td")])
df = pd.DataFrame(rows) #load the collection to a dataframe
print(df)
#XPath to Table
#//*[#id="ctl00_ContentPlaceHolder1_GridView1"]
#//*[#id="ctl00_ContentPlaceHolder1_GridView1"]/tbody

Turns out I was able passing the wrong thing when using MechanicalSoup. I was able to pass the new page's contents to a variable called table had the page use .find('table') to retrieve the table HTML rather than the full page's HTML. From there just used table.get_text().split('\n') to make essentially a giant list of all of the rows.
I also dabble with setting form filters which worked as well.
import mechanicalsoup
from bs4 import BeautifulSoup
#Sets StatefulBrowser Object to winnet then it it grabs form
browser = mechanicalsoup.StatefulBrowser()
winnet = "http://winnet.wartburg.edu/coursefinder/"
browser.open(winnet)
Searchform = browser.select_form()
#Selects submit button and has filter options listed.
Searchform.choose_submit('ctl00$ContentPlaceHolder1$FormView1$Button_FindNow')
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$TextBox_keyword', "") #Keyword Searches by Class Title. Inputting string will search by that string ignoring any stored nonsense in the page.
#ACxxx Course Codes have 3 spaces after them, THIS IS REQUIRED. Except the All value for not searching by a Department does not.
Searchform.set("ctl00$ContentPlaceHolder1$FormView1$DropDownList_Department", 'All') #For Department List, it takes the CourseCodes as inputs and displays as the Full Name
Searchform.set("ctl00$ContentPlaceHolder1$FormView1$DropDownList_Term", "2020 Winter Term") # Term Dropdown takes a value that is a string. String is Exactly the Term date.
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$DropDownList_MeetingTime', 'all') #Takes the Week Class Time as a String. Need to Retrieve list of options from pages
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$DropDownList_EssentialEd', 'none') #takes a small string signialling the EE req or 'all' or 'none'. None doesn't select and option and all selects all coruses w/ a EE
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$DropDownList_CulturalDiversity', 'none')# Cultural Diversity, Takes none, C, D or all
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$DropDownList_WritingIntensive', 'none') # options are none or WI
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$DropDownList_PassFail', 'none')# Pass/Faill takes 'none' or 'PF'
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$CheckBox_OpenCourses', False) #Check Box, It's True or False
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$DropDownList_Instructor', '0')# 0 is for None Selected otherwise it is a string of numbers (Instructor ID?)
#Submits Page, Grabs results and then launches a browser for test purposes.
browser.submit_selected()# Submits Form. Retrieves Results.
table = browser.get_current_page().find('table') #Finds Result Table
print(type(table))
rows = table.get_text().split('\n') # List of all Class Rows split by \n.

Python 3.7 Issue with append function

I'm learning Python and decided to adapte code from an example to scrape Craigslist data to look at prices of cars. https://towardsdatascience.com/web-scraping-craigslist-a-complete-tutorial-c41cea4f4981
I've created a Jupyter notebook and modified the code for my use. I recreated the same error when running the code in Spyder Python 3.7.
I'm running into an issue at line 116.
File "C:/Users/UserM/Documents/GitHub/learning/Spyder Python Craigslist Scrape Untitled0.py", line 116
post_prices.append(post_price). I receive a "SynaxError: invalid syntax".
Any help appreciated. Thanks.
# -*- coding: utf-8 -*-
"""
Created on Wed Oct 2 12:26:06 2019
"""
#import get to call a get request on the site
from requests import get
#get the first page of the Chicago car prices
response = get('https://chicago.craigslist.org/search/cto?bundleDuplicates=1') #eliminate duplicates and show owner only sales
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
#get the macro-container for the housing posts
posts = html_soup.find_all('li', class_= 'result-row')
print(type(posts)) #to double check that I got a ResultSet
print(len(posts)) #to double check I got 120 (elements/page
#grab the first post
post_one = posts[0]
#grab the price of the first post
post_one_price = post_one.a.text
post_one_price.strip()
#grab the time of the post in datetime format to save on cleaning efforts
post_one_time = post_one.find('time', class_= 'result-date')
post_one_datetime = post_one_time['datetime']
#title is a and that class, link is grabbing the href attribute of that variable
post_one_title = post_one.find('a', class_='result-title hdrlnk')
post_one_link = post_one_title['href']
#easy to grab the post title by taking the text element of the title variable
post_one_title_text = post_one_title.text
#the neighborhood is grabbed by finding the span class 'result-hood' and pulling the text element from that
post_one_hood = posts[0].find('span', class_='result-hood').text
#the price is grabbed by finding the span class 'result-price' and pulling the text element from that
post_one_hood = posts[0].find('span', class_='result-price').text
#build out the loop
from time import sleep
import re
from random import randint #avoid throttling by not sending too many requests one after the other
from warnings import warn
from time import time
from IPython.core.display import clear_output
import numpy as np
#find the total number of posts to find the limit of the pagination
results_num = html_soup.find('div', class_= 'search-legend')
results_total = int(results_num.find('span', class_='totalcount').text) #pulled the total count of posts as the upper bound of the pages array
#each page has 119 posts so each new page is defined as follows: s=120, s=240, s=360, and so on. So we need to step in size 120 in the np.arange function
pages = np.arange(0, results_total+1, 120)
iterations = 0
post_timing = []
post_hoods = []
post_title_texts = []
post_links = []
post_prices = []
for page in pages:
#get request
response = get("https://chicago.craigslist.org/search/cto?bundleDuplicates=1"
+ "s=" #the parameter for defining the page number
+ str(page) #the page number in the pages array from earlier
+ "&hasPic=1"
+ "&availabilityMode=0")
sleep(randint(1,5))
#throw warning for status codes that are not 200
if response.status_code != 200:
warn('Request: {}; Status code: {}'.format(requests, response.status_code))
#define the html text
page_html = BeautifulSoup(response.text, 'html.parser')
#define the posts
posts = html_soup.find_all('li', class_= 'result-row')
#extract data item-wise
for post in posts:
if post.find('span', class_ = 'result-hood') is not None:
#posting date
#grab the datetime element 0 for date and 1 for time
post_datetime = post.find('time', class_= 'result-date')['datetime']
post_timing.append(post_datetime)
#neighborhoods
post_hood = post.find('span', class_= 'result-hood').text
post_hoods.append(post_hood)
#title text
post_title = post.find('a', class_='result-title hdrlnk')
post_title_text = post_title.text
post_title_texts.append(post_title_text)
#post link
post_link = post_title['href']
post_links.append(post_link)
#removes the \n whitespace from each side, removes the currency symbol, and turns it into an int
#test removed: post_price = int(post.a.text.strip().replace("$", ""))
post_price = int(float((post.a.text.strip().replace("$", ""))) #does this work??
post_prices.append(post_price)
iterations += 1
print("Page " + str(iterations) + " scraped successfully!")
print("\n")
print("Scrape complete!")
import pandas as pd
eb_apts = pd.DataFrame({'posted': post_timing,
'neighborhood': post_hoods,
'post title': post_title_texts,
'URL': post_links,
'price': post_prices})
print(eb_apts.info())
eb_apts.head(10)

Welcome to StackOverflow. Usually when you see syntax errors in already working code, it means that you've either messed up indentation, forgot to terminate a string somewhere, or missed a closing bracket.
You can tell this when a line of what looks to be ok code is throwing you a syntax error. This is because the line before isn't ended properly and the interpreter is giving you hints around where to look.
In this case, you're short a paranthesis in the line before.
post_price = int(float((post.a.text.strip().replace("$", "")))
should be
post_price = int(float((post.a.text.strip().replace("$", ""))))
or delete the extra paranthesis after float
post_price = int(float(post.a.text.strip().replace("$", "")))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Web Scraping with Selenium, No results with XPath - python-3.x

Your XPATH is correct.To get the value from dynamic element you need to induce WebDriverWait() and wait for visibility_of_element_located() print(WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.XPATH,'//dt[text() = "PM2.5"]/following-sibling::dd[1]'))).text)

Related

Switching to pop-up window that cannot be found via windows_handle Selenium/Pytohn

How to extract multiple text using selenium python

Web Scraping Python fails to load the url on button.click()

Is it possible to move through a HTML Table and grab the data within w/ BeautifulSoup4?

Python 3.7 Issue with append function

Categories

Resources