I am scrolling google job page to extract multiple company names, but getting only 2 records.
Can anyone suggest me how to tweak the below code to get all the companies name present next to word 'via' as showing in the below image.
driver.get("https://www.google.com/search?q=bank+jobs+in+india&rlz=1C1CHBF_enIN869IN869&oq=upsc+jo&aqs=chrome.1.69i57j0i433i512j0i131i433i512j0i512l3j0i131i433i512l2j0i512j0i433i512&sourceid=chrome&ie=UTF-8&ibp=htl;jobs&sa=X&sqi=2&ved=2ahUKEwjR27GN_qPzAhX4ppUCHb_0B_QQkd0GegQIORAB#fpstate=tldetail&sxsrf=AOaemvIxuJXh3if0tw7ezZfjkXRe5DSxsA:1632911697417&htivrt=jobs&htidocid=hr3yUBTZAssve05hAAAAAA%3D%3D")
name = []
cnt = 0
try:
while True:
element = driver.find_elements_by_xpath("//div[#role='treeitem']")
driver.execute_script("arguments[0].scrollIntoView(true);", element[cnt])
time.sleep(2)
try:
nam = driver.find_element_by_xpath("//div[contains(#class, 'oNwCmf')]").text
nam1 = nam.split("\nvia ")[1]
name.append(nam1.split("\n")[0])
except:
name.append("")
cnt=cnt+1
except:
pass
Try like this:
Get the name nam using WebElement element(instead of finding with driver). Since we are finding element within elements now, add a dot in the xpath. This will get the name of that particular Element.
try:
while True:
element = driver.find_elements_by_xpath("//div[#role='treeitem']")
driver.execute_script("arguments[0].scrollIntoView(true);", element[cnt])
time.sleep(2)
try:
nam = element[cnt].find_element_by_xpath(".//div[contains(#class, 'oNwCmf')]").text # finds the name of that particular element[cnt], add a dot to find element within element.
print(nam)
nam1 = nam.split("\nvia ")[1]
name.append(nam1.split("\n")[0])
except:
name.append("")
cnt=cnt+1
except:
pass
Related
I'm using python with selenium to access a webpage. And I want to keep checking rather the element's text exist. Only if it exist then the loop stopped. I come up with some code like:
while True:
try:
myElem = WebDriverWait(driver, 0).until(EC.presence_of_element_located((By.XPATH, '//*[#id="EP"]/ol/li/li')))
if(myElem.text == "HELLO"):
print("Found!")
except TimeoutException:
print("Not Found!")
continue
break
Now, the main issue is instead of 1 element. I need to check 3 elements. If any one of the element was found. Then print the found element and stop the loop. How can I achieved this?
WebDriverWait(driver, 0).until(EC.presence_of_element_located((By.XPATH, '//*[#id="EP"]/ol/li/li')))
WebDriverWait(driver, 0).until(EC.presence_of_element_located((By.XPATH, '//*[#id="EP"]/div[2]/p[2]/p[2]')))
WebDriverWait(driver, 0).until(EC.presence_of_element_located((By.XPATH, '//*[#class="class20"]')))
python: 3.11.1, selenium: 4.8.0
Put the try...except... in a function which takes as parameters the element's xpath and text, and returns True if text is found, otherwise False. Then loop and check if any of the three calls returned True.
from selenium.common.exceptions import TimeoutException
def check_presence(xpath, txt):
try:
myElem = WebDriverWait(driver, 0).until(EC.presence_of_element_located((By.XPATH, xpath)))
if myElem.text == txt:
print("Found!",txt)
return True
else:
print(f'Element exists but text "{txt}" does not match')
except TimeoutException:
print("Not Found!",txt)
return False
while 1:
a = check_presence('//*[#id="EP"]/ol/li/li' , "HELLO")
b = check_presence('//*[#id="EP"]/div[2]/p[2]/p[2]' , "HI")
c = check_presence('//*[#class="class20"]' , "CIAO")
if any([a,b,c]):
break
The easiest way to do this is to combine the different XPaths into one and check for the existence of "HELLO" in the XPath itself.
//*[#id="EP"]/ol/li/li[text()="HELLO"] | //*[#class="class20"][text()="HELLO"] | //*[#id="EP"]/div[2]/p[2]/p[2][text()="HELLO"]
If any element is returned, you've found a desired element. You put this in a wait with a reasonable timeout (30s?) and you're done.
try:
myElem = WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.XPATH, '//*[#id="EP"]/ol/li/li[text()="HELLO"] | //*[#class="class20"][text()="HELLO"] | //*[#id="EP"]/div[2]/p[2]/p[2][text()="HELLO"]')))
print("Found!")
except TimeoutException:
print("Not Found!")
NOTE: I don't think //*[#id="EP"]/ol/li/li and //*[#id="EP"]/div[2]/p[2]/p[2] are valid locators. You can't have nested LI or P tags in valid HTML. You might want to check those again and update them.
The CSV file contains the names of the countries used. However, after Argentina, it fails to recover the url. And it returns a empty string.
country,country_url
Afghanistan,https://openaq.org/#/locations?parameters=pm25&countries=AF&_k=tomib2
Algeria,https://openaq.org/#/locations?parameters=pm25&countries=DZ&_k=dcc8ra
Andorra,https://openaq.org/#/locations?parameters=pm25&countries=AD&_k=crspt2
Antigua and Barbuda,https://openaq.org/#/locations?parameters=pm25&countries=AG&_k=l5x5he
Argentina,https://openaq.org/#/locations?parameters=pm25&countries=AR&_k=962zxt
Australia,
Austria,
Bahrain,
Bangladesh,
The country.csv looks like this:
Afghanistan,Algeria,Andorra,Antigua and Barbuda,Argentina,Australia,Austria,Bahrain,Bangladesh,Belgium,Bermuda,Bosnia and Herzegovina,Brazil,
The code used is:
driver = webdriver.Chrome(options = options, executable_path = driver_path)
url = 'https://openaq.org/#/locations?parameters=pm25&_k=ggmrvm'
driver.get(url)
time.sleep(2)
# This function opens .csv file that we created at the first stage
# .csv file includes names of countries
with open('1Countries.csv', newline='') as f:
reader = csv.reader(f)
list_of_countries = list(reader)
list_of_countries = list_of_countries[0]
print(list_of_countries) # printing a list of countries
# Let's create Data Frame of the country & country_url
df = pd.DataFrame(columns=['country', 'country_url'])
# With this function we are generating urls for each country page
for country in list_of_countries[:92]:
try:
path = ('//span[contains(text(),' + '\"' + country + '\"' + ')]')
# "path" is used to filter each country on the website by
# iterating country names.
next_button = driver.find_element_by_xpath(path)
next_button.click()
# Using "button.click" we are get on the page of next country
time.sleep(2)
country_url = (driver.current_url)
# "country_url" is used to get the url of the current page
next_button.click()
except:
country_url = None
d = [{'country': country, 'country_url': country_url}]
df = df.append(d)
I've tried increasing the sleep time, not sure what is leading to this?
The challenge you face is that the country list is scrollalble:
A bit convenient that your code stops working when they're not displayed.
It's a relatively easy solution - You need to scroll it into view. I've made a quick test with your code to confirm it's working. I removed the CSV part, hard coded a country that's further down the list and I've the parts to make it scroll to view:
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
import time
def ScrollIntoView(element):
actions = ActionChains(driver)
actions.move_to_element(element).perform()
url = 'https://openaq.org/#/locations?parameters=pm25&_k=ggmrvm'
driver = webdriver.Chrome()
driver.get(url)
driver.implicitly_wait(10)
country = 'Bermuda'
path = ('//span[contains(text(),' + '\"' + country + '\"' + ')]')
next_button = driver.find_element_by_xpath(path)
ScrollIntoView(next_button) # added this
next_button.click()
time.sleep(2)
country_url = (driver.current_url)
print(country_url) # added this
next_button.click()
This is the output from the print:
https://openaq.org/#/locations?parameters=pm25&countries=BM&_k=7sp499
You happy to merge that into your solution? (just say if you need more support)
If it helps a reason you didn't notice for yourself is that try was masking a NotInteractableException. Have a look at how to handle errors here
try statements are great and useful - but it's also good to track when the occur so you can fix them later. Borrowing some code from that link, you can try something like this in your catch:
except:
print("Unexpected error:", sys.exc_info()[0])
I'm trying to get the data from https://openaq.org/#/location/Algiers?_k=nv8w8w ,But it always returns a null value.
def getCardDetails(country, url):
local_df = pd.DataFrame(columns=['country','card_url','general','country_link','city', 'PM2.5','date','hour'])
pm = None
date = None
hour = None
general = None
city = None
country_link = None
try:
#wait = WebDriverWait(driver, 3)
#wait.until(EC.presence_of_element_located((By.ID, 'location-fold-stats')))
time.sleep(2)
# Using Xpath we are getting the full text of the sibling that comes
# after the text containing "PM2.5". We will split the full text to
# generate variables for our Data Frame such as "pm", "date" & "hour".
try:
print("inn")
pm_date = driver.find_element(By.XPATH, '//dt[text() = "PM2.5"]/following-sibling::dd[1]').text
# Scraping pollution details from each location page
# and splitting them to save in the relevant variables
text = pm_date.split('µg/m³ at ')
print("nn",pm_date)
pm = float(text[0])
full_date = text[1].split(' ')
date = full_date[0]
hour = full_date[1]
This is my first time with Selenium in webscraping. I'd like to know how XPath works and what is the issue here.
Your XPATH is correct.To get the value from dynamic element you need to induce WebDriverWait() and wait for visibility_of_element_located()
print(WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.XPATH,'//dt[text() = "PM2.5"]/following-sibling::dd[1]'))).text)
I am trying to parse the hrefs and the titles of all articles from https://www.weforum.org/agenda/archive/covid-19 but I also want to pull information on the next page.
My code can only pull the current page but is not working on click() next page.
driver.get("https://www.weforum.org/agenda/archive/covid-19")
links =[]
titles = []
while True:
for elem in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.tout__link'))):
links.append(elem.get_attribute('href'))
titles.append(elem.text)
try:
WebDriverWait(driver,5).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".pagination__nav-text"))).click()
WebDriverWait(driver,5).until(EC.staleness_of(elem))
except:
break
Can anyone help me with the issue? Thank you!
The class name 'pagination__nav-text' is not unique. As per the design, it clicks on the first found element which is "Prev" link. so you would not see that working.
Can you try with this approach,
driver.get("https://www.weforum.org/agenda/archive/covid-19")
wait = WebDriverWait(driver,10)
links =[]
titles = []
while True:
for elem in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.tout__link'))):
links.append(elem.get_attribute('href'))
titles.append(elem.text)
try:
print('trying to click next')
WebDriverWait(driver,5).until(EC.presence_of_element_located((By.XPATH,"//div[#class='pagination__nav-text' and contains(text(),'Next')]"))).click()
WebDriverWait(driver,5).until(EC.staleness_of(elem))
except:
break
print(links)
print(titles)
driver.quit()
as part of my Ph.D. research, I am scraping numerous webpages and search for keywords within the scrape results.
This is how I do it thus far:
# load data with as pandas data frame with column df.url
df = pd.read_excel('sample.xls', header=0)
# define keyword search function
def contains_keywords(link, keywords):
try:
output = requests.get(link).text
return int(any(x in output for x in keywords))
except:
return "Wrong/Missing URL"
# define the relevant keywords
mykeywords = ('for', 'bar')
# store search results in new column 'results'
df['results'] = df.url.apply(lambda l: contains_keywords(l, mykeywords))
This works just fine. I only have one problem: the list of relevant keywords mykeywordschanges frequently, whilst the webpages stay the same. Running the code takes a long time, since I request over and over.
I have two questions:
(1) Is there a way to store the results of request.get(link).text?
(2) And if so, how to I search within the saved file(s) producing the same result as with the current script?
As always, thank you for your time and help! /R
You can download the content of the urls and save them in separate files in a directory (eg: 'links')
def get_link(url):
file_name = os.path.join('/path/to/links', url.replace('/', '_').replace(':', '_'))
try:
r = requests.get(url)
except Exception as e:
print("Failded to get " + url)
else:
with open(file_name, 'w') as f:
f.write(r.text)
Then modify the contains_keywords function to read local files, so you won't have to use requests every time you run the script.
def contains_keywords(link, keywords):
file_name = os.path.join('/path/to/links', link.replace('/', '_').replace(':', '_'))
try:
with open(file_name) as f:
output = f.read()
return int(any(x in output for x in keywords))
except Exception as e:
print("Can't access file: {}\n{}".format(file_name, e))
return "Wrong/Missing URL"
Edit: i just added a try-except block in get_link and used absolute path for file_name