I am trying to parse the hrefs and the titles of all articles from https://www.weforum.org/agenda/archive/covid-19 but I also want to pull information on the next page.
My code can only pull the current page but is not working on click() next page.
driver.get("https://www.weforum.org/agenda/archive/covid-19")
links =[]
titles = []
while True:
for elem in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.tout__link'))):
links.append(elem.get_attribute('href'))
titles.append(elem.text)
try:
WebDriverWait(driver,5).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".pagination__nav-text"))).click()
WebDriverWait(driver,5).until(EC.staleness_of(elem))
except:
break
Can anyone help me with the issue? Thank you!
The class name 'pagination__nav-text' is not unique. As per the design, it clicks on the first found element which is "Prev" link. so you would not see that working.
Can you try with this approach,
driver.get("https://www.weforum.org/agenda/archive/covid-19")
wait = WebDriverWait(driver,10)
links =[]
titles = []
while True:
for elem in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.tout__link'))):
links.append(elem.get_attribute('href'))
titles.append(elem.text)
try:
print('trying to click next')
WebDriverWait(driver,5).until(EC.presence_of_element_located((By.XPATH,"//div[#class='pagination__nav-text' and contains(text(),'Next')]"))).click()
WebDriverWait(driver,5).until(EC.staleness_of(elem))
except:
break
print(links)
print(titles)
driver.quit()
Related
I am trying to scroll down the job posts using below lines, but it will give sometime correct results to scroll down to the end and sometimes it won't.
html = driver.find_element_by_tag_name('html')
time.sleep(5)
html.send_keys(Keys.END)
Can anyone suggest me how to scroll down to the end, please find the link and screenshot below.
https://www.google.com/search?q=upsc+jobs+in+india&rlz=1C1CHBF_enIN869IN869&oq=upsc+jo&aqs=chrome.1.69i57j0i433i512j0i131i433i512j0i512l3j0i131i433i512l2j0i512j0i433i512&sourceid=chrome&ie=UTF-8&ibp=htl;jobs&sa=X&sqi=2&ved=2ahUKEwjR27GN_qPzAhX4ppUCHb_0B_QQkd0GegQIORAB#fpstate=tldetail&sxsrf=AOaemvIxuJXh3if0tw7ezZfjkXRe5DSxsA:1632911697417&htivrt=jobs&htidocid=N1_BNfzt8n8auXjGAAAAAA%3D%3D
The more you scroll the more data you get, basically it's a dynamic web site. I have hardcoded 50 as a dummy number, you can have 100 or any other number for this matter.
You can use the sample code :
driver = webdriver.Chrome(driver_path)
driver.maximize_window()
driver.implicitly_wait(30)
driver.get("https://www.google.com/search?q=upsc+jobs+in+india&rlz=1C1CHBF_enIN869IN869&oq=upsc+jo&aqs=chrome.1.69i57j0i433i512j0i131i433i512j0i512l3j0i131i433i512l2j0i512j0i433i512&sourceid=chrome&ie=UTF-8&ibp=htl;jobs&sa=X&sqi=2&ved=2ahUKEwjR27GN_qPzAhX4ppUCHb_0B_QQkd0GegQIORAB#fpstate=tldetail&sxsrf=AOaemvIxuJXh3if0tw7ezZfjkXRe5DSxsA:1632911697417&htivrt=jobs&htidocid=N1_BNfzt8n8auXjGAAAAAA%3D%3D")
j = 1
for i in range(50):
element = driver.find_element(By.XPATH, f"(//div[#role='heading'])[{j}]")
driver.execute_script("arguments[0].scrollIntoView(true);", element)
j = j + 1
You can also try this to scroll till the end.
driver.get("https://www.google.com/search?q=upsc+jobs+in+india&rlz=1C1CHBF_enIN869IN869&oq=upsc+jo&aqs=chrome.1.69i57j0i433i512j0i131i433i512j0i512l3j0i131i433i512l2j0i512j0i433i512&sourceid=chrome&ie=UTF-8&ibp=htl;jobs&sa=X&sqi=2&ved=2ahUKEwjR27GN_qPzAhX4ppUCHb_0B_QQkd0GegQIORAB#fpstate=tldetail&sxsrf=AOaemvIxuJXh3if0tw7ezZfjkXRe5DSxsA:1632911697417&htivrt=jobs&htidocid=N1_BNfzt8n8auXjGAAAAAA%3D%3D")
i = 0
try:
while True:
options = driver.find_elements_by_xpath("//div[#role='treeitem']")
driver.execute_script("arguments[0].scrollIntoView(true);",options[i])
i+=1
time.sleep(.5)
except:
pass
I want help in scraping infinite scrolling pages. For now, I have entered pageNumber = 100, which helps me in getting the name from 100 pages.
But I want to crawl all the pages till the end. As the page has infinite scrolling and being new to scrapy I am unable to do the same. I am trying this for the past 2 days.
class StorySpider(scrapy.Spider):
name = 'story-spider'
start_urls = ['https://www.storytel.com/in/en/categories/3-Crime?pageNumber=100']
def parse(self, response):
for quote in response.css('div.gridBookTitle'):
item = {
'name': quote.css('a::attr(href)').extract_first()
}
yield item
The original link is https://www.storytel.com/in/en/categories/1-Children. I see that the pageNumber variable is inside script tag, if it helps to find the solution.
Any help would be appreciated. Thanks in advance!!
If you search the XPath like <link rel="next" href=''>
you will found the pagination option. With the help of you can add the pagination code.
here is some example of the pagination page.
next_page = xpath of pagination
if len(next_page) !=0:
next_page_url = main_url.join(next_page
yield scrapy.Request(next_page_url, callback=self.parse)
It will helps you.
This is the extended question on how to click 'More' button on a webpage.
Below is my previous question and one person kindly answered for it.
Since I'm not that familiar with 'find element by class name' function, I just added that person's revised code on my existing code. So my revised code would not be efficient (my apology).
Python click 'More' button is not working
The situation is, there are two types of 'More' button. 1st one is in the property description part and the 2nd one is in the text reviews part. If you click only one 'More' button from any of the reviews, reviews will be expanded so that you can see the full text reviews.
The issue I run into is that I can click 'More' button for the reviews that are in the 1st page but not clickable for the reviews in the 2nd page.
Below is the error message I get but my code still runs (without stopping once it sees an error).
Message:
no such element: Unable to locate element: {"method":"tag name","selector":"span"}
Based on my understanding, there is entry class and corresponding span for every review. I don't understand why it says python can't find it.
from selenium import webdriver
from selenium.webdriver import ActionChains
from bs4 import BeautifulSoup
review_list=[]
review_appended_list=[]
review_list_v2=[]
review_appended_list_v2=[]
listed_reviews=[]
listed_reviews_v2=[]
listed_reviews_total=[]
listed_reviews_total_v2=[]
final_list=[]
#Incognito Mode
option = webdriver.ChromeOptions()
option.add_argument("--incognito")
#Open Chrome
driver=webdriver.Chrome(executable_path="C:/Users/chromedriver.exe",options=option)
#url I want to visit (I'm going to loop over multiple listings but for simplicity, I just added one listing url).
lists = ['https://www.tripadvisor.com/VacationRentalReview-g30196-d6386734-Hot_51st_St_Walk_to_Mueller_2BDR_Modern_sleeps_7-Austin_Texas.html']
for k in lists:
driver.get(k)
time.sleep(3)
#click 'More' on description part.
link = driver.find_element_by_link_text('More')
try:
ActionChains(driver).move_to_element(link)
time.sleep(1) # time to move to link
link.click()
time.sleep(1) # time to update HTML
except Exception as ex:
print(ex)
time.sleep(3)
# first "More" shows text in all reviews - there is no need to search other "More"
try:
first_entry = driver.find_element_by_class_name('entry')
more = first_entry.find_element_by_tag_name('span')
#more = first_entry.find_element_by_link_text('More')
except Exception as ex:
print(ex)
try:
ActionChains(driver).move_to_element(more)
time.sleep(1) # time to move to link
more.click()
time.sleep(1) # time to update HTML
except Exception as ex:
print(ex)
#begin parsing html and scraping data.
html =driver.page_source
soup=BeautifulSoup(html,"html.parser")
listing=soup.find_all("div", class_="review-container")
all_reviews = driver.find_elements_by_class_name('wrap')
for review in all_reviews:
all_entries = review.find_elements_by_class_name('partial_entry')
if all_entries:
review_list=[all_entries[0].text]
review_appended_list.extend([review_list])
for i in range(len(listing)):
review_id=listing[i]["data-reviewid"]
listing_v1=soup.find_all("div", class_="rating reviewItemInline")
rating=listing_v1[i].span["class"][1]
review_date=listing_v1[i].find("span", class_="ratingDate relativeDate")
review_date_detail=review_date["title"]
listed_reviews=[review_id, review_date_detail, rating[7:8]]
listed_reviews.extend([k])
listed_reviews_total.append(listed_reviews)
for a,b in zip (listed_reviews_total,review_appended_list):
final_list.append(a+b)
#loop over from the 2nd page of the reviews for the same listing.
for j in range(5,20,5):
url_1='-'.join(k.split('-',3)[:3])
url_2='-'.join(k.split('-',3)[3:4])
middle="-or%d-" % j
final_k=url_1+middle+url_2
driver.get(final_k)
time.sleep(3)
link = driver.find_element_by_link_text('More')
try:
ActionChains(driver).move_to_element(link)
time.sleep(1) # time to move to link
link.click()
time.sleep(1) # time to update HTML
except Exception as ex:
print(ex)
# first "More" shows text in all reviews - there is no need to search other "More"
try:
first_entry = driver.find_element_by_class_name('entry')
more = first_entry.find_element_by_tag_name('span')
except Exception as ex:
print(ex)
try:
ActionChains(driver).move_to_element(more)
time.sleep(2) # time to move to link
more.click()
time.sleep(2) # time to update HTML
except Exception as ex:
print(ex)
html =driver.page_source
soup=BeautifulSoup(html,"html.parser")
listing=soup.find_all("div", class_="review-container")
all_reviews = driver.find_elements_by_class_name('wrap')
for review in all_reviews:
all_entries = review.find_elements_by_class_name('partial_entry')
if all_entries:
#print('--- review ---')
#print(all_entries[0].text)
#print('--- end ---')
review_list_v2=[all_entries[0].text]
#print (review_list)
review_appended_list_v2.extend([review_list_v2])
#print (review_appended_list)
for i in range(len(listing)):
review_id=listing[i]["data-reviewid"]
#print review_id
listing_v1=soup.find_all("div", class_="rating reviewItemInline")
rating=listing_v1[i].span["class"][1]
review_date=listing_v1[i].find("span", class_="ratingDate relativeDate")
review_date_detail=review_date["title"]
listed_reviews_v2=[review_id, review_date_detail, rating[7:8]]
listed_reviews_v2.extend([k])
listed_reviews_total_v2.append(listed_reviews_v2)
for a,b in zip (listed_reviews_total_v2,review_appended_list_v2):
final_list.append(a+b)
print (final_list)
if len(listing) !=5:
break
How to enable clicking 'More' button for the 2nd and rest of the pages? so that I can scrape the full text reviews?
Edited Below:
The error messages I get are these two lines.
Message: no such element: Unable to locate element: {"method":"tag name","selector":"span"}
Message: stale element reference: element is not attached to the page document
I guess my whole codes still run because I used try and except function? Usually when python runs into an error, it stops running.
Try it like:
driver.execute_script("""
arguments[0].click()
""", link)
I am trying to scrap reviews from verizon website and I found the xpath of reviews by doing inspect on webpage. I am executing below code but this review.text doesnt seems to be working perfectly all the time. I get correct text sometimes and sometimes it just prints Error in message -
Not sure , what am I doing wrong..
from selenium import webdriver
url = 'https://www.verizonwireless.com/smartphones/samsung-galaxy-s7/'
browser = webdriver.Chrome(executable_path='/Users/userName/PycharmProjects/Verizon/chromedriver')
browser.get(url)
reviews = []
xp = '//*[#id="BVRRContainer"]/div/div/div/div/div[3]/div/ul/li[2]/a/span[2]'
# read first ten pages of reviews ==>
for j in range(10):
reviews.extend(browser.find_elements_by_xpath('//*[#id="BVRRContainer"]/div/div/div/div/ol/li[*]/div/div[1]'
'/div/div[2]/div/div/div[1]/p'))
try:
next = browser.find_element_by_xpath(xp)
next.click()
except:
print(j,"error clicking")
# Print reviews ===>
for i, review in enumerate(reviews):
try:
print(review.text)
except:
print("Error in :" review)
You should improve the logic of your code. Note, that you cannot get text of elements from the first page after redirection to next page- you need to get text before clicking "Next" button.
Try to use below code instead:
from selenium import webdriver
from selenium.common.exceptions import WebDriverException
import time
url = 'https://www.verizonwireless.com/smartphones/samsung-galaxy-s7/'
browser = webdriver.Chrome()
browser.get(url)
reviews = []
xp = '//a[span[#class="bv-content-btn-pages-next"]]'
# read first ten pages of reviews ==>
for i in range(10):
for review in browser.find_elements_by_xpath('//div[#class="bv-content-summary-body-text"]/p'):
reviews.append(review.text)
try:
next = browser.find_element_by_xpath(xp)
next.location_once_scrolled_into_view
time.sleep(0.5) # To wait until scrolled down to "Next" button
next.click()
time.sleep(2) # To wait for page "autoscrolling" to first review + until modal window dissapeared
except WebDriverException:
print("error clicking")
for review in reviews:
print(review)
I am trying to pull out the names of all courses offered by Lynda.com together with the subject so that it appears on my list as '2D Drawing -- Project Soane: Recover a Lost Monument with BIM with Paul F. Aubin'. So I am trying to write a script that will go to each subject on http://www.lynda.com/sitemap/categories and pull out the list of courses. I already managed to get Selenium to go from one subject to another and pull the courses. My only problem is that there is a button 'See X more courses' to see the rest of the courses. Sometimes you have to click it couple of times that´s why I used while loop. But selenium doesn´t seem to execute this click. Does anyone know why?
This is my code:
from selenium import webdriver
url = 'http://www.lynda.com/sitemap/categories'
mydriver = webdriver.Chrome()
mydriver.get(url)
course_list = []
for a in [1,2,3]:
for b in range(1,73):
mydriver.find_element_by_xpath('//*[#id="main-content"]/div[2]/div[3]/div[%d]/ul/li[%d]/a' % (a,b)).click()
while True:
#click the button 'See more results' as long as it´s available
try:
mydriver.find_element_by_xpath('//*[#id="main-content"]/div[1]/div[3]/button').click()
except:
break
subject = mydriver.find_element_by_tag_name('h1') # pull out the subject
courses = mydriver.find_elements_by_tag_name('h3') # pull out the courses
for course in courses:
course_list.append(str(subject.text)+" -- " + str(course.text))
# go back to the initial site
mydriver.get(url)
Scroll to element before clicking:
see_more_results = browser.find_element_by_css_selector('button[class*=see-more-results]')
browser.execute_script('return arguments[0].scrollIntoView()', see_more_results)
see_more_results.click()
One solution how to repeat these actions could be:
def get_number_of_courses():
return len(browser.find_elements_by_css_selector('.course-list > li'))
number_of_courses = get_number_of_courses()
while True:
try:
button = browser.find_element_by_css_selector(CSS_SELECTOR)
browser.execute_script('return arguments[0].scrollIntoView()', button)
button.click()
while True:
new_number_of_courses = get_number_of_courses()
if (new_number_of_courses > number_of_courses):
number_of_courses = new_number_of_courses
break
except:
break
Caveat: it's always better to use build-in explicit wait than while True:
http://www.seleniumhq.org/docs/04_webdriver_advanced.jsp#explicit-waits
The problem is that you're calling a method to find element by class name, but you're passing a xpath. if you're sure this is the correct xpath you'll simply need to change to method to 'find_element_by_xpath'.
A recommendation if you allow: Try to stay away from these long xpaths and go through some tutorials on how to write efficient xpath for example.