First, I have never used selenium until yesterday. I was able to scrape the target table correctly after many attempts.
I am currently trying to scrape the tables on sequential pages. It works sometimes and other times it fails immediately. I have spent hours surfing Google and Stack Overflow, but I have not solve my problem. I am sure the answer is something simple, but after 8 hours I need to ask a question to the experts in selenium.
My target url is: RedHat Security Advisories
If there is a question on Stack Overflow that answers my problem, please let me know and I will do some my research and testing.
Here are some of the items that I have tried:
Example 1:
page_number = 0
while True:
try:
page_number += 1
browser.execute_script("return arguments[0].scrollIntoView(true);",
WebDriverWait(browser, 30).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="jumpPoint"]/div[3]/div/div/div[2]/div/div['
'2]/dir-pagination-controls/ul/li[str(page_number))]'))))
browser.find_element_by_xpath('//*[#id="jumpPoint"]/div[3]/div/div/div[2]/div/div[2]/dir-pagination-controls/ul/li[str(page_number)').click()
print(f"Navigating to page {page_number}")
# I added this because my connection was
# being terminated by RedHat
time.sleep(20)
except (TimeoutException, WebDriverException) as e:
print("Last page reached")
break
except Exception as e:
print (e)
break
Example 2:
page_number = 0
while True:
try:
page_number += 1
browser.execute_script("return arguments[0].scrollIntoView(true);",
WebDriverWait(browser, 30).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="jumpPoint"]/div[3]/div/div/div[2]/div/div['
'2]/dir-pagination-controls/ul/li[12]'))))
browser.find_element_by_xpath('//*[#id="jumpPoint"]/div[3]/div/div/div[2]/div/div[2]/dir-pagination-controls/ul/li[12]').click()
print(f"Navigating to page {page_number}")
# I added this because my connection was
# being terminated by RedHat
time.sleep(20)
except (TimeoutException, WebDriverException) as e:
print("Last page reached")
break
except Exception as e:
print (e)
break
You can use the below logic.
lastPage = WebDriverWait(driver,120).until(EC.element_to_be_clickable((By.XPATH,"(//ul[starts-with(#class,'pagination hidden-xs ng-scope')]/li[starts-with(#ng-repeat,'pageNumber')])[last()]")))
driver.find_element_by_css_selector("i.web-icon-plus").click()
pages = lastPage.text
pages = '5'
for pNumber in range(1,int(pages)):
currentPage = WebDriverWait(driver,30).until(EC.element_to_be_clickable((By.XPATH,"//ul[starts-with(#class,'pagination hidden-xs ng-scope')]//a[.='" + str(pNumber) + "']")))
print ("===============================================")
print("Current Page : " + currentPage.text)
currentPage.location_once_scrolled_into_view
currentPage.click()
WebDriverWait(driver,120).until_not(EC.element_to_be_clickable((By.CSS_SELECTOR,"#loading")))
# print rows data here
rows = driver.find_elements_by_xpath("//table[starts-with(#class,'cve-table')]/tbody/tr") #<== getting rows here
for row in rows:
print (row.text) <== I am printing all row data, if you want cell data please update the logic accordingly
time.sleep(randint(1, 5)) #<== this step is optional
I believe you can read data directly using url instead of trying for pagination, this will lead to less sync issues because of which script might be failing
Use this xpath to get total no of pages for the security-updates table.
//*[#id="jumpPoint"]/div[3]/div/div/div[2]/div/div[2]/dir-pagination-controls/ul/li[11]
Run loop till page count get from step 1
Inside loop pass page number in below url and send get request
https://access.redhat.com/security/security-updates/#/security-advisories?q=&p=page_number&sort=portal_publication_date%20desc&rows=10&portal_advisory_type=Security%20Advisory&documentKind=PortalProduct
wait for page to load
Read data from table populated on page
This process will run till the pagination count
Incase you find specific error that site has blocked the user then you can refresh the page with same page_number.
Related
I am trying to scroll down the job posts using below lines, but it will give sometime correct results to scroll down to the end and sometimes it won't.
html = driver.find_element_by_tag_name('html')
time.sleep(5)
html.send_keys(Keys.END)
Can anyone suggest me how to scroll down to the end, please find the link and screenshot below.
https://www.google.com/search?q=upsc+jobs+in+india&rlz=1C1CHBF_enIN869IN869&oq=upsc+jo&aqs=chrome.1.69i57j0i433i512j0i131i433i512j0i512l3j0i131i433i512l2j0i512j0i433i512&sourceid=chrome&ie=UTF-8&ibp=htl;jobs&sa=X&sqi=2&ved=2ahUKEwjR27GN_qPzAhX4ppUCHb_0B_QQkd0GegQIORAB#fpstate=tldetail&sxsrf=AOaemvIxuJXh3if0tw7ezZfjkXRe5DSxsA:1632911697417&htivrt=jobs&htidocid=N1_BNfzt8n8auXjGAAAAAA%3D%3D
The more you scroll the more data you get, basically it's a dynamic web site. I have hardcoded 50 as a dummy number, you can have 100 or any other number for this matter.
You can use the sample code :
driver = webdriver.Chrome(driver_path)
driver.maximize_window()
driver.implicitly_wait(30)
driver.get("https://www.google.com/search?q=upsc+jobs+in+india&rlz=1C1CHBF_enIN869IN869&oq=upsc+jo&aqs=chrome.1.69i57j0i433i512j0i131i433i512j0i512l3j0i131i433i512l2j0i512j0i433i512&sourceid=chrome&ie=UTF-8&ibp=htl;jobs&sa=X&sqi=2&ved=2ahUKEwjR27GN_qPzAhX4ppUCHb_0B_QQkd0GegQIORAB#fpstate=tldetail&sxsrf=AOaemvIxuJXh3if0tw7ezZfjkXRe5DSxsA:1632911697417&htivrt=jobs&htidocid=N1_BNfzt8n8auXjGAAAAAA%3D%3D")
j = 1
for i in range(50):
element = driver.find_element(By.XPATH, f"(//div[#role='heading'])[{j}]")
driver.execute_script("arguments[0].scrollIntoView(true);", element)
j = j + 1
You can also try this to scroll till the end.
driver.get("https://www.google.com/search?q=upsc+jobs+in+india&rlz=1C1CHBF_enIN869IN869&oq=upsc+jo&aqs=chrome.1.69i57j0i433i512j0i131i433i512j0i512l3j0i131i433i512l2j0i512j0i433i512&sourceid=chrome&ie=UTF-8&ibp=htl;jobs&sa=X&sqi=2&ved=2ahUKEwjR27GN_qPzAhX4ppUCHb_0B_QQkd0GegQIORAB#fpstate=tldetail&sxsrf=AOaemvIxuJXh3if0tw7ezZfjkXRe5DSxsA:1632911697417&htivrt=jobs&htidocid=N1_BNfzt8n8auXjGAAAAAA%3D%3D")
i = 0
try:
while True:
options = driver.find_elements_by_xpath("//div[#role='treeitem']")
driver.execute_script("arguments[0].scrollIntoView(true);",options[i])
i+=1
time.sleep(.5)
except:
pass
I have made an Instagram bot about 1 year ago.
From time to time Instagram has changed the code structure but I have always managed to successfully modify it in order to work.
Since a couple of weeks ago, Instagram has changed drastically.
I have made some changes but I will skip the Following part for the moment.
Currently I am facing some issues on selecting the next image after liking one.
for hashtag in hashtag_list:
tag += 1
webdriver.get('https://www.instagram.com/explore/tags/'+ hashtag_list[tag] + '/')
sleep(5)
first_thumbnail = webdriver.find_element_by_xpath('//*[#id="react-root"]/section/main/article/div[1]/div/div/div[1]/div[1]/a/div')
first_thumbnail.click()
sleep(randint(10,15))
try:
for x in range(1,30):
# Liking the picture
like_button = webdriver.find_element_by_xpath('//*[#aria-label="Like"]')
like_button.click()
likes += 1
sleep(5)
# Comments and tracker
comm_prob = randint(1,11)
print('{}_{}: {}'.format(hashtag, x,comm_prob))
if comm_prob > 7:
comments += 1
#webdriver.find_element_by_xpath('/html/body/div[5]/div[2]/div/article/div[3]/section[1]/span[2]/button').click()
comment_button = webdriver.find_element_by_xpath('//*[#aria-label="Comment"]')
comment_button.click()
comment_box = webdriver.find_element_by_xpath('/html/body/div[5]/div[2]/div/article/div[3]/section[3]/div/form/textarea')
if (comm_prob < 7):
comment_box.send_keys('Really cool :D!')
sleep(5)
elif (comm_prob > 6) and (comm_prob < 9):
comment_box.send_keys('Interesting work!')
sleep(5)
elif comm_prob == 9:
comment_box.send_keys('Nice gallery!')
sleep(5)
elif comm_prob == 10:
comment_box.send_keys('Cool view!')
sleep(5)
elif comm_prob == 11:
comment_box.send_keys('Wonderful view :)')
sleep(5)
# Enter to post comment
comment_box.send_keys(Keys.ENTER)
sleep(3)
followed += 1
nxt = webdriver.find_element_by_link_text('Next')
nxt.click()
sleep(2)
nxt = webdriver.find_element_by_link_text('Next')
nxt.click()
sleep(2)
nxt = webdriver.find_element_by_link_text('Next')
nxt.click()
sleep(2)
# some hashtag stops refreshing photos (it may happen sometimes), it continues to the next
except:
continue
I have managed to make it work again using the above code. Right now it is liking a picture and skipping one. I tried to remove one of the "Next" blocks but it won't go to the next picture anymore...well, better than nothing.
I guess your logical bug is here:
if webdriver.find_element_by_xpath('/html/body/div[5]/div[2]/div/article/header/div[2]/div[1]/div/span/a').text != 'x':
The /html/body/div[5]/div[2]/div/article/header/div[2]/div[1]/div/span/a locator is matching the opened thumbnail user name but you are comparing it with 'x' string.
So since the user name is not equals to 'x', you will always enter this block and will actually like and unlike in endless loop the same, the first user.
Moreover, there is else case for the above if.
It checks for clicks counter. So since you will always click on the click inside the if and just increase the clicks counter you will never click with webdriver.find_element_by_xpath('/html/body/div[5]/div[1]/div/div/a').click() but you will try to perform click with webdriver.find_element_by_xpath('/html/body/div[5]/div[1]/div/div/a[2]').click().
However I see this locator is wrong, no such element. So nothing will be clicked, you will stay inside the first opened thumbnail and click on "like" button there endlessly.
Additionally to the above I would recommend you never use the automatically generated locators like those you are using here. they are extremely unreliable. You have yo learn how to make correct locators.
I am scraping a list of websites for images,using selenium webdriver+scrapy but the next button for each website has different class/div names,how to automatically find the next pages in different sites to scrape?
Here is the most basic structure for pagination:
page_count = 1
while (nextbutton.isEnbled):
# Increase page_count value on each iteration on +1
page_count += 1
try:
nextbutton.click()
except NoSuchElementException:
# Stop loop if no more page available
break
This is the extended question on how to click 'More' button on a webpage.
Below is my previous question and one person kindly answered for it.
Since I'm not that familiar with 'find element by class name' function, I just added that person's revised code on my existing code. So my revised code would not be efficient (my apology).
Python click 'More' button is not working
The situation is, there are two types of 'More' button. 1st one is in the property description part and the 2nd one is in the text reviews part. If you click only one 'More' button from any of the reviews, reviews will be expanded so that you can see the full text reviews.
The issue I run into is that I can click 'More' button for the reviews that are in the 1st page but not clickable for the reviews in the 2nd page.
Below is the error message I get but my code still runs (without stopping once it sees an error).
Message:
no such element: Unable to locate element: {"method":"tag name","selector":"span"}
Based on my understanding, there is entry class and corresponding span for every review. I don't understand why it says python can't find it.
from selenium import webdriver
from selenium.webdriver import ActionChains
from bs4 import BeautifulSoup
review_list=[]
review_appended_list=[]
review_list_v2=[]
review_appended_list_v2=[]
listed_reviews=[]
listed_reviews_v2=[]
listed_reviews_total=[]
listed_reviews_total_v2=[]
final_list=[]
#Incognito Mode
option = webdriver.ChromeOptions()
option.add_argument("--incognito")
#Open Chrome
driver=webdriver.Chrome(executable_path="C:/Users/chromedriver.exe",options=option)
#url I want to visit (I'm going to loop over multiple listings but for simplicity, I just added one listing url).
lists = ['https://www.tripadvisor.com/VacationRentalReview-g30196-d6386734-Hot_51st_St_Walk_to_Mueller_2BDR_Modern_sleeps_7-Austin_Texas.html']
for k in lists:
driver.get(k)
time.sleep(3)
#click 'More' on description part.
link = driver.find_element_by_link_text('More')
try:
ActionChains(driver).move_to_element(link)
time.sleep(1) # time to move to link
link.click()
time.sleep(1) # time to update HTML
except Exception as ex:
print(ex)
time.sleep(3)
# first "More" shows text in all reviews - there is no need to search other "More"
try:
first_entry = driver.find_element_by_class_name('entry')
more = first_entry.find_element_by_tag_name('span')
#more = first_entry.find_element_by_link_text('More')
except Exception as ex:
print(ex)
try:
ActionChains(driver).move_to_element(more)
time.sleep(1) # time to move to link
more.click()
time.sleep(1) # time to update HTML
except Exception as ex:
print(ex)
#begin parsing html and scraping data.
html =driver.page_source
soup=BeautifulSoup(html,"html.parser")
listing=soup.find_all("div", class_="review-container")
all_reviews = driver.find_elements_by_class_name('wrap')
for review in all_reviews:
all_entries = review.find_elements_by_class_name('partial_entry')
if all_entries:
review_list=[all_entries[0].text]
review_appended_list.extend([review_list])
for i in range(len(listing)):
review_id=listing[i]["data-reviewid"]
listing_v1=soup.find_all("div", class_="rating reviewItemInline")
rating=listing_v1[i].span["class"][1]
review_date=listing_v1[i].find("span", class_="ratingDate relativeDate")
review_date_detail=review_date["title"]
listed_reviews=[review_id, review_date_detail, rating[7:8]]
listed_reviews.extend([k])
listed_reviews_total.append(listed_reviews)
for a,b in zip (listed_reviews_total,review_appended_list):
final_list.append(a+b)
#loop over from the 2nd page of the reviews for the same listing.
for j in range(5,20,5):
url_1='-'.join(k.split('-',3)[:3])
url_2='-'.join(k.split('-',3)[3:4])
middle="-or%d-" % j
final_k=url_1+middle+url_2
driver.get(final_k)
time.sleep(3)
link = driver.find_element_by_link_text('More')
try:
ActionChains(driver).move_to_element(link)
time.sleep(1) # time to move to link
link.click()
time.sleep(1) # time to update HTML
except Exception as ex:
print(ex)
# first "More" shows text in all reviews - there is no need to search other "More"
try:
first_entry = driver.find_element_by_class_name('entry')
more = first_entry.find_element_by_tag_name('span')
except Exception as ex:
print(ex)
try:
ActionChains(driver).move_to_element(more)
time.sleep(2) # time to move to link
more.click()
time.sleep(2) # time to update HTML
except Exception as ex:
print(ex)
html =driver.page_source
soup=BeautifulSoup(html,"html.parser")
listing=soup.find_all("div", class_="review-container")
all_reviews = driver.find_elements_by_class_name('wrap')
for review in all_reviews:
all_entries = review.find_elements_by_class_name('partial_entry')
if all_entries:
#print('--- review ---')
#print(all_entries[0].text)
#print('--- end ---')
review_list_v2=[all_entries[0].text]
#print (review_list)
review_appended_list_v2.extend([review_list_v2])
#print (review_appended_list)
for i in range(len(listing)):
review_id=listing[i]["data-reviewid"]
#print review_id
listing_v1=soup.find_all("div", class_="rating reviewItemInline")
rating=listing_v1[i].span["class"][1]
review_date=listing_v1[i].find("span", class_="ratingDate relativeDate")
review_date_detail=review_date["title"]
listed_reviews_v2=[review_id, review_date_detail, rating[7:8]]
listed_reviews_v2.extend([k])
listed_reviews_total_v2.append(listed_reviews_v2)
for a,b in zip (listed_reviews_total_v2,review_appended_list_v2):
final_list.append(a+b)
print (final_list)
if len(listing) !=5:
break
How to enable clicking 'More' button for the 2nd and rest of the pages? so that I can scrape the full text reviews?
Edited Below:
The error messages I get are these two lines.
Message: no such element: Unable to locate element: {"method":"tag name","selector":"span"}
Message: stale element reference: element is not attached to the page document
I guess my whole codes still run because I used try and except function? Usually when python runs into an error, it stops running.
Try it like:
driver.execute_script("""
arguments[0].click()
""", link)
I am very new to Programming and started teaching myself web-scraping with Python.
I am scraping player data from multiple pages of a site and built a while loop which scrapes a 'next'-button's href to get to the next player's page.
Everything is working out fine, except breaking the while loop after the last player available. The 'next'-button will gray out and have no link behind it, therefore I want to stop the iteration and save everything to a csv.
My script looks like this:
#name base url and first page to start
BaseUrl = #url
PageUrl = #also url
while True:
#scraping tables
try:
# retrieve link for 'next' player in order
link = soup.find(attrs={"class": "go_to_next_player"}).get('href')
# join base url and new link href
PageUrl = BaseUrl + link
if link is None:
break
except IndexError as e:
print(e)
break
#writing to csv
I thought I could check if the retrieved href is empty, therefore checking 'is None' and breaking, but I get this error:
In line > PageUrl = BaseUrl + link
TypeError: must be str, not NoneType
Help would be greatly appreciated! I am very new to this, so please disregard my beginner code.
You can check if link is None before doing any operations with it, and then break the loop:
if link is not None:
PageUrl = BaseUrl + link
else:
break