Running my script I get "javascript:getDetail(19978)" such items as href. The number in braces if concatenated with "https://www.aopa.org/airports/4M3/business/", produces valid links. However, clicking on this newly created links I can see that It gets me to a different page which is not similar to the one if clicked from the original page link. How can I get the original links instead of "javascript:getDetail(19978)". Search should be made writing "All" in the searchbox.
The code I've tried with:
from selenium import webdriver
import time
link = "https://www.aopa.org/airports/4M3/business/"
driver = webdriver.Chrome()
driver.get("https://www.aopa.org/learntofly/school/")
driver.find_element_by_id('searchTerm').send_keys('All')
driver.find_element_by_id('btnSearch').click()
time.sleep(5)
for pro in driver.find_elements_by_xpath('//td/a'):
print(pro.get_attribute("href"))
driver.quit()
Code to create new links with the base url I pasted in my description:
from selenium import webdriver
import time
link = "https://www.aopa.org/airports/4M3/business/"
driver = webdriver.Chrome()
driver.get("https://www.aopa.org/learntofly/school/")
driver.find_element_by_id('searchTerm').send_keys('All')
driver.find_element_by_id('btnSearch').click()
time.sleep(5)
for item in driver.find_elements_by_xpath('//td/a'):
fresh = item.get_attribute("href").replace("javascript:getDetail(","")
print(link + fresh.replace(")",""))
driver.quit()
However, this newly created links lead me to different destinations.
FYC, original links are embedded within elements like the below one:
<td>GOLD DUST FLYING SERVICE, INC.</td>
Clicking link you make an XHR. The page is actually remained the same, but received data from JSON rendered instead of previous content.
If you want to open raw data inside an HTML page you might try something like
# To get list of entries as ["19978", "30360", ... ]
entries = [a.get_attribute('href').split("(")[1].split(")")[0] for a in driver.find_elements_by_xpath('//td/a')]
url = "https://www.aopa.org/learntofly/school/wsSearch.cfm?method=schoolDetail&businessId="
for entry in entries:
driver.get(url + entry)
print(driver.page_source)
You also might use requests to get each JSON response as
import requests
for entry in entries:
print(requests.get(url + entry).json())
without rendering data in browser
If you look at how getDetail() is implemented in the source code and explore the "network" tab when you click each of the search result links, you may see that there are multiple XHR requests issued for a result and there is some client-side logic executed to form a search result page.
If you don't want to dive into replicating all the logic happening to form each of the search result pages - simply go back and forth between the search results page and a single search result page:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("https://www.aopa.org/learntofly/school/")
driver.find_element_by_id('searchTerm').send_keys('All')
driver.find_element_by_id('btnSearch').click()
# wait for search results to be visible
table = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "#searchResults table")))
for index in range(len(table.find_elements_by_css_selector('td a[href*=getDetail]'))):
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "#searchResults table")))
# get the next link
link = table.find_elements_by_css_selector('td a[href*=getDetail]')[index]
link_text = link.text
link.click()
print(link_text)
# TODO: get details
# go back
back_link = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#schoolDetail a[href*=backToList]")))
driver.execute_script("arguments[0].click();", back_link)
driver.quit()
Note the use of Explicit Waits instead of hardcoded "sleeps".
It may actually make sense to avoid using selenium here altogether and approach the problem "headlessly" - doing HTTP requests (via requests module) to the website's API.
Related
I am newbie to web scraping using selenium and I am scraping seetickets.us
My scraper works as follows.
sign in
search for events
click on each event
scrape data
come back
click on next event
repeat
Now the problem is that some of the events do not contain some elements such as
this event: https://wl.seetickets.us/event/Beta-Hi-Fi/484490?afflky=WorldCafeLive
which does not contain pricing table
but this one does
https://www.seetickets.us/event/Wake-Up-Daisy-1100AM/477633
so I have used try except blocks
try:
find element
except:
return none
but if it doesnt found the element in try, it takes 5 seconds to go to except because I have used
webdriver.implicitwait(5)
Now , if any page does not contain multiple elements , the selenium takes very much time to scrape that page.
I have thousands of pages to scrape. What should be done to speed up the process.
Thanks
To speedup web scraping using Selenium:
Remove implicitwait() totally.
Induce WebDriverWait to synchronise the webdriver instance with the WebBrowser instance for either of the following element states:
presence_of_element_located()
visibility_of_element_located()
element_to_be_clickable()
Your effective code block will be:
try:
element = WebDriverWait(driver, 3).until(EC.visibility_of_element_located((By.ID, "input"))))
print("Element is visible")
except TimeoutException:
print("Element is not visible")
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Instead of ImplicitWait try to use ExplicitWait but apply it to search of main container only to wait for content to be loaded. For all inner elements apply find_element with no waits.
P.S. It's always better to share your real code instead of pseudo-code
Instead of using implicitWait and waiting for each individual element, only wait for the full page load, for example wait for h1 tag, which will indicate the full page has been loaded then proceed with extraction.
#wait for page load
try:
pageLoadCheck=WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.XPATH, "(//h1)[1]"))).get_attribute("textContent").strip()
#extract data without any wait once the page is loaded
try:
dataOne=driver.find_element_by_xpath("((//h1/following-sibling::div)[1]//a[contains(#href,'tel:')])[1]").get_attribute("textContent").strip()
except:
dataOne=''
except Exception as e:
print(e)
So I get the messages from this line:
<pre class="_3Gy8WZD53wWAE41lr57by3 ">Sleep</pre>
My code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
PATH = 'C:\\Users\\User\\Desktop\\chromedriver.exe'
driver = webdriver.Chrome(PATH)
driver.get('https://www.reddit.com')
time.sleep(80) # TIME TO LOGIN IN
search = driver.find_element_by_class_name('_3Gy8WZD53wWAE41lr57by3 ')
print(driver.find_element_by_xpath(".//pre").text) # *LET'S CALL THIS 'S'*
And everything works, kinda. When I print: 's' it prints out the last message from that chat.
Note that whenever someone enters a message, it will be under the variable(class): '_3Gy8WZD53wWAE41lr57by3 '
My goal is to print out the first message from the that chat.
I had to edit it twice because of some mistakes that I had made
I would suggest 2 changes to your code which'll save you major frustration:
Avoid explicit sleep calls, instead, wait for presence of elements. This will allow your program to wait as little time as possible for the page you're trying to load.
Utilize css selectors instead of xpath --> you have much finer control over accessing elements, plus, your code becomes more robust and flexible.
In terms of execution, here's how that looks:
Wait up to 80 seconds for login:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
# Get the page, now the user will need to log in
driver.get('https://www.reddit.com')
# Wait until the page is loaded, up to 80 seconds
try:
element = WebDriverWait(driver, 80).until(
EC.presence_of_element_located((By. CSS_SELECTOR, "pre. _3Gy8WZD53wWAE41lr57by3"))
)
except TimeoutException:
print("You didn't log in, shutting down program")
driver.quit()
# continue as normal here
Utilize css selectors to find your messages:
# I personally like to always use the plural form of this function
# since, if it fails, it returns an empty list. The single form of
# this function results in an error if no results are found
# NOTE: utilize reddit's class for comments, you may need to change the css selector
all_messages = driver.find_elements_by_css_selector('pre._3Gy8WZD53wWAE41lr57by3')
# You can now access the first and last elements from this:
first_message = all_messages[0].text
last_message = all_messages[-1].text
# Alternatively, if you are concerned about memory usage from potentially large
# lists of messages, use css selector 'nth-of-type'
# NOTE: accessing first instance of list of the list exists allows None
# if no results are found
first_message = driver.find_elements_by_css_selector('pre._3Gy8WZD53wWAE41lr57by3:first-of-type')
first_message = first_message[0] if first_message else None
last_message = driver.find_elements_by_css_selector('pre._3Gy8WZD53wWAE41lr57by3:last-of-type')
last_message = last_message[0] if last_message else None
I hope this provides an immediate solution but also some fundamentals how to optimize your web scraping moving forward.
I want to scrape the comments off this page using beautifulsoup - https://www.x....s.com/video_id/the-suburl
The comments are loaded on click via Javascript. The comments are paginated and each page loads comments on click too. I wish to fetch all comments, for each comment, I want to get the poster profile url, the comment, no. of likes, no of dislikes, and time posted (as stated on the page).
The comments can be a list of dictionaries.
How do I go about this?
This script will print all comments found on the page:
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.x......com/video_id/gggjggjj/'
video_id = url.rsplit('/', maxsplit=2)[-2].replace('video', '')
u = 'https://www.x......com/threads/video/ggggjggl/{video_id}/0/0'.format(video_id=video_id)
comments = requests.post(u, data={'load_all':1}).json()
for id_ in comments['posts']['ids']:
print(comments['posts']['posts'][id_]['date'])
print(comments['posts']['posts'][id_]['name'])
print(comments['posts']['posts'][id_]['url'])
print(BeautifulSoup(comments['posts']['posts'][id_]['message'], 'html.parser').get_text())
# ...etc.
print('-'*80)
This would be done with Selenium. Selenium emulates a browser. Depending on your preferences you can use a chrome driver or the Firefox driver which is the geckodriver.
Here is a link on how to install the chrome webdriver:
http://jonathansoma.com/lede/foundations-2018/classes/selenium/selenium-windows-install/
Then in your code here is how you would set it up:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
# this part may change depending on where you installed the webdriver.
# You may have to define the path to the driver.
# For me my driver is in C:/bin so I do not need to define the path
chrome_options = Options()
# or '-start maximized' if you want the browser window to open
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(options=chrome_options)
driver.get(your_url)
html = driver.page_source # downloads the html from the driver
Selenium has several functions that you can use to perform certain actions such as click on elements on the page. Once you find an element with selenium you can use the .click() method to interact with the element.
Let me know if this helps
I am using python to scrape some data from a website in combination with selenium and Beautiful Soup. This page has buttons you can click which change the data displayed in the tables, but this is all handled by the javascript in the page. The page url does not change.
Selenium successfully renders the javascript on the page on load, but it continues using the previous state (before the clicks) therefore, scraping the same data instead of the new data.
I tried following the solutions given on Obey The Testing Goat but it always seemed to timeout and not turn the state stale. I've tried waiting for 10 seconds manually by using a time.sleep for it to wait for the state to possibly refresh in a while. I've tried using WebDriverWait to wait until the old page turned stale. I've tried looking through the selenium documentation for possible solutions. The code presented below attempts to use the solution presented in the website, but it simply times out no matter the timeout rate.
from selenium.webdriver.support.wait import WebDriverWait
from contextlib import contextmanager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.expected_conditions import staleness_of
class MySeleniumTest():
# assumes self.browser is a selenium webdriver
def __init__(self, browser, soup):
self.browser = browser
self.soup = soup
#contextmanager
def wait_for_page_load(self, timeout=30):
old_page = self.browser.find_element_by_tag_name('html')
yield
WebDriverWait(self.browser, timeout).until(staleness_of(old_page))
def tryChangingState(self):
with self.wait_for_page_load(timeout=20):
og_state = self.soup
tab = self.browser.find_element_by_link_text('Breakfast')
tab.click()
tab = self.browser.find_element_by_link_text('Lunch')
tab.click()
new_state = self.soup
# check if the HTML code has changed
print(og_state != new_state)
# create tester object
tester = MySeleniumTest(browser, soup)
# try changing state by after clicking on button
tester.tryChangingState()
I'm not sure if I'm using it in the correct way or not. I also tried creating a new with self.wait_for_page_load(timeout=20): after the first click and put the rest of the code within that, but this also did not work. I would expect og_state != new_state to result in true implying the HTML changed, but the actual result is false.
Original poster here. I found the reason for the issue. The state was being updated in selenium but since I was using Beautiful Soup for parsing, the Beautiful Soup object was using the source code from the previous selenium web driver object. But updating the soup object each time the page was clicked, the scraper was able to successfully gather the new data.
I updated the soup object by simply calling soup = BeautifulSoup(browser.page_source, 'lxml')
In other words, I didn't need to worry about the state of the selenium web driver, it was simply an issue of updating the source code the parser was reading.
I'm recently trying to learn Selenium and found a website that just ignores my attempts to find particular element by ID, name or xpath. The website is here:
https://www.creditview.pl/PL/Creditview.htm
I am trying to select first text window, the one labeled Uzytkownik, the code for it goes like that:
I am trying to find it using several methods:
from selenium import webdriver
browser = webdriver.Chrome()
site = "https://www.creditview.pl/pl/creditview.htm"
browser.get(site)
login_txt = browser.find_element_by_xpath(r"/html//input[#id='ud_username']")
login_txt2 = browser.find_element_by_id("ud_username")
login_txt3 = browser.find_element_by_name("ud_username")
No matter what I try I keep getting:
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element:
as if the element wasn't there at all.
I have suspected that the little frame containing the field might be an iframe and tried to switch to various elements with no luck. Also tried to check if the element isn't somehow obscured to my code (hidden element). Nothing seems to work, or I am making some newbie mistake and the answer is right in front of me. Finally I was able to select other element on the site and used several TAB keys to move cursor to desired position, but is feels like cheating.
Can someone please point show me how to find the element ? I literally can't sleep because of this issue :)
Given that your element is there, you still need to wait for your element to be loaded/visible/clickable etc. You can do that using selenium's expected conditions (EC).
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
my_XPATH = r"/html//input[#id='ud_username']"
wait_time = 10 # Define maximum time to wait in seconds
driver = webdriver.Chrome()
site = "https://www.creditview.pl/pl/creditview.htm"
driver.get(site)
try:
my_element = driver.WebDriverWait(driver, wait_time).until(EC.presence_of_element_located(By.XPATH,my_XPATH))
except:
print ("element not found after %d seconds" % (wait_time))