Speedup web scraping with selenium

Speedup web scraping with selenium - python-3.x

I am newbie to web scraping using selenium and I am scraping seetickets.us
My scraper works as follows.
sign in
search for events
click on each event
scrape data
come back
click on next event
repeat
Now the problem is that some of the events do not contain some elements such as
this event: https://wl.seetickets.us/event/Beta-Hi-Fi/484490?afflky=WorldCafeLive
which does not contain pricing table
but this one does
https://www.seetickets.us/event/Wake-Up-Daisy-1100AM/477633
so I have used try except blocks
try:
find element
except:
return none
but if it doesnt found the element in try, it takes 5 seconds to go to except because I have used
webdriver.implicitwait(5)
Now , if any page does not contain multiple elements , the selenium takes very much time to scrape that page.
I have thousands of pages to scrape. What should be done to speed up the process.
Thanks

To speedup web scraping using Selenium:
Remove implicitwait() totally.
Induce WebDriverWait to synchronise the webdriver instance with the WebBrowser instance for either of the following element states:
presence_of_element_located()
visibility_of_element_located()
element_to_be_clickable()
Your effective code block will be:
try:
element = WebDriverWait(driver, 3).until(EC.visibility_of_element_located((By.ID, "input"))))
print("Element is visible")
except TimeoutException:
print("Element is not visible")
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

Instead of ImplicitWait try to use ExplicitWait but apply it to search of main container only to wait for content to be loaded. For all inner elements apply find_element with no waits.
P.S. It's always better to share your real code instead of pseudo-code

Instead of using implicitWait and waiting for each individual element, only wait for the full page load, for example wait for h1 tag, which will indicate the full page has been loaded then proceed with extraction.
#wait for page load
try:
pageLoadCheck=WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.XPATH, "(//h1)[1]"))).get_attribute("textContent").strip()
#extract data without any wait once the page is loaded
try:
dataOne=driver.find_element_by_xpath("((//h1/following-sibling::div)[1]//a[contains(#href,'tel:')])[1]").get_attribute("textContent").strip()
except:
dataOne=''
except Exception as e:
print(e)

Related

Can Selenium detect when webpage finishes loading in Python3?

In python I wrote:
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=chrome_options)
driver.get(url)
try:
WebDriverWait(driver, 10).until(
lambda driver: driver.execute_script('return document.readyState') == 'complete')
except se.TimeoutException:
return False
# Start Parsing
Even though I have waited for readyState for some websites when I parse it I see that there is no checkbox. But, If I add time.sleep(5) Before parsing for the same website I get that there is a checkbox.
My question is, how can I have a general solution that works with the majority of websites? I can't just write time.sleep(5) as some websites might need much more and some might finish within 0.001 seconds (which will have bad impact on performance...)
I just want to stimulate a real browser and not to handle anything before the refresh button appears again (which means everything was loaded).

Ideally web applications when accessed through get(), returns the control to the WebDriver only when document.readyState equals to complete. So unless the AUT(Application under Test) behaves otherwise, the following line of code is typically an overhead:
WebDriverWait(driver, 10).until(lambda driver: driver.execute_script('return document.readyState') == 'complete')
However, as per your test requirements you can configure the pageLoadStrategy either as:
none
eager
normal
You can find a detailed discussion in What is the correct syntax checking the .readyState of a website in Selenium Python
At this point, it is to be noted that using time.sleep(secs) without any specific condition to achieve defeats the purpose of Automation and should be avoided at any cost.
Solution
The generic approach that would work with all the websites is to induce WebDriverWait as per the prevailing test scenario. As an example:
To wait for the presence of an element you need to invoke the expected_conditions of presence_of_element_located()
To wait for the visibility of an element you need to invoke the expected_conditions of visibility_of_element_located()
To wait for the element to be visible, enabled and interactable such that you can click it you need to invoke the expected_conditions of element_to_be_clickable()

How do I print the last message from a Reddit message group using Selenium

So I get the messages from this line:
<pre class="_3Gy8WZD53wWAE41lr57by3 ">Sleep</pre>
My code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
PATH = 'C:\\Users\\User\\Desktop\\chromedriver.exe'
driver = webdriver.Chrome(PATH)
driver.get('https://www.reddit.com')
time.sleep(80) # TIME TO LOGIN IN
search = driver.find_element_by_class_name('_3Gy8WZD53wWAE41lr57by3 ')
print(driver.find_element_by_xpath(".//pre").text) # *LET'S CALL THIS 'S'*
And everything works, kinda. When I print: 's' it prints out the last message from that chat.
Note that whenever someone enters a message, it will be under the variable(class): '_3Gy8WZD53wWAE41lr57by3 '
My goal is to print out the first message from the that chat.
I had to edit it twice because of some mistakes that I had made

I would suggest 2 changes to your code which'll save you major frustration:
Avoid explicit sleep calls, instead, wait for presence of elements. This will allow your program to wait as little time as possible for the page you're trying to load.
Utilize css selectors instead of xpath --> you have much finer control over accessing elements, plus, your code becomes more robust and flexible.
In terms of execution, here's how that looks:
Wait up to 80 seconds for login:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
# Get the page, now the user will need to log in
driver.get('https://www.reddit.com')
# Wait until the page is loaded, up to 80 seconds
try:
element = WebDriverWait(driver, 80).until(
EC.presence_of_element_located((By. CSS_SELECTOR, "pre. _3Gy8WZD53wWAE41lr57by3"))
)
except TimeoutException:
print("You didn't log in, shutting down program")
driver.quit()
# continue as normal here
Utilize css selectors to find your messages:
# I personally like to always use the plural form of this function
# since, if it fails, it returns an empty list. The single form of
# this function results in an error if no results are found
# NOTE: utilize reddit's class for comments, you may need to change the css selector
all_messages = driver.find_elements_by_css_selector('pre._3Gy8WZD53wWAE41lr57by3')
# You can now access the first and last elements from this:
first_message = all_messages[0].text
last_message = all_messages[-1].text
# Alternatively, if you are concerned about memory usage from potentially large
# lists of messages, use css selector 'nth-of-type'
# NOTE: accessing first instance of list of the list exists allows None
# if no results are found
first_message = driver.find_elements_by_css_selector('pre._3Gy8WZD53wWAE41lr57by3:first-of-type')
first_message = first_message[0] if first_message else None
last_message = driver.find_elements_by_css_selector('pre._3Gy8WZD53wWAE41lr57by3:last-of-type')
last_message = last_message[0] if last_message else None
I hope this provides an immediate solution but also some fundamentals how to optimize your web scraping moving forward.

Which method should I use for selenium.webdriver.support.expected_conditions object when waiting for Ajax based card details to load?

I'm trying to spider a page for links with a specific CSS class with Selenium for Python 3. For some reason it just stops, when it should loop through again
def spider_me_links(driver, max_pages, links):
page = 1 # NOTE: Change this to start with a different page.
while page <= max_pages:
url = "https://www.example.com/home/?sort=title&p=" + str(page)
driver.get(url)
# Timeout after 2 seconds, and duration 5 seconds between polls.
wait = WebDriverWait(driver, 120, 5000)
wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, 'card-details')))
# Obtain source text
source_code = driver.page_source
soup = BeautifulSoup(source_code, 'lxml')
print("findAll:", len(soup.findAll('a', {'class' : 'card-details'}))) # returns 12 at every loop iteration.
links += soup.findAll('a', {'class' : 'card-details'})
page += 1
The two lines I think I have it wrong on are the following:
# Timeout after 2 seconds, and duration 5 seconds between polls.
wait = WebDriverWait(driver, 120, 5000)
wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, 'card-details')))
Because during that point I'm waiting for content to be loaded dynamically with Ajax, and the content loads fine. If I don't use the function to load it and I don't run the above two lines, I'm able to grab the <a> tags, but if I put it in the loop it just gets stuck.
I looked at the documentation for the selenium.webdriver.support.expected_conditions class (the EC object in my code above), and I'm fairly unsure about which method I should use to make sure the content has been loaded before scraping it with BS4.

Usually creditcard name, creditcard numbers resides within <frame> / <iframe>
To focus on those elements, you have to:
Induce WebDriverWait for the desired frame to be available and switch to it.
You can use either of the following Locator Strategies:
Using ID:
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.ID,"iframe_id")))
Using NAME:
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.NAME,"iframe_name")))
Using CLASS_NAME:
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.CLASS_NAME,"iframe_classname")))
Using CSS_SELECTOR:
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"iframe_css")))
Using XPATH:
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,"iframe_xpath")))
Note: You have to add the following imports:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Reference
You can find a couple of relevant discussions in:
Ways to deal with #document under iframe
Switch to an iframe through Selenium and python
Unable to locate element of credit card number using selenium python
NoSuchElementException: Message: Unable to locate element while trying to click on the button VISA through Selenium and Python

Your selector "means" that you want to select element with tag name 'card-details' while you need to select element with #class='card-details'
Try either
(By.CSS_SELECTOR, '.card-details')
or
(By.CLASS_NAME, 'card-details')

I ended up using:
wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'card-details')))
And it appears to have worked.

ElementNotVisibleException: Message: element not interactable error while trying to click on the top video in a youtube search

I cannot seem to find a way to click on the right element in order to get the url I am looking for. In essence I am trying to click on the top video in a youtube search (the most highly ranked returned video).
How to resolve ElementNotInteractableException: Element is not visible in Selenium webdriver?
-> This is for Java but it let me in the right direction (knowing I needed to execute JavaScript)
http://www.teachmeselenium.com/2018/04/17/python-selenium-interacting-with-the-browser-executing-javascript-through-javascriptexecutor/
-> This shows me how I should try to execute the javascript in python.
I have also seen countless articles about waits but they do not solve my problem.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
wrds = ["Vivaldi four seasons", "The Beatles twist and shout", "50
cent heat"] #Random list of songs
driver = webdriver.Chrome()
for i in wrds:
driver.get("http://www.youtube.com")
elem = driver.find_element_by_id("search")
elem.send_keys(i)
elem.send_keys(Keys.RETURN)
time.sleep(5)
driver.execute_script("arguments[0].click()",driver.find_element_by_id('video-title')) #THIS CLICKS ON WRONG VIDEO
#elem = driver.find_element_by_id("video-title").click() #THIS FAILS
time.sleep(5)
url = driver.current_url
driver.close()
I get a ElementNotVisibleException: Message: element not interactable error when I do not execute any javascript (even though it has actually worked before it is just no way near robust). When I do execute the javascript it clicks on the wrong videos.
I have tried all types of waits "Explicit" and "Implicit" this did now work.
I am quite sure I need to execute some JavaScript but I don't know how.

You were almost there. You need to induce WebDriverWait for the element to be clickable and you can use the following solution:
Code Block:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
wrds = ["Vivaldi four seasons", "The Beatles twist and shout", "50 cent heat"]
kwrd = ["Vivaldi", "Beatles", "50"]
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
driver=webdriver.Chrome(chrome_options=options, executable_path=r'C:\WebDrivers\\chromedriver.exe')
for i, j in zip(wrds, kwrd):
driver.get("https://www.youtube.com/")
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input#search"))).send_keys(i)
driver.find_element_by_css_selector("button.style-scope.ytd-searchbox#search-icon-legacy").click()
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "h3.title-and-badge.style-scope.ytd-video-renderer a"))).click()
WebDriverWait(driver, 10).until(EC.title_contains(j))
print(driver.current_url)
driver.quit()

That's one of the reasons you should never use JavaScript click, Selenium Webdriver is designed to stimulate as if a real user is able to click. Real user can't click an invisible element in the page but you can click through Javascript. If you search the element by that id video-title, it matches totally 53 videos. But I don't know which one you want to click. You may match that element by some other way(not by id).
I will give you an idea how to click that element but you need to find out the index first before you click.
driver.find_element_by_xpath("(//*[#id='video-title'])[1]").click
If the first one is invisible, then pass 2, [2] then three, figure out which one it's clicking the desired element. Or you may specify the exact element, we may try to locate that element by some other way.

My script produces javascript stuffs Instead of valid links

Running my script I get "javascript:getDetail(19978)" such items as href. The number in braces if concatenated with "https://www.aopa.org/airports/4M3/business/", produces valid links. However, clicking on this newly created links I can see that It gets me to a different page which is not similar to the one if clicked from the original page link. How can I get the original links instead of "javascript:getDetail(19978)". Search should be made writing "All" in the searchbox.
The code I've tried with:
from selenium import webdriver
import time
link = "https://www.aopa.org/airports/4M3/business/"
driver = webdriver.Chrome()
driver.get("https://www.aopa.org/learntofly/school/")
driver.find_element_by_id('searchTerm').send_keys('All')
driver.find_element_by_id('btnSearch').click()
time.sleep(5)
for pro in driver.find_elements_by_xpath('//td/a'):
print(pro.get_attribute("href"))
driver.quit()
Code to create new links with the base url I pasted in my description:
from selenium import webdriver
import time
link = "https://www.aopa.org/airports/4M3/business/"
driver = webdriver.Chrome()
driver.get("https://www.aopa.org/learntofly/school/")
driver.find_element_by_id('searchTerm').send_keys('All')
driver.find_element_by_id('btnSearch').click()
time.sleep(5)
for item in driver.find_elements_by_xpath('//td/a'):
fresh = item.get_attribute("href").replace("javascript:getDetail(","")
print(link + fresh.replace(")",""))
driver.quit()
However, this newly created links lead me to different destinations.
FYC, original links are embedded within elements like the below one:
<td>GOLD DUST FLYING SERVICE, INC.</td>

Clicking link you make an XHR. The page is actually remained the same, but received data from JSON rendered instead of previous content.
If you want to open raw data inside an HTML page you might try something like
# To get list of entries as ["19978", "30360", ... ]
entries = [a.get_attribute('href').split("(")[1].split(")")[0] for a in driver.find_elements_by_xpath('//td/a')]
url = "https://www.aopa.org/learntofly/school/wsSearch.cfm?method=schoolDetail&businessId="
for entry in entries:
driver.get(url + entry)
print(driver.page_source)
You also might use requests to get each JSON response as
import requests
for entry in entries:
print(requests.get(url + entry).json())
without rendering data in browser

If you look at how getDetail() is implemented in the source code and explore the "network" tab when you click each of the search result links, you may see that there are multiple XHR requests issued for a result and there is some client-side logic executed to form a search result page.
If you don't want to dive into replicating all the logic happening to form each of the search result pages - simply go back and forth between the search results page and a single search result page:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("https://www.aopa.org/learntofly/school/")
driver.find_element_by_id('searchTerm').send_keys('All')
driver.find_element_by_id('btnSearch').click()
# wait for search results to be visible
table = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "#searchResults table")))
for index in range(len(table.find_elements_by_css_selector('td a[href*=getDetail]'))):
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "#searchResults table")))
# get the next link
link = table.find_elements_by_css_selector('td a[href*=getDetail]')[index]
link_text = link.text
link.click()
print(link_text)
# TODO: get details
# go back
back_link = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#schoolDetail a[href*=backToList]")))
driver.execute_script("arguments[0].click();", back_link)
driver.quit()
Note the use of Explicit Waits instead of hardcoded "sleeps".
It may actually make sense to avoid using selenium here altogether and approach the problem "headlessly" - doing HTTP requests (via requests module) to the website's API.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Speedup web scraping with selenium - python-3.x

Instead of ImplicitWait try to use ExplicitWait but apply it to search of main container only to wait for content to be loaded. For all inner elements apply find_element with no waits. P.S. It's always better to share your real code instead of pseudo-code

Related

Can Selenium detect when webpage finishes loading in Python3?

How do I print the last message from a Reddit message group using Selenium

Which method should I use for selenium.webdriver.support.expected_conditions object when waiting for Ajax based card details to load?

ElementNotVisibleException: Message: element not interactable error while trying to click on the top video in a youtube search

My script produces javascript stuffs Instead of valid links

Categories

Resources