This question is related to my previous two: Inducing WebDriverWait for specific elements and Inconsistency in scraping through <div>'s in Selenium.
I am scraping all of the Air Jordan sneakers off of https://www.grailed.com/. The feed is an infinitely scrolling list of sneakers and I am using Selenium webdriver to scrape the data. My problem is that the images for the shoes seem to take a while to load, so it throws a lot of errors. I have found the pattern in the xpath's of the images. The xpath to the first image is
/html/body/div[3]/div[6]/div[3]/div[3]/div[2]/div[2]/div[1]/a/div[2]/img, and the second is /html/body/div[3]/div[6]/div[3]/div[3]/div[2]/div[2]/div[2]/a/div[2]/img etc.
It follows this linear sequences where the second to last div index increases by one each time. To handle this I put the following in my loop (only relevant code is included).
i = 1
while len(sneakers) < sneaker_count:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Get sneakers currently on page and add to sneakers list
feed = driver.find_elements_by_class_name('feed-item')
for item in feed:
xpath = "/html/body/div[3]/div[6]/div[3]/div[3]/div[2]/div[2]/div[" + str(i) + "]/a/div[2]/img"
img = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, xpath)))
i += 1
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
The issue is, after about the 5th pair of shoes, the wait statement times out, it seems that the xpath passed in after that pair of shoes is not recognized. I used FireFox Developer to check the xpath using the copy xpath feature, and it seems identical to the passed in xpath when I print it. I use ChromeDriver w/Selenium but I don't think that's relevant. Does anyone know why the xpath's stop being recognized even though they seem identical?
UPDATE: So using an Xpath checker add-on to Chrome, it detects xpaths for items 1-4, but often stops detecting them after 6. When I check the xpath (both on Chrome and FireFox Developer mode, the xpath still looks identical, but it doesn't detect them when I use the "CSS and Xpath checker" it still doesn't seem to come out. This is a huge mystery to me.
I found the problem. The xpath was fine, but after the first 4-5 elements, the images are lazy-loaded. This means that a different solution must be reached in order to scrape these images. It's not that they take too long to load, it's that they just load placeholders in the HTML.
Related
I am using headless Firefox on Selenium and XPath Helper to identify insanely long paths to elements.
When the page initially loads, I can use XPath Helper to find the xpath of any element of interest, and selenium can find the element when given the xpath.
However, several buttons that I need to interact with on the page open menus when pressed that are either small or take up the whole "screen". No matter their size, these containers are overlaid on the original page, and although I can find their xpaths using XPath Helper, when I try to use those xpaths to find the elements using selenium, they can't be found.
I've checked, and there's no iframe funny business happening. I'm a bit stumped as to what could be happening. My guess is that the page's source code is being dynamically changed after I press the buttons that open the menu containers and when I call find_element_by_xpath on new elements in the containers, the original source is being searched, instead of the new source. Could that be it?
Any other ideas?
As a workaround, I can get around this issue by sending keystrokes to the body of the page, but I feel this solution is rather brittle and likely to fail. Would be a much more robust solution to actually specify all elements.
EDIT:
With selenium I can find the export button, but not the menu it opens.
Here is the code for the export button itself:
The element of interest for me is "Customize Export" which I have not been able to find using selenium. Here is the code for this element:
Notice the very top line of this last image (cdk-overlay-container)
Now, when I refresh the page and do NOT click the export button, the cdk-overlay-container section of the code is empty:
This suggests my that my hypothesis is correct -- that when the page loads initially, the "Customize Export" button is nowhere in the source code, but appears only after "Export" is clicked, and that selenium is using the original source code only --not the dynamically generated code that appears after clicking "Export" -- to find elements
Selenium could find the dynamic content after doing
driver.execute_script("return document.body.innerHTML")
The WebDriverWait is what you need to use to wait for a certain condition of elements. Here is an example of waiting for the elements to be clickable before the click with a timeout in 5 seconds:
wait = WebDriverWait(driver, 5)
button = wait.until(EC.element_to_be_clickable((By.XPATH, 'button xpath')))
button.click()
wait.until(EC.element_to_be_clickable((By.XPATH, 'menu xpath'))).click()
identify insanely long paths
is an anti pattern. You can try to not use XPath Helper and find xpath or selector yourself.
Update:
wait = WebDriverWait(driver, 10)
export_buttons = wait.until(EC. presence_of_all_elements_located((By.XPATH, '//button[contains(#class, "mat-menu-trigger") and contains(.="Export")]')))
print("Export button count: ", len(export_buttons))
export_button = wait.until(EC.element_to_be_clickable((By.XPATH, '//button[contains(#class, "mat-menu-trigger") and contains(.="Export")]')))
export_button.click()
cus_export_buttons = wait.until(EC. presence_of_all_elements_located((By.XPATH, '//button[contains(#class, "mat-menu-item") and contains(.="Customize Export")]')))
print("Customize Export button count: ", len(cus_export_buttons))
I'm trying to run the code:
for j in range(1,13):
driver.find_element_by_xpath('//*[#id="gateway-page"]/body/table/tbody/tr[3]/td[2]/table/tbody/tr[2]/td[2]/table/tbody/tr/td/table/tbody/tr[2]/td/div/div[2]/ul/li['+str(j)+']').click()
time.sleep(3)
To click every satisfying element on this website. But it ignores some elements every time, while it worked when I tried them not in the for loop but separately. Any idea why this happened?
Seems problem is with /ul/li['+str(j)+'] you are performing the click on <li> tag while actual link reside in it. That's why sometime the actual link won't receive the click without any error as link wrapped inside <li> tag .
Try to locate actual link tag. Use below code. I have tested on my system. Hope this will help you.
driver.get('http://catalog.sps.cuny.edu/preview_program.php?catoid=2&poid=607')
driver.implicitly_wait(10)
links = driver.find_elements_by_xpath("//div//h2[contains(.,'Electives')]/..//ul/li//span/a")
for link in links:
link.click()
time.sleep(3)
After observing xpath, I observed that you are trying to click the Elective option on that website. I think you have stored text of all electives in str array and using the loop, you are trying to click on each elective.
I suggest using another approach. Store all electives in list and then iterate over the elements and click them. e.g.
elements = driver.find_elements_by_xpath('///*[#id="gateway-page"]/body/table/tbody/tr[3]/td[2]/table/tbody/tr[2]/td[2]/table/tbody/tr/td/table/tbody/tr[2]/td/div/div[2]/ul/li')
for element in elements:
element.click()
time.sleep()
Probable problems in your solution
You are storing the name of electives in the array. If there is any typo, xPath will become invalid
You are starting loop from 1 to 13 but str is 0 indexed so start the loop from 0. because in you case you will always miss the first elective
Also after each click, elective expands. So you can also think about scrolling if an element is not found
Suggestion:
Also, use relative xpaths instead of absolute. Relative xpaths are more stable.
Happy Coding~
Before marking as duplicate, please consider that I have already looked through many related stack overflow posts, as well as websites and articles. I have not found a solution yet.
This question is a follow up to this question here Selenium Webdriver not finding XPATH despite seemingly identical strings. I determined the problem did not in fact come from the xpath method by updating the code to work in a more elegant manner:
for item in feed:
img_div = item.find_element_by_class_name('listing-cover-photo ')
img = WebDriverWait(img_div, 10).until(
EC.visibility_of_element_located((By.TAG_NAME, 'img')))
This works for the first 5ish elements. But after that it times out, by getting the inner html of the img_div and printing it, I found that for elements that time out, instead of the image I want there is a div with class "lazyload-placeholder". This led me to scraping lazy-loaded elements, but there was no answer that I could find. As you can see, I am using a WebDriverWait to try and give it time to load, but I also tried a site-wide wait call, as well as a time.sleep call. Waiting does not seem to fix it. I am looking for the easiest way to handle these lazy-loaded images, preferably in Selenium, but if there are other libraries or products I can use in tandem with the Selenium code I already have, that would be great. Any help is appreciated.
Your images will only load when they're scrolled into view. It's such a common requirement that the Selenium Python docs have it in their FAQ. Adapting from this answer, the below script will scroll down the page before scraping the images.
driver.get("https://www.grailed.com/categories/footwear")
SCROLL_PAUSE_TIME = 0.5
i = 0
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(SCROLL_PAUSE_TIME)
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
i += 1
if i == 5:
break
driver.implicitly_wait(10)
shoe_images = driver.find_elements(By.CSS_SELECTOR, 'div.listing-cover-photo img')
print(len(shoe_images))
In the interest of not scrolling through shoes (seemingly) forever, I have added in a break after 5 iterations, however, you're free to remove the i variable and it will scroll down for as long as it can.
The implicit wait is there to allow catchup for any remaining images that are still loading in.
A test run yielded 82 images, I confirmed that it had scraped all on the page by using Chrome's DevTools selector which highlighted 82. You'll see a different number based on how many images you allow to load.
C# sample
var img= Driver.FindElement(By.TagName("img"));// find lazy-load img
Actions actions = new Actions(Driver);
actions.MoveToElement(img); // scroll to img
actions.Perform();
var imageUrl = img.GetAttribute("src");// ready src
I'm checking a python library: requests-html. Looks interesting, easy and clear scraping. However, I'm not sure how to render a page with infinite scrolling.
From their documentation I understand that I should render a page with special attribute (scrolldown). I'm trying but I do not know how exactly. I know how to use selenium to handle infinite scroll, but I wonder whether it is possible with requests-html.
from requests_html import HTML, HTMLSession
page1 = session.get(url1)
page1.html.render( scrolldown=5,sleep=3)
html = HTML(html=page1.text)
noticeName = html.find('h2.noticeName')
for element in noticeName:
print(element.text)
It finds 10 elements from 13. 10 is visible without scrolling (and loading new content because of infinite scroll).
scrolldown=5 means scroll 5 pixel down, is your monitor that small?? or vm height that small?? now give it a bigger value like height of the screen with sleep or 2000 or 5000 without sleep
And it will not give you uniquely next elements, it will give you exactly all elements from the starting.
I will add some sample code soon.
I hope you've solved this already, but I'll post this for any other curious souls.
In most cases, if you want to infinite scroll, scrolldown needs to be a large value because it is based on the number of times requests_html will send a "page down" request in Chromium.
According to the docs:
scrolldown – Integer, if provided, of how many times to page down.
However, the requests_html uses the pyppeteer library which sends a page down as a key press. This means that if you are on a page that blocks the page down keys or simply doesn't infinite scroll using only the key presses, you will need a different solution.
Alternative solution (in Javascript)
Documentation: requests_html (archived)
I am trying to scrape content of a page.
Let's say this is the page:
http://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL
I know I need to use Selenium to get the data I want.
I found this example from Stackoverflow that shows how to do it:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
driver = webdriver.Chrome()
driver.maximize_window()
driver.get("http://finance.yahoo.com/quote/AAPL/profile?p=AAPL")
# wait for the Full Time Employees to be visible
wait = WebDriverWait(driver, 10)
employees = wait.until(EC.visibility_of_element_located((By.XPATH, "//span[. = 'Full Time Employees']/following-sibling::strong")))
print(employees.text)
driver.close()
My question is this:
In the above example to find Full Time Employees the code that has been used is:
employees = wait.until(EC.visibility_of_element_located((By.XPATH, "//span[. = 'Full Time Employees']/following-sibling::strong")))
How the author has found that s/he needs to use:
"//span[. = 'Full Time Employees']/following-sibling::strong"
To find the number of employees.
For my example page: http://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL how can I find for example Trailing P/E?
Can you please tell me the steps you took to find this? I do right click and choose Inspect, but then what shall I do?
A picture is worth of thousand words.
In web dev. tools (F12) you do the following steps:
Choose Elements tab
Press Element Selector button
With that button pressed you click an element in the main browser window.
In the DOM-elements window you right-click that highlighted element.
The context menu gets transpired and you choose Copy.
Choose Copy XPath in a sub menu. Now you have that element xpath in a console buffer.
NOTE!
The browser makes/composes an element xpath based on its own algorithm. It might not be the way you think or the way that fits to your code. So, you have to understand xpath in nature, be acquainted with it.
See what xpath the Chrome browser has issued for Trailing P/E:
//*[#id="main-0-Quote-Proxy"]/section/div[2]/section/div/section/div[2]/div[1]/div[1]/div/table/tbody/tr[3]/td[1]/span
'//h3[contains(., "Valuation Measures")]/following-sibling::div[1]//tr[3]'
Here I have the answer for all your confusions.
It will be better to look on some xpath tutorials and do practice from yourself, then you will be able to decide what you have to use .
There are so many site. You can start Here or Here
Now come to your Query -
Suppose I am using following xpath to locate the element
//h3/span[text()='Financial Highlights']/../preceding-sibling::div//tr[3]/td/span
Your requirement to find Trailing P/E in your page, definatly you will look unique xpath which won't change. If you try to find this using firepath it shows some lengthy xpath
Now you will check alternative and find another element (may be sibling, child or ancestor of your element) based on that you can to locate your element
in My case, first will find the Financial Highlights text which I will be able to find using //h3/span[text()='Financial Highlights']
Now I move its parent tag which is h3 and I will do this using /..
I have Trailing P/E element in just above the current node so move on just above node using /preceding-sibling::div
And finally find your element in that <div> like -//tr[3]/td/span
See the screens as well -
Step 1 :
Step 2 :
Step 3 :
Step 4 :