Before marking as duplicate, please consider that I have already looked through many related stack overflow posts, as well as websites and articles. I have not found a solution yet.
This question is a follow up to this question here Selenium Webdriver not finding XPATH despite seemingly identical strings. I determined the problem did not in fact come from the xpath method by updating the code to work in a more elegant manner:
for item in feed:
img_div = item.find_element_by_class_name('listing-cover-photo ')
img = WebDriverWait(img_div, 10).until(
EC.visibility_of_element_located((By.TAG_NAME, 'img')))
This works for the first 5ish elements. But after that it times out, by getting the inner html of the img_div and printing it, I found that for elements that time out, instead of the image I want there is a div with class "lazyload-placeholder". This led me to scraping lazy-loaded elements, but there was no answer that I could find. As you can see, I am using a WebDriverWait to try and give it time to load, but I also tried a site-wide wait call, as well as a time.sleep call. Waiting does not seem to fix it. I am looking for the easiest way to handle these lazy-loaded images, preferably in Selenium, but if there are other libraries or products I can use in tandem with the Selenium code I already have, that would be great. Any help is appreciated.
Your images will only load when they're scrolled into view. It's such a common requirement that the Selenium Python docs have it in their FAQ. Adapting from this answer, the below script will scroll down the page before scraping the images.
driver.get("https://www.grailed.com/categories/footwear")
SCROLL_PAUSE_TIME = 0.5
i = 0
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(SCROLL_PAUSE_TIME)
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
i += 1
if i == 5:
break
driver.implicitly_wait(10)
shoe_images = driver.find_elements(By.CSS_SELECTOR, 'div.listing-cover-photo img')
print(len(shoe_images))
In the interest of not scrolling through shoes (seemingly) forever, I have added in a break after 5 iterations, however, you're free to remove the i variable and it will scroll down for as long as it can.
The implicit wait is there to allow catchup for any remaining images that are still loading in.
A test run yielded 82 images, I confirmed that it had scraped all on the page by using Chrome's DevTools selector which highlighted 82. You'll see a different number based on how many images you allow to load.
C# sample
var img= Driver.FindElement(By.TagName("img"));// find lazy-load img
Actions actions = new Actions(Driver);
actions.MoveToElement(img); // scroll to img
actions.Perform();
var imageUrl = img.GetAttribute("src");// ready src
Related
I'm trying to scrape a website for an about 2 days, but scrolling down to get more elements is the problem. I've almost checked every javascript code in stackoverflow to do that, but none of them worked.
For example :
window.scrollTo(1, 1000)
window.scrollTo(0, document.body.scrollHeight);
arguments[0].scrollIntoView(true);
I even used this article, but it didn't work either.
I checked the network to see if I can find the API to use requests to get more elements, but I couldn't find it.
All I want to do is to get more elements, so is there a way to do that?
Try this approach to scroll page down:
from selenium.webdriver.common.keys import Keys
driver.find_element('xpath', '//body').send_keys(Keys.END)
For me the following worked (move_to_element):
from selenium.webdriver import ActionChains
your_link = driver.find_elements(By.XPATH, "//*[contains(#class, 'ClassName')]"
action = ActionChains(driver)
action.move_to_element(your_link).perform()
I am using headless Firefox on Selenium and XPath Helper to identify insanely long paths to elements.
When the page initially loads, I can use XPath Helper to find the xpath of any element of interest, and selenium can find the element when given the xpath.
However, several buttons that I need to interact with on the page open menus when pressed that are either small or take up the whole "screen". No matter their size, these containers are overlaid on the original page, and although I can find their xpaths using XPath Helper, when I try to use those xpaths to find the elements using selenium, they can't be found.
I've checked, and there's no iframe funny business happening. I'm a bit stumped as to what could be happening. My guess is that the page's source code is being dynamically changed after I press the buttons that open the menu containers and when I call find_element_by_xpath on new elements in the containers, the original source is being searched, instead of the new source. Could that be it?
Any other ideas?
As a workaround, I can get around this issue by sending keystrokes to the body of the page, but I feel this solution is rather brittle and likely to fail. Would be a much more robust solution to actually specify all elements.
EDIT:
With selenium I can find the export button, but not the menu it opens.
Here is the code for the export button itself:
The element of interest for me is "Customize Export" which I have not been able to find using selenium. Here is the code for this element:
Notice the very top line of this last image (cdk-overlay-container)
Now, when I refresh the page and do NOT click the export button, the cdk-overlay-container section of the code is empty:
This suggests my that my hypothesis is correct -- that when the page loads initially, the "Customize Export" button is nowhere in the source code, but appears only after "Export" is clicked, and that selenium is using the original source code only --not the dynamically generated code that appears after clicking "Export" -- to find elements
Selenium could find the dynamic content after doing
driver.execute_script("return document.body.innerHTML")
The WebDriverWait is what you need to use to wait for a certain condition of elements. Here is an example of waiting for the elements to be clickable before the click with a timeout in 5 seconds:
wait = WebDriverWait(driver, 5)
button = wait.until(EC.element_to_be_clickable((By.XPATH, 'button xpath')))
button.click()
wait.until(EC.element_to_be_clickable((By.XPATH, 'menu xpath'))).click()
identify insanely long paths
is an anti pattern. You can try to not use XPath Helper and find xpath or selector yourself.
Update:
wait = WebDriverWait(driver, 10)
export_buttons = wait.until(EC. presence_of_all_elements_located((By.XPATH, '//button[contains(#class, "mat-menu-trigger") and contains(.="Export")]')))
print("Export button count: ", len(export_buttons))
export_button = wait.until(EC.element_to_be_clickable((By.XPATH, '//button[contains(#class, "mat-menu-trigger") and contains(.="Export")]')))
export_button.click()
cus_export_buttons = wait.until(EC. presence_of_all_elements_located((By.XPATH, '//button[contains(#class, "mat-menu-item") and contains(.="Customize Export")]')))
print("Customize Export button count: ", len(cus_export_buttons))
This question is related to my previous two: Inducing WebDriverWait for specific elements and Inconsistency in scraping through <div>'s in Selenium.
I am scraping all of the Air Jordan sneakers off of https://www.grailed.com/. The feed is an infinitely scrolling list of sneakers and I am using Selenium webdriver to scrape the data. My problem is that the images for the shoes seem to take a while to load, so it throws a lot of errors. I have found the pattern in the xpath's of the images. The xpath to the first image is
/html/body/div[3]/div[6]/div[3]/div[3]/div[2]/div[2]/div[1]/a/div[2]/img, and the second is /html/body/div[3]/div[6]/div[3]/div[3]/div[2]/div[2]/div[2]/a/div[2]/img etc.
It follows this linear sequences where the second to last div index increases by one each time. To handle this I put the following in my loop (only relevant code is included).
i = 1
while len(sneakers) < sneaker_count:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Get sneakers currently on page and add to sneakers list
feed = driver.find_elements_by_class_name('feed-item')
for item in feed:
xpath = "/html/body/div[3]/div[6]/div[3]/div[3]/div[2]/div[2]/div[" + str(i) + "]/a/div[2]/img"
img = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, xpath)))
i += 1
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
The issue is, after about the 5th pair of shoes, the wait statement times out, it seems that the xpath passed in after that pair of shoes is not recognized. I used FireFox Developer to check the xpath using the copy xpath feature, and it seems identical to the passed in xpath when I print it. I use ChromeDriver w/Selenium but I don't think that's relevant. Does anyone know why the xpath's stop being recognized even though they seem identical?
UPDATE: So using an Xpath checker add-on to Chrome, it detects xpaths for items 1-4, but often stops detecting them after 6. When I check the xpath (both on Chrome and FireFox Developer mode, the xpath still looks identical, but it doesn't detect them when I use the "CSS and Xpath checker" it still doesn't seem to come out. This is a huge mystery to me.
I found the problem. The xpath was fine, but after the first 4-5 elements, the images are lazy-loaded. This means that a different solution must be reached in order to scrape these images. It's not that they take too long to load, it's that they just load placeholders in the HTML.
I'm trying to download a link within a table after navigating to a page using Python3 + selenium. After using selenium to click through links, and inspecting the elements on the latest-loaded page, I can see that my element is within a frame called "contents". When I try to access this frame, however, upon calling:
DRIVER.switch_to_frame("contents")
I get the following error:
selenium.common.exceptions.NoSuchFrameException: Message: contents
To gain more context I applied an experiment. Both the page that I loaded using DRIVER.get(URL) and the one to which I navigated had a frame called "menu".
I could call DRIVER.switch_to_frame("menu") on the first page, but not on the second.
DRIVER = webdriver.Chrome(CHROME_DRIVER)
DRIVER.get(SITE_URL)
DRIVER.switch_to_frame("contents") # This works
target_element = DRIVER.find_element_by_name(LINK)
target_element.click()
time.sleep(5)
DRIVER.switch_to_frame("menu")
target_element = DRIVER.find_element_by_name(LINK2)
target_element.click()
target_element = DRIVER.find_element_by_name(LINK3)
target_element.click()
DRIVER.switch_to_frame("contents") # Code breaks here.
target_element = DRIVER.find_element_by_xpath(REF)
target_element.click()
print("Program complete.")
I expect the code to find the xpath reference for the link in the "contents" frame. Instead, when attempt to switch to the "contents" frame, python run-time errors and cannot find "contents".
selenium.common.exceptions.NoSuchFrameException: Message: contents
this is because you are staying in child level of iframe which is 'menu' so inside that it can't able to find the iframe 'contents'.
First Switch back to the parent frame which is "contents", by using
DRIVER.switch_to.default_content()
and then try to go to the 'contents' iframe and perform actions, Now it should work.
Since contents appears to be a top level frame try going back to the top before selecting the frame:
DRIVER.switch_to.default_content()
DRIVER.switch_to.frame("contents")
As a thumb rule whenever switching frames you need to induce WebDriverWait for frame_to_be_available_and_switch_to_it() and you need to:
Induce WebDriverWait for the desired frame to be available and switch to it.
Induce WebDriverWait for the desired element to be clickable.
You can use the following solution:
target_element.click()
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,"frame_xpath")))
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "REF"))).click()
You can find a detailed discussion on frame switching in How can I select a html element no matter what frame it is in in selenium?
Here you can find a relevant discussion on Ways to deal with #document under iframe
I'm checking a python library: requests-html. Looks interesting, easy and clear scraping. However, I'm not sure how to render a page with infinite scrolling.
From their documentation I understand that I should render a page with special attribute (scrolldown). I'm trying but I do not know how exactly. I know how to use selenium to handle infinite scroll, but I wonder whether it is possible with requests-html.
from requests_html import HTML, HTMLSession
page1 = session.get(url1)
page1.html.render( scrolldown=5,sleep=3)
html = HTML(html=page1.text)
noticeName = html.find('h2.noticeName')
for element in noticeName:
print(element.text)
It finds 10 elements from 13. 10 is visible without scrolling (and loading new content because of infinite scroll).
scrolldown=5 means scroll 5 pixel down, is your monitor that small?? or vm height that small?? now give it a bigger value like height of the screen with sleep or 2000 or 5000 without sleep
And it will not give you uniquely next elements, it will give you exactly all elements from the starting.
I will add some sample code soon.
I hope you've solved this already, but I'll post this for any other curious souls.
In most cases, if you want to infinite scroll, scrolldown needs to be a large value because it is based on the number of times requests_html will send a "page down" request in Chromium.
According to the docs:
scrolldown – Integer, if provided, of how many times to page down.
However, the requests_html uses the pyppeteer library which sends a page down as a key press. This means that if you are on a page that blocks the page down keys or simply doesn't infinite scroll using only the key presses, you will need a different solution.
Alternative solution (in Javascript)
Documentation: requests_html (archived)