Selenium doesn't get all the href from a web page - python-3.x

I am trying to get all the href links from https://search.yhd.com/c0-0-1003817/ (the ones that lead to the specific products), but although my code runs, it only gets 30 links. I don't know why this is happening. Could you help me, please?
I've been working with selenium (python 3.7), but previously I also tried to get the codes with beautiful soup. That didn't work either.
from selenium import webdriver
import time
import requests
import pandas as pd
def getListingLinks(link):
# Open the driver
driver = webdriver.Chrome()
driver.get(link)
time.sleep(3)
# Save the links
listing_links = []
links = driver.find_elements_by_xpath('//a[#class="img"]')
for link in links:
listing_links.append(str(link.get_attribute('href')))
driver.close()
return listing_links
imported = getListingLinks("https://search.yhd.com/c0-0-1003817/")
I should get 60 links, but I am only managing to get 30 with my code.

at initial load, the page contains only 30 images/links. only when you scroll down, does it load all 60 items. you need to do the following:
def getListingLinks(link):
# Open the driver
driver = webdriver.Chrome()
driver.maximize_window()
driver.get(link)
time.sleep(3)
# scroll down: repeated to ensure it reaches the bottom and all items are loaded
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(3)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(3)
# Save the links
listing_links = []
links = driver.find_elements_by_xpath('//a[#class="img"]')
for link in links:
listing_links.append(str(link.get_attribute('href')))
driver.close()
return listing_links
imported = getListingLinks("https://search.yhd.com/c0-0-1003817/")
print(len(imported)) ## Output: 60

Related

Selenium can't find a CSS selector

Selenium catches a NoSuchElementException after retrieving exactly 9 entries from the website. I think the problem might be in that the page contents doesn't have enough time to load, but I'm not sure.
I've written the code following this YouTube tutorial (nineteenths minute).
import requests
import json
import re
from bs4 import BeautifulSoup
from selenium import webdriver
import time
driver = webdriver.Chrome()
URL = 'https://www.alibaba.com//trade/search?fsb=y&IndexArea=product_en&CatId=&SearchText=white+hoodie'
time.sleep(1)
driver.get(URL)
driver.execute_script('window.scrollTo(0, document.body.scrollHeight)')
time.sleep(2)
driver.execute_script('window.scrollTo(0, document.body.scrollHeight)')
time.sleep(2)
items = driver.find_elements_by_css_selector('.J-offer-wrapper')
num = 1
for i in items:
print(num)
product_name = i.find_element_by_css_selector('h4').text
price = i.find_element_by_css_selector('.elements-offer-price-normal').text
time.sleep(0.5)
num += 1
print(price, product_name)
#driver.close()
If you have a clue why Selenium stops at the 10th entry and how to overcome this issue, please, share.
You are getting that because the 10th item is not like the rest. It's an ad thingy and not a hoodie as you've searched for. I suspect you'd want to exclude this so you are left only with the results you are actually interested in.
All you need to do is change the way you identify items (this just one of the options):
items = driver.find_elements_by_css_selector('.img-switcher-parent')
You need to update for the error handling as below:
for i in items:
print(num)
try:
product_name = i.find_element_by_css_selector('h4').text
except:
product_name=''
try:
price = i.find_element_by_css_selector('.elements-offer-pricenormal').text
except:
price=''
time.sleep(0.5)
num += 1
print(price, product_name)

How to Download webpage as .mhtml

I am able to successfully open a URL and save the resultant page as a .html file. However, I am unable to determine how to download and save a .mhtml (Web Page, Single File).
My code is:
import urllib.parse, time
from urllib.parse import urlparse
import urllib.request
url = ('https://www.example.com')
encoded_url = urllib.parse.quote(url, safe='')
print(encoded_url)
base_url = ("https://translate.google.co.uk/translate?sl=auto&tl=en&u=")
translation_url = base_url+encoded_url
print(translation_url)
req = urllib.request.Request(translation_url, headers={'User-Agent': 'Mozilla/6.0'})
print(req)
response = urllib.request.urlopen(req)
time.sleep(15)
print(response)
webContent = response.read()
print(webContent)
f = open('GoogleTranslated.html', 'wb')
f.write(webContent)
print(f)
f.close
I have tried to use wget using the details captured in this question:
How to download a webpage (mhtml format) using wget in python but the details are incomplete (or I am simply unabl eto understand).
Any suggestions would be helpful at this stage.
Did you try using Selenium with a Chrome Webdriver to save page?
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.expected_conditions import visibility_of_element_located
from selenium.webdriver.support.ui import WebDriverWait
import pyautogui
URL = 'https://en.wikipedia.org/wiki/Python_(programming_language)'
FILE_NAME = ''
# open page with selenium
# (first need to download Chrome webdriver, or a firefox webdriver, etc)
driver = webdriver.Chrome()
driver.get(URL)
# wait until body is loaded
WebDriverWait(driver, 60).until(visibility_of_element_located((By.TAG_NAME, 'body')))
time.sleep(1)
# open 'Save as...' to save html and assets
pyautogui.hotkey('ctrl', 's')
time.sleep(1)
if FILE_NAME != '':
pyautogui.typewrite(FILE_NAME)
pyautogui.hotkey('enter')
I have a better solution, which will not involve any possible manual operation and specify the path to hold the mhtml file. I learn this from a chinese blog . The key idea is to use chrome-dev-tools command.
The code is shown below as an example.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.qq.com/')
# Execute Chrome dev tool command to obtain the mhtml file
res = driver.execute_cdp_cmd('Page.captureSnapshot', {})
# 2. write file locally
with open('./store/qq.mhtml', 'w', newline='') as f:
f.write(res['data'])
driver.quit()
Hope this will help!
more things about chrome dev protocols
save as mhtml, need to add argument '--save-page-as-mhtml'
options = webdriver.ChromeOptions()
options.add_argument('--save-page-as-mhtml')
driver = webdriver.Chrome(options=options)
I wrote it just the way it was. Sorry if it's wrong.
I created a class, so you can use it. The example is in the three lines below.
Also, you can change the number of seconds to sleep as you like.
Incidentally, non-English keyboards such as Japanese and Hangul keyboards are also supported.
import chromedriver_binary
from selenium import webdriver
import pyautogui
import pyperclip
import uuid
class DonwloadMhtml(webdriver.Chrome):
def __init__(self):
super().__init__()
self._first_save = True
time.sleep(2)
def save_page(self, url, filename=None):
self.get(url)
time.sleep(3)
# open 'Save as...' to save html and assets
pyautogui.hotkey('ctrl', 's')
time.sleep(1)
if filename is None:
pyperclip.copy(str(uuid.uuid4()))
else:
pyperclip.copy(filename)
time.sleep(1)
pyautogui.hotkey('ctrl', 'v')
time.sleep(2)
if self._first_save:
pyautogui.hotkey('tab')
time.sleep(1)
pyautogui.press('down')
time.sleep(1)
pyautogui.press('up')
time.sleep(1)
pyautogui.hotkey('enter')
time.sleep(1)
self._first_save = False
pyautogui.hotkey('enter')
time.sleep(1)
# example
dm = DonwloadMhtml()
dm.save_page('https://en.wikipedia.org/wiki/Python_(programming_language)', 'wikipedia_python') # create file named "wikipedia_python.mhtml"
dm.save_page('https://www.python.org/') # file named randomly based on uuid4
python3.8.10
selenium==4.4.3

Selenium finds only a fraction of href links

I am trying to get all the products' url from this webpage, but I managed to get only a fraction of it.
My first attempt was to scrape the webpage with Beautifulsoup, but then I realized selenium would be better as I needed to click the "Show more" button several times. I also added a code to scroll down the page as I though that was the problem, but the result didn't change.
import time
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
def getListingLinks(link):
# Open the driver
driver = webdriver.Chrome(executable_path="")
driver.maximize_window()
driver.get(link)
time.sleep(3)
# scroll down: repeated to ensure it reaches the bottom and all items are loaded
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(3)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(3)
listing_links = []
while True:
try:
driver.execute_script("return arguments[0].scrollIntoView(true);", WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.XPATH, '//*[#id="main-content"]/div[2]/div[2]/div[4]/button'))))
driver.execute_script("arguments[0].click();", WebDriverWait(driver,20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#main-content > div:nth-child(2) > div.main-column > div.btn-wrapper.center > button"))))
print("Button clicked")
links = driver.find_elements_by_class_name('fop-contentWrapper')
for link in links:
algo=link.find_element_by_css_selector('.fop-contentWrapper a').get_attribute('href')
print(algo)
listing_links.append(str(algo))
except:
print("No more Buttons")
break
driver.close()
return listing_links
fresh_food = getListingLinks("https://www.ocado.com/browse/fresh-20002")
print(len(fresh_food)) ## Output: 228
As you can see, I get 228 urls while I would like to get 5605 links, that is the actual number of products in the webpage according to Ocado. I believe I have a problem with the order of my code, but can't find the proper order. I would sincerely appreciate any help.

How can I fix encoding problems without a metric-ton of .replace()? Python3 Chrome-Driver BS4?

The print() command prints the scraped website perfectly to the IDLE shell. However, write/writelines/print will not write to a file without throwing many encode errors or super-geek-squad code.
Tried various forms of .encode(encoding='...',errors='...') to no avail.
When I tried many different encodings they would turn into super-geek-squad formats or multiple ?'s inside the text file.
If I wanted to spend 10 years doing .replace('...','...'), as shown in the code of text = ... I can get this to completely work:
#! python3
import os
import os.path
from os import path
import requests
import bs4 as BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.chrome.options import Options
def Close():
driver.stop_client()
driver.close()
driver.quit()
CHROMEDRIVER_PATH = 'E:\Downloads\chromedriver_win32\chromedriver.exe'
# start raw html
NovelName = 'Novel/Isekai-Maou-to-Shoukan-Shoujo-Dorei-Majutsu'
BaseURL = 'https://novelplanet.com'
url = '%(U)s/%(N)s' % {'U': BaseURL, "N": NovelName}
options = Options()
options.add_experimental_option("excludeSwitches",["ignore-certificate-errors"])
#options.add_argument("--headless") # Runs Chrome in headless mode.
#options.add_argument('--no-sandbox') # Bypass OS security model
#options.add_argument('--disable-gpu') # applicable to windows os only
options.add_argument('start-maximized') #
options.add_argument('disable-infobars')
#options.add_argument("--disable-extensions")
driver = webdriver.Chrome(CHROMEDRIVER_PATH, options=options)
driver.get(url)
# wait for title not be equal to "Please wait 5 seconds..."
wait = WebDriverWait(driver, 10)
wait.until(lambda driver: driver.title != "Please wait 5 seconds...")
soup = BeautifulSoup.BeautifulSoup(driver.page_source, 'html.parser')
# End raw html
# Start get first chapter html coded
i=0
for chapterLink in soup.find_all(class_='rowChapter'):
i+=1
cLink = chapterLink.find('a').contents[0].strip()
print(driver.title)
# end get first chapter html coded
# start navigate to first chapter
link = driver.find_element_by_link_text(cLink)
link.click()
# end navigate to first chapter
# start copy of chapter and add to a file
def CopyChapter():
wait = WebDriverWait(driver, 10)
wait.until(lambda driver: driver.title != "Please wait 5 seconds...")
print(driver.title)
soup = BeautifulSoup.BeautifulSoup(driver.page_source, 'html.parser')
readables = soup.find(id='divReadContent')
text = readables.text.strip().replace('混','').replace('魔','').replace('族','').replace('デ','').replace('イ','').replace('ー','').replace('マ','').replace('ン','').replace('☆','').replace('ッ','Uh').replace('『','[').replace('』',']').replace('“','"').replace('”','"').replace('…','...').replace('ー','-').replace('○','0').replace('×','x').replace('《',' <<').replace('》','>> ').replace('「','"').replace('」','"')
name = driver.title
file_name = (name.replace('Read ',"").replace(' - NovelPlanet',"")+'.txt')
print(file_name)
#print(text) # <-- This shows the correct text in the shell with no errors
with open(file_name,'a+') as file:
print(text,file=file) # <- this never works without a bunch of .replace() where text is defined
global lastURL
lastURL = driver.current_url
NextChapter()
# end copy of chapter and add to a file
# start goto next chapter if exists then return to copy chapter else Close()
def NextChapter():
soup = BeautifulSoup.BeautifulSoup(driver.page_source, 'html.parser')
a=0
main = soup.find(class_='wrapper')
for container in main.find_all(class_='container'):
a+=1
row = container.find(class_='row')
b=0
for chapterLink in row.find_all(class_='4u 12u(small)'):
b+=1
cLink = chapterLink.find('a').contents[0].strip()
link = driver.find_element_by_link_text(cLink)
link.click()
wait = WebDriverWait(driver, 10)
wait.until(lambda driver: driver.title != "Please wait 5 seconds...")
global currentURL
currentURL = driver.current_url
if currentURL != lastURL:
CopyChapter()
else:
print('Finished!!!')
Close()
# end goto next chapter if exists then return to copy chapter else Close()
CopyChapter()
#EOF
Expected results would have the Text file output exactly the same as the IDLE print(text) with absolutely no changes. Then I would be able to test if every chapter gets copied for offline viewing and that it stops at the last chapter posted.
At the current time unless I keep adding more and more .replace() for every novel and chapter this won't ever be working properly. I wouldn't mind manually removing the Ad descriptions by using .replace() but if there is also a better way to do that then how can it be done?
Windows 10
Python 3.7.0
There was some reason for os and os.path in an earlier version of this script but now I don't remember if it is still needed or not.

Scroll down pages with selenium and python

this is the problem:
I am using selenium to download all the successful projects from this webpage ("https://www.rockethub.com/projects"). The url does not change if a click on any button.
I'm interested in successful project, thus I click on the button status and then I click on successful.
Once on this page I need to scroll down repedetly to make other urls appear.
Here is the problem. So far I have been not able to scroll down the page
This is my code:
from selenium.webdriver import Firefox
from selenium import webdriver
url="https://www.rockethub.com/projects"
link=[]
wd = webdriver.Firefox()
wd.get(url)
next_button = wd.find_element_by_link_text('Status')
next_button.click()
next_but = wd.find_element_by_link_text('Successful')
next_but.click()
wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")
Any idea on how to solve this?
Thanks
Giangi
Since the content is updated dynamically, you need to wait for a change of the content before executing the next step:
class element_is_not(object):
""" An expectation for checking that the element returned by
the locator is not equal to a given element.
"""
def __init__(self, locator, element):
self.locator = locator
self.element = element
def __call__(self, driver):
new_element = driver.find_element(*self.locator)
return new_element if self.element != new_element else None
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Firefox()
wait = WebDriverWait(driver, 10)
driver.get("https://www.rockethub.com/projects")
# get the last box
by_last_box = (By.CSS_SELECTOR, '.project-box:last-of-type')
last_box = wait.until(element_is_not(by_last_box, None))
# click on menu Status > Successful
driver.find_element_by_link_text('Status').click()
driver.find_element_by_link_text('Successful').click()
# wait for a new box to be added
last_box = wait.until(element_is_not(by_last_box, last_box))
# scroll down the page
driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight);")
# wait for a new box to be added
last_box = wait.until(element_is_not(by_last_box, last_box))
Run the wd.execute_script("window.scrollTo(0, document.body.scrollHeight);") in loop, since each time the script is executed only certain number of data is rerieved, so you have to execute it in a loop.
If you are just looking to retrieve all the successful projects at once and not interested in simulating the scrolling down to the page, then look at this answer, it may help.

Resources