Selenium can't find a CSS selector - python-3.x

Selenium catches a NoSuchElementException after retrieving exactly 9 entries from the website. I think the problem might be in that the page contents doesn't have enough time to load, but I'm not sure.
I've written the code following this YouTube tutorial (nineteenths minute).
import requests
import json
import re
from bs4 import BeautifulSoup
from selenium import webdriver
import time
driver = webdriver.Chrome()
URL = 'https://www.alibaba.com//trade/search?fsb=y&IndexArea=product_en&CatId=&SearchText=white+hoodie'
time.sleep(1)
driver.get(URL)
driver.execute_script('window.scrollTo(0, document.body.scrollHeight)')
time.sleep(2)
driver.execute_script('window.scrollTo(0, document.body.scrollHeight)')
time.sleep(2)
items = driver.find_elements_by_css_selector('.J-offer-wrapper')
num = 1
for i in items:
print(num)
product_name = i.find_element_by_css_selector('h4').text
price = i.find_element_by_css_selector('.elements-offer-price-normal').text
time.sleep(0.5)
num += 1
print(price, product_name)
#driver.close()
If you have a clue why Selenium stops at the 10th entry and how to overcome this issue, please, share.

You are getting that because the 10th item is not like the rest. It's an ad thingy and not a hoodie as you've searched for. I suspect you'd want to exclude this so you are left only with the results you are actually interested in.
All you need to do is change the way you identify items (this just one of the options):
items = driver.find_elements_by_css_selector('.img-switcher-parent')

You need to update for the error handling as below:
for i in items:
print(num)
try:
product_name = i.find_element_by_css_selector('h4').text
except:
product_name=''
try:
price = i.find_element_by_css_selector('.elements-offer-pricenormal').text
except:
price=''
time.sleep(0.5)
num += 1
print(price, product_name)

Related

Add Image/thumbnail in my python dataframe

I am working on a project where I need to create a movie database.
I have created my database and imported the links from IMDB that redirect you to the webpage. I would like to add also, the main image/thumbnail of each movie so that I can use then the csv in Power BI.
However, I did not manage to do it:
I have tried this:
import requests
from bs4 import BeautifulSoup
import numpy as np
images = []
for i in df_database_url['Url Film']:
r = requests.get(i)
soup = BeautifulSoup(r.content, "html.parser")
images.append(image_url)
But my goal is to have a column that includes the thumbnail for each movie.
Assuming that i is an imdb movie url (the kind that starts with https://www.imdb.com/title), you can target the script tag that seems to contain a lot of the main information for the movie - you can get that with
# import json
image_url = json.loads(soup.select_one('script[type="application/ld+json"]').text)['image']
or, if we're more cautious:
# import json
scCont = [s.text for s in soup.select('script[type="application/ld+json"]') if '"image"' in s.text]
if scCont:
try:
scCont = json.loads(scCont[0])
if 'image' not in scCont:
image_url = None
print('No image found for', i)
else: image_url = scCont['image']
except Exception as e:
image_url = None
print('Could not parse movie info for', i, '\n', str(e))
else:
image_url = None
print('Could not find script with movie info for', i)
(and you can get the trailer thumbnail with scCont['trailer']['thumbnailUrl'])
This way, instead of raising an error if anything on the path to the expected info is unavailable, it will just add image_url as None; if you want it to halt and raise error in such cases, use the first version.
and then after the loop you can add in the column with something like
df_database_url['image_urls'] = images
(you probably know that...)

Can't scroll through a div more than once | Selenium | Python

When I run this, it only manages to scroll down once, and it throws a "Message: element not interactable" error. (it's supposed to scroll twice). When I tried to run it in a loop (made a try and except to ignore the error), and scrolled around with it manually, it would keep pushing me back up to a specific position. But that's strange, because I'm using arrow keys here, not a move to element:
ActionChains(driver).move_to_element(driver.sl.find_element_by_id('my-id')).perform()
I've tried: giving everything more time to load with sleep, hovering over the element and clicking it to make it interactable, using other methods to scroll such as this one and others like it: driver.execute_script("window.scrollTo(0, Y)")
I'm very lost at this point, don't know what to do
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
from datetime import date
from datetime import datetime
from time import sleep
from random import *
import random, json, selenium, os.path, os
driver = webdriver.Chrome('/Users/apple/Downloads/chromedriver')
driver.maximize_window()
driver.get('https://instagram.com')
sleep(7)
username_form = driver.find_element_by_xpath('/html/body/div[1]/section/main/article/div[2]/div[1]/div/form/div/div[1]/div/label/input')
username_form.clear()
username_form.send_keys('ENTER INSTA USER HERE')
password_form = driver.find_element_by_xpath('/html/body/div[1]/section/main/article/div[2]/div[1]/div/form/div/div[2]/div/label/input')
password_form.clear()
password_form.send_keys('ENTER INSTA PASS HERE')
button_click = driver.find_element_by_xpath('/html/body/div[1]/section/main/article/div[2]/div[1]/div/form/div/div[3]/button')
try:
button_click.click()
except:
driver.execute_script("arguments[0].click();", button_click)
sleep(4)
driver.get('https://instagram.com/p/CQ_sfAeFl5s/')
sleep(4)
like_meter = driver.find_element_by_class_name('zV_Nj')
like_meter.click()
sleep(1)
try:
scroll_zone = driver.find_element_by_xpath('/html/body/div[5]/div/div/div[2]/div/div')
except:
scroll_zone = driver.find_element_by_xpath('/html/body/div[4]/div/div/div[2]/div/div')
scroll_zone.click()
sleep(0.5)
hover = ActionChains(driver).move_to_element(scroll_zone)
hover.perform()
sleep(0.5)
scroll_zone.send_keys(Keys.ARROW_DOWN)
scroll_zone.send_keys(Keys.ARROW_DOWN)
If you want to scroll that list of persons liked that page you can do this:
like_meter = driver.find_element_by_class_name('zV_Nj')
like_meter.click()
sleep(1)
elem = driver.find_element_by_css_selector("div[role='dialog'] div[style*='padding']")
for n in range(10):
driver.execute_script("arguments[0].scrollDown += 20", elem)
The range of 10 and 20 pixels scrolling can be changed according to your needs

os.listdir() won't go after nine files

Hello i am trying to make a program that would automaticly go to imgur, enter the name that you typed and download top 10 images.Everything is working except the os library.When i try to do os.listdir() after nine files it wont show anymore files.I tried googling and found nothing if you see something that i messed up please tell me.Thanks in advance.Sorry for bad grammar.
Here is the code sample:
#! python3
import requests, os, sys
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
os.chdir('imgur/')
broswer = webdriver.Chrome(executable_path=r'C:\Users\{YOUR USERNAME}\Downloads\chromedriver.exe')
broswer.get('https://imgur.com/')
broswer.maximize_window()
search_bar = broswer.find_element_by_tag_name('input')
search_bar.send_keys('happy')
search_bar.send_keys(Keys.ENTER)
pictures = broswer.find_elements_by_tag_name('img')
for i in range(1, 11):
res = requests.get(pictures[i].get_attribute('src'))
try:
res.raise_for_status()
except:
print('Link doesnt exist')
if os.listdir() == []:
picture = open('picture1.png', 'wb')
else:
picture = open('picture' + str(int(os.listdir()[-1][7:-4]) + 1) + '.png', 'wb')
print(os.listdir())
for chunk in res.iter_content(100000):
picture.write(chunk)
picture.close()
os.listdir(".") #u missed adding address

Selenium doesn't get all the href from a web page

I am trying to get all the href links from https://search.yhd.com/c0-0-1003817/ (the ones that lead to the specific products), but although my code runs, it only gets 30 links. I don't know why this is happening. Could you help me, please?
I've been working with selenium (python 3.7), but previously I also tried to get the codes with beautiful soup. That didn't work either.
from selenium import webdriver
import time
import requests
import pandas as pd
def getListingLinks(link):
# Open the driver
driver = webdriver.Chrome()
driver.get(link)
time.sleep(3)
# Save the links
listing_links = []
links = driver.find_elements_by_xpath('//a[#class="img"]')
for link in links:
listing_links.append(str(link.get_attribute('href')))
driver.close()
return listing_links
imported = getListingLinks("https://search.yhd.com/c0-0-1003817/")
I should get 60 links, but I am only managing to get 30 with my code.
at initial load, the page contains only 30 images/links. only when you scroll down, does it load all 60 items. you need to do the following:
def getListingLinks(link):
# Open the driver
driver = webdriver.Chrome()
driver.maximize_window()
driver.get(link)
time.sleep(3)
# scroll down: repeated to ensure it reaches the bottom and all items are loaded
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(3)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(3)
# Save the links
listing_links = []
links = driver.find_elements_by_xpath('//a[#class="img"]')
for link in links:
listing_links.append(str(link.get_attribute('href')))
driver.close()
return listing_links
imported = getListingLinks("https://search.yhd.com/c0-0-1003817/")
print(len(imported)) ## Output: 60

optimise scraping and requesting web page

How should I optimise my time in making requests
link=['http://youtube.com/watch?v=JfLt7ia_mLg',
'http://youtube.com/watch?v=RiYRxPWQnbE'
'http://youtube.com/watch?v=tC7pBOPgqic'
'http://youtube.com/watch?v=3EXl9xl8yOk'
'http://youtube.com/watch?v=3vb1yIBXjlM'
'http://youtube.com/watch?v=8UBY0N9fWtk'
'http://youtube.com/watch?v=uRPf9uDplD8'
'http://youtube.com/watch?v=Coattwt5iyg'
'http://youtube.com/watch?v=WaprDDYFpjE'
'http://youtube.com/watch?v=Pm5B-iRlZfI'
'http://youtube.com/watch?v=op3hW7tSYCE'
'http://youtube.com/watch?v=ogYN9bbU8bs'
'http://youtube.com/watch?v=ObF8Wz4X4Jg'
'http://youtube.com/watch?v=x1el0wiePt4'
'http://youtube.com/watch?v=kkeMYeAIcXg'
'http://youtube.com/watch?v=zUdfNvqmTOY'
'http://youtube.com/watch?v=0ONtIsEaTGE'
'http://youtube.com/watch?v=7QedW6FcHgQ'
'http://youtube.com/watch?v=Sb33c9e1XbY']
I have a list of 15-20 links of youtube search result of first page Now the task is to get the likes,dislikes,view count from each video url and for that what I had done is
def parse(url,i,arr):
req=requests.get(url)
soup = bs4.BeautifulSoup(req.text,"lxml")#, 'html5lib')
try:
likes=int(soup.find("button",attrs={"title": "I like this"}).getText().__str__().replace(",",""))
except:
likes=0
try:
dislikes=int(soup.find("button",attrs={"title": "I dislike this"}).getText().__str__().replace(",",""))
except:
dislikes=0
try:
view=int(soup.find("div",attrs={"class": "watch-view-count"}).getText().__str__().split()[0].replace(",",""))
except:
view=0
arr[i]=(likes,dislikes,view,url)
time.sleep(0.3)
def parse_list(link):
arr=len(link)*[0]
threadarr=len(link)*[0]
import threading
a=time.clock()
for i in range(len(link)):
threadarr[i]=threading.Thread(target=parse,args=(link[i],i,arr))
threadarr[i].start()
for i in range(len(link)):
threadarr[i].join()
print(time.clock()-a)
return arr
arr=parse_list(link)
Now I am getting the populated result array in about 6 seconds.Is there any faster way I can get my array(arr) so that it takes quite less time than 6 secs
my array first 4 elements look like so that you get a rough idea
[(105, 11, 2836, 'http://youtube.com/watch?v=JfLt7ia_mLg'),
(32, 18, 5420, 'http://youtube.com/watch?v=RiYRxPWQnbE'),
(45, 3, 7988, 'http://youtube.com/watch?v=tC7pBOPgqic'),
(106, 38, 4968, 'http://youtube.com/watch?v=3EXl9xl8yOk')]
Thanks in advance :)
I would use multiprocessing Pool object for that particular case.
import requests
import bs4
from multiprocessing import Pool, cpu_count
links = [
'http://youtube.com/watch?v=JfLt7ia_mLg',
'http://youtube.com/watch?v=RiYRxPWQnbE',
'http://youtube.com/watch?v=tC7pBOPgqic',
'http://youtube.com/watch?v=3EXl9xl8yOk'
]
def parse_url(url):
req=requests.get(url)
soup = bs4.BeautifulSoup(req.text,"lxml")#, 'html5lib')
try:
likes=int(soup.find("button", attrs={"title": "I like this"}).getText().__str__().replace(",",""))
except:
likes=0
try:
dislikes=int(soup.find("button", attrs={"title": "I dislike this"}).getText().__str__().replace(",",""))
except:
dislikes=0
try:
view=int(soup.find("div", attrs={"class": "watch-view-count"}).getText().__str__().split()[0].replace(",",""))
except:
view=0
return (likes, dislikes, view, url)
pool = Pool(cpu_count) # number of processes
data = pool.map(parse_url, links) # this is where your results are
This is cleaner as you only have one function to write and you end up with exactly the same results.
This is not a workaround but it can save your script from using "try/except block" which definitely plays a role to somewhat slow the operation down.
for url in links:
response = requests.get(url).text
soup = BeautifulSoup(response,"html.parser")
for item in soup.select("div#watch-header"):
view = item.select("div.watch-view-count")[0].text
likes = item.select("button[title~='like'] span.yt-uix-button-content")[0].text
dislikes = item.select("button[title~='dislike'] span.yt-uix-button-content")[0].text
print(view, likes, dislikes)

Resources