The webpage shows that there are 702 Comments.
target youtube sample
I write a function get_total_youtube_comments(url) ,many codes copied from the project on github.
project on github
def get_total_youtube_comments(url):
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
options = webdriver.ChromeOptions()
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument("--headless")
driver = webdriver.Chrome(options=options,executable_path='/usr/bin/chromedriver')
wait = WebDriverWait(driver,60)
driver.get(url)
SCROLL_PAUSE_TIME = 2
CYCLES = 7
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.PAGE_DOWN)
html.send_keys(Keys.PAGE_DOWN)
time.sleep(SCROLL_PAUSE_TIME * 3)
for i in range(CYCLES):
html.send_keys(Keys.END)
time.sleep(SCROLL_PAUSE_TIME)
comment_elems = driver.find_elements_by_xpath('//*[#id="content-text"]')
all_comments = [elem.text for elem in comment_elems]
return all_comments
Try to parse all comments on a sample webpage https://www.youtube.com/watch?v=N0lxfilGfak.
url='https://www.youtube.com/watch?v=N0lxfilGfak'
list = get_total_youtube_comments(url)
It can get some comments ,only small party of all comments.
len(list)
60
60 is much less than 702,how to get all comments in youtube with selenium?
#supputuri,i can extract all comments with your code.
comments_list = driver.find_elements_by_xpath("//*[#id='content-text']")
len(comments_list)
709
print(driver.find_element_by_xpath("//h2[#id='count']").text)
717 Comments
comments_list[-1].text
'mistake at 23:11 \nin NOT it should return false if x is true.'
comments_list[0].text
'Got a question on the topic? Please share it in the comment section below and our experts will answer it for you. For Edureka Python Course curriculum, Visit our Website: Use code "YOUTUBE20" to get Flat 20% off on this training.'
Why the comments number is 709 instead of 717 shown in page?
You are getting a limited number of comments as YouTube will load the comments as you keep scrolling down. There are around 394 comments left on that video you have to first make sure all the comments are loaded and then also expand all View Replies so that you will reach the max comments count.
Note: I was able to get 700 comments using the below lines of code.
# get the last comment
lastEle = driver.find_element_by_xpath("(//*[#id='content-text'])[last()]")
# scroll to the last comment currently loaded
lastEle.location_once_scrolled_into_view
# wait until the comments loading is done
WebDriverWait(driver,30).until(EC.invisibility_of_element((By.CSS_SELECTOR,"div.active.style-scope.paper-spinner")))
# load all comments
while lastEle != driver.find_element_by_xpath("(//*[#id='content-text'])[last()]"):
lastEle = driver.find_element_by_xpath("(//*[#id='content-text'])[last()]")
driver.find_element_by_xpath("(//*[#id='content-text'])[last()]").location_once_scrolled_into_view
time.sleep(2)
WebDriverWait(driver,30).until(EC.invisibility_of_element((By.CSS_SELECTOR,"div.active.style-scope.paper-spinner")))
# open all replies
for reply in driver.find_elements_by_xpath("//*[#id='replies']//paper-button[#class='style-scope ytd-button-renderer'][contains(.,'View')]"):
reply.location_once_scrolled_into_view
driver.execute_script("arguments[0].click()",reply)
time.sleep(5)
WebDriverWait(driver, 30).until(
EC.invisibility_of_element((By.CSS_SELECTOR, "div.active.style-scope.paper-spinner")))
# print the total number of comments
print(len(driver.find_elements_by_xpath("//*[#id='content-text']")))
There are a couple of things:
The WebElements within the website https://www.youtube.com/ are dynamic. So are the comments dynamically rendered.
With in the webpage https://www.youtube.com/watch?v=N0lxfilGfak the comments doesn't render unless user scrolls the following element within the Viewport.
The comments are with in:
<!--css-build:shady-->
Which applies, Polymer CSS Builder is used apply Polymer's CSS Mixin shim and ShadyDOM scoping. So some runtime work is still done to convert CSS selectors under the default settings.
Considering the above mentioned factors here's a solution to retrieve all the comments:
Code Block:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException, ElementClickInterceptedException, WebDriverException
import time
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get('https://www.youtube.com/watch?v=N0lxfilGfak')
driver.execute_script("return scrollBy(0, 400);")
subscribe = WebDriverWait(driver, 60).until(EC.visibility_of_element_located((By.XPATH, "//yt-formatted-string[text()='Subscribe']")))
driver.execute_script("arguments[0].scrollIntoView(true);",subscribe)
comments = []
my_length = len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//yt-formatted-string[#class='style-scope ytd-comment-renderer' and #id='content-text'][#slot='content']"))))
while True:
try:
driver.execute_script("window.scrollBy(0,800)")
time.sleep(5)
comments.append([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//yt-formatted-string[#class='style-scope ytd-comment-renderer' and #id='content-text'][#slot='content']")))])
except TimeoutException:
driver.quit()
break
print(comment)
If you don't have to use Selenium I would recommend you to look at the google/youtube api.
https://developers.google.com/youtube/v3/getting-started
Example :
https://www.googleapis.com/youtube/v3/commentThreads?key=YourAPIKey&textFormat=plainText&part=snippet&videoId=N0lxfilGfak&maxResults=100
This would give you the first 100 results and gets you a token that you can append on the next request to get the next 100 results.
I'm not familiar with python, but I'll tell you the steps that I would do to get all comments.
First of all, if your code I think the main issue is with the
CYCLES = 7
According to this, you will be scrolling for 2 seconds 7 times. Since you are successfully grabbing 60 comments, fixing the above condition will solve your issue.
I assume you don't have any issue in finding elements on a website using locators.
You need to get the total comments to count to a variable as an int. (in your case, let's say it's COMMENTS = 715)
Define another variable called VISIBLECOUNTS = 0
The use a while loop to scroll if the COMMENTS > VISIBLECOUNTS
The code might look like this ( really sorry if there are syntax issues )
// python - selenium command to get all comments counts.
COMMENTS = 715
(715 is just a sample value, it will change upon the total comments count)
VISIBLECOUNTE = 0
SCROLL_PAUSE_TIME = 2
while VISIBLECOUNTS < COMMENTS :
html.send_keys(Keys.END)
time.sleep(SCROLL_PAUSE_TIME)
VISIBLECOUNTS = len(driver.find_elements_by_xpath('//ytm-comment-thread-renderer'))
With this, you will be scrolling down until the COMMENTS = VISIBLECOUNTS. Then you can grab all the comments as all of them share the same element attributes such as ytm-comment-thread-renderer
Since I'm not familiar with python I'll add the command to get the comments to count from js. you can try this on your browser and convert it into your python command
Run the bellow queries in your console and check.
To get total comments count
var comments = document.querySelector(".comment-section-header-text").innerText.split(" ")
//We can get the text value "Comments • 715" and split by spaces and get the last value
Number(comments[comments.length -1])
//Then convirt string "715" to int, you just need to do these in python - selenium
To get active comments count
$x("//ytm-comment-thread-renderer").length
Note: if it's hard to extract the values you still can use the selenium js executor and do the scrolling with js until all the comments are visible. But I guess it's not hard to do it in python since the logic is the same.
I'm really sorry about not being able to add the solution in python.
But hope this helped.
cheers.
The first thing you need to do is scroll down the video page to load all comments:
$actualHeight = 0;
$nextHeight = 0;
while (true) {
try {
$nextHeight += 10;
$actualHeight = $this->driver->executeScript('return document.documentElement.scrollHeight;');
if ($nextHeight >= ($actualHeight - 50 ) ) break;
$this->driver->executeScript("window.scrollTo(0, $nextHeight);");
$this->driver->manage()->timeouts()->implicitlyWait = 10;
} catch (Exception $e) {
break;
}
}
Related
I wrote a python script that goes to a site, and interacts with some dropdowns. It works perfectly fine if after I run the script, quickly make the browser instance full screen so that the elements are in view. If I don't do that, I get the error "Element could not be scrolled into view".
Here is my script:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("https://example.com")
driver.implicitly_wait(5)
yearbtn = driver.find_element("id", "dropdown_year")
yearbtn.click()
year = driver.find_element("css selector", '#dropdown_ul_year li:nth-child(5)')
year.click()
makebtn = driver.find_element("id", "dropdown_make")
makebtn.click()
make = driver.find_element("css selector", '#dropdown_ul_make li:nth-child(2)')
make.click()
modelbtn = driver.find_element("id", "dropdown_model")
modelbtn.click()
model = driver.find_element("css selector", '#dropdown_ul_model li:nth-child(2)')
model.click()
trimbtn = driver.find_element("id", "dropdown_trim")
trimbtn.click()
trim = driver.find_element("css selector", '#dropdown_ul_trim li:nth-child(2)')
trim.click()
vehicle = driver.find_element("css selector", '#vehiclecontainer > div > p')
vdata = driver.find_element("css selector", '.top-sect .tow-row:nth-child(2)')
print("--------------")
print("Your Vehicle: " + vehicle.text)
print("Vehicle Data: " + vdata.text)
print("--------------")
print("")
driver.close()
Like I said, it works fine if I make the browser full-screen (or manually scroll) so that the elements in question are in view. It finds the element, so what's the issue here? I've tried both Firefox and Chrome.
Generally firefox opens in maximized mode. Incase for specific version of os it doesn't, the best practice is to open it in maximized mode as follows:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
options = Options()
options.add_argument("start-maximized")
driver = webdriver.Firefox(options=options)
However in somecases the desired element may not get rendered within the due coarse, so you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following locator strategies:
vehicle = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "#vehiclecontainer > div > p")))
vdata = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".top-sect .tow-row:nth-child(2)")))
print("--------------")
print("Your Vehicle: " + vehicle.text)
print("Vehicle Data: " + vdata.text)
print("--------------")
print("")
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
You can find a relevant discussion in How to retrieve the text of a WebElement using Selenium - Python
I'm not sure what all is happening on the page in question. I use a generic method to take care of common actions like click(), find(), etc. that contain a wait, scroll, and whatever else is useful for that action. I've put an example below. It waits for the element to be clickable (always a best practice), scrolls to the element, and then clicks it. You may want to tweak it to your specific use but it will get you pointed in the right direction.
def click(self, locator):
element = WebDriverWait(driver, 20).until(EC.element_to_be_clickable(locator))
driver.execute_script("arguments[0].scrollIntoView();", element)
element.click()
I would suggest you try this and then convert all of your clicks to use this method and see if that helps, e.g.
click((By.ID, "dropdown_year"))
or
locator = (By.ID, "dropdown_year")
click(locator)
Good evening to all. I am new to python and was just doing one of this exercise in Udemy course where we are tasked to prepare a program using webdriver and getting Internet speed test results on speedtest.net. I was stuck when I found some one gave solution of using:
WebDriverWait(self.driver, 50).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"div[data-result-id*='true']"))
My question is how and where this value for the CSS SELECTOR can be found on the website. Please explain. Please also give as much insight as you can regarding selenium or web driver and CSS in python.
#Libraries
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
#Constants
PROMISED_DOWN = 150
PROMISED_UP = 10
TWITTER_EMAIL = "my email"
TWITTER_PASSWORD = "my password"
#Class
class InternetSpeedTwitterBot:
def __init__(self):
self.s = Service('D:\Python Related Documents and Programsweb development folder\chromedriver_win32\chromedriver.exe')
self.driver = webdriver.Chrome(service=self.s)
self.down = None
self.up = None
def get_internet_speed(self):
self.driver.get("https://www.speedtest.net/")
time.sleep(5)
go_botton = self.driver.find_element(By.CSS_SELECTOR, '.start-button a')
go_botton.click()
#speed_download = self.driver.find_element(By.CSS_SELECTOR, ".download-speed")
speed_download = self.driver.find_element(By.XPATH, '//*[#id="container"]/div/div[3]/div/div/div/div[2]/div[3]/div[3]/div/div[3]/div/div/div[2]/div[1]/div[1]/div/div[2]/span')
WebDriverWait(self.driver, 50).until(
EC.visibility_of_element_located((By.CSS_SELECTOR, "div[data-result-id*='true']"))
)
speed_results = self.driver.find_elements(
By.CSS_SELECTOR, ".result-container-speed span.result-data-large.number.result-data-value"
)
self.down, self.up = (float(result.text) for result in speed_results)
print(f"Down Speed: {self.down}, Up Speed: {self.up}" )
#go_botton.click()
def tweet_at_provider(self):
pass
#Object creation
bot = InternetSpeedTwitterBot()
#Calling methods
bot.get_internet_speed()
bot.get_internet_speed()
What I do to find the elements?:
Open the inspect/dev tool from your browser (I use chrome) using F12 or right click --> Inspect
Click on element tab
Then make Ctrl + f on the element window and new search bar will appear
As you can see that in that bar you can search elements in your page by xpath or CSS.
Click in the button located on the top left of your dev tool. Then that element will be located on your dev tool.
If in your search bar you write a xpath or css selector the element will be highlighted in your screen, in that way you will know what is the selector you can use
Some doc:
How to find elements by CSS selector?: Small guide for CSS
How to find elements by XPATH selector?: Small guide for xpath
I need your help. I'm working on a telegram bot which sends me all the sales from amazon.
It works well but this function doesn't work properly. I have always the same error that, however, blocks the script
imgs_str = img_div.img.get('data-a-dynamic-image') # a string in Json format
AttributeError: 'NoneType' object has no attribute 'img'
def take_image(soup):
img_div = soup.find(id="imgTagWrapperId")
imgs_str = img_div.img.get('data-a-dynamic-image') # a string in Json format
# convert to a dictionary
imgs_dict = json.loads(imgs_str)
#each key in the dictionary is a link of an image, and the value shows the size (print all the dictionay to inspect)
num_element = 0
first_link = list(imgs_dict.keys())[num_element]
return first_link
I still don't understand how to solve this issue.
Thanks for All!
From the looks of the error, soup.find didn't work.
Have you tried using images = soup.findAll("img",{"id":"imgTagWrapperId"})
This will return a list
Images are not inserted in HTML Page they are linked to it so you need wait until uploaded. Here i will give you two options;
1-) (not recommend cause there may be a margin of error) simply; you can wait until the image is loaded(for this you can use "time.sleep()"
2-)(recommend) I would rather use Selenium Web Driver. You also have to wait when you use selenium, but the good thing is that selenium has a unique function for this job.
I will show how make it with selenium;
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
browser = webdriver.Chrome()
browser.get("url")
delay = 3 # seconds
try:
myElem = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, 'imgTagWrapperId')))# I used what do you want find
print ("Page is ready!")
except TimeoutException:
print ("Loading took too much time!")
More Documention
Code example for way 1
Q/A for way 2
I'm trying to get the course information on http://bulletin.iit.edu/graduate/colleges/science/applied-mathematics/master-data-science/#programrequirementstext.
In my code, I tried to first click on each course, next get the description in the bubble, and then close the bubble as it may overlay on top of other course links.
My problem is that I couldn't get the description in the bubble and some course links were still skipped though I tried to avoid it by closing the bubble.
Any idea about how to do this? Thanks in advance!
info = []
driver = webdriver.Chrome()
driver.get('http://bulletin.iit.edu/graduate/colleges/science/applied-mathematics/master-data-science/#programrequirementstext')
for i in range(1,3):
for j in range(2, 46):
try:
driver.find_element_by_xpath('//*[#id="programrequirementstextcontainer"]/table['+str(i)+']/tbody/tr['+str(j)+']/td[1]/a').click()
info.append(driver.find_elements_by_xpath('/html/body/div[8]/div[3]/div/div')[0].text)
driver.find_element_by_xpath('//*[#id="lfjsbubbleclose"]').click()
time.sleep(3)
except: pass
[1]: http://bulletin.iit.edu/graduate/colleges/science/applied-mathematics/master-data-science/#programrequirementstext
Not sure why you have put static range in for loop even though all the combinations of i and j index count in your xpath doesn't find any element on your application.
I would suggest better to go with finding all element on your webpage using single locator and loop trough to get descriptions from bubble.
Use below code:
course_list = driver.find_elements_by_css_selector("table.sc_courselist a.bubblelink.code")
wait = WebDriverWait(driver, 20)
for course in course_list:
try:
print("grabbing info of course : ", course.text)
course.click()
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.courseblockdesc")))
info.append(driver.find_element_by_css_selector('div.courseblockdesc>p').text)
wait.until(EC.visibility_of_element_located((By.ID, "lfjsbubbleclose")))
driver.find_element_by_id('lfjsbubbleclose').click()
except:
print("error while grabbing info")
print(info)
As it require some time to load the content in bubble so you should introduce explicit wait in your script until bubble content get completely visible and then grab it.
import below package for using wait in above code:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
Please note, this code grab all the courses description from bubble. Let me know if you are looking for some specific not all.
To load the bubble, the website makes an ajax call.
import requests
from bs4 import BeautifulSoup
def course(course_code):
data = {"page":"getcourse.rjs","code":course_code}
res = requests.get("http://bulletin.iit.edu/ribbit/index.cgi", data=data)
soup = BeautifulSoup(res.text,"lxml")
result = {}
result["description"] = soup.find("div", class_="courseblockdesc").text.strip()
result["title"] = soup.find("div", class_="coursetitle").text.strip()
return result
Output for course("CS 522")
{'description': 'Continued exploration of data mining algorithms. More sophisticated algorithms such as support vector machines will be studied in detail. Students will continuously study new contributions to the field. A large project will be required that encourages students to push the limits of existing data mining techniques.',
'title': 'Advanced Data Mining'}```
Hi I am trying to scrape this website I originally was using Bs4 and that was fine to get certain elements. Sector, name etc. But I am not able to use it to get the financial data. Below I have copied some of the page_source the "-" should be in this case 0.0663. I believe I am trying to scrape javascript and I have looked around and none of the solutions I have seen have worked for me. I was wondering if someone could help me crack this.
Although I will be grateful if someone can post some working code I would also really appreciate if you can point me in the right direction as well to understand what to look for in the html which shows me what I need to do and how to get it kinda thing.
URL: https://www.tradingview.com/symbols/LSE-TSCO/
HTML:
<span class="tv-widget-fundamentals__label apply-overflow-tooltip">
Return on Equity (TTM)
</span>
<span class="tv-widget-fundamentals__value apply-overflow-tooltip">
—
</span>
Python Code:
url = "https://www.tradingview.com/symbols/LSE-TSCO/"
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)
html = driver.page_source
To get the equity value.Induce WebDriverWait() and wait for visibility_of_element_located() and below xpath.
driver.get(url)
print(WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.XPATH,"//span[contains(.,'Return on Equity (TTM)')]/following-sibling::span[1]"))).text)
You need to import below libraries.
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
You can get the return on equity using xpath
equity = driver.find_element_by_xpath('/html/body/div[2]/div[4]/div/div/div/div/div/div[2]/div[2]/div[2]/div/div[2]/div[1]/div/div/div[1]/div[3]/div[3]/span[2]').text
print(equity)
The issue here is not with the element being present or not, but the time the page takes to load. The page looks very heavy with all those dynamic graphs..Even before the page is fully loaded in, the DOM start to get created and default values are taking place.
WebDriverWait with find_element_* works when the element is currently not present but will take a certain time to appear. In your context, it is present from the start and adding it won't do much. This is also why you get '-', as the element is present with its default value.
To fix this or reduce the issue, you can add code to wait until the document readyState is completed
Something like this can be used:
def wait_for_page_ready_state(driver):
wait = WebDriverWait(driver, 20)
def _ready_state_script(driver):
return driver.execute_async_script(
"""
var callback = arguments[arguments.length - 1];
callback(document.readyState);
""") == 'complete'
wait.until(_ready_state_script)
wait_for_page_ready_state(driver)
Then since you brought bs4 in play, this is where I would use it:
financials = {}
for el in BeautifulSoup(driver.page_source, "lxml").find_all('div', {"class": "tv-widget-fundamentals__row"}):
try:
key = re.sub('\s+', ' ', el.find('span', {"class": "tv-widget-fundamentals__label "
"apply-overflow-tooltip"}).text.strip())
value = re.sub('\s+', ' ', el.find('span', {"class": "tv-widget-fundamentals__value"}).text.strip())
financials[key] = value
except AttributeError:
pass
This will give you every value you need from the financial card.
You can now print the value you need:
print(financials['Return on Equity (TTM)'])
Output:
'0.0663'
Of course you can do the above with selenium as well, but wanted to provide with what you started to work with.
To be noted that this does not guaranty to always return the proper value. It might and did in my case, but since you know the default value you could add a while loop until the default change.
[EDIT]
After running my code in a loop, I was hitting the default value 1/5 times. One way to work around it would be to create a method and loop until a threshold is reached. In my finding, you will always have ~90% of the value updated with digit. When it fails with the default value, all other values were also at '-'. One way will be to use a threshold (i.e 50% and only return the values once it is reached).
def get_financial_card_values(default_value='—', threshold=.5):
financials = {}
while True:
for el in BeautifulSoup(driver.page_source, "lxml").find_all('div', {"class": "tv-widget-fundamentals__row"}):
try:
key = re.sub('\s+', ' ', el.find('span', {"class": "tv-widget-fundamentals__label "
"apply-overflow-tooltip"}).text.strip())
value = re.sub('\s+', ' ', el.find('span', {"class": "tv-widget-fundamentals__value"}).text.strip())
financials[key] = value
except AttributeError:
pass
number_of_updated_values = [value for value in financials.values() if value != default_value]
if len(number_of_updated_values) / len(financials) > threshold:
return financials
With this method, I was able to always retrieve the value you are expecting. Note that if all values won't change (site issue) you will be in a loop for ever, you might want to use a timer instead of while True. Just want to point this out, but I don't think it will happen.