How to __scrape__ data off page loaded via Javascript - python-3.x

I want to scrape the comments off this page using beautifulsoup - https://www.x....s.com/video_id/the-suburl
The comments are loaded on click via Javascript. The comments are paginated and each page loads comments on click too. I wish to fetch all comments, for each comment, I want to get the poster profile url, the comment, no. of likes, no of dislikes, and time posted (as stated on the page).
The comments can be a list of dictionaries.
How do I go about this?

This script will print all comments found on the page:
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.x......com/video_id/gggjggjj/'
video_id = url.rsplit('/', maxsplit=2)[-2].replace('video', '')
u = 'https://www.x......com/threads/video/ggggjggl/{video_id}/0/0'.format(video_id=video_id)
comments = requests.post(u, data={'load_all':1}).json()
for id_ in comments['posts']['ids']:
print(comments['posts']['posts'][id_]['date'])
print(comments['posts']['posts'][id_]['name'])
print(comments['posts']['posts'][id_]['url'])
print(BeautifulSoup(comments['posts']['posts'][id_]['message'], 'html.parser').get_text())
# ...etc.
print('-'*80)

This would be done with Selenium. Selenium emulates a browser. Depending on your preferences you can use a chrome driver or the Firefox driver which is the geckodriver.
Here is a link on how to install the chrome webdriver:
http://jonathansoma.com/lede/foundations-2018/classes/selenium/selenium-windows-install/
Then in your code here is how you would set it up:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
# this part may change depending on where you installed the webdriver.
# You may have to define the path to the driver.
# For me my driver is in C:/bin so I do not need to define the path
chrome_options = Options()
# or '-start maximized' if you want the browser window to open
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(options=chrome_options)
driver.get(your_url)
html = driver.page_source # downloads the html from the driver
Selenium has several functions that you can use to perform certain actions such as click on elements on the page. Once you find an element with selenium you can use the .click() method to interact with the element.
Let me know if this helps

Related

How to find element with selenium on python?

import os
import selenium
from selenium import webdriver
import time
browser = webdriver.Chrome()
browser.get('https://www.skysports.com/champions-league-fixtures')
time.sleep(7) #So page loads completely
teamnames = browser.find_element_by_tag("span")
print(teamnames.text)
seems find_element attribute is changed on selenium :/
i also want to find all <img on another local website ( images url ) , appreciate if you can help.
Replace teamnames = browser.find_element_by_tag("span")
with teamnames = browser.find_element_by_tag_name("span")
Try to find elements instead of element, because in DOM Tags are always considered multiple.
Example:
browser.find_elements_by_tag_name('span')
Also, not that it will return a list of elements you need to traverse to access properties further.
Seems selenium made some changes in new version:
from selenium.webdriver.common.by import By
browser = webdriver.Firefox()
browser.get('url')
browser.find_element(by=By.CSS_SELECTOR, value='')
You can also use : By.ID - By.NAME - By.XPATH - By.LINK_TEXT - By.PARTIAL_LINK_TEXT - By.TAG_NAME - By.CLASS_NAME - By.CSS_SELECTOR
I used these in Python 3.10 and now its working just fine

Selenium webdriver python element screenshot not working properly

I looked up Selenium python documentation and it allows one to take screenshots of an element. I tried the following code and it worked for small pages (around 3-4 actual A4 pages when you print them):
from selenium.webdriver import FirefoxOptions
firefox_profile = webdriver.FirefoxProfile()
firefox_profile.set_preference("browser.privatebrowsing.autostart", True)
# Configure options for Firefox webdriver
options = FirefoxOptions()
options.add_argument('--headless')
# Initialise Firefox webdriver
driver = webdriver.Firefox(firefox_profile=firefox_profile, options=options)
driver.maximize_window()
driver.get(url)
driver.find_element_by_tag_name("body").screenshot("career.png")
driver.close()
When I try it with url="https://waitbutwhy.com/2020/03/my-morning.html", it gives the screenshot of the entire page, as expected. But when I try it with url="https://waitbutwhy.com/2018/04/picking-career.html", almost half of the page is not rendered in the screenshot (the image is too large to upload here), even though the "body" tag does extend all the way down in the original HTML.
I have tried using both implicit and explicit waits (set to 10s, which is more than enough for a browser to load all contents, comments and discussion section included), but that has not improved the screenshot capability. Just to be sure that selenium was in fact loading the web page properly, I tried loading without the headless flag, and once the webpage was completely loaded, I ran driver.find_element_by_tag_name("body").screenshot("career.png"). The screenshot was again half-blank.
It seems that there might be some memory constraints put on the screenshot method (although I couldn't find any), or the logic behind the screenshot method itself is flawed. I can't figure it out though. I simply want to take the screenshot of the entire "body" element (preferably in a headless environment).
You may try this code, just that you need to install a package from command prompt using the command pip install Selenium-Screenshot
import time
from selenium import webdriver
from Screenshot import Screenshot_Clipping
driver = webdriver.Chrome()
driver.maximize_window()
driver.implicitly_wait(10)
driver.get("https://waitbutwhy.com/2020/03/my-morning.html")
obj=Screenshot_Clipping.Screenshot()
img_loc=obj.full_Screenshot(driver, save_path=r'.', image_name='capture.png')
print(img_loc)
time.sleep(5)
driver.close()
Outcome/Result comes out to be like, you just need to zoom the screenshot saved
Hope this works for you!

Unable to click on one of the items in a drop down list using selenium (python)

I am unable to click on the 'Search photos' button on flickr (image below including the html).
I have tried the following:
sp = browser.find_element_by_partial_link_text('/search/?text=tennis%20shoes')
sp.click()
sp = browser.find_element_by_name('Select photos')
sp.click()
searchPhotos = browser.find_element_by_class_name('Search photos')
searchPhotos.click()
browser.find_element_by_xpath("//class[#name='Search photos']").click()
But none of them seem to work. I am learning how to do this, including how to use xpath, so maybe I am not using it correctly. Any advice to point me in the right direction?
EDIT: full section of code to answer comment below:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
profile = webdriver.FirefoxProfile()
profile.set_preference("browser.download.folderList", 2)
profile.set_preference("browser.download.manager.showWhenStarting", False)
profile.set_preference("browser.download.dir", '/Users/home/Box/Temp-to delete')
profile.set_preference("browser.helperApps.neverAsk.saveToDisk", 'png/jpg')
browser = webdriver.Firefox(firefox_profile=profile, executable_path='/usr/local/bin/geckodriver')
browser.get('https://www.flickr.com/')
searchBar = browser.find_element_by_css_selector('#search-field')
searchBar.send_keys(searchTerm)
browser.find_element_by_xpath(".//*[#data-track='autosuggestNavigate_searchPhotos']").click()
Using firefox 72.0.2 (64-bit), python3, geckodriver v0.26.0
The path used in your XPath won't work. Try this one .//*[#data-track='autosuggestNavigate_searchPhotos'].
The .// tells Selenium so search anywhere in the DOM. The asterisk (*) will make Selenium to look for any element (no matter if it is div, li or any other HTML tag). Then it will check which element has the data-track attribute, with value autosuggestNavigate_searchPhotos. Since there is only one element like this, we are fine.
I advise to read more about XPath and train a bit, you may start here
Solved it Just had to hit ENTER for the photos results page to show. Here is the single line of code I changed:
searchBar.send_keys(searchTerm, Keys.ENTER)

Selenium Webdriver saving corrupt jpeg

Below is a script that opens a URL, saves the image as a JPEG file, and also saves some html attribute (i.e. the Accession Number) as the file name. The script runs but saves corrupted images; size = 210 bytes with no preview. When I try to open them, the error message suggests the file is damaged.
The reason I am saving the images instead of doing a direct request is to get around the site's security measures, it doesn't seem to allow web scraping. My colleague who tested the script on Windows below got a robot check request (just once at the beginning of the loop) before the images successfully downloaded. I do not get this check from the site, so I believe my script is actually pulling the robot check instead of the webpage as it hasn't allowed me to manually bypass the check. I'd appreciate help addressing this issue, perhaps forcing the robot check when the script opens the first URL.
Dependencies
I am using Python 3.6 on MacOS. If anyone testing this for me is also using Mac and is using Selenium for the first time, please note that a file called "Install Certificates.command" first needs to be executed before you can access anything. Otherwise, it will throw a "Certificate_Verify_Failed" error. Easy to search in Finder.
Download for Selenium ChromeDriver utilized below: https://chromedriver.storage.googleapis.com/index.html?path=2.41/
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import urllib
import time
urls = ['https://www.metmuseum.org/art/collection/search/483452',
'https://www.metmuseum.org/art/collection/search/460833',
'https://www.metmuseum.org/art/collection/search/551844']
#Set up Selenium Webdriver
options = webdriver.ChromeOptions()
#options.add_argument('--ignore-certificate-errors')
options.add_argument("--test-type")
driver = webdriver.Chrome(executable_path="/Users/user/Desktop/chromedriver", chrome_options=options)
for link in urls:
#Load page and pull HTML File
driver.get(link)
time.sleep(2)
soup = BeautifulSoup(driver.page_source, 'lxml')
#Find details (e.g. Accession Number)
details = soup.find_all('dl', attrs={'class':'artwork__tombstone--row'})
for d in details:
if 'Accession Number' in d.find('dt').text:
acc_no = d.find('dd').text
pic_link = soup.find('img', attrs={'id':'artwork__image', 'class':'artwork__image'})['src']
urllib.request.urlretrieve(pic_link, '/Users/user/Desktop/images/{}.jpg'.format(acc_no))
time.sleep(2)

My script produces javascript stuffs Instead of valid links

Running my script I get "javascript:getDetail(19978)" such items as href. The number in braces if concatenated with "https://www.aopa.org/airports/4M3/business/", produces valid links. However, clicking on this newly created links I can see that It gets me to a different page which is not similar to the one if clicked from the original page link. How can I get the original links instead of "javascript:getDetail(19978)". Search should be made writing "All" in the searchbox.
The code I've tried with:
from selenium import webdriver
import time
link = "https://www.aopa.org/airports/4M3/business/"
driver = webdriver.Chrome()
driver.get("https://www.aopa.org/learntofly/school/")
driver.find_element_by_id('searchTerm').send_keys('All')
driver.find_element_by_id('btnSearch').click()
time.sleep(5)
for pro in driver.find_elements_by_xpath('//td/a'):
print(pro.get_attribute("href"))
driver.quit()
Code to create new links with the base url I pasted in my description:
from selenium import webdriver
import time
link = "https://www.aopa.org/airports/4M3/business/"
driver = webdriver.Chrome()
driver.get("https://www.aopa.org/learntofly/school/")
driver.find_element_by_id('searchTerm').send_keys('All')
driver.find_element_by_id('btnSearch').click()
time.sleep(5)
for item in driver.find_elements_by_xpath('//td/a'):
fresh = item.get_attribute("href").replace("javascript:getDetail(","")
print(link + fresh.replace(")",""))
driver.quit()
However, this newly created links lead me to different destinations.
FYC, original links are embedded within elements like the below one:
<td>GOLD DUST FLYING SERVICE, INC.</td>
Clicking link you make an XHR. The page is actually remained the same, but received data from JSON rendered instead of previous content.
If you want to open raw data inside an HTML page you might try something like
# To get list of entries as ["19978", "30360", ... ]
entries = [a.get_attribute('href').split("(")[1].split(")")[0] for a in driver.find_elements_by_xpath('//td/a')]
url = "https://www.aopa.org/learntofly/school/wsSearch.cfm?method=schoolDetail&businessId="
for entry in entries:
driver.get(url + entry)
print(driver.page_source)
You also might use requests to get each JSON response as
import requests
for entry in entries:
print(requests.get(url + entry).json())
without rendering data in browser
If you look at how getDetail() is implemented in the source code and explore the "network" tab when you click each of the search result links, you may see that there are multiple XHR requests issued for a result and there is some client-side logic executed to form a search result page.
If you don't want to dive into replicating all the logic happening to form each of the search result pages - simply go back and forth between the search results page and a single search result page:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("https://www.aopa.org/learntofly/school/")
driver.find_element_by_id('searchTerm').send_keys('All')
driver.find_element_by_id('btnSearch').click()
# wait for search results to be visible
table = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "#searchResults table")))
for index in range(len(table.find_elements_by_css_selector('td a[href*=getDetail]'))):
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "#searchResults table")))
# get the next link
link = table.find_elements_by_css_selector('td a[href*=getDetail]')[index]
link_text = link.text
link.click()
print(link_text)
# TODO: get details
# go back
back_link = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#schoolDetail a[href*=backToList]")))
driver.execute_script("arguments[0].click();", back_link)
driver.quit()
Note the use of Explicit Waits instead of hardcoded "sleeps".
It may actually make sense to avoid using selenium here altogether and approach the problem "headlessly" - doing HTTP requests (via requests module) to the website's API.

Resources