Selenium Webdriver saving corrupt jpeg - python-3.x

Below is a script that opens a URL, saves the image as a JPEG file, and also saves some html attribute (i.e. the Accession Number) as the file name. The script runs but saves corrupted images; size = 210 bytes with no preview. When I try to open them, the error message suggests the file is damaged.
The reason I am saving the images instead of doing a direct request is to get around the site's security measures, it doesn't seem to allow web scraping. My colleague who tested the script on Windows below got a robot check request (just once at the beginning of the loop) before the images successfully downloaded. I do not get this check from the site, so I believe my script is actually pulling the robot check instead of the webpage as it hasn't allowed me to manually bypass the check. I'd appreciate help addressing this issue, perhaps forcing the robot check when the script opens the first URL.
Dependencies
I am using Python 3.6 on MacOS. If anyone testing this for me is also using Mac and is using Selenium for the first time, please note that a file called "Install Certificates.command" first needs to be executed before you can access anything. Otherwise, it will throw a "Certificate_Verify_Failed" error. Easy to search in Finder.
Download for Selenium ChromeDriver utilized below: https://chromedriver.storage.googleapis.com/index.html?path=2.41/
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import urllib
import time
urls = ['https://www.metmuseum.org/art/collection/search/483452',
'https://www.metmuseum.org/art/collection/search/460833',
'https://www.metmuseum.org/art/collection/search/551844']
#Set up Selenium Webdriver
options = webdriver.ChromeOptions()
#options.add_argument('--ignore-certificate-errors')
options.add_argument("--test-type")
driver = webdriver.Chrome(executable_path="/Users/user/Desktop/chromedriver", chrome_options=options)
for link in urls:
#Load page and pull HTML File
driver.get(link)
time.sleep(2)
soup = BeautifulSoup(driver.page_source, 'lxml')
#Find details (e.g. Accession Number)
details = soup.find_all('dl', attrs={'class':'artwork__tombstone--row'})
for d in details:
if 'Accession Number' in d.find('dt').text:
acc_no = d.find('dd').text
pic_link = soup.find('img', attrs={'id':'artwork__image', 'class':'artwork__image'})['src']
urllib.request.urlretrieve(pic_link, '/Users/user/Desktop/images/{}.jpg'.format(acc_no))
time.sleep(2)

Related

Cant' download file embedded even after switching to iframe using selenium and python

I have automated selection of data using selenium but site displays report inside iframe, I want to download xlsx files which is part of iframe.
Cant download files iteratively and there is no url generated for downloading of file.
cycle = Select(browser.find_element(By.ID,"CycleId"))
for i in range(1,3):
cycle.select_by_index(i)
browser.execute_script("BindReportData();")
WebDriverWait(browser, 120).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,"//*[#id='report']/iframe")))
WebDriverWait(browser, 120).until(EC.element_to_be_clickable((By.ID,"ReportViewer1_ctl05_ctl04_ctl00_Button")))
browser.switch_to.default_content()
time.sleep(60)
iframe_var = browser.find_element(By.TAG_NAME,'iframe')
browser.switch_to.frame(iframe_var)
but = browser.find_element(By.ID,'ReportViewer1_ctl05_ctl04_ctl00_Menu')
options = but.find_elements(By.TAG_NAME,'a')
browser.execute_script(options[1].get_attribute('onclick'))
browser.switch_to.default_content()
```
```
```
Site is (https://soilhealth.dac.gov.in/PublicReports/GridFormNSVW) I want to iterate through all states,district,block,village and cycle to download file
I have already tried switching to iframe to locate element to download, used webdriver wait to avoid stale element exception.
It sometimes shows stale element exception or timeout error

Can't download image from a website with Selenium; it gives 403 error

I was trying to scrape pictures with Selenium of a certain character on Pixiv, but when I tried to download, it gave me a 403 error. I tried using the request module with the src link to download it directly and it gave me the error. I tried opening a new tab with the src link and it gave me the same error. Is there a way to download a image from Pixiv? I was planning something a bit larger than just downloading a single image, but I am stuck in it. I did put the user-agent, as this thread suggested, but or it didn't work or I did something wrong.
This is the image I tried to download: https://www.pixiv.net/en/artworks/93284987.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://i.pximg.net/img-original/img/2021/10/07/19/41/28/93284987_p0.jpg')
I skipped the code to get it here, but what happens is that I get the src link to the image, but I can't access the page to download it. I don't know if I need to actually go to the page, but I can't do anything with the src either. I tried several methods but nothing works. Can someone help me?
Seems to download just fine without any headers for me.
import requests
import shutil
url = 'https://i.pximg.net/img-master/img/2021/10/07/19/41/28/93284987_p0_master1200.jpg'
response = requests.get(url, stream=True)
local_filename = url.split('/')[-1]
with open(local_filename, 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
del response
print(local_filename)

Selenium webdriver python element screenshot not working properly

I looked up Selenium python documentation and it allows one to take screenshots of an element. I tried the following code and it worked for small pages (around 3-4 actual A4 pages when you print them):
from selenium.webdriver import FirefoxOptions
firefox_profile = webdriver.FirefoxProfile()
firefox_profile.set_preference("browser.privatebrowsing.autostart", True)
# Configure options for Firefox webdriver
options = FirefoxOptions()
options.add_argument('--headless')
# Initialise Firefox webdriver
driver = webdriver.Firefox(firefox_profile=firefox_profile, options=options)
driver.maximize_window()
driver.get(url)
driver.find_element_by_tag_name("body").screenshot("career.png")
driver.close()
When I try it with url="https://waitbutwhy.com/2020/03/my-morning.html", it gives the screenshot of the entire page, as expected. But when I try it with url="https://waitbutwhy.com/2018/04/picking-career.html", almost half of the page is not rendered in the screenshot (the image is too large to upload here), even though the "body" tag does extend all the way down in the original HTML.
I have tried using both implicit and explicit waits (set to 10s, which is more than enough for a browser to load all contents, comments and discussion section included), but that has not improved the screenshot capability. Just to be sure that selenium was in fact loading the web page properly, I tried loading without the headless flag, and once the webpage was completely loaded, I ran driver.find_element_by_tag_name("body").screenshot("career.png"). The screenshot was again half-blank.
It seems that there might be some memory constraints put on the screenshot method (although I couldn't find any), or the logic behind the screenshot method itself is flawed. I can't figure it out though. I simply want to take the screenshot of the entire "body" element (preferably in a headless environment).
You may try this code, just that you need to install a package from command prompt using the command pip install Selenium-Screenshot
import time
from selenium import webdriver
from Screenshot import Screenshot_Clipping
driver = webdriver.Chrome()
driver.maximize_window()
driver.implicitly_wait(10)
driver.get("https://waitbutwhy.com/2020/03/my-morning.html")
obj=Screenshot_Clipping.Screenshot()
img_loc=obj.full_Screenshot(driver, save_path=r'.', image_name='capture.png')
print(img_loc)
time.sleep(5)
driver.close()
Outcome/Result comes out to be like, you just need to zoom the screenshot saved
Hope this works for you!

How to __scrape__ data off page loaded via Javascript

I want to scrape the comments off this page using beautifulsoup - https://www.x....s.com/video_id/the-suburl
The comments are loaded on click via Javascript. The comments are paginated and each page loads comments on click too. I wish to fetch all comments, for each comment, I want to get the poster profile url, the comment, no. of likes, no of dislikes, and time posted (as stated on the page).
The comments can be a list of dictionaries.
How do I go about this?
This script will print all comments found on the page:
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.x......com/video_id/gggjggjj/'
video_id = url.rsplit('/', maxsplit=2)[-2].replace('video', '')
u = 'https://www.x......com/threads/video/ggggjggl/{video_id}/0/0'.format(video_id=video_id)
comments = requests.post(u, data={'load_all':1}).json()
for id_ in comments['posts']['ids']:
print(comments['posts']['posts'][id_]['date'])
print(comments['posts']['posts'][id_]['name'])
print(comments['posts']['posts'][id_]['url'])
print(BeautifulSoup(comments['posts']['posts'][id_]['message'], 'html.parser').get_text())
# ...etc.
print('-'*80)
This would be done with Selenium. Selenium emulates a browser. Depending on your preferences you can use a chrome driver or the Firefox driver which is the geckodriver.
Here is a link on how to install the chrome webdriver:
http://jonathansoma.com/lede/foundations-2018/classes/selenium/selenium-windows-install/
Then in your code here is how you would set it up:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
# this part may change depending on where you installed the webdriver.
# You may have to define the path to the driver.
# For me my driver is in C:/bin so I do not need to define the path
chrome_options = Options()
# or '-start maximized' if you want the browser window to open
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(options=chrome_options)
driver.get(your_url)
html = driver.page_source # downloads the html from the driver
Selenium has several functions that you can use to perform certain actions such as click on elements on the page. Once you find an element with selenium you can use the .click() method to interact with the element.
Let me know if this helps

My script produces javascript stuffs Instead of valid links

Running my script I get "javascript:getDetail(19978)" such items as href. The number in braces if concatenated with "https://www.aopa.org/airports/4M3/business/", produces valid links. However, clicking on this newly created links I can see that It gets me to a different page which is not similar to the one if clicked from the original page link. How can I get the original links instead of "javascript:getDetail(19978)". Search should be made writing "All" in the searchbox.
The code I've tried with:
from selenium import webdriver
import time
link = "https://www.aopa.org/airports/4M3/business/"
driver = webdriver.Chrome()
driver.get("https://www.aopa.org/learntofly/school/")
driver.find_element_by_id('searchTerm').send_keys('All')
driver.find_element_by_id('btnSearch').click()
time.sleep(5)
for pro in driver.find_elements_by_xpath('//td/a'):
print(pro.get_attribute("href"))
driver.quit()
Code to create new links with the base url I pasted in my description:
from selenium import webdriver
import time
link = "https://www.aopa.org/airports/4M3/business/"
driver = webdriver.Chrome()
driver.get("https://www.aopa.org/learntofly/school/")
driver.find_element_by_id('searchTerm').send_keys('All')
driver.find_element_by_id('btnSearch').click()
time.sleep(5)
for item in driver.find_elements_by_xpath('//td/a'):
fresh = item.get_attribute("href").replace("javascript:getDetail(","")
print(link + fresh.replace(")",""))
driver.quit()
However, this newly created links lead me to different destinations.
FYC, original links are embedded within elements like the below one:
<td>GOLD DUST FLYING SERVICE, INC.</td>
Clicking link you make an XHR. The page is actually remained the same, but received data from JSON rendered instead of previous content.
If you want to open raw data inside an HTML page you might try something like
# To get list of entries as ["19978", "30360", ... ]
entries = [a.get_attribute('href').split("(")[1].split(")")[0] for a in driver.find_elements_by_xpath('//td/a')]
url = "https://www.aopa.org/learntofly/school/wsSearch.cfm?method=schoolDetail&businessId="
for entry in entries:
driver.get(url + entry)
print(driver.page_source)
You also might use requests to get each JSON response as
import requests
for entry in entries:
print(requests.get(url + entry).json())
without rendering data in browser
If you look at how getDetail() is implemented in the source code and explore the "network" tab when you click each of the search result links, you may see that there are multiple XHR requests issued for a result and there is some client-side logic executed to form a search result page.
If you don't want to dive into replicating all the logic happening to form each of the search result pages - simply go back and forth between the search results page and a single search result page:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("https://www.aopa.org/learntofly/school/")
driver.find_element_by_id('searchTerm').send_keys('All')
driver.find_element_by_id('btnSearch').click()
# wait for search results to be visible
table = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "#searchResults table")))
for index in range(len(table.find_elements_by_css_selector('td a[href*=getDetail]'))):
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "#searchResults table")))
# get the next link
link = table.find_elements_by_css_selector('td a[href*=getDetail]')[index]
link_text = link.text
link.click()
print(link_text)
# TODO: get details
# go back
back_link = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#schoolDetail a[href*=backToList]")))
driver.execute_script("arguments[0].click();", back_link)
driver.quit()
Note the use of Explicit Waits instead of hardcoded "sleeps".
It may actually make sense to avoid using selenium here altogether and approach the problem "headlessly" - doing HTTP requests (via requests module) to the website's API.

Resources