I'm trying to access a set of links and download them using the simplified print format on Firefox using Selenium. My current code downloads the webpage as a pdf in the original form but I need to get it in the simplified form.
Here is my current code snippet which downloads the pdf in the original format.
from time import sleep
from helium import start_firefox
from selenium.webdriver import FirefoxOptions
options = FirefoxOptions()
options.add_argument("--start-maximized")
options.set_preference("print.always_print_silent", True)
options.set_preference("print.printer_Mozilla_Save_to_PDF.print_to_file", True)
options.set_preference("print_printer", "Mozilla Save to PDF")
options.set_preference("print.use_simplify_page", True) # Does not seem to download in the simplified form
driver = start_firefox("https://www.hsph.harvard.edu/nutritionsource/selenium/", options=options, headless = True)
driver.execute_script("window.print();")
sleep(10) # Found that a little wait is needed for the print to be rendered otherwise the file will be corrupted
driver.quit()
The format I'm trying to get it in can be viewed by opening the link (https://www.hsph.harvard.edu/nutritionsource/selenium/) on firefox and using the print option. Under format please select "simplified".
Is there any way this can be done in the required format?
Related
I looked up Selenium python documentation and it allows one to take screenshots of an element. I tried the following code and it worked for small pages (around 3-4 actual A4 pages when you print them):
from selenium.webdriver import FirefoxOptions
firefox_profile = webdriver.FirefoxProfile()
firefox_profile.set_preference("browser.privatebrowsing.autostart", True)
# Configure options for Firefox webdriver
options = FirefoxOptions()
options.add_argument('--headless')
# Initialise Firefox webdriver
driver = webdriver.Firefox(firefox_profile=firefox_profile, options=options)
driver.maximize_window()
driver.get(url)
driver.find_element_by_tag_name("body").screenshot("career.png")
driver.close()
When I try it with url="https://waitbutwhy.com/2020/03/my-morning.html", it gives the screenshot of the entire page, as expected. But when I try it with url="https://waitbutwhy.com/2018/04/picking-career.html", almost half of the page is not rendered in the screenshot (the image is too large to upload here), even though the "body" tag does extend all the way down in the original HTML.
I have tried using both implicit and explicit waits (set to 10s, which is more than enough for a browser to load all contents, comments and discussion section included), but that has not improved the screenshot capability. Just to be sure that selenium was in fact loading the web page properly, I tried loading without the headless flag, and once the webpage was completely loaded, I ran driver.find_element_by_tag_name("body").screenshot("career.png"). The screenshot was again half-blank.
It seems that there might be some memory constraints put on the screenshot method (although I couldn't find any), or the logic behind the screenshot method itself is flawed. I can't figure it out though. I simply want to take the screenshot of the entire "body" element (preferably in a headless environment).
You may try this code, just that you need to install a package from command prompt using the command pip install Selenium-Screenshot
import time
from selenium import webdriver
from Screenshot import Screenshot_Clipping
driver = webdriver.Chrome()
driver.maximize_window()
driver.implicitly_wait(10)
driver.get("https://waitbutwhy.com/2020/03/my-morning.html")
obj=Screenshot_Clipping.Screenshot()
img_loc=obj.full_Screenshot(driver, save_path=r'.', image_name='capture.png')
print(img_loc)
time.sleep(5)
driver.close()
Outcome/Result comes out to be like, you just need to zoom the screenshot saved
Hope this works for you!
I want to scrape the comments off this page using beautifulsoup - https://www.x....s.com/video_id/the-suburl
The comments are loaded on click via Javascript. The comments are paginated and each page loads comments on click too. I wish to fetch all comments, for each comment, I want to get the poster profile url, the comment, no. of likes, no of dislikes, and time posted (as stated on the page).
The comments can be a list of dictionaries.
How do I go about this?
This script will print all comments found on the page:
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.x......com/video_id/gggjggjj/'
video_id = url.rsplit('/', maxsplit=2)[-2].replace('video', '')
u = 'https://www.x......com/threads/video/ggggjggl/{video_id}/0/0'.format(video_id=video_id)
comments = requests.post(u, data={'load_all':1}).json()
for id_ in comments['posts']['ids']:
print(comments['posts']['posts'][id_]['date'])
print(comments['posts']['posts'][id_]['name'])
print(comments['posts']['posts'][id_]['url'])
print(BeautifulSoup(comments['posts']['posts'][id_]['message'], 'html.parser').get_text())
# ...etc.
print('-'*80)
This would be done with Selenium. Selenium emulates a browser. Depending on your preferences you can use a chrome driver or the Firefox driver which is the geckodriver.
Here is a link on how to install the chrome webdriver:
http://jonathansoma.com/lede/foundations-2018/classes/selenium/selenium-windows-install/
Then in your code here is how you would set it up:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
# this part may change depending on where you installed the webdriver.
# You may have to define the path to the driver.
# For me my driver is in C:/bin so I do not need to define the path
chrome_options = Options()
# or '-start maximized' if you want the browser window to open
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(options=chrome_options)
driver.get(your_url)
html = driver.page_source # downloads the html from the driver
Selenium has several functions that you can use to perform certain actions such as click on elements on the page. Once you find an element with selenium you can use the .click() method to interact with the element.
Let me know if this helps
Object:
automate following process.
1. Open particular web page, fill the information in search box, submit.
2. from search results click on first result and download the PDF
Work done:
To reach to this object I have written a code as first step. The code works fine but opens up download pop up. Till the time I can't get rid of it, I can not automate the process further. Searched for very many solutions. But none has worked.
For instance, This solution is hard for me to understand and I think its more to do with Java then Python. I changed fire fox profile as suggested by many. This dose matches though not exactly same. I haven't tried as there is no much difference. Even this speaks about changing fire fox profile but that doesn't work.
My code is as below
import selenium.webdriver as webdriver
import selenium.webdriver.support.ui as ui
from time import sleep
import time
import wget
from wget import download
import os
#set firefox Profile
profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2)
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference("browser.download.manager.showAlertOnComplete", False)
profile.set_preference('browser.download.dir', os.getcwd())
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'application/pdf')
#set variable driver to open firefox
driver = webdriver.Firefox(profile)
#set variable webpage to open the expected URL
webpage = r"https://documents.un.org/prod/ods.nsf/home.xsp" # edit me
#set variable to enter in search box
searchterm = "A/HRC/41/23" # edit me
#open the webpage with get command
driver.get(webpage)
#find the element "symbol", insert data and click submit.
symbolBox = driver.find_element_by_id("view:_id1:_id2:txtSymbol")
symbolBox.send_keys(searchterm)
submit = driver.find_element_by_id("view:_id1:_id2:btnRefine")
submit.click()
#list of search results open up and 1st occarance is clicked by coppying its id element
downloadPage = driver.find_element_by_id("view:_id1:_id2:cbMain:_id135:rptResults:0:linkURL")
downloadPage.click()
#change windiows. with sleep time
window_before = driver.window_handles[0]
window_after = driver.window_handles[1]
time.sleep(10)
driver.switch_to.window(window_after)
#the actual download of the pdf page
theDownload = driver.find_element_by_id("download")
theDownload.click()
Please guide.
The "Selections" popup is not a different window/tab, it's just an HTML popup. You can tell this because if you right click on the dialog, you will see the normal context menu. You just need to make your "Language" and "File type(s)" selections and click the "Download selected" button.
What I'm trying to do: I want to scrape a web page to get the amount of a financial transaction from a PDF file that is loaded with javascript from a website. Example website: http://www.nebraskadeedsonline.us/document.aspx?g5savSPtTDnumMn1bRBWoKqN6Gu65tBhDE9%2fVs5YdPg=
When I click the 'View Document' button, the PDF file loads into my browser's window (I'm using Google Chrome). I can right-click on the PDF and save it to my computer, but I want to automate that process by either having Selenium (or similar package) download that file and then process it for OCR.
If I can get it saved, I will be able to do the OCR part (I hope). I just can't get the file saved.
From here, I found and modified this code:
def download_pdf(lnk):
from selenium import webdriver
from time import sleep
options = webdriver.ChromeOptions()
download_folder = "C:\\Users\\rickc\\Documents\\Scraper2\\screenshots\\"
profile = {"plugins.plugins_list": [{"enabled": False,
"name": "Chrome PDF Viewer"}],
"download.default_directory": download_folder,
"download.extensions_to_open": ""}
options.add_experimental_option("prefs", profile)
print("Downloading file from link: {}".format(lnk))
driver = webdriver.Chrome(chrome_options = options)
driver.get(lnk)
filename = lnk.split("/")[3].split(".aspx")[0]+".pdf"
print("File: {}".format(filename))
print("Status: Download Complete.")
print("Folder: {}".format(download_folder))
driver.close()
download_pdf('http://www.nebraskadeedsonline.us/document.aspx?g5savSPtTDnumMn1bRBWoKqN6Gu65tBhDE9fVs5YdPg=')
But it isn't working. My old college professor once said, "If you've spent more than two hours on the problem and haven't made headway, it's time to look for help elsewhere." So I'm looking for help.
Other info: The link above will take you to a web page, but you can't access the PDF document until you click on the 'View Document' button. I've tried using Selenium's webdriver.find_element_by_ID('btnDocument').click() to make things happen, and it just loads the page but doesn't do anything with it.
You can download pdf using requests and BeautifulSoup libraries. In code below replace /Users/../aaa.pdf with full path where document will be downloaded:
import requests
from bs4 import BeautifulSoup
url = 'http://www.nebraskadeedsonline.us/document.aspx?g5savSPtTDnumMn1bRBWoKqN6Gu65tBhDE9%2fVs5YdPg='
response = requests.post(url)
page = BeautifulSoup(response.text, "html.parser")
VIEWSTATE = page.select_one("#__VIEWSTATE").attrs["value"]
VIEWSTATEGENERATOR = page.select_one("#__VIEWSTATEGENERATOR").attrs["value"]
EVENTVALIDATION = page.select_one("#__EVENTVALIDATION").attrs["value"]
btnDocument = page.select_one("[name=btnDocument]").attrs["value"]
data = {
'__VIEWSTATE': VIEWSTATE,
'__VIEWSTATEGENERATOR': VIEWSTATEGENERATOR,
'__EVENTVALIDATION': EVENTVALIDATION,
'btnDocument': btnDocument
}
response = requests.post(url, data=data)
with open('/Users/../aaa.pdf', 'wb') as f:
f.write(response.content)
Below is a script that opens a URL, saves the image as a JPEG file, and also saves some html attribute (i.e. the Accession Number) as the file name. The script runs but saves corrupted images; size = 210 bytes with no preview. When I try to open them, the error message suggests the file is damaged.
The reason I am saving the images instead of doing a direct request is to get around the site's security measures, it doesn't seem to allow web scraping. My colleague who tested the script on Windows below got a robot check request (just once at the beginning of the loop) before the images successfully downloaded. I do not get this check from the site, so I believe my script is actually pulling the robot check instead of the webpage as it hasn't allowed me to manually bypass the check. I'd appreciate help addressing this issue, perhaps forcing the robot check when the script opens the first URL.
Dependencies
I am using Python 3.6 on MacOS. If anyone testing this for me is also using Mac and is using Selenium for the first time, please note that a file called "Install Certificates.command" first needs to be executed before you can access anything. Otherwise, it will throw a "Certificate_Verify_Failed" error. Easy to search in Finder.
Download for Selenium ChromeDriver utilized below: https://chromedriver.storage.googleapis.com/index.html?path=2.41/
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import urllib
import time
urls = ['https://www.metmuseum.org/art/collection/search/483452',
'https://www.metmuseum.org/art/collection/search/460833',
'https://www.metmuseum.org/art/collection/search/551844']
#Set up Selenium Webdriver
options = webdriver.ChromeOptions()
#options.add_argument('--ignore-certificate-errors')
options.add_argument("--test-type")
driver = webdriver.Chrome(executable_path="/Users/user/Desktop/chromedriver", chrome_options=options)
for link in urls:
#Load page and pull HTML File
driver.get(link)
time.sleep(2)
soup = BeautifulSoup(driver.page_source, 'lxml')
#Find details (e.g. Accession Number)
details = soup.find_all('dl', attrs={'class':'artwork__tombstone--row'})
for d in details:
if 'Accession Number' in d.find('dt').text:
acc_no = d.find('dd').text
pic_link = soup.find('img', attrs={'id':'artwork__image', 'class':'artwork__image'})['src']
urllib.request.urlretrieve(pic_link, '/Users/user/Desktop/images/{}.jpg'.format(acc_no))
time.sleep(2)