How to get download link to the file, in the firefox downloads via selenium and python3 - python-3.x

I can't get link in webpage, it's generat automaticly using JS. But I can get firefox download window after clicking on href (it's a JS script, that returns href).
How can I get the link in this window using selenium. If i can't do this, is there any other way to get link (no explicit link in the HTML DOM)

from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2) # 2 means custom location
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.download.dir', '/tmp') # location is tmp
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'text/csv')
browser = webdriver.Firefox(profile)
browser.get("yourwebsite")
element = browser.find_element_by_id('yourLocator')
href = element.get_attribute("href")
Now you have website in your href.
Use below code to navigate to the URL
browser.navigate().to(href)

You can go for the following approach:
Get WebElement's href attribute using WebElement.get_attribute() function
href = your_element.get_attribute("href")
Use WebDriver.execute_script() function to evaluate the JavaScript and return the real URL
url = driver.execute_script("return " + href + ";")
Now you should be able to use urllib or requests library to download the file. If your website assumes authentication - don't forget to obtain Cookies from the browser instance and add the relevant Cookie header to the request downloading the file

Related

Python + Mechanize - Emulate Javascript button click using POST?

I'm trying to automate filling a car insurance quote form on a site:
(following the same format as the site URL lets call it: "https://secure.examplesite.com/css/car/step1#noBack")
I'm stuck on the rego section as once the rego has been added, a button needs to be clicked to perform the search and it seems this is heavy Javascript and I know mechanizer can't handle this. I'm not versed in JavaScript, but I can see that when clicking the button a POST request is made to this URL: ("https://secure.examplesite.com/css/car/step1/searchVehicleByRegNo") Please see image also.
How can I emulate this POST request in Mechanize to run the javascript? So I can see the response / interact with the response? Or is this not possible? Can I consider bs4/requests/robobrowser instead. I'm only ~4 months into learning! Thanks
# Mechanize test
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False) # ignore robots
br.set_handle_refresh(False) # can sometimes hang without this
res = br.open("https://secure.examplesite.com/css/car/step1#noBack")
br.select_form(id = "quoteCollectForm")
br.set_all_readonly(False) # allow everything to be written to
controlDict = {}
# List all form controls
for control in br.form.controls:
controlDict[control.name] = control.value
print("type = %s, name = %s, value = %s" %(control.type, control.name, control.value))
# Enter Rego etc "example"
br.form["vehicle.searchRegNo"] = "example"
# Now for control name = vehicle.searchRegNo, value = example
# BUT Now how do I click the button?? Simulate POST? The post url is formatted like:
# https://secure.examplesite.com/css/car/step1/searchVehicleByRegNo
Javascript POST
Solved my own problem-
Steps:
open dev tools in browser
Go to network tab and clear
interact with form element (in my case car rego finder)
click on the event that occurs from interaction
copy the exact URL, Request header data, and payload
I used Postman to quickly test the request and responses were correct / the same as the Webform and found the relevant headers
in postman convert to python requests code
Now I can interact completely with the form

How can I retrieve data from a web page with a loading screen?

I am using the requests library to retrieve data from nitrotype.com/racer/insert_name_here about a user's progress using the following code:
import requests
base_url = 'https://www.nitrotype.com/racer/'
name = 'test'
url = base_url + name
page = requests.get(url)
print(page.text)
However my problem is that this retrieves data from the loading screen, I want the data after the loading screen.
Is it possible to do this and how?
This is likely because of dynamic loading and can easily be navigated by using selenium or pyppeteer.
In my example, I have used pyppeteer to spawn a browser and load the javascript so that I can attain the required information.
Example:
import pyppeteer
import asyncio
async def main():
# launches a chromium browser, can use chrome instead of chromium as well.
browser = await pyppeteer.launch(headless=False)
# creates a blank page
page = await browser.newPage()
# follows to the requested page and runs the dynamic code on the site.
await page.goto('https://www.nitrotype.com/racer/tupac')
# provides the html content of the page
cont = await page.content()
return cont
# prints the html code of the user profiel: tupac
print(asyncio.get_event_loop().run_until_complete(main()))

How to scrape image/file from web page in Python?

I try to use Python3.7.4 to backup pictures in a blog site, e.g.
http://s2.sinaimg.cn/mw690/001H6t4Fzy7zgC0WLXb01&690
If I input the above address in Firefox address bar, the file is shown correctly.
If I use following code to download picture, server always redirects to a default picture:
from requests import get # just to try different methods
from urllib.request import urlopen
from urllib.parse import urlsplit, urlunsplit, quote
# hard-coded address is randomly selected for debug purpose.
origPict = 'http://s2.sinaimg.cn/mw690/001H6t4Fzy7zgC0WLXb01&690'
p = urlsplit (origPict)
newP = quote (p.path)
origPict = urlunsplit ([p.scheme, p.netloc, newP, p.query, p.fragment])
try:
#url_file = urlopen(origPict)
#u = url_file.geturl ()
url_file = get (origPict)
u = url_file.url
if u != origPict:
raise Exception ('Failed to get picture ' + origPict)
...
Any clue why requests.get or urllib.urlopen don't like '&' in url?
Updates: Thanks for Artur's comments, I realize the question is not on request itself, but on site protection mechanism: js or cookies or something else in webpage feedback something to server to allow it to judge if request comes from scraper. So now the question turns to how to scrape image from web page which is more complex than simply download image from url.
It's not about & symbol, but about redirection. Try adding parameter allow_redirects=False to get, it should be okay

Python - Web scraping using HTML tags

I am trying to scrape a web-page to list out the jobs posted in URL: https://careers.microsoft.com/us/en/search-results?rk=l-hyderabad
Refer to image for details of web-page inspect Web inspect
Following is observed through a web-page inspect:
Each job listed, is in a HTML li with class="jobs-list-item". The Li contains following html tag & data in parent Div within li
data-ph-at-job-title-text="Software Engineer II",
data-ph-at-job-category-text="Engineering",
data-ph-at-job-post-date-text="2018-03-19T16:33:00".
1st Child Div within parent Div with class="information" has HTML with url
href="https://careers.microsoft.com/us/en/job/406138/Software-Engineer-II"
3rd child Div with class="description au-target" within parent Div has short job description
My requirement is to extract below information for each job
Job Title
Job Category
Job Post Date
Job Post Time
Job URL
Job Short Description
I have tried following Python code to scrape the webpage, but unable to extract the required information. (Please ignore the indentation shown in code below)
import requests
from bs4 import BeautifulSoup
def ms_jobs():
url = 'https://careers.microsoft.com/us/en/search-results?rk=l-hyderabad'
resp = requests.get(url)
if resp.status_code == 200:
print("Successfully opened the web page")
soup = BeautifulSoup(resp.text, 'html.parser')
print(soup)
else:
print("Error")
ms_jobs()
If you want to do this via requests you need to reverse engineer the site. Open the dev tools in Chrome, select the networks tab and fill out the form.
This will show you how the site loads the data. If you dig in the site you'll see, that it grabs the data by doing a POST to this endpoint: https://careers.microsoft.com/widgets. It also shows you the payload that the site uses. The site uses cookies so all you have to do is create a session that keeps the cookie, get one and copy/paste the payload.
This way you'll be able to extract the same json-data, that the javascript fetches to populate the site dynamically.
Below is a working example of what that would look like. Left is only to parse out the json as you see fit.
import requests
from pprint import pprint
# create a session to grab a cookie from the site
session = requests.Session()
r = session.get("https://careers.microsoft.com/us/en/")
# these params are the ones that the dev tools show that site sets when using the website form
payload = {
"lang":"en_us",
"deviceType":"desktop",
"country":"us",
"ddoKey":"refineSearch",
"sortBy":"",
"subsearch":"",
"from":0,
"jobs":"true",
"counts":"true",
"all_fields":["country","state","city","category","employmentType","requisitionRoleType","educationLevel"],
"pageName":"search-results",
"size":20,
"keywords":"",
"global":"true",
"selected_fields":{"city":["Hyderabad"],"country":["India"]},
"sort":"null",
"locationData":{}
}
# this is the endpoint the site uses to fetch json
url = "https://careers.microsoft.com/widgets"
r = session.post(url, json=payload)
data = r.json()
job_list = data['refineSearch']['data']['jobs']
# the job_list will hold 20 jobs (you can se the parameter in the payload to a higher number if you please - I tested 100, that returned 100 jobs
job = job_list[0]
pprint(job)
Cheers.

Selenium firefox webdriver: set download.dir for a pdf

I tried several solution around nothing really works or they are simply outdated.
Here my webdriver profile
let firefox = require('selenium-webdriver/firefox');
let profile = new firefox.Profile();
profile.setPreference("pdfjs.disabled", true);
profile.setPreference("browser.download.dir", 'C:\\MYDIR');
profile.setPreference("browser.helperApps.neverAsk.saveToDisk", "application/pdf","application/x-pdf", "application/acrobat", "applications/vnd.pdf", "text/pdf", "text/x-pdf", "application/vnd.cups-pdf");
I simply want to download a file and set the destination path. It looks like browser.download.dir is ignored.
That's the way I download the file:
function _getDoc(i){
driver.sleep(1000)
.then(function(){
driver.get(`http://mysite/pdf_showcase/${i}`);
driver.wait(until.titleIs('here my pdf'), 5000);
})
}
for(let i=1;i<5;i++){
_getDoc(i);
}
The page contains an iframe with a pdf. I can gathers the src attribute of it, but with the iframe and pdfjs.disabled=true simply visits the page driver.get() causes the download (so I'm ok with it).
The only problem is the download dir is ignored and the file is saved in the default download firefox dir.
Side question: if I wrap _getDoc() in a for loop for that parameter i how can I be sure I won't flood the server? If I use the same driver instance (just like everyone usually does) the requests are sequentials?

Resources