How to scrape image/file from web page in Python? - python-3.x

I try to use Python3.7.4 to backup pictures in a blog site, e.g.
http://s2.sinaimg.cn/mw690/001H6t4Fzy7zgC0WLXb01&690
If I input the above address in Firefox address bar, the file is shown correctly.
If I use following code to download picture, server always redirects to a default picture:
from requests import get # just to try different methods
from urllib.request import urlopen
from urllib.parse import urlsplit, urlunsplit, quote
# hard-coded address is randomly selected for debug purpose.
origPict = 'http://s2.sinaimg.cn/mw690/001H6t4Fzy7zgC0WLXb01&690'
p = urlsplit (origPict)
newP = quote (p.path)
origPict = urlunsplit ([p.scheme, p.netloc, newP, p.query, p.fragment])
try:
#url_file = urlopen(origPict)
#u = url_file.geturl ()
url_file = get (origPict)
u = url_file.url
if u != origPict:
raise Exception ('Failed to get picture ' + origPict)
...
Any clue why requests.get or urllib.urlopen don't like '&' in url?
Updates: Thanks for Artur's comments, I realize the question is not on request itself, but on site protection mechanism: js or cookies or something else in webpage feedback something to server to allow it to judge if request comes from scraper. So now the question turns to how to scrape image from web page which is more complex than simply download image from url.

It's not about & symbol, but about redirection. Try adding parameter allow_redirects=False to get, it should be okay

Related

Python + Mechanize - Emulate Javascript button click using POST?

I'm trying to automate filling a car insurance quote form on a site:
(following the same format as the site URL lets call it: "https://secure.examplesite.com/css/car/step1#noBack")
I'm stuck on the rego section as once the rego has been added, a button needs to be clicked to perform the search and it seems this is heavy Javascript and I know mechanizer can't handle this. I'm not versed in JavaScript, but I can see that when clicking the button a POST request is made to this URL: ("https://secure.examplesite.com/css/car/step1/searchVehicleByRegNo") Please see image also.
How can I emulate this POST request in Mechanize to run the javascript? So I can see the response / interact with the response? Or is this not possible? Can I consider bs4/requests/robobrowser instead. I'm only ~4 months into learning! Thanks
# Mechanize test
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False) # ignore robots
br.set_handle_refresh(False) # can sometimes hang without this
res = br.open("https://secure.examplesite.com/css/car/step1#noBack")
br.select_form(id = "quoteCollectForm")
br.set_all_readonly(False) # allow everything to be written to
controlDict = {}
# List all form controls
for control in br.form.controls:
controlDict[control.name] = control.value
print("type = %s, name = %s, value = %s" %(control.type, control.name, control.value))
# Enter Rego etc "example"
br.form["vehicle.searchRegNo"] = "example"
# Now for control name = vehicle.searchRegNo, value = example
# BUT Now how do I click the button?? Simulate POST? The post url is formatted like:
# https://secure.examplesite.com/css/car/step1/searchVehicleByRegNo
Javascript POST
Solved my own problem-
Steps:
open dev tools in browser
Go to network tab and clear
interact with form element (in my case car rego finder)
click on the event that occurs from interaction
copy the exact URL, Request header data, and payload
I used Postman to quickly test the request and responses were correct / the same as the Webform and found the relevant headers
in postman convert to python requests code
Now I can interact completely with the form

how to target specific input for posting with requests or requests_html

i need to fill a unique input here
https://outline.com/
when i do ctrl+U with firefox, the id is hidden by some js effect i think, in my opinion the id is 'source' but how can i see it and target it for posting ?
it tried this:
import requests, requests_html
import json
url_out = "https://outline.com/"
url_target = "https://www.whatyouwanthere"
r = requests.post(url_out, data=json.dumps(url_target))
is it possible to avoid selenium for this task ?
You were pretty far from the actual solution, but there are definitely more ways to do this.
This works:
import requests, json
url_out = "http://outline-server-788914346.us-west-2.elb.amazonaws.com/article"
url_target = "https://www.google.com"
r = requests.get(url = url_out, params = {'source_url': url_target})
data = r.json()
print(data)
I was able to find this by watching the network activity in devtools and taking a look at the app.js file in use on their site. The form on the main page just queries an API with the article URL and the API is what generates the short url on their service.
Honestly, I'm not sure if this breaks their ToS or not, you may want to check with them on that.

how to properly call a REST-API with an * in the URL

i searched the internet (and stackoverflow :D) to find an answer for the following question - and found none that i understood.
background:
we want to use a python script to connect our companies CMDB with our AWX/Ansible infrastructure.
the CMDB has a REST API which supports a (halfway) proper export.
i'm currently stuck with the implementation of the correct API call.
i can call the API itself and authenticate, but i can't call the proper filter to get the results i need.
the filter is realized by having the following string within the URL (more in the attached code example)
Label LIKE "host*"
it seems that python has a problem with the *.
error message:
InvalidURL(f"URL can't contain control characters. {url!r} "
I found some bug reports that there is an issue within some python versions, but i'm way to new to properly understand if this affects me here :D
used python version 3.7.4
PS: let's see if i can get the markup right :D
i switched the called URL to determine where exactly the problem occurs.
it only occurs when i use the SQL like filter part.
this part is essential since i just want our "hosts" to be returned and not the whole CMDB itself.
#import the required classes and such
from http.client import HTTPConnection
import json
#create a HTTP connection client
client = HTTPConnection("cmdb.example.company")
#basic auth and some header details
headers = {'Content-Type': 'application/json',
'Authorization' : 'Basic my-auth-token'}
#working API call
client.request('GET', '/cmdb/rest/hosts?attributes=Label,Keywords,Tag,Description&limit=10', headers=headers)
#broken API call returns - InvalidURL(f"URL can't contain control characters. {url!r} "
client.request('GET', '/cmdb/rest/hosts?filter=Label LIKE "host*"&attributes=Label,Keywords,Tag,Description&limit=10', headers=headers)
#check and convert the response into a readable (JSON) format
response = client.getresponse()
data = response.read()
#debugging print - show that the returned data is bytes?!
print(data)
#convert the returned data into json
my_json = data.decode('utf8').replace("'", '"')
data = json.loads(my_json)
#only return the data part from the JSON and ignore the meta-overhead
text = json.dumps(data["data"], sort_keys=True, indent=4)
print(text)
so, i want to know how to properly call the API with the described filter and resolve the displayed error.
can you give me an example i can try or pin-point a beginners mistake i made?
am i affected by the mentioned python bug regarding the URL call with * in it?
thanks for helping me out :)
soooo i found my beginners mistake myself:
i used the URL from my browser - and my browser automaticly encodes the special characters within the URL.
i found the following piece of code within Python3 URL encoding guide and modified the string to fit my needs :)
import urllib.parse
query = ' "host*"'
urllib.parse.quote(query)
'%20%22host%2A%22'
Result: '%20%22host%2A%22'
%20 = " "
%22 = " " "
%2A = "*"
so the final code looks somewhat like this:
#broken API call returns - InvalidURL(f"URL can't contain control characters. {url!r} "
client.request('GET', '/cmdb/rest/hosts?filter=Label LIKE "host*"&attributes=Label,Keywords,Tag,Description&limit=10', headers=headers)
filter=Label LIKE "host*"
#fixed API call
client.request('GET', '/cmdb/rest/hosts?filter=Label%20LIKE%20%22host%2A%22&attributes=Label,Keywords,Tag,Description&limit=10', headers=headers)
filter=Label%20LIKE%20%22host%2A%22

Python Requests Refresh

I'm trying to use python's requests library to log in to a website. It's a pretty simple code, and you can really get the gist of requests just by going on its website. I, however, want to check if I'm successfully logged in via the url. The problem I've encountered is when I initiate the post requests and give it (the variable p) a url, whether the html has changed or not I'm still passed the same url when I type print(p.url). Is there any way for me to refresh the browser or update the url to whatever it's currently set at?
(I can add a line for checking the url against itself later, but for now I just want to get the correct url)
#!usr/bin/env python3
import requests
payload = {'login': 'USERNAME,
'password': 'PASSWORD'}
with requests.Session() as s:
p = s.post('WEBSITE', data=payload)
#print p.text
print(p.url)
The usuage of python-requests may not as complex as you think. It will automatically handle the redirect of your post ( or session.get()).
Here, session.post() method return a response object:
r = s.post('website', data=payload)
which means r.url is current url you are looking for.
If you still want to refresh current page, just use:
s.get(r.url)
To verify whether you has login successfully, one solution is to do the login in your browser.
Based on the title or content of the webpage returned (i.e, use the content in r.text), you can judge whether you have made it.
BTW, python-requests is a great library, enjoy it.

Python Requests: Use * as wildcard part of URL

Let's say I want to get a zip file (and then extract it) from a specific URL.
I want to be able to use a wildcard character in the URL like this:
https://www.urlgoeshere.domain/+*+-one.zip
instead of this:
https://www.urlgoeshere.domain/two-one.zip
Here's an example of the code I'm using (URL is contrived):
import requests, zipfile, io
year='2016'
monthnum='01'
month='Jan'
zip_file_url='https://www.urlgoeshere.domain/two-one.zip'
r = requests.get(zip_file_url, stream=True)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()
Thanks in advance!
HTTP does not work that way. You must use the exact URL in order to request a page from the server.
I'm not sure if this helps you, but Flask has a feature that works similarly to what you require. Here's a working example:
#app.route('/categories/<int:category_id>')
def categoryDisplay(category_id):
''' Display a category's items
'''
# Get category and it's items via queries on session
category =session.query(Category).filter_by(id=category_id).one()
items = session.query(Item).filter_by(category_id=category.id)
# Display items using Flask HTML Templates
return render_template('category.html', category=category, items=items,
editItem=editItem, deleteItem=deleteItem, logged_in = check_logged_in())
the route decorator tells the web server to call that method when a url like */categories/(1/2/3/4/232...) is accessed. I'm not sure but I think you could do the same with the name of your zip as a String. See here (project.py) for more details.

Resources