Download content-disposition from http response header (Python 3) - python-3.x

Im looking for a little help here. I've been using requests in Python to gain access to a website. Im able access the website and get a response header but im not exactly sure how to download the zip file contained in the Content-disposition. Im guessing this isnt a function Requests can handle or at least I cant seem to find any info on it. How do I gain access to the file and save it?
'Content-disposition': 'attachment;filename=MT0376_DealerPrice.zip'

Using urllib instead of requests because it's a lighter library :
import urllib
req = urllib.request.Request(url, method='HEAD')
r = urllib.request.urlopen(req)
print(r.info().get_filename())
Example :
In[1]: urllib.request.urlopen(urllib.request.Request('https://httpbin.org/response-headers?content-disposition=%20attachment%3Bfilename%3D%22example.csv%22', method='HEAD')).info().get_filename()
Out[1]: 'example.csv'

What you want to do is to access response.content. You may want to add better checks if the content really contains the file.
Little example snippet
response = requests.post(url, files={'name': 'filename.bin'}, timeout=60, stream=True)
if response.status_code == 200:
if response.headers.get('Content-Disposition'):
print("Got file in response")
print("Writing file to filename.bin")
open("filename.bin", 'wb').write(response.content)

Related

Python Web Scraping - After Authentication - Traverse - Download from URL with no Extension

In Chrome, I will Log In to a website.
I will then inspect the site, go to Network and clear out existing.
I will then click on a link and snag the header which stores the cookie.
In Python, I then store the header in a dictionary and use that for the rest of the code.
def site_session(url, header_dict):
session = requests.Session()
t = session.get(url, headers=header_dict)
return soup(t.text, 'html.parser')
site = site_session('https://website.com', headers)
# Scrape the Site as usual until I reach a file I can't download..
This a video file but has no extension.
"https://website.sub.com/download"
Clicking on this link will open up the save dialog and I can save it. But not in Python..
Examining the Network, it appears it redirects to another url in which I was able to scrape.
"https://website.sub.com/download/file.mp4?jibberish-jibberish-jibberish
Trying to shorten it to just "https://website.sub.com/download/file.mp4" does not open.
with open(shortened_url, 'wb') as file:
response = requests.get(shorened_url)
file.write(response.content)
I've tried both the full url and shortened url and receive:
OSError: [Errno 22] Invalid argument.
Any help with this would be awesome!
Thanks!
I had to use the full url with the query and include the headers.
# Read the Header of the first URL to get the file URL
fake_url = requests.head(url, allow_redirects=True, headers=headers)
# Get the real file URL
real_url = requests.get(fake_url.url, headers=headers)
# strip out the name from the url here since it's a loop
# Open a file on pc and write the contents of the real_url
with open(stripped_name, 'wb') as file:
file.write(real_url.content)

Can't download image from a website with Selenium; it gives 403 error

I was trying to scrape pictures with Selenium of a certain character on Pixiv, but when I tried to download, it gave me a 403 error. I tried using the request module with the src link to download it directly and it gave me the error. I tried opening a new tab with the src link and it gave me the same error. Is there a way to download a image from Pixiv? I was planning something a bit larger than just downloading a single image, but I am stuck in it. I did put the user-agent, as this thread suggested, but or it didn't work or I did something wrong.
This is the image I tried to download: https://www.pixiv.net/en/artworks/93284987.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://i.pximg.net/img-original/img/2021/10/07/19/41/28/93284987_p0.jpg')
I skipped the code to get it here, but what happens is that I get the src link to the image, but I can't access the page to download it. I don't know if I need to actually go to the page, but I can't do anything with the src either. I tried several methods but nothing works. Can someone help me?
Seems to download just fine without any headers for me.
import requests
import shutil
url = 'https://i.pximg.net/img-master/img/2021/10/07/19/41/28/93284987_p0_master1200.jpg'
response = requests.get(url, stream=True)
local_filename = url.split('/')[-1]
with open(local_filename, 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
del response
print(local_filename)

How do I scrape data from this specific website?

I am trying to get some data out of this website.
http://asphaltoilmarket.com/index.php/state-index-tracker/
I am trying to get the data using the following code but it times out.
import requests
asphalt_r = requests.get('http://asphaltoilmarket.com/index.php/state-index-tracker/')
This website opens with no problems in the browser and also I can get data from other websites (with different structure) using this code, but my code does not work with this website. I am not sure what changes I need to make.
Also, I could get the data to download in excel and another tool (Alteryx) which uses GET from curl.
They likely don't want you to scrape their site.
The response code is a quick indication of that.
>>> import requests
>>> asphalt_r = requests.get('http://asphaltoilmarket.com/index.php/state-index-tracker/')
>>> asphalt_r
<Response [406]>
406 = Not Acceptable
>>> asphalt_r = requests.get('http://asphaltoilmarket.com/index.php/state-index-tracker/', headers={"User-Agent": "curl/7.54"})
>>> asphalt_r
<Response [200]>
Read and follow their AUP & Terms of Service.
Working does not equal permission.

Unable to get values from a GET request

I am trying to scrape a website to extract values.
I can text back in the response, but cannot see any of the values on the website in the response.
How do i get the values please ?
I have used basic code from stackoverflow as a test to explore the data. The code is posted below. It works on other sites, but not this site ?
import requests
url = 'https://www.ishares.com/uk/individual/en/products/253741/'
data = requests.get(url).text
with open('F:\\webScrapeFolder\\out.txt', 'w') as f:
print(data.encode("utf-8"), file=f)
print('--- end ---')
There is no error message.
The file is written correctly.
However, i do not see any of the numbers ?!?
Check with this
url ="https://www.ishares.com/uk/individual/en/products/253741/?switchLocale=y&siteEntryPassthrough=true"
and try to get the response and still if you cannot get what is expected can you brief more on what is needed

Downloading files using Python requests

I am writing a script to download files from Slack using the slack api and the requests library in Python. Anytime I download a file they all come out the same size (80kb) and they are all corrupted.
Here is my code:
def download_file(url, out):
try:
os.stat(out)
except:
os.mkdir(out)
local_filename = out + '\\' + url.split('/')[-1]
print('outputting to file: %s' % local_filename)
response = requests.get(url, stream=True)
with open(local_filename, 'wb') as f:
response.raw.decode_content = True
shutil.copyfileobj(response.raw,f)
return local_filename
I have tried various different methods posted throughout SO to download the files but have not been successful. I have also checked the URL's I am getting from the Slack API and they are correct since I can paste them in my browser and download the file.
Any help would be appreciated!
I figured out my problem. Since I am using the image's private url download from the Slack API file object, I needed to include a header in addition to the basic request with a token. To do this using the requests API:
response = request.get(url,stream = True, headers={'Authorization':'Bearer ' + my_token})

Resources