Python Web Scraping - After Authentication - Traverse - Download from URL with no Extension

Python Web Scraping - After Authentication - Traverse - Download from URL with no Extension - python-3.x

In Chrome, I will Log In to a website.
I will then inspect the site, go to Network and clear out existing.
I will then click on a link and snag the header which stores the cookie.
In Python, I then store the header in a dictionary and use that for the rest of the code.
def site_session(url, header_dict):
session = requests.Session()
t = session.get(url, headers=header_dict)
return soup(t.text, 'html.parser')
site = site_session('https://website.com', headers)
# Scrape the Site as usual until I reach a file I can't download..
This a video file but has no extension.
"https://website.sub.com/download"
Clicking on this link will open up the save dialog and I can save it. But not in Python..
Examining the Network, it appears it redirects to another url in which I was able to scrape.
"https://website.sub.com/download/file.mp4?jibberish-jibberish-jibberish
Trying to shorten it to just "https://website.sub.com/download/file.mp4" does not open.
with open(shortened_url, 'wb') as file:
response = requests.get(shorened_url)
file.write(response.content)
I've tried both the full url and shortened url and receive:
OSError: [Errno 22] Invalid argument.
Any help with this would be awesome!
Thanks!

I had to use the full url with the query and include the headers.
# Read the Header of the first URL to get the file URL
fake_url = requests.head(url, allow_redirects=True, headers=headers)
# Get the real file URL
real_url = requests.get(fake_url.url, headers=headers)
# strip out the name from the url here since it's a loop
# Open a file on pc and write the contents of the real_url
with open(stripped_name, 'wb') as file:
file.write(real_url.content)

Related

Using Python to save a file that is sent to browser when visiting URL

When I visit the URL below in a browser, it automatically downloads a CSV. As the contents are updated daily, I want to write a Python command to get the latest file each time.
I've tried wget, requests and urllib.request - all without luck.
url = 'https://coronavirus.data.gov.uk/api/v1/data?filters=areaType=overview&structure=%7B%22areaType%22:%22areaType%22,%22areaName%22:%22areaName%22,%22areaCode%22:%22areaCode%22,%22date%22:%22date%22,%22newPeopleVaccinatedFirstDoseByPublishDate%22:%22newPeopleVaccinatedFirstDoseByPublishDate%22,%22cumPeopleVaccinatedFirstDoseByPublishDate%22:%22cumPeopleVaccinatedFirstDoseByPublishDate%22%7D&format=csv'
Anyone got any ideas? TIA

This works just fine for me:
import requests
url = 'https://coronavirus.data.gov.uk/api/v1/data?filters=areaType=overview&structure=%7B%22areaType%22:%22areaType%22,%22areaName%22:%22areaName%22,%22areaCode%22:%22areaCode%22,%22date%22:%22date%22,%22newPeopleVaccinatedFirstDoseByPublishDate%22:%22newPeopleVaccinatedFirstDoseByPublishDate%22,%22cumPeopleVaccinatedFirstDoseByPublishDate%22:%22cumPeopleVaccinatedFirstDoseByPublishDate%22%7D&format=csv'
r = requests.get(url)
with open("uk_data.csv", "wb") as f:
f.write(r.content)
The content is a bytes object, so you need to open the file in binary mode.

Scraping websites with Python3 (Scrapy, BS4) does yield incomplete data. Can not figure out why

Some time ago I have set up a web scraper using BS4, logging the value of a whisky each day
import requests
from bs4 import BeautifulSoup
def getPrice() -> float:
try:
URL = "https://www.thewhiskyexchange.com/p/2940/suntory-yamazaki-12-year-old"
website = requests.get(URL)
except:
print("ERROR requesting Price")
try:
soup = BeautifulSoup(website.content, 'html.parser')
price = str(soup.find("p", class_="product-action__price").next)
price = float(price[1::])
return price
except:
print("ERROR parsing Price")
This worked as intended. The request contained the complete website and the correct value was extracted.
I was now trying to scrape other sites for data on other whiskys this time using SCRAPY.
I tried the following URLS:
https://www.thegrandwhiskyauction.com/past-auctions/q-macallan/180-per-page/relevance
https://www.ebay.de/sch/i.html?_sacat=0&LH_Complete=1&_udlo=&_udhi=&_samilow=&_samihi=&_sadis=10&_fpos=&LH_SALE_CURRENCY=0&_sop=12&_dmd=1&_fosrp=1&_nkw=macallan&rt=nc
import scrapy
class QuotesSpider(scrapy.Spider):
name = "whisky"
def start_requests(self):
user_agent = 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'
urls = [
'https://www.thegrandwhiskyauction.com/past-auctions/q-macallan/180-per-page/relevance',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = f'whisky-{page}.html'
#data = response.css('.itemDetails').getall()
with open(filename, 'wb') as f:
f.write(response.body)
I just customized the basic example from the tutorial to create the fast prototype above.
However it did not return the complete website. The body of the response did miss several tags and especially the content I was looking for.
I tried to solve this with BS4 again like this:
import requests
from bs4 import BeautifulSoup
URL = "https://www.thegrandwhiskyauction.com/past-auctions/q-macallan/180-per-page/relevance"
website = requests.get(URL)
soup = BeautifulSoup(website.content, 'html.parser')
with open("whiskeySoup.html", 'w') as f:
f.write(str(soup.body))
To my surprise this produced the same result. The request and its body did not contain the complete website, missing all the data I was looking for.
I also included a user-agent header since I learned that some sites recognize requests from bots and spiders and do not deliver all their data. However, this did not solve the problem.
I am unable to figure out or debug why the data requested from those URLs is incomplete.
Is there a way to solve this using SCRAPY?

A lot of websites heavily relies on javascript to generate the final html page of website. When you send request to server it returns html code with some script web browsers like chrome, Firefox and others process that javascript code and the final html that you can see appears. But when you are using scrapy, request or some library they do not come with the functionality of executing the javascript code and hence the html code is different in html, and as the crawler sees the webpage.
If you want to see how crawler sees the website ( the html code of webpage as seen by crawler ) you can run command 'scrapy view {url}' this will open page in browser or if you want to get the html code of webpage as seen by crawler you can run command 'scrapy fetch {url}'. When you are working with scrapy it is good idea to open the url in shell ( the command is 'scrapy shell {url}' ) and then test your extracting desired content logic there with xpath or css method ( response.css('some_css').css('again_some_css'). ) and then finally add this logic to your final crawler. If you want to see what response you got in shell. you can just type view(response) and it will open the response received in browser. I hope that is clear. But if you want to process the javascript before finally processing the response ( when it is necessary ) you can use selenium which is headless browser or splash which is lightweight web browser. selenium is pretty easy to use.
Edit 1. For the first url : go to scrapy shell and check the css path div.bidPrice::text. Inside that you will see that content inside is generated dynamically and there is no html code and content is being generated dynamically.

are there alternate ways to download pdfs from the internet without them being corrupted?

I have written code for a web scraper program that is as follows (in python) -
import requests, bs4 #you probably need to install requests and bs4, just go online and type beautiful soup 4 installation and requests installation
link_list = []
res = requests.get('https://scholar.google.com/scholar?start=0&q=traffic+light+reinforcement+learning&hl=en&as_sdt=1,5&as_ylo=2019&as_vis=1')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for link in soup.find_all('a'):
if('pdf' in link.get('href')):
link_list.append(link.get('href'))
for x in range(1,100):
i = str(x*10)
url = f'https://scholar.google.com/scholar?start={i}&q=traffic+light+reinforcement+learning&hl=en&as_sdt=1,5&as_ylo=2019&as_vis=1'
res_2 = requests.get(url)
soup = bs4.BeautifulSoup(res_2.text, 'html.parser')
for link in soup.find_all('a'):
if('pdf' in link.get('href')):
link_list.append(link.get('href'))
if(link_list):
for x in range(0,len(link_list)):
res_3 = requests.get(link_list[x])
with open(f'/Users/atharvanaik/Desktop/Cursed/{x}.pdf', 'wb') as f: #parameter 1 of the open function is set to a file path that is available only on my computer
f.write(res_3.content)
print(x) #Set it to something that is accessible on your computer.
else: #Your final path should be -
print('sorry, unavailable') #Something\something\something\{x}.pdf
#Do not change the last part !
For context, I am trying to bulk download pdfs from a google scholar search, instead of doing it manually.
I manage to download a vast majority of the pdfs, but some of the pdfs, when I tried opening them, gave me this message -
"It may be damaged or use a file format that Preview doesn’t recognise."
As seen in the above code, I am using requests to download the content and write to the file. Is there a way to work around this ?

Downloading files using Python requests

I am writing a script to download files from Slack using the slack api and the requests library in Python. Anytime I download a file they all come out the same size (80kb) and they are all corrupted.
Here is my code:
def download_file(url, out):
try:
os.stat(out)
except:
os.mkdir(out)
local_filename = out + '\\' + url.split('/')[-1]
print('outputting to file: %s' % local_filename)
response = requests.get(url, stream=True)
with open(local_filename, 'wb') as f:
response.raw.decode_content = True
shutil.copyfileobj(response.raw,f)
return local_filename
I have tried various different methods posted throughout SO to download the files but have not been successful. I have also checked the URL's I am getting from the Slack API and they are correct since I can paste them in my browser and download the file.
Any help would be appreciated!

I figured out my problem. Since I am using the image's private url download from the Slack API file object, I needed to include a header in addition to the basic request with a token. To do this using the requests API:
response = request.get(url,stream = True, headers={'Authorization':'Bearer ' + my_token})

Download content-disposition from http response header (Python 3)

Im looking for a little help here. I've been using requests in Python to gain access to a website. Im able access the website and get a response header but im not exactly sure how to download the zip file contained in the Content-disposition. Im guessing this isnt a function Requests can handle or at least I cant seem to find any info on it. How do I gain access to the file and save it?
'Content-disposition': 'attachment;filename=MT0376_DealerPrice.zip'

Using urllib instead of requests because it's a lighter library :
import urllib
req = urllib.request.Request(url, method='HEAD')
r = urllib.request.urlopen(req)
print(r.info().get_filename())
Example :
In[1]: urllib.request.urlopen(urllib.request.Request('https://httpbin.org/response-headers?content-disposition=%20attachment%3Bfilename%3D%22example.csv%22', method='HEAD')).info().get_filename()
Out[1]: 'example.csv'

What you want to do is to access response.content. You may want to add better checks if the content really contains the file.
Little example snippet
response = requests.post(url, files={'name': 'filename.bin'}, timeout=60, stream=True)
if response.status_code == 200:
if response.headers.get('Content-Disposition'):
print("Got file in response")
print("Writing file to filename.bin")
open("filename.bin", 'wb').write(response.content)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string