python download file into memory and handle broken links - python-3.x

I'm using the following code to download a file into memory :
if 'login_before_download' in obj.keys():
with requests.Session() as session:
session.post(obj['login_before_download'], verify=False)
request = session.get(obj['download_link'], allow_redirects=True)
else:
request = requests.get(obj['download_link'], allow_redirects=True)
print("downloaded {} into memory".format(obj[download_link_key]))
file_content = request.content
obj is a dict that contains the download_link and another key that indicates if I need to login to the page to create a cookie.
The problem with my code is that if the url is broken and there isnt any file to download I'm still getting the html content of the page instead of identifying that the download failed.
Is there any way to identify that the file wasnt downloaded ?

I found the following solution in this url :
import requests
def is_downloadable(url):
"""
Does the url contain a downloadable resource
"""
h = requests.head(url, allow_redirects=True)
header = h.headers
content_type = header.get('content-type')
if 'text' in content_type.lower():
return False
if 'html' in content_type.lower():
return False
return True
print is_downloadable('https://www.youtube.com/watch?v=9bZkp7q19f0')
# >> False
print is_downloadable('http://google.com/favicon.ico')
# >> True

Related

Why I cannot get all contributors of a GitHub Repository using GitHub API

Following is my code and every time it only returns 60 contributors:
base_url = "https://api.github.com"
owner = "http4s"
repo = "http4s"
# Set the headers for the request
headers = {
"Accept": "application/vnd.github+json",
"Authorization":"token xxxxxx"
}
url = f"{base_url}/repos/{owner}/{repo}/contributors?per_page=100"
more_pages = True
# Initialize a set to store the contributors
contributors = set()
while more_pages:
# Send the request to the API endpoint
response = requests.get(url, headers=headers)
# Check the status code of the response
if response.status_code == 200:
# If the request is successful, add the list of contributors to the `contributors` set
contributors.update([contributor["login"] for contributor in response.json()])
# Check the `Link` header in the response to see if there are more pages to fetch
link_header = response.headers.get("Link")
if link_header:
# The `Link` header has the format `<url>; rel="name"`, so we split it on `;` to get the URL
next_url = link_header.split(";")[0][1:-1]
# Check if the `rel` parameter is "next" to determine if there are more pages to fetch
if "rel=\"next\"" in link_header:
url = next_url
else:
more_pages = False
else:
more_pages = False
else:
# If the request is not successful, print the status code and the error message
print(f"Failed to get contributors: {response.status_code} {response.text}")
more_pages = False
Can someone tell me how can I optimize my code?
I have tried many ways to fetch more contributors, but every time it only returns 60 different contributors. I want to get the full list of contributors.

How do I use Mindee API with Python3?

I'm just playing about with code and interested in parsing receipts to text (end point csv file). I came across this tutorial on mindee API where they also provide code to run the parsing. However, I keep getting the below errors when attempting to parse.
import requests
url = "https://api.mindee.net/v1/products/mindee/expense_receipts/v3/predict"
with open("/Users/test/PycharmProjects/PythonCrashCourse", "rb") as myfile: # Here they mention to specify the PATH to file/files which is here as per my windows10 path.
files = {"IMG_5800.jpg": myfile}
headers = {"Authorization": "Token asdasd21321"}
response = requests.post(url, files=files, headers=headers)
print(response.text)
PermissionError: [Errno 13] Permission denied: '/Users/test/PycharmProjects/PythonCrashCourse'
Why is there permission denied? When I am admin and have full permissions enabled on the file iteself.
I have also tried modifying the code and running the below;
import requests
url = "https://api.mindee.net/v1/products/mindee/expense_receipts/v3/predict"
imageFile = "IMG_5800.jpg" #File is in the current directory
files = {"file": open(imageFile, "rb")}
headers = {"Authorization": "Token a4342343c925a"}
response = requests.post(url, files=files, headers=headers)
print(response.text)
#output
{"api_request":{"error":{"code":"BadRequest","details":{"document":["Missing data for required field."],"file":["Unknown field."]},"message":"Invalid fields in form"},"resources":[],"status":"failure","**status_code":400**,"url":"http://api.mindee.net/v1/products/mindee/expense_receipts/v3/predict"}}
Process finished with exit code 0
Status code 400 - suggests something has gone wrong with the syntax....Unfortunately I am stuck and simply just want the API to parse my receipt. Any ideas on what is going wrong please?
Desired output:
get results from receipt in text format/json from Mindee API
References Used:
https://medium.com/mindeeapi/extract-receipt-data-with-mindees-api-using-python-7ee7303f4b6d tutorial on Mindee API
https://platform.mindee.com/products/mindee/expense_receipts?setup=default#documentation
From the error message, it was stated that the document was missing.
I'm glad you found the solution to this.
However, following the documentation, there is an improved code, the authentication header X-Inferuser-Token has been deprecated.
You can try doing this instead
import requests
url = "https://api.mindee.net/v1/products/mindee/expense_receipts/v3/predict"
with open("./IMG_5800.jpg", "rb") as myfile:
files = {"document": myfile}
headers = {"Authorization": "Token my-api-key-here"}
response = requests.post(url, files=files, headers=headers)
print(response.text)
After brushing up on HTML format - https://www.codegrepper.com/code-examples/html/HTML+file+path. I realised the path I used was wrong and should've used the correct HTML format whether I am on Windows/Mac.
To resolve my issue, I mentioned to go 1 directory up to where the image file is, when running my code.
with open("./IMG_5800.jpg", "rb") as myfile: #modified here to go 1 directory up to where the image file is hosted
files = {"file": myfile}
headers = {"X-Inferuser-Token": "Token my-api-key-here"}
response = requests.post(url, files=files, headers=headers)

Downloading files by crawling sub-URLs in python

I am trying to download documents (mainly in pdf) from a large number of web links like the following:
https://projects.worldbank.org/en/projects-operations/document-detail/P167897?type=projects
https://projects.worldbank.org/en/projects-operations/document-detail/P173997?type=projects
https://projects.worldbank.org/en/projects-operations/document-detail/P166309?type=projects
However, the pdf files are not directly accessible from these links. One needs to click on sub-URLs to access the pdfs. Is there any way to crawl the sub-URLs and download all the related files from them? I am trying it with the following codes but have not had any success so far specifically for these URLs listed here.
Please let me know if you need any further clarifications. I would be happy to do so. Thank you.
from simplified_scrapy import Spider, SimplifiedDoc, SimplifiedMain, utils
class MySpider(Spider):
name = 'download_pdf'
allowed_domains = ["www.worldbank.org"]
start_urls = [
"https://projects.worldbank.org/en/projects-operations/document-detail/P167897?type=projects",
"https://projects.worldbank.org/en/projects-operations/document-detail/P173997?type=projects",
"https://projects.worldbank.org/en/projects-operations/document-detail/P166309?type=projects"
] # Entry page
def afterResponse(self, response, url, error=None, extra=None):
if not extra:
print ("The version of library simplified_scrapy is too old, please update.")
SimplifiedMain.setRunFlag(False)
return
try:
path = './pdfs'
# create folder start
srcUrl = extra.get('srcUrl')
if srcUrl:
index = srcUrl.find('year/')
year = ''
if index > 0:
year = srcUrl[index + 5:]
index = year.find('?')
if index>0:
path = path + year[:index]
utils.createDir(path)
# create folder end
path = path + url[url.rindex('/'):]
index = path.find('?')
if index > 0: path = path[:index]
flag = utils.saveResponseAsFile(response, path, fileType="pdf")
if flag:
return None
else: # If it's not a pdf, leave it to the frame
return Spider.afterResponse(self, response, url, error, extra)
except Exception as err:
print(err)
def extract(self, url, html, models, modelNames):
doc = SimplifiedDoc(html)
lst = doc.selects('div.list >a').contains("documents/", attr="href")
if not lst:
lst = doc.selects('div.hidden-md hidden-lg >a')
urls = []
for a in lst:
a["url"] = utils.absoluteUrl(url.url, a["href"])
# Set root url start
a["srcUrl"] = url.get('srcUrl')
if not a['srcUrl']:
a["srcUrl"] = url.url
# Set root url end
urls.append(a)
return {"Urls": urls}
# Download again by resetting the URL. Called when you want to download again.
def resetUrl(self):
Spider.clearUrl(self)
Spider.resetUrlsTest(self)
SimplifiedMain.startThread(MySpider()) # Start download
There's an API endpoint that contains the entire response you see on the web-site along with... the URL to the document pdf. :D
So, you can query the API, get the URLS, and finally fetch the documents.
Here's how:
import requests
pids = ["P167897", "P173997", "P166309"]
for pid in pids:
end_point = f"https://search.worldbank.org/api/v2/wds?" \
f"format=json&includepublicdocs=1&" \
f"fl=docna,lang,docty,repnb,docdt,doc_authr,available_in&" \
f"os=0&rows=20&proid={pid}&apilang=en"
documents = requests.get(end_point).json()["documents"]
for document_data in documents.values():
try:
pdf_url = document_data["pdfurl"]
print(f"Fetching: {pdf_url}")
with open(pdf_url.rsplit("/")[-1], "wb") as pdf:
pdf.write(requests.get(pdf_url).content)
except KeyError:
continue
Output: (fully downloaded .pdf files)
Fetching: http://documents.worldbank.org/curated/en/106981614570591392/pdf/Official-Documents-Grant-Agreement-for-Additional-Financing-Grant-TF0B4694.pdf
Fetching: http://documents.worldbank.org/curated/en/331341614570579132/pdf/Official-Documents-First-Restatement-to-the-Disbursement-Letter-for-Grant-D6810-SL-and-for-Additional-Financing-Grant-TF0B4694.pdf
Fetching: http://documents.worldbank.org/curated/en/387211614570564353/pdf/Official-Documents-Amendment-to-the-Financing-Agreement-for-Grant-D6810-SL.pdf
Fetching: http://documents.worldbank.org/curated/en/799541612993594209/pdf/Sierra-Leone-AFRICA-WEST-P167897-Sierra-Leone-Free-Education-Project-Procurement-Plan.pdf
Fetching: http://documents.worldbank.org/curated/en/310641612199201329/pdf/Disclosable-Version-of-the-ISR-Sierra-Leone-Free-Education-Project-P167897-Sequence-No-02.pdf
and more ...

Unable to get the response in POST method in Python

I am facing a unique problem.
Following is my code.
url = 'ABCD.com'
cookies={'cookies':'xyz'}
r = requests.post(url,cookies=cookies)
print(r.status_code)
json_data = json.loads(r.text)
print("Printing = ",json_data)
When I use the url and cookie in the POSTMAN tool and use POST request I get JSON response . But when I use the above code with POST request method in python I get
404
Printing = {'title': 'init', 'description': "Error: couldn't find a device with id: xxxxxxxxx in ABCD: d1"}
But when I use the following code i .e with GET request method
url = 'ABCD.com'
cookies={'cookies':'xyz'}
r = requests.post(url,cookies=cookies)
print(r.status_code)
json_data = json.loads(r.text)
print("Printing = ",json_data)
I get
200
Printing = {'apiVersion': '0.4.0'}
I am not sure why POST method works with JSON repsone in POSTMAN tool and when I try using python it is not work. I use latest python 3.6.4
I finally found what was wrong following is correct way
url = 'ABCD.com'
cookies={'cookies':'xyz'}
r = requests.post(url,headers={'Cookie'=cookies)
print(r.status_code)
json_data = json.loads(r.text)
print("Printing = ",json_data)
web page was expecting headers as cookie and i got the response correctly

Allow user to download ZIP from Django view

My main task is to have the user press a Download button and download file "A.zip" from the query directory.
The reason I have a elif request.POST..... is because I have another condition checking if the "Execute" button was pressed. This execute button runs a script. Both POST actions work, and the dir_file is C:\Data\Folder.
I followed and read many tutorials and responses as to how to download a file from Django, and I cannot figure out why my simple code does not download a file.
What am I missing? The code does not return any errors. Does anybody have any documentation that can explain what I am doing wrong?
I am expecting an automatic download of the file, but does not occur.
elif request.POST['action'] == 'Download':
query = request.POST['q']
dir_file = query + "A.zip"
zip_file = open(dir_file, 'rb')
response = HttpResponse(zip_file, content_type='application/zip')
response['Content-Disposition'] = 'attachment; filename=%s' % 'foo_zip'
zip_file.close()
I found out my answer.
After reading through many documentation about this, I left out the most important aspect of this feature which is the url.
Basically, the function download_zip is called by the POST and runs script where the zip is downloaded.
Here is what I ended up doing:
elif request.POST['action'] == 'Download':
return(HttpResponseRedirect('/App/download'))
Created a view:
def download_zip(request):
zip_path = root + "A.zip"
zip_file = open(zip_path, 'rb')
response = HttpResponse(zip_file, content_type='application/zip')
response['Content-Disposition'] = 'attachment; filename=%s' % 'A.zip'
response['Content-Length'] = os.path.getsize(zip_path)
zip_file.close()
return response
Finally in urls.py:
url(r'^download/$', views.download_zip, name='download_zip'),

Resources