File Partially Download with urllib.request.urlretrieve - python-3.x

I have this code Which is trying to retrieve file from Git Hub Repositories.
import os
import tarfile
from six.moves import urllib
import urllib.request
DOWNLOAD_ROOT = "https://github.com/ageron/handson-ml/tree/master/"
HOUSING_PATH = os.path.join("datasets", "housing").replace("\\","/")
print(HOUSING_PATH)
HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH
print(HOUSING_URL)
print(os.getcwd())
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
if not os.path.isdir(housing_path):
os.makedirs(housing_path)
tgz_path = os.path.join(housing_path, "housing.tgz").replace("\\","/")
print(tgz_path)
urllib.request.urlretrieve(housing_url, tgz_path)
housing_tgz = tarfile.open(tgz_path)
housing_tgz.extractall(path=housing_path)
housing_tgz.close()
fetch_housing_data()
After Executing the code I got this Error ReadError: file could not be opened successfully. I did checked the actual file size and the file which is download after executing this code and I came to know that file is downloaded partially.
So is their any way to download the whole file ? Thanks in Advance

Finally I got the problem. It was with the link that I was using to retrieve the file. I didn't knew that RAW link should be used along with the file name (Not using file name will give you 404 Error) in Git Hub Repositories.
So I little bit of modification is needs to be done in actual code posted in my question.
That is :
Change the link from
DOWNLOAD_ROOT = "https://github.com/ageron/handson-ml/tree/master/"
To this :
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
And this
HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH
to
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz" \\**( Actual File name is needed)**
Thank you !

Related

i want download files with python using wget(FTP). but error occured. please help to download

I want down load "*_ice.nc" files in ftp. so..
library
import wget
import math
import re
from urllib import request
adress and file list
url = "ftp://ftp.hycom.org/datasets/GLBy0.08/expt_93.0/data/hindcasts/2021/" #url
html = request.urlopen(url) #open url
html_contents = str(html.read().decode("cp949"))
url_list = re.findall(r"(ftp)(.+)(_ice.nc)", html_contents)
loop for download
for url in url_list: #loop
url_full="".join(url) #tuple to string
file_name=url_full.split("/")[-1]
print('\nDownloading ' + file_name)
wget.download(url_full) #down with wget
but error messege occured like this
(ValueError: unknown url type: 'ftp%20%20%20%20%20%20ftp%20%20%20%20%20%20382663848%20Jan%2002%20%202021%20hycom_GLBy0.08_930_2021010112_t000_ice.nc')
could i get some help?
After decoding
ftp%20%20%20%20%20%20ftp%20%20%20%20%20%20382663848%20Jan%2002%20%202021%20hycom_GLBy0.08_930_2021010112_t000_ice.nc
is
ftp ftp 382663848 Jan 02 2021 hycom_GLBy0.08_930_2021010112_t000_ice.nc
which clearly is not legal ftp address. You need alter your code so it will be
ftp://ftp.hycom.org/datasets/GLBy0.08/expt_93.0/data/hindcasts/2021/hycom_GLBy0.08_930_2021010112_t000_ice.nc
I suggest temporarily replacing wget.download(url_full) using print(url_full), then apply changes to get desired output and then reverting to wget.download(url_full).

Cannot import name 'Instalysis' from 'instagramy'

I cannot figure out what this error requires... any ideas for a Python newbie? All pre requisites are installed... this is version 3.9 64-bit.
Details: "ADO.NET: Python script error.
ImportError: cannot import name 'Instalysis' from 'instagramy' (C:\Python\Python39\lib\site-packages\instagramy_init_.py)
"
Here's the test script I'm running:
from instagramy import Instalysis
# Instagram user_id of ipl teams
teams = ["chennaiipl", "mumbaiindians",
"royalchallengersbangalore", "kkriders",
"delhicapitals", "sunrisershyd",
"kxipofficial"]
data = Instalysis(teams)
# return the dataframe
data_frame = data.analyis()
data_frame
The instagramy package doesn't have the 'Instalysis' section to it. But it does have these:
from instagramy import InstagramUser
This will give uyou all of the info also I saw that you had parentheses at data_frame = data.analyis()
You will recieve an error if you have them as it isn't an executable function.
Go have a look at the pypi page here: Pypi- Instagramy
Hope this could help!

Google cloud function with wand stopped working

I have set up 3 Google Cloud Storge buckets and 3 functions (one for each bucket) that will trigger when a PDF file is uploaded to a bucket. Functions convert PDF to png image and do further processing.
When I am trying to create a 4th bucket and similar function, strangely it is not working. Even if I copy one of the existing 3 functions, it is still not working and I am getting this error:
Traceback (most recent call last): File "/env/local/lib/python3.7/site-packages/google/cloud/functions_v1beta2/worker.py", line 333, in run_background_function _function_handler.invoke_user_function(event_object) File "/env/local/lib/python3.7/site-packages/google/cloud/functions_v1beta2/worker.py", line 199, in invoke_user_function return call_user_function(request_or_event) File "/env/local/lib/python3.7/site-packages/google/cloud/functions_v1beta2/worker.py", line 196, in call_user_function event_context.Context(**request_or_event.context)) File "/user_code/main.py", line 27, in pdf_to_img with Image(filename=tmp_pdf, resolution=300) as image: File "/env/local/lib/python3.7/site-packages/wand/image.py", line 2874, in __init__ self.read(filename=filename, resolution=resolution) File "/env/local/lib/python3.7/site-packages/wand/image.py", line 2952, in read self.raise_exception() File "/env/local/lib/python3.7/site-packages/wand/resource.py", line 222, in raise_exception raise e wand.exceptions.PolicyError: not authorized/tmp/tmphm3hiezy' # error/constitute.c/ReadImage/412`
It is baffling me why same functions are working on existing buckets but not on new one.
UPDATE:
Even this is not working (getting "cache resources exhausted" error):
In requirements.txt:
google-cloud-storage
wand
In main.py:
import tempfile
from google.cloud import storage
from wand.image import Image
storage_client = storage.Client()
def pdf_to_img(data, context):
file_data = data
pdf = file_data['name']
if pdf.startswith('v-'):
return
bucket_name = file_data['bucket']
blob = storage_client.bucket(bucket_name).get_blob(pdf)
_, tmp_pdf = tempfile.mkstemp()
_, tmp_png = tempfile.mkstemp()
tmp_png = tmp_png+".png"
blob.download_to_filename(tmp_pdf)
with Image(filename=tmp_pdf) as image:
image.save(filename=tmp_png)
print("Image created")
new_file_name = "v-"+pdf.split('.')[0]+".png"
blob.bucket.blob(new_file_name).upload_from_filename(tmp_png)
Above code is supposed to just create a copy of image file which is uploaded to bucket.
Because the vulnerability has been fixed in Ghostscript but not updated in ImageMagick, the workaround for converting PDFs to images in Google Cloud Functions is to use this ghostscript wrapper and directly request the PDF conversion to png from Ghostscript (bypassing ImageMagick).
requirements.txt
google-cloud-storage
ghostscript==0.6
main.py
import locale
import tempfile
import ghostscript
from google.cloud import storage
storage_client = storage.Client()
def pdf_to_img(data, context):
file_data = data
pdf = file_data['name']
if pdf.startswith('v-'):
return
bucket_name = file_data['bucket']
blob = storage_client.bucket(bucket_name).get_blob(pdf)
_, tmp_pdf = tempfile.mkstemp()
_, tmp_png = tempfile.mkstemp()
tmp_png = tmp_png+".png"
blob.download_to_filename(tmp_pdf)
# create a temp folder based on temp_local_filename
# use ghostscript to export the pdf into pages as pngs in the temp dir
args = [
"pdf2png", # actual value doesn't matter
"-dSAFER",
"-sDEVICE=pngalpha",
"-o", tmp_png,
"-r300", tmp_pdf
]
# the above arguments have to be bytes, encode them
encoding = locale.getpreferredencoding()
args = [a.encode(encoding) for a in args]
#run the request through ghostscript
ghostscript.Ghostscript(*args)
print("Image created")
new_file_name = "v-"+pdf.split('.')[0]+".png"
blob.bucket.blob(new_file_name).upload_from_filename(tmp_png)
Anyway, this gets you around the issue and keeps all the processing in GCF for you. Hope it helps. Your code works for single page PDFs though. My use-case was for multipage pdf conversion, ghostscript code & solution in this question.
This actually seems to be a show stopper for ImageMagick related functionalities using PDF format. Similar code deployed by us on Google App engine via custom docker is failing with the same error on missing authorizations.
I am not sure how to edit the policy.xml file on GAE or GCF but a line there has to be changed to:
<policy domain="coder" rights="read|write" pattern="PDF" />
#Dustin: Do you have a bug link where we can see the progress ?
Update:
I fixed it on my Google app engine container by adding a line in docker image. This directly changes the policy.xml file content after imagemagick gets installed.
RUN sed -i 's/rights="none"/rights="read|write"/g' /etc/ImageMagick-6/policy.xml
This is an upstream bug in Ubuntu, we are working on a workaround for App Engine and Cloud Functions.
While we wait for the issue to be resolved in Ubuntu, I followed #DustinIngram's suggestion and created a virtual machine in Compute Engine with an ImageMagick installation. The downside is that I now have a second API that my API in App Engine has to call, just to generate the images. Having said that, it's working fine for me. This is my setup:
Main API:
When a pdf file is uploaded to Cloud Storage, I call the following:
response = requests.post('http://xx.xxx.xxx.xxx:5000/makeimages', data=data)
Where data is a JSON string with the format {"file_name": file_name}
On the API that is running on the VM, the POST request gets processed as follows:
#app.route('/makeimages', methods=['POST'])
def pdf_to_jpg():
file_name = request.form['file_name']
blob = storage_client.bucket(bucket_name).get_blob(file_name)
_, temp_local_filename = tempfile.mkstemp()
temp_local_filename_jpeg = temp_local_filename + '.jpg'
# Download file from bucket.
blob.download_to_filename(temp_local_filename)
print('Image ' + file_name + ' was downloaded to ' + temp_local_filename)
with Image(filename=temp_local_filename, resolution=300) as img:
pg_num = 0
image_files = {}
image_files['pages'] = []
for img_page in img.sequence:
img_page_2 = Image(image=img_page)
img_page_2.format = 'jpeg'
img_page_2.compression_quality = 70
img_page_2.save(filename=temp_local_filename_jpeg)
new_file_name = file_name.replace('.pdf', 'p') + str(pg_num) + '.jpg'
new_blob = blob.bucket.blob(new_file_name)
new_blob.upload_from_filename(temp_local_filename_jpeg)
print('Page ' + str(pg_num) + ' was saved as ' + new_file_name)
image_files['pages'].append({'page': pg_num, 'file_name': new_file_name})
pg_num += 1
try:
os.remove(temp_local_filename)
except (ValueError, PermissionError):
print('Could not delete the temp file!')
return jsonify(image_files)
This will download the pdf from Cloud Storage, create an image for each page, and save them back to cloud storage. The API will then return a JSON file with the list of image files created.
So, not the most elegant solution, but at least I don't need to convert the files manually.

Mitmproxy how to launch from script, and save dumps to file

I am trying to figure out a way to launch Mitmproxy from a python script (which I have done) and save any traffic to a dump file (which i need help with).
By googling, looking at mitmproxy git issues and reading example code, this is what I have so far:
from mitmproxy import proxy, options
from mitmproxy.tools.dump import DumpMaster
from mitmproxy.addons import core
class AddHeader:
def __init__(self):
self.num = 0
def response(self, flow):
self.num = self.num + 1
print(self.num)
flow.response.headers["count"] = str(self.num)
addons = [
AddHeader()
]
opts = options.Options(listen_host='127.0.0.1', listen_port=8080)
pconf = proxy.config.ProxyConfig(opts)
m = DumpMaster(None)
m.server = proxy.server.ProxyServer(pconf)
# print(m.addons)
m.addons.add(addons)
print(m.addons)
# m.addons.add(core.Core())
try:
m.run()
except KeyboardInterrupt:
m.shutdown()
Issue is, this creates an error AttributeError: No such option: body_size_limit which seems to be mitigated with master.addons.add(core.Core) but this core addon already exists in DumpMaster so that fires a different error.
Inspecting the addons that are currently loaded by DumpMaster i do see the save to file addon is loaded, but I am not clear how to access that so that any traffic that is going through the proxy, regardless if it is request, response, ws or tcp can be written to a dump file
Thanks!
Here is a redacted list of the addons that are loaded
mitmproxy.addons.streambodies.StreamBodies object at 0x111542da0>
mitmproxy.addons.save.Save object at 0x111542dd8>
mitmproxy.addons.upstream_auth.UpstreamAuth object at 0x111542e10>
just add those 2 lines after opts = options.Options(listen_host='127.0.0.1', listen_port=8080)
opts.add_option("body_size_limit", int, 0, "")
opts.add_option("keep_host_header", bool, True, "")
your code snippet already runs a working proxy. However, the option to dump the recorded traffic into a file during runtime (save_stream_file) is part of the Save-Addon which is loaded by default after the DumpMaster instance is created. Therefore, you need to set the save_stream_file option after creating the DumpMaster instance. Took me a while to figure it out as well but this worked for me, saving your output stream to a file named traffic_stream:
from mitmproxy import proxy, options
from mitmproxy.tools.dump import DumpMaster
opts = options.Options(listen_port=8081)
opts.add_option("body_size_limit", int, 0, "")
pconf = proxy.config.ProxyConfig(opts)
m = DumpMaster(None)
m.server = proxy.server.ProxyServer(pconf)
m.options.set('save_stream_file=traffic_stream')
try:
m.run()
except KeyboardInterrupt:
m.shutdown()
Hope it works for you as well!

How to get the default application mapped to a file extention in windows using Python

I would like to query Windows using a file extension as a parameter (e.g. ".jpg") and be returned the path of whatever app windows has configured as the default application for this file type.
Ideally the solution would look something like this:
from stackoverflow import get_default_windows_app
default_app = get_default_windows_app(".jpg")
print(default_app)
"c:\path\to\default\application\application.exe"
I have been investigating the winreg builtin library which holds the registry infomation for windows but I'm having trouble understanding its structure and the documentation is quite complex.
I'm running Windows 10 and Python 3.6.
Does anyone have any ideas to help?
The registry isn't a simple well-structured database. The Windows
shell executor has some pretty complex logic to it. But for the simple cases, this should do the trick:
import shlex
import winreg
def get_default_windows_app(suffix):
class_root = winreg.QueryValue(winreg.HKEY_CLASSES_ROOT, suffix)
with winreg.OpenKey(winreg.HKEY_CLASSES_ROOT, r'{}\shell\open\command'.format(class_root)) as key:
command = winreg.QueryValueEx(key, '')[0]
return shlex.split(command)[0]
>>> get_default_windows_app('.pptx')
'C:\\Program Files\\Microsoft Office 15\\Root\\Office15\\POWERPNT.EXE'
Though some error handling should definitely be added too.
Added some improvements to the nice code by Hetzroni, in order to handle more cases:
import os
import shlex
import winreg
def get_default_windows_app(ext):
try: # UserChoice\ProgId lookup initial
with winreg.OpenKey(winreg.HKEY_CURRENT_USER, r'SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\FileExts\{}\UserChoice'.format(ext)) as key:
progid = winreg.QueryValueEx(key, 'ProgId')[0]
with winreg.OpenKey(winreg.HKEY_CURRENT_USER, r'SOFTWARE\Classes\{}\shell\open\command'.format(progid)) as key:
path = winreg.QueryValueEx(key, '')[0]
except: # UserChoice\ProgId not found
try:
class_root = winreg.QueryValue(winreg.HKEY_CLASSES_ROOT, ext)
if not class_root: # No reference from ext
class_root = ext # Try direct lookup from ext
with winreg.OpenKey(winreg.HKEY_CLASSES_ROOT, r'{}\shell\open\command'.format(class_root)) as key:
path = winreg.QueryValueEx(key, '')[0]
except: # Ext not found
path = None
# Path clean up, if any
if path: # Path found
path = os.path.expandvars(path) # Expand env vars, e.g. %SystemRoot% for ext .txt
path = shlex.split(path, posix=False)[0] # posix False for Windows operation
path = path.strip('"') # Strip quotes
# Return
return path

Resources