File downloaded larger than original

File downloaded larger than original - python-3.x

I'm working on a little python3 server and I want to download a sqlite database from this server. But when I tried that, I discovered that the downloaded file is larger than the original : the original file size is 108K, the downloaded file size is 247K. I've tried this many times, and each time I had the same result. I also checked the sum with sha256, which have different results.
Here is my downloader.py file :
import cgi
import os
print('Content-Type: application/octet-stream')
print('Content-Disposition: attachment; filename="Library.db"\n')
db = os.path.realpath('..') + '/Library.db'
with open(db,'rb') as file:
print(file.read())
Thanks in advance !
EDIT :
I tried that :
$ ./downloader > file
file's size is also 247K.

Well, I've finally found the solution. The problem (which I didn't see first) was that the server sent plain text to client. Here is one way to send binary data :
import cgi
import os
import shutil
import sys
print('Content-Type: application/octet-stream; file="Library.db"')
print('Content-Disposition: attachment; filename="Library.db"\n')
sys.stdout.flush()
db = os.path.realpath('..') + '/Library.db'
with open(db,'rb') as file:
shutil.copyfileobj(file, sys.stdout.buffer)
But if someone has a better syntax, I would be glad to see it ! Thank you !

Related

Download file from website directly into Linux directory - Python

If I manually click on button, the browser starts downloading a CSV file (2GB) onto my computer. But I want to automate this.
This is the link to download:
https://data.cityofnewyork.us/api/views/bnx9-e6tj/rows.csv?accessType=DOWNLOAD
Issue; when I use either (requests or pandas) libraries it just hangs. I have no idea if it is being downloaded or not.
My goal is to:
Know if the file is being downloaded and
Have the CSV downloaded to a specified directory ie.
~/mydirectory
Can someone provide the code to do this?

Try this...
import requests
URL = "https://data.cityofnewyork.us/api/views/bnx9-e6tj/rows.csv?accessType=DOWNLOAD"
response = requests.get(URL)
print('Download Complete')
open("/mydirectory/downloaded_file.csv", "wb").write(response.content)
Or you could do it this way and have a progress bar ...
import wget
wget.download('https://data.cityofnewyork.us/api/views/bnx9-e6tj/rows.csv?accessType=DOWNLOAD')
The output will look like this:
11% [........ ] 73728 / 633847

Matplotlib created a temporary config/cache directory

Matplotlib created a temporary config/cache directory at /var/www/.config/matplotlib because the default path (/tmp/matplotlib-b33qbx_v) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
This is the message I'm getting in error.log file and 504 Gateway Time out error on browser .
Someone please help to resolve this issue.

Please check:
https://github.com/pyinstaller/pyinstaller/issues/617
I run matplotlib from the webserver and use:
os.environ['MPLCONFIGDIR'] = '/opt/myapplication/.config/matplotlib'
This dir should writable by the web server (e.g. www-data).

import os
os.environ['MPLCONFIGDIR'] = os.getcwd() + "/configs/"
before
import matplotlib
works for me

open large gzip file (~1gb) in python

I am beginner in python and trying to learn python. I have written few line of code to open a large gzip file (size of ~ 1gb) and want to extract some content, however I am getting memory related error. could somebody please guide me how open the gzip with limited memory. I have put a part of code that is throwing error.
import os
import gzip
with gzip.open("test.gz","rb") as peak:
for line in peak:
file_content = line.read().decode("utf-8")
print(file_content)
Error: File "/software/anaconda3/lib/python3.7/gzip.py", line 276, in read
return self._buffer.read(size)

I am trying to recreate your issue but I am unable to. Using fallocate I create a big file, then gzip it, but hit no error in Python
$ fallocate -l 2G tempfile.img
$ gzip tempfile.img
$ ipython
>>> import gzip
>>> with gzip.open('tempfile.img.gz', 'rb') as fIn:
>>> content = fIn.read()
If you hit an exception, it should have some name like OSError or something more specific. My guess is that you have a 32-bit installation of Python which would impose memory limits in the gigabyte range. This SO thread covers a way to check if you're running 32- or 64-bit.
If you post the name of the exception or a reproducible example, then I can update this answer.

Creating a Spark RDD from a file located in Google Drive using Python on Colab.Research.Google

I have been successful in running Python 3 / Spark 2.2.1 program in Google's Colab.Research platform :
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz
!tar xf spark-2.2.1-bin-hadoop2.7.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.2.1-bin-hadoop2.7"
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
this works perfectly when I uploaded text files from my local computer to the Unix VM using
from google.colab import files
datafile = files.upload()
and read them as follows :
textRDD = spark.read.text('hobbit.txt').rdd
so far so good ..
My problem starts when I am trying to read a file that is lying in my Google drive colab directory.
Following instructions I have authenticated user and created a drive service
from google.colab import auth
auth.authenticate_user()
from googleapiclient.discovery import build
drive_service = build('drive', 'v3')
after which I have been able to access the file lying in the drive as follows :
file_id = '1RELUMtExjMTSfoWF765Hr8JwNCSL7AgH'
import io
from googleapiclient.http import MediaIoBaseDownload
request = drive_service.files().get_media(fileId=file_id)
downloaded = io.BytesIO()
downloader = MediaIoBaseDownload(downloaded, request)
done = False
while done is False:
# _ is a placeholder for a progress object that we ignore.
# (Our file is small, so we skip reporting progress.)
_, done = downloader.next_chunk()
downloaded.seek(0)
print('Downloaded file contents are: {}'.format(downloaded.read()))
Downloaded file contents are: b'The king beneath the mountain\r\nThe king of ......
even this works perfectly ..
downloaded.seek(0)
print(downloaded.read().decode('utf-8'))
and gets the data
The king beneath the mountain
The king of carven stone
The lord of silver fountain ...
where things FINALLY GO WRONG is where I try to grab this data and put it into a spark RDD
downloaded.seek(0)
tRDD = spark.read.text(downloaded.read().decode('utf-8'))
and I get the error ..
AnalysisException: 'Path does not exist: file:/content/The king beneath the mountain\ ....
Evidently, I am not using the correct method / parameters to read the file into spark. I have tried quite a few of the methods described
I would be very grateful if someone can help me figure out how to read this file for subsequent processing.

A complete solution to this problem is available in another StackOverflow question that is available at this URL.
Here is the notebook where this solution is demonstrated.
I have tested it and it works!

It seems that spark.read.text expects a file name. But you give it the file content instead. You can try either of these:
save it to a file then give the name
use just downloaded instead of downloaded.read().decode('utf-8')
You can also simplify downloading from Google Drive with pydrive. I gave an example here.
https://gist.github.com/korakot/d56c925ff3eccb86ea5a16726a70b224
Downloading is just
fid = drive.ListFile({'q':"title='hobbit.txt'"}).GetList()[0]['id']
f = drive.CreateFile({'id': fid})
f.GetContentFile('hobbit.txt')

Py2exe: Embed static files in exe file itself and access them

I found a solution to add files in library.zip via: Extend py2exe to copy files to the zipfile where pkg_resources can load them.
I can access to my file when library.zip is not include the exe.
I add a file : text.txt in directory: foo/media in library.zip.
And I use this code:
import pkg_resources
import zipfile
from cStringIO import StringIO
my_data = pkg_resources.resource_string(__name__,"library.zip")
filezip = StringIO(my_data)
zip = zipfile.ZipFile(filezip)
data = zip.read("foo/media/text.txt")
I try to use pkg_resources but I think that I don't understand something because I could open directly "library.zip".
My question is how can I do this when library.zip is embed in exe?
Best Regards
Jean-Michel

I cobbled together a reasonably neat solution to this, but it doesn't use pkg_resources.
I need to distribute productivity tools as standalone EXEs, that is, all bundled into the one .exe file. I also need to send out notifications when these tools are used, which I do via the Logging API, using file-based configuration. I emded the logging.cfg fileto make it harder to effectively switch-off these notifications i.e. by deleting the loose file... which would probably break the app anyway.
So the following is the interesting bits from my setup.py:
LOGGING_CFG = open('main/resources/logging.cfg').read()
setup(
name='productivity-tool',
...
# py2exe extras
console=[{'script': productivity_tool.__file__.replace('.pyc', '.py'),
'other_resources': [(u'LOGGINGCFG', 1, LOGGING_CFG)]}],
zipfile=None,
options={'py2exe': {'bundle_files': 1, 'dll_excludes': ['w9xpopen.exe']}},
)
Then in the startup code for productivity_tool.py:
from win32api import LoadResource
from StringIO import StringIO
from logging.config import fileConfig
...
if __name__ == '__main__':
if is_exe():
logging_cfg = StringIO(LoadResource(0, u'LOGGINGCFG', 1))
else:
logging_cfg = 'main/resources/logging.cfg'
fileConfig(logging_cfg)
...
Works a treat!!!

Thank you but I found the solution
my_data = pkg_resources.resource_stream("__main__",sys.executable) # get lib.zip file
zip = zipfile.ZipFile(my_data)
data = zip.read("foo/media/doc.pdf") # get my data on lib.zip
file = open(output_name, 'wb')
file.write(data) # write it on a file
file.close()
Best Regards

You shouldn't be using pkg_resources to retrieve the library.zip file. You should use it to retrieve the added resource.
Suppose you have the following project structure:
setup.py
foo/
__init__.py
bar.py
media/
image.jpg
You would use resource_string (or, preferably, resource_stream) to access image.jpg:
img = pkg_resources.resource_string(__name__, 'media/image.jpg')
That should "just work". At least it did when I bundled my media files in the EXE. (Sorry, I've since left the company where I was using py2exe, so don't have a working example to draw on.)
You could also try using pkg_resources.resource_filename(), but I don't think that works under py2exe.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

File downloaded larger than original - python-3.x

Related

Download file from website directly into Linux directory - Python

Matplotlib created a temporary config/cache directory

open large gzip file (~1gb) in python

Creating a Spark RDD from a file located in Google Drive using Python on Colab.Research.Google

Py2exe: Embed static files in exe file itself and access them

Categories

Resources