open large gzip file (~1gb) in python - python-3.x

I am beginner in python and trying to learn python. I have written few line of code to open a large gzip file (size of ~ 1gb) and want to extract some content, however I am getting memory related error. could somebody please guide me how open the gzip with limited memory. I have put a part of code that is throwing error.
import os
import gzip
with gzip.open("test.gz","rb") as peak:
for line in peak:
file_content = line.read().decode("utf-8")
print(file_content)
Error: File "/software/anaconda3/lib/python3.7/gzip.py", line 276, in read
return self._buffer.read(size)

I am trying to recreate your issue but I am unable to. Using fallocate I create a big file, then gzip it, but hit no error in Python
$ fallocate -l 2G tempfile.img
$ gzip tempfile.img
$ ipython
>>> import gzip
>>> with gzip.open('tempfile.img.gz', 'rb') as fIn:
>>> content = fIn.read()
If you hit an exception, it should have some name like OSError or something more specific. My guess is that you have a 32-bit installation of Python which would impose memory limits in the gigabyte range. This SO thread covers a way to check if you're running 32- or 64-bit.
If you post the name of the exception or a reproducible example, then I can update this answer.

Related

FastText Error! ValueError: (file-name) cannot be opened for training

I Have installed fasttext module in Python and loaded the model [ 'cc.en.300.bin'].
I already made the data frame format according to the fasttext. and then generating the files
train.to_csv(" ecomm.train",columns=['Category_description'], index= False, header= False)
test.to_csv("ecom.test", columns=['Category_description'], index= False, header= False)
the files created successfully! then when I run this code
import fasttext
mod= fasttext.train_supervised(input='ecomm.train')
I get this error:
Traceback (most recent call last):
File "/Users/rosie/Documents/ProGraMinG/Python/pythonProject/FastText/FastText_overview.py", line 97, in <module>
mod= fasttext.train_supervised(input='ecomm.train')
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/fasttext/FastText.py", line 533, in train_supervised
fasttext.train(ft.f, a)
ValueError: ecomm.train cannot be opened for training!
{ UPDATE } !!!
Used both isfile() and exists() functions to check if the file exists:
path = 'Users/rosie/Documents/ProGraMinG/Python/pythonProject/FastText/ecomm.train'
check_file = os.path.isfile(path)
print("isfile method ",check_file)
check_file = os.path.exists(path)
print("exists method ",check_file)
Both methods returns ' False '.
I also checked if the file is readable or not
doc= open(' ecomm.train', 'r')
print('checking if the file is readable', doc.readable())
However, it returned 'True', now I'm confused. As for the size of the ' ecomm.train', it is 29.4 MB
Are you sure the file is readable, at the simple (local) path 'ecomm.train', from your Python process, given its current local orking directory?
For example, try specifying the file as its full absolute path – on MacOS probably something like /Users/yourusername/yourdirectory/etc/etc/ecomm.train. If that works, the problem was that your Python code's effective directory wasn't what you expected.
Alternatively, if the process that wrote the file was in some way a different user than the later process trying to read it, there might be permission errors.
Totally separate from fasttext, you could check, from the same code that's about to try fasttext operations, if the file is readable (via either the local path, or the absolute path) using a recipe lie that in this other answer: https://stackoverflow.com/a/44213239/130288
Even if it fails, it might give a more-explanatory error.

shutil.rmtree() error when trying to remove NFS-mounted directory

Attempting to execute shutil.rmtree(path) on a directory managed by NFS consistently fails. Below you can see that os.rmdir(path) within shutil.rmtree(path) causes the exception. Is there a more robust way for me to achieve the expected result?
It appears that it removes all of the files, yet a hidden .nfs file remains in the directory for a short amount of time. I'm guessing that the process from which I'm calling rmtree has an open file handle to one of the files inside the directory, which, when deleted, apparently causes NFS to write a new hidden file. That would cause os.rmdir to fail on attempting to remove a non-empty directory.
Traceback (most recent call last):
File "/home/me/pre3/lib/python3.6/shutil.py", line 484, in rmtree
onerror(os.rmdir, path, sys.exc_info())
File "/home/me/pre3/lib/python3.6/shutil.py", line 482, in rmtree
os.rmdir(path)
OSError: [Errno 39] Directory not empty:
NFS details:
$ nfsstat -m
/home/me/nfs from XXX.YYY.ZZZ:/mnt/path/to/nfs
Flags: rw,relatime,vers=3,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=50,retrans=2,sec=sys,mountaddr=REDACTED,mountvers=3,mountport=832,mountproto=udp,local_lock=none,addr=REDACTED
I'm using Python 3.6.6 on Ubuntu 16.04.
If the python logging module is logging to the target output directory, it will maintain an open file. A workaround is to call logging.shutdown() first, then called shutil.rmtree(path). This is not a general answer to the broader question, however.
You could try defining an error handler function to be passed to the onerror arg for shutil.rmtree: https://docs.python.org/3/library/shutil.html#shutil.rmtree
def handle_rmtree_err(function, path, excinfo):
...
shutil.rmtree(my_path, onerror=handle_rmtree_err)
There are all sorts of reasons why a process may be holding onto a file, so I can't tell you what the error handler should do exactly.
If you haven't figured out what is holding onto the file, try $ lsof | grep .nfsXXXX.
If all else fails you could time.sleep(secs) and retry shutil.rmtree.

Why does draw() in pygraphviz/agraph not work on the server (but locally)?

I have a Python app using Pygraphviz that works fine locally, but on the server the draw function throws an error. It happens in make_svg. The following lines are the relevant part of the errors I get. (The full trail is here.)
File "/path/to/app/utils/make_svg.py", line 17, in make_svg
prog='dot'
File "/path/to/pygraphviz/agraph.py", line 1477, in draw
fh = self._get_fh(path, 'w+b')
File "/path/to/pygraphviz/agraph.py", line 1506, in _get_fh
fh = open(path, mode=mode)
FileNotFoundError: [Errno 2] No such file or directory: 'app/svg_files/nope.svg'
Logging type(g) gives <class 'pygraphviz.agraph.AGraph'> as expected.
I work in a virtualenv in a mod_wsgi 4.6.5/Python3.7 environment on a Webfaction server.
Locally I use a virtualenv with Python 3.5.
The version of Pygraphviz is 1.3.1.(First I had 1.5 on the server. The error was exactly the same, except for the line numbers.)
What can I do?
The same error is described in this bug report from last year.
I don't get which directory I am supposed to create. svg_files exists and has rights 777.
The draw function at the end of make_svg should create the SVG.(And at the end of extract_coordinates_from_svg the file is removed again.)The file name is a hash created in connected_dag (svg_name).
On the server app/svg_files seems not to describe the same place as locally.
I defined the path unambiguously, and now it works.
file_path = '{grandparent}/svg_files/{name}.svg'.format(
grandparent=os.path.dirname(os.path.dirname(__file__)),
name=name
)
g.draw(file_path, prog='dot')

Create directory with path longer than 260 characters in Python

This is very similar question to this one, but for Python instead of powershell. It was also discussed here, and here, but no working solutions are posted.
So, is there a way to create a directory in Python that bypasses the 260 char limit on windows? I tried multiple ways of prepending \\?\, but could not make it work.
In particular, the following most obvious code
path = f'\\\\?\\C:\\{"a"*300}.txt'
open(path, 'w')
fails with an error
OSError: [Errno 22] Invalid argument: '\\\\?\\C:\\aaaaa<...>aaaa.txt'
Thanks to eryksun I realized that I was trying to create a file with too long of a name. After some experiments, this is how one can create a path that exceeds 260 chars on windows (provided file system allows it):
from pathlib import Path
folder = Path('//?/c:/') / ('a'*100) / ('b'*100)
file = folder / ('c' * 100)
folder.mkdir(parents=True, exist_ok=True)
file.open('w')

File downloaded larger than original

I'm working on a little python3 server and I want to download a sqlite database from this server. But when I tried that, I discovered that the downloaded file is larger than the original : the original file size is 108K, the downloaded file size is 247K. I've tried this many times, and each time I had the same result. I also checked the sum with sha256, which have different results.
Here is my downloader.py file :
import cgi
import os
print('Content-Type: application/octet-stream')
print('Content-Disposition: attachment; filename="Library.db"\n')
db = os.path.realpath('..') + '/Library.db'
with open(db,'rb') as file:
print(file.read())
Thanks in advance !
EDIT :
I tried that :
$ ./downloader > file
file's size is also 247K.
Well, I've finally found the solution. The problem (which I didn't see first) was that the server sent plain text to client. Here is one way to send binary data :
import cgi
import os
import shutil
import sys
print('Content-Type: application/octet-stream; file="Library.db"')
print('Content-Disposition: attachment; filename="Library.db"\n')
sys.stdout.flush()
db = os.path.realpath('..') + '/Library.db'
with open(db,'rb') as file:
shutil.copyfileobj(file, sys.stdout.buffer)
But if someone has a better syntax, I would be glad to see it ! Thank you !

Resources