Cleaning up the Garbage Output File - python-3.x

I am pretty new to Python and tried to read a jpg file and write it as a simple practice. The file was not huge, only 142 KBytes but when I used the buffer chunks of 50000 Bytes to read and write it to a new.jpg file, it gave limited space error and ate up all of my remaining 4GB of space in the C:\ drive on my desktop and never released that memory. How can I see and free up the memory used by Python. Here is my code:
def main():
buffersize = 50000
infile = open('olives.jpg', 'rb')
outfile = open('new.jpg', 'wb')
buffer = infile.read(buffersize)
while(len(buffer)):
outfile.write(buffer)
print('.', end = '')
infile.read(buffersize)
print()
print('Done')
if __name__ == "__main__": main()
Please let me know how I can free this memory as my C:\ Drive is short in memory.
Thank you!

In your loop, you never change buffer. You need to assign the read result, or you just keep writing the same original buffer contents over and over, the loop never ends. Change the final line of this while loop:
while(len(buffer)):
outfile.write(buffer)
print('.', end = '')
infile.read(buffersize) # <-- Not assigning result
to:
buffer = infile.read(buffersize)
Also, for style points, change while(len(buffer)): to just while buffer:; bytes objects (like all sequences) are truthy when non-zero length, and falsy when zero-length (and Python doesn't require parens for conditional statements unless they're needed for something like grouping tests).
As for "freeing the memory" (meaning "cleaning up the garbage output file"), just delete it. At the command line, you can just visit the working directory you ran the script in and run del new.jpg, or navigate there the same way in the file browser. If you have no idea where the file is, rerunning your script after fixing the infinite loop problem will truncate the original file before writing to it (because your mode included w), which will also solve the problem.

Related

Why would a python script keep running after the output is generated (strange behavior)?

Background: The purpose of this script is to take eight very large (~7GB) FASTQ files, subsample each, and concatenate each subsample into one "master" FASTQ file. The resulting file is about 60GB. Each file is subsampled to 120,000,000 lines.
The issue: The basic purpose of this script is to output a huge file. I have print statements & time stamps in my code so I know that it goes through the entire script, processes the input files and creates the output files. After I see the final print statement, I go to my directory and see that the output file has been generated, it's the correct size, and it was last modified a while ago, despite the fact that the script is still running. At this point, however, the code has still not finished running, and it will actually stall there for about 2-3 hours before I can enter anything into my terminal again.
My code is behaving like it gets stuck on the last line of the script even after it's finished creating the output file.
I'm hoping someone might be able to identify what's causing this weird behavior. Below is a dummy version of what my script does:
import random
import itertools
infile1 = "sample1_1.fastq"
inFile2 = "sample1_2.fastq"
with open(infile1, 'r') as file_1:
f1 = file_1.read()
with open(inFile2, 'r') as file_2:
f2 = file_2.read()
fastq1 = f1.split('\n')
fastq2 = f2.split('\n')
def subsampleFASTQ(compile1, compile2):
random.seed(42)
random_1 = random.sample(compile1, 30000000)
random.seed(42)
random_2 = random.sample(compile2, 30000000)
return random_1, random_2
combo1, combo2 = subsampleFASTQ(fastq1, fastq2)
with open('sampleout_1.fastq', 'w') as out1:
out1.write('\n'.join(str(i) for i in combo1))
with open('sampleout_2.fastq', 'w') as out2:
out2.write('\n'.join(str(i) for i in combo2))
My ideas of what it could be:
File size is causing some slowness
There is some background process running in this script that wont let it finish (but i have no idea how to debug that-- any resources would be appreciated)

Why does zlib decompression break after an http request is reinitated?

I have a python script that "streams" a very large gzip file using urllib3 and feeds it into a zlib.decompressobj. This zlib decompression object is configured to read gzip compression. If this initial http connection is interrupted then the zlib.decompressobj begins to throw errors after the connection is "resumed". See my source code below if you want to cut to the chase.
These errors occur despite the fact that the script initiates a new http connection using the Range header to specify the number of bytes previously read. It resumes from the completed read point present when the connection was broken. I believe this arbitrary resume point is the source of my problem.
If I don't try to decompress the chunks of data being read in by urllib3, but instead just write them to a file, everything works just fine. Without trying to decompress the stream everything works even when there is an interruption. The completed archive is valid, it is the same size as one downloaded by a browser and the MD5 hash of the .gz file is the same as if I had downloaded the file directly with Chrome.
On the other hand, if I try to decompress the chunks of data coming in after the interruption, even with the Range header specified, the zlib library throws all kinds of errors. The most recent was Error -3 while decompressing data: invalid block type
Additional Notes:
The site that I am using has the Accept-Range flag set to bytes meaning that I am able to submit modified Range headers to the server.
I am not using the requests library in this script as it ultimately manages urllib3. I am instead using urllib3 directly in an attempt to cut out the middle man.
This script it an oversimplification of my ultimate goal, which is to stream the compressed data directly from where it is hosted, enrich it and store it in a MySQL database on the local network.
I am heavily resource constrained inside of the docker container where this processing will occur.
The genesis of this question is present in a question I asked almost 3 weeks ago: requests.iter_content() thinks file is complete but it's not
The most common problem I am encountering with the urllib3 (and requests) library is the IncompleteRead(self._fp_bytes_read, self.length_remaining) error.
This error only appears if the urllib3 library has been patched to raise an exception when an incomplete read occurs.
My best guess:
I am guessing that the break in the data stream being fed to zlib.decompressobj is causing zlib to somehow lose context and start attempting to decompress the data again in an odd location. Sometimes it will resume, however the data stream is garbled, making me believe the byte location used as the new Range header fell at the front of some bytes which are then incorrectly interpreted as headers. I do not know how to counteract this and I have been trying to solve it for several weeks. The fact that the data are still valid when downloaded whole (without being decompressed prior to completion) even with an interruption occurs, makes me believe that some "loss of context" within zlib is the cause.
Source Code: (Has been updated to include a "buffer")
This code is a little bit slapped together so forgive me. Also, this target gzip file is quite a lot smaller than the actual file I will be using. Additionally, the target file in this example will no longer be available from Rapid7 in about a month's time. You may choose to substitute a different .gz file if that suits you.
import urllib3
import certifi
import inspect
import os
import time
import zlib
def patch_urllib3():
"""Set urllib3's enforce_content_length to True by default."""
previous_init = urllib3.HTTPResponse.__init__
def new_init(self, *args, **kwargs):
previous_init(self, *args, enforce_content_length = True, **kwargs)
urllib3.HTTPResponse.__init__ = new_init
#Path the urllib3 module to throw an exception for IncompleteRead
patch_urllib3()
#Set the target URL
url = "https://opendata.rapid7.com/sonar.http/2021-11-27-1638020044-http_get_8899.json.gz"
#Set the local filename
local_filename = '2021-11-27-1638020044-http_get_8899_script.json.gz'
#Configure the PoolManager to handle https (I think...)
http = urllib3.PoolManager(ca_certs=certifi.where())
#Initiate start bytes at 0 then update as download occurs
sum_bytes_read=0
session_bytes_read=0
total_bytes_read=0
#Dummy variable to silence console output from file write
writer=0
#Set zlib window bits to 16 bits for gzip decompression
decompressor = zlib.decompressobj(zlib.MAX_WBITS|16)
#Build a buffer list
buf_list=[]
i=0
while True:
print("Building request. Bytes read:",total_bytes_read)
resp = http.request(
'GET',
url,
timeout=urllib3.Timeout(connect=15, read=40),
preload_content=False)
print("Setting headers.")
#This header should cause the request to resume at "total_bytes_read"
resp.headers['Range'] = 'bytes=%s' % (total_bytes_read)
print("Local filename:",local_filename)
#If file already exists then append to it
if os.path.exists(local_filename):
print("File already exists.")
try:
print("Starting appended download.")
with open(local_filename, 'ab') as f:
for chunk in resp.stream(2048):
buf_list.append(chunk)
#Use i to offset the chunk being read from the "buffer"
#I.E. load 3 chunks (0,1,2) in the buffer list before starting to read from it
if i >2:
buffered_chunk=buf_list.pop(0)
writer=f.write(buffered_chunk)
#Comment out the below line to stop the error from occurring.
#File download should complete successfully even if interrupted when the following line is commented out.
decompressed_chunk=decompressor.decompress(buffered_chunk)
#Increment i so that the buffer list will fill before reading from it
i=i+1
session_bytes_read = resp._fp_bytes_read
#Sum bytes read is an updated value that isn't stored. It is only used for console print
sum_bytes_read = total_bytes_read + session_bytes_read
print("[+] Bytes read:",str(format(sum_bytes_read, ",")), end='\r')
print("\nAppended download complete.")
break
except Exception as e:
print(e)
#Add to total bytes read to current session bytes each time the loop needs to repeat
total_bytes_read=total_bytes_read+session_bytes_read
print("Bytes Read:",total_bytes_read)
#Mod the total_bytes back to the nearest chunk size so it can be - re-requested
total_bytes_read=total_bytes_read-(total_bytes_read%2048)-2048
print("Rounded bytes Read:",total_bytes_read)
#Pop the last entry off of the buffer since it may be incomplete
buf_list.pop()
#reset i so that the buffer has to rebuilt
i=0
print("Sleeping for 30 seconds before re-attempt...")
time.sleep(30)
#If file doesn't already exist then write to it directly
else:
print("File does not exist.")
try:
print("Starting initial download.")
with open(local_filename, 'wb') as f:
for chunk in resp.stream(2048):
buf_list.append(chunk)
#Use i to offset the chunk being read from the "buffer"
#I.E. load 3 chunks (0,1,2) in the buffer list before starting to read from it
if i > 2:
buffered_chunk=buf_list.pop(0)
#print("Buffered Chunk",str(i-2),"-",buffered_chunk)
writer=f.write(buffered_chunk)
decompressed_chunk=decompressor.decompress(buffered_chunk)
#Increment i so that the buffer list will fill before reading from it
i=i+1
session_bytes_read = resp._fp_bytes_read
print("[+] Bytes read:",str(format(session_bytes_read, ",")), end='\r')
print("\nInitial download complete.")
break
except Exception as e:
print(e)
#Set the total bytes read equal to the session bytes since this is the first failure
total_bytes_read=session_bytes_read
print("Bytes Read:",total_bytes_read)
#Mod the total_bytes back to the nearest chunk size so it can be - re-requested
total_bytes_read=total_bytes_read-(total_bytes_read%2048)-2048
print("Rounded bytes Read:",total_bytes_read)
#Pop the last entry off of the buffer since it may be incomplete
buf_list.pop()
#reset i so that the buffer has to rebuilt
i=0
print("Sleeping for 30 seconds before re-attempt...")
time.sleep(30)
print("Looping...")
#Finish writing from buffer into file
#BE SURE TO SET TO "APPEND" with "ab" or you will overwrite the start of the file
f = open(local_filename, 'ab')
print("[+] Finishing write from buffer.")
while not len(buf_list) == 0:
buffered_chunk=buf_list.pop(0)
writer=f.write(buffered_chunk)
decompressed_chunk=decompressor.decompress(buffered_chunk)
#Flush and close the file
f.flush()
f.close()
resp.release_conn()
Reproducing the error
To reproduce the error perform the following actions:
Run the script and let the download start
Be sure that line 65 decompressed_chunk=decompressor.decompress(chunk) is not commented out
Turn off your network connection until an exception is raised
Turn your network connection back on immediately.
If the decompressor.decompress(chunk) line is removed from the script then it will download the file and the data can be successfully decompressed from the file itself. However, if line 65 is present and an interruption occurs, the zlib library will not be able to continue decompressing the data stream. I need to decompress the data stream as I cannot store the actual file I am trying to use.
Is there some way to prevent this from occurring? I have now attempted to add a "buffer" list that stores the chunks; the script discards the last chunk after a failure and moves back to a point in the file that preceded the "failed" chunk. I am able to re-establish the connection and even pull back all the data correctly but even with a "buffer" my ability to decompress the stream is interrupted. I must not be smoothly recovering the data back to the buffer somehow.
Visualization:
I put this together very quickly in an attempt to better describe what I am trying to do...
I bet Mark Adler is hiding out there somewhere...
r+b doesn't append. You would need to use ab for that. It appears that on the re-try, you are reading the entire gzip file again from the start. With r+b, that file is written correctly to your output file, by overwriting what was read before.
However, you are feeding the initial read to the decompressor, and then the start of the file again. Not surprisingly, the decompressor then soon detects invalid compressed data.

Problem with trying to encrypt .txts files in python

Ok so I made this simple encryption program (I am new to python) that encrypts every file in the directory it is being used (For learning purposes).
Here's the code:
import os
import time
txts = list()
for i in os.listdir():
if i.endswith('.txt'):
txts.append(i)
for i in txts:
filer = open(i, 'r')
st = str()
for j in filer.read():
st += chr(ord(j)+2)
filew = open(i,'w')
filew.write(st)
def decrypt():
for i in txts:
filer = open(i, 'r')
st = str()
for j in filer.read():
st += chr(ord(j)-2)
filew = open(i,'w')
filew.write(st)
So my problem is: It encrypts every single file txt file in the directory, besides the last one, always. The last file always gets overwritten with nothing, unlike all the others, no matter what txt is the last file. Ive checked the txts list and All the txt files in the directory. But the last file, just doesnt want to get encrypted. Lets say I put abcd in the file, after my program runs in the file there won't be a single thing.
When you put the "encryption" code in a function, the file objects returned by open go out of scope and are garbage collected. Part of that process for file objects is flushing write buffers and closing the files.
According to the documentation:
Warning: Calling f.write() without using the with keyword or calling f.close() might result in the arguments of f.write() not being completely written to the disk, even if the program exits successfully.

shutil.copyfileobj but without headers or skip first line [duplicate]

I am currently using the below code to import 6,000 csv files (with headers) and export them into a single csv file (with a single header row).
#import csv files from folder
path =r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
stockstats_data = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_,index_col=None,)
list_.append(df)
stockstats_data = pd.concat(list_)
print(file_ + " has been imported.")
This code works fine, but it is slow. It can take up to 2 days to process.
I was given a single line script for Terminal command line that does the same (but with no headers). This script takes 20 seconds.
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done
Does anyone know how I can speed up the first Python script? To cut the time down, I have thought about not importing it into a DataFrame and just concatenating the CSVs, but I cannot figure it out.
Thanks.
If you don't need the CSV in memory, just copying from input to output, it'll be a lot cheaper to avoid parsing at all, and copy without building up in memory:
import shutil
import glob
#import csv files from folder
path = r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
allFiles.sort() # glob lacks reliable ordering, so impose your own if output order matters
with open('someoutputfile.csv', 'wb') as outfile:
for i, fname in enumerate(allFiles):
with open(fname, 'rb') as infile:
if i != 0:
infile.readline() # Throw away header on all but first file
# Block copy rest of file from input to output without parsing
shutil.copyfileobj(infile, outfile)
print(fname + " has been imported.")
That's it; shutil.copyfileobj handles efficiently copying the data, dramatically reducing the Python level work to parse and reserialize. Don't omit the `allFiles.sort()!†
This assumes all the CSV files have the same format, encoding, line endings, etc., the encoding encodes such that newlines appear as a single byte equivalent to ASCII \n and it's the last byte in the character (so ASCII and all ASCII superset encodings work, as does UTF-16-BE and UTF-32-BE, but not UTF-16-LE and UTF-32-LE) and the header doesn't contain embedded newlines, but if that's the case, it's a lot faster than the alternatives.
For the cases where the encoding's version of a newline doesn't look enough like an ASCII newline, or where the input files are in one encoding, and the output file should be in a different encoding, you can add the work of encoding and decoding without adding CSV parsing/serializing work, with (adding a from io import open if on Python 2, to get Python 3-like efficient encoding-aware file objects, and defining known_input_encoding to some string representing the known encoding for input files, e.g. known_input_encoding = 'utf-16-le', and optionally a different encoding for output files):
# Other imports and setup code prior to first with unchanged from before
# Perform encoding to chosen output encoding, disabling line-ending
# translation to avoid conflicting with CSV dialect, matching raw binary behavior
with open('someoutputfile.csv', 'w', encoding=output_encoding, newline='') as outfile:
for i, fname in enumerate(allFiles):
# Decode with known encoding, disabling line-ending translation
# for same reasons as above
with open(fname, encoding=known_input_encoding, newline='') as infile:
if i != 0:
infile.readline() # Throw away header on all but first file
# Block copy rest of file from input to output without parsing
# just letting the file object decode from input and encode to output
shutil.copyfileobj(infile, outfile)
print(fname + " has been imported.")
This is still much faster than involving the csv module, especially in modern Python (where the io module has undergone greater and greater optimization, to the point where the cost of decoding and reencoding is pretty minor, especially next to the cost of performing I/O in the first place). It's also a good validity check for self-checking encodings (e.g. the UTF family) even if the encoding is not supposed to change; if the data doesn't match the assumed self-checking encoding, it's highly unlikely to decode validly, so you'll get an exception rather than silent misbehavior.
Because some of the duplicates linked here are looking for an even faster solution than copyfileobj, some options:
The only succinct, reasonably portable option is to continue using copyfileobj and explicitly pass a non-default length parameter, e.g. shutil.copyfileobj(infile, outfile, 1 << 20) (1 << 20 is 1 MiB, a number which shutil has switched to for plain shutil.copyfile calls on Windows due to superior performance).
Still portable, but only works for binary files and not succinct, would be to copy the underlying code copyfile uses on Windows, which uses a reusable bytearray buffer with a larger size than copyfileobj's default (1 MiB, rather than 64 KiB), removing some allocation overhead that copyfileobj can't fully avoid for large buffers. You'd replace shutil.copyfileobj(infile, outfile) with (3.8+'s walrus operator, :=, used for brevity) the following code adapted from CPython 3.10's implementation of shutil._copyfileobj_readinto (which you could always use directly if you don't mind using non-public APIs):
buf_length = 1 << 20 # 1 MiB buffer; tweak to preference
# Using a memoryview gets zero copy performance when short reads occur
with memoryview(bytearray(buf_length)) as mv:
while n := infile.readinto(mv):
if n < buf_length:
with mv[:n] as smv:
outfile.write(smv)
else:
outfile.write(mv)
Non-portably, if you can (in any way you feel like) determine the precise length of the header, and you know it will not change by even a byte in any other file, you can write the header directly, then use OS-specific calls similar to what shutil.copyfile uses under the hood to copy the non-header portion of each file, using OS-specific APIs that can do the work with a single system call (regardless of file size) and avoid extra data copies (by pushing all the work to in-kernel or even within file-system operations, removing copies to and from user space) e.g.:
a. On Linux kernel 2.6.33 and higher (and any other OS that allows the sendfile(2) system call to work between open files), you can replace the .readline() and copyfileobj calls with:
filesize = os.fstat(infile.fileno()).st_size # Get underlying file's size
os.sendfile(outfile.fileno(), infile.fileno(), header_len_bytes, filesize - header_len_bytes)
To make it signal resilient, it may be necessary to check the return value from sendfile, and track the number of bytes sent + skipped and the number remaining, looping until you've copied them all (these are low level system calls, they can be interrupted).
b. On any system Python 3.8+ built with glibc >= 2.27 (or on Linux kernel 4.5+), where the files are all on the same filesystem, you can replace sendfile with copy_file_range:
filesize = os.fstat(infile.fileno()).st_size # Get underlying file's size
os.copy_file_range(infile.fileno(), outfile.fileno(), filesize - header_len_bytes, header_len_bytes)
With similar caveats about checking for copying fewer bytes than expected and retrying.
c. On OSX/macOS, you can use the completely undocumented, and therefore even less portable/stable API shutil.copyfile uses, posix._fcopyfile for a similar purpose, with something like (completely untested, and really, don't do this; it's likely to break across even minor Python releases):
infile.seek(header_len_bytes) # Skip past header
posix._fcopyfile(infile.fileno(), outfile.fileno(), posix._COPYFILE_DATA)
which assumes fcopyfile pays attention to the seek position (docs aren't 100% on this) and, as noted, is not only macOS-specific, but uses undocumented CPython internals that could change in any release.
† An aside on sorting the results of glob: That allFiles.sort() call should not be omitted; glob imposes no ordering on the results, and for reproducible results, you'll want to impose some ordering (it wouldn't be great if the same files, with the same names and data, produced an output file in a different order simply because in-between runs, a file got moved out of the directory, then back in, and changed the native iteration order). Without the sort call, this code (and all other Python+glob module answers) will not reliably read from a directory containing a.csv and b.csv in alphabetical (or any other useful) order; it'll vary by OS, file system, and often the entire history of file creation/deletion in the directory in question. This has broken stuff before in the real world, see details at A Code Glitch May Have Caused Errors In More Than 100 Published Studies.
Are you required to do this in Python? If you are open to doing this entirely in shell, all you'd need to do is first cat the header row from a randomly selected input .csv file into merged.csv before running your one-liner:
cat a-randomly-selected-csv-file.csv | head -n1 > merged.csv
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done
You don't need pandas for this, just the simple csv module would work fine.
import csv
df_out_filename = 'df_out.csv'
write_headers = True
with open(df_out_filename, 'wb') as fout:
writer = csv.writer(fout)
for filename in allFiles:
with open(filename) as fin:
reader = csv.reader(fin)
headers = reader.next()
if write_headers:
write_headers = False # Only write headers once.
writer.writerow(headers)
writer.writerows(reader) # Write all remaining rows.
Here's a simpler approach - you can use pandas (though I am not sure how it will help with RAM usage)-
import pandas as pd
import glob
path =r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
stockstats_data = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_)
stockstats_data = pd.concat((df, stockstats_data), axis=0)

Output data from subprocess command line by line

I am trying to read a large data file (= millions of rows, in a very specific format) using a pre-built (in C) routine. I want to then yeild the results of this, line by line, via a generator function.
I can read the file OK, but where as just running:
<command> <filename>
directly in linux will print the results line by line as it finds them, I've had no luck trying to replicate this within my generator function. It seems to output the entire lot as a single string that I need to split on newline, and of course then everything needs reading before I can yield line 1.
This code will read the file, no problem:
import subprocess
import config
file_cmd = '<command> <filename>'
for rec in (subprocess.check_output([file_cmd], shell=True).decode(config.ENCODING).split('\n')):
yield rec
(ENCODING is set in config.py to iso-8859-1 - it's a Swedish site)
The code I have works, in that it gives me the data, but in doing so, it tries to hold the whole lot in memory. I have larger files than this to process which are likely to blow the available memory, so this isn't an option.
I've played around with bufsize on Popen, but not had any success (and also, I can't decode or split after the Popen, though I guess the fact I need to split right now is actually my problem!).
I think I have this working now, so will answer my own question in the event somebody else is looking for this later ...
proc = subprocess.Popen(shlex.split(file_cmd), stdout=subprocess.PIPE)
while True:
output = proc.stdout.readline()
if output == b'' and proc.poll() is not None:
break
if output:
yield output.decode(config.ENCODING).strip()

Resources