Read numpy data from GZip file over the network - python-3.x

I am attempting to download the MNIST dataset and decode it without writing it to disk (mostly for fun).
request_stream = urlopen('http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz')
zip_file = GzipFile(fileobj=request_stream, mode='rb')
with zip_file as fd:
magic, numberOfItems = struct.unpack('>ii', fd.read(8))
rows, cols = struct.unpack('>II', fd.read(8))
images = np.fromfile(fd, dtype='uint8') # < here be dragons
images = images.reshape((numberOfItems, rows, cols))
return images
This code fails with OSError: obtaining file position failed, an error that seems to be ungoogleable. What could the problem be?

The problem seems to be, that what gzip and similar modules provide, aren't real file objects (unsurprisingly), but numpy attempts to read through the actual FILE* pointer, so this cannot work.
If it's ok to read the entire file into memory (which it might not be), then this can be worked around by reading all non-header information into a bytearray and deserializing from that:
rows, cols = struct.unpack('>II', fd.read(8))
b = bytearray(fd.read())
images = np.frombuffer(b, dtype='uint8')
images = images.reshape((numberOfItems, rows, cols))
return images

Related

Repairing corrupted JPEG images from character replacement

Recently I got some corrupted JPEG images after a mistakingly input command:
~$> sed -i 's/;/_/g' *
After that, in the working directory and the subdirectories, Every byte '0x3b' in JPEG images became '0x5f'. Viewer apps displays the images corrupted, such as below:
corrupted image sample
I could not identify which byte should be recovered, and when I tried to validate the warning/error flags from the images with toolkits such as EXIFtool, they just returns OK as the corrupted JPEG is not literally BROKEN not to be opened by a viewer.
Images should be repaired, since there is no duplicated image backup for them, but I don't know how to start. Just replacing 0x5f with 0x3b again is not effective, since the number of cases would be too big (2^n I guess where there are n candidate 0x5f) to take the trial-and-error replacing way. I've just started parsing huffman table in a JPEG image header and hoping to identify the conflict point between huffman coded statement and binary, but not sure.
How can I recover the images in this situation? I appreciate your help.
There appear to be 57 incidences of 0x5f in your corrupted image. If you can't find a better way, you could maybe "eyeball" the effects of replacing the incorrect bytes in the image fairly quickly like this:
open the image in binary mode and read it all with JPEG = open('PdQpR.jpg','rb').read()
use offsets = [m.start() for m in re.finditer(b'_', JPEG)] to find the byte offsets of the 57 occurrences
display the image with cv2.imdecode() and cv2.imshow() and then enter a loop accepting keypresses with cv2.waitkey()
p = move to previous one of 57 occurrences
n = move to next one of 57 occurrences
SPACE = toggle between 0x5f and 0x3b
s = save current state
q = quit
I had a quick attempt at this but haven't had much success using it yet:
#!/usr/bin/env python3
import cv2
import re
import numpy as np
# Load the image
filename = 'PdQpR.jpg'
JPEG = open(filename,'rb').read()
JPEG = bytearray(JPEG)
# Find the byte offsets of all the underscores
offsets = [m.start() for m in re.finditer(b'_', JPEG)]
N = len(offsets)
index = 0
while True:
# Show user which entry we are at
print(f'{index}/{N}: n=next, p=previous, space=toggle, q=quit')
# Decode and display the JPEG
im = cv2.imdecode(np.frombuffer(JPEG, dtype=np.uint8), cv2.IMREAD_COLOR)
cv2.imshow(filename, im)
key = cv2.waitKey(0)
# n = next offset
if key == ord('n'):
index = (index + 1) % N
next
# p = previous offset
if key == ord('p'):
index = index -1
if index < 0:
index = N - 1
next
# q = Quit
if key == ord('q'):
break
# space = toggle between underscore and semicolon
if key == ord(' '):
if JPEG[offsets[index]] == ord('_'):
print(f'{index}/{N}: Toggling to ;')
JPEG[offsets[index]] = ord(';')
else:
print(f'{index}/{N}: Toggling to _')
JPEG[offsets[index]] = ord('_')
next
Note: Toggling some bytes between '_' and ';' results in illegal images and error messages from cv2.imdecode() and/or cv2.imshow(). Ideally you would wrap these inside a try/except and back out the last change if they occur. I didn't do that, yet.
Note: I didn't implement save function, it is just something like open('corrected.jpg', 'wb').write(JPEG)

Magick convert through subprocess, Converting tiff images to pdf increases the size by 20 times

I tried using density but it didn't help. The original TIFF image is 459 kB but when it gets converted to PDF the size changes to 8446 KB.
commands = ['magick', 'convert']
commands.extend(waiting_list["images"][2:])
commands.append('-adjoin')
commands.append(combinedFormPathOutput)
process = Popen(commands, stdout=PIPE, stderr=PIPE, shell=True)
process.communicate()
https://drive.google.com/file/d/14V3vKRcyyEx1U23nVC13DDyxGAYOpH-6/view?usp=sharing
Its not teh above code but the below PIL code which is causing the image to increase
images = []
filepath = 'Multi_document_Tiff.tiff'
image = Image.open(filepath)
if filepath.endswith('.tiff'):
imagepath = filepath.replace('.tiff', '.pdf')
for i, page in enumerate(ImageSequence.Iterator(image)):
page = page.convert("RGB")
images.append(page)
if len(images) == 1:
images[0].save(imagepath)
else:
images[0].save(imagepath, save_all=True, append_images=images[1:])
image.close()
When I run
convert Multi_document_Tiff.tiff -adjoin Multi_document.pdf
I get a 473881 bytes PDF that contains the 10 pages of the TIFF. If I run
convert Multi_document_Tiff.tiff Multi_document_Tiff.tiff Multi_document_Tiff.tiff -adjoin Multi_document.pdf
I get a 1420906 bytes PDF that contains 30 pages (three copies of your TIFF).
So obviously if you pass several input files to IM it will coalesce them in the output file.
You code does:
commands.extend(waiting_list["images"][2:])
So it seems it is passing a list of files to IM, and the output should be the accumulation of all these files, which can be a lot bigger that the size of the first file.
So:
did you check the content of the output PDF?
did you check the list of files which is actually passed?

Unpack binary file contents; modify value; then pack contents to new binary file

I have a binary file and limited knowledge of the structure of the file. I'd like to unpack the contents of the file, make a change to a value, and then re-pack the modified contents into a new binary file. If I can complete the unpacking successfully, I certainly can modify one of the values; and then I believe I will be able to handle the re-packing to create a new binary file. However, I am having trouble completing the unpacking. This is what I have so far
image = None
one = two = three = four = five = 0
with open(my_file, 'rb') as fil:
one = struct.unpack('i', fil.read(4))[0]
two = struct.unpack('i', fil.read(4))[0]
three = struct.unpack('d', fil.read(8))[0]
four = struct.unpack('d', fil.read(8))[0]
five = struct.unpack('iiii', fil.read(16))
image = fil.read(920)
When I set a breakpoint below the section of code displayed above, I can see that the type of the image variable above is <class 'bytes'>. The type of fil is <class 'io.BufferedReader'>. How can I unpack the data in this image variable?
The recommendation from #Stanislav directly led me to the solution to this problem. Ultimately, I did not need struct unpack/pack to reach my goal. The code below roughly illustrates the solution.
with open(my_file, 'rb') as fil:
data = bytearray(fil.read())
mylist = list(data)
mylist[8] = mylist[8] + 2 #modify some fields
mylist[9] = mylist[9] + 2
mylist[16] = mylist[16] + 3
data = bytearray(mylist)
another_file = open("other_file.bin", "wb")
another_file.write(data)
another_file.close()

os.listdir reads images randomly making bounding box training difficult

os.listdir(path) command reads images randomly from a folder. I have saved a csv file with bouding box information for the images in folder sequentially. I assumed os.listdir would read the images sequentially so that my csv file also can be read sequentiallu during the training.
I have tried sorted(os.listdir) but no use. I could not find any other functions or code to read the images sequentially from a folder. I named the images as frame1.jpg, frame2.jpg etc.
PATH = os.getcwd()
# Define data path
data_path = PATH + '/frames'
data_dir_list = sorted(os.listdir(data_path))
print(data_dir_list)
img_data_list=[]
for dataset in (data_dir_list):
img_list=sorted(os.listdir(data_path+'/'+ dataset))
print ('Loaded the images of dataset-'+'{}\n'.format(dataset))
for img in sorted(img_list):
input_img=cv2.imread(data_path + '/'+ dataset + '/'+ img )
input_img=cv2.cvtColor(input_img, cv2.COLOR_BGR2GRAY)
input_img1=input_img
#input_img_resize=cv2.resize(input_img,(512,512))
img_data_list.append(input_img1)
img_data = np.array(img_data_list)
img_data = img_data.astype('float32')
img_data /= 255
As per Python docs, os.listdir() returns filenames in an arbitrary order. It just maps to an underlying operating system call that will return filenames in whatever order is presumably most efficient based on filesystem design.
It's just a standard list of strings, so sorted() would work in the way you're using it. Are the sequential numbers in your filenames correctly padded for this to work with the more than 10 images you're presumably using? What's the random order you're seeing from sorted(os.listdir(...))?

Why does pandas python use disk space

I have a PC with two disks:
110GB SSD
1TB HDD
There is around 18GB free in the SSD.
When I run the python code below, it "uses" all the space from my SSD (I end up having only 1GB free). This code iterates on all SAS files in a folder, performs a group by operation and appends results of each file to one big dataframe.
import pandas as pd
import os
import datetime
import numpy as np
#The function GetDailyPricePoints does the following:
#1. Imports file
#2. Creates "price" variable
#3. Performs a group by
#4. Decode byte variables and convert salesdate to date type (if needed)
def GetDailyPricePoints(inpath,infile):
intable = pd.read_sas(filepath_or_buffer=os.path.join(inpath,infile))
#Create price column
intable.loc[intable['quantity']!=0,'price'] = intable['salesvalue']/intable['quantity']
intable['price'] = round(intable['price'].fillna(0.0),0)
#Create outtable
outtable = intable.groupby(["salesdate", "storecode", "price", "barcode"]).agg({'key_row':'count', 'salesvalue':'sum', 'quantity':'sum'}).reset_index().rename(columns = {'key_row':'Baskets', 'salesvalue':'Sales', 'quantity':'Quantity'})
#Fix byte values and salesdate column
for column in outtable:
if not column in list(outtable.select_dtypes(include=[np.number]).columns.values): #loop non-numeric columns
outtable[column] = outtable[column].where(outtable[column].apply(type) != bytes, outtable[column].str.decode('utf-8'))
elif column=='salesdate': #numeric column and name is salesdate
outtable[column] = pd.to_timedelta(outtable[column], unit='D') + pd.Timestamp('1960-1-1')
return outtable
inpath = r'C:\Users\admin\Desktop\Transactions'
outpath = os.getcwd() + '\Export'
outfile = 'DailyPricePoints'
dirs = os.listdir(inpath)
outtable = pd.DataFrame()
#loop through SAS files in folder
for file in dirs:
if file[-9:] == '.sas7bdat':
outtable.append(GetDailyPricePoints(inpath,file,decimals))
I would like to understand what exactly is using disk space. Also, I would like to change the path where this "temporary works" are saved, to a path in my HDD.
You are copying all the data you have into RAM; you don't have enough in this case, so Python uses a page file or virtual memory instead. The only way to fix this would be to get more memory, or you could just not store everything in one big dataframe, e.g. write each file into a pickle with outtable.to_pickle('csvfile.csv').
However, if you insist on storing everything in one large csv, you can append to a csv by passing a file object as the first argument:
out = open('out.csv', 'a')
outtable.to_csv(out, index = False)
doing the .to_csv() step within your loop.
Also, the .append() method for dataframes does not modify the dataframe in place, but instead returns a new dataframe (unlike the method with lists). So your last block of code probably isn't doing what you're expecting.

Resources