How to read binary-protobuf gz files in Spark / Spark Streaming? - apache-spark

I have to read the gz file from local / hdfs / kafka, and decompress it and parse it. Who have any experiences about this?
Or the other type likes bin.tar.gz

You can use sc.binaryFiles to read binary files and do whatever you like with the content bytes.
As for tar.gz, see Read whole text files from a compression in Spark

This is what I did:
1. read binary data = sc.binaryFiles(path)
2. extract content
data = (data
.map(lambda x: (x[0], ungzip(x[1])))
)
def ungzip(df):
compressed_file = io.BytesIO(df)
decompressed_file = gzip.GzipFile(fileobj=compressed_file)
return decompressed_file.read()
parse messages
def _VarintDecoder(mask):
local_ord = ord
def DecodeVarint(buffer, pos):
result = 0
shift = 0
while 1:
if pos > len(buffer) - 1:
raise NotEnoughDataExcption("Not enough data to decode varint")
b = local_ord(buffer[pos])
result |= ((b & 0x7f) << shift)
pos += 1
if not (b & 0x80):
result &= mask
return (result, pos)
shift += 7
if shift >= 64:
raise ValueError('Too many bytes when decoding varint.')
return DecodeVarint
.
def parse_binary(data):
decoder = _VarintDecoder((1 << 64) - 1)
next_pos, pos = 0, 0
messages = []
try:
while 1:
next_pos, pos = decoder(data[1], pos)
messages.append((data[0], data[1][pos:pos + next_pos]))
pos += next_pos
except:
return messages
.
data = (data
.flatMap(lambda x: parse_binary(x))
)
after this you have you protobuf messages one per row and you can apply your protobuf_parsing function in parallel

Related

Python: stream a tarfile to S3 using multipart upload

I would like to create a .tar file in an S3 bucket from Python code running in an AWS Lambda function. Lambda functions are very memory- and disk- constrained. I want to create a .tar file that contains multiple files that are too large to fit in the Lambda function's memory or disk space.
Using "S3 multipart upload," it is possible to upload a large file by uploading chunks of 5MB or more in size. I have this figured out and working. What I need to figure out is how to manage a buffer of bytes in memory that won't grow past the limits of the Lambda function's runtime environment.
I think the solution is to create an io.BytesIO() object and manage both a read pointer and a write pointer. I can then write into the buffer (from files that I want to add to the .tar file) and every time the buffer exceeds some limit (like 5MB) I can read off a chunk of data and send another file part to S3.
What I haven't quite wrapped my head around is how to truncate the part of the buffer that has been read and is no longer needed in memory. I need to trim the head of the buffer, not the tail, so the truncate() function of BytesIO won't work for me.
Is the 'correct' solution to create a new BytesIO buffer, populating it with the contents of the existing buffer from the read pointer to the end of the buffer, when I truncate? Is there a better way to truncate the head of the BytesIO buffer? Is there a better solution than using BytesIO?
For the random Google-r who stumbles onto this question six years in the future and thinks, "man, that describes my problem exactly!", here's what I came up with:
import io
import struct
from tarfile import BLOCKSIZE
#This class was designed to write a .tar file to S3 using multipart upload
#in a memory- and disk constrained environment, such as AWS Lambda Functions.
#
#Much of this code is copied or adapted from the Python source code tarfile.py
#file at https://github.com/python/cpython/blob/3.10/Lib/tarfile.py
#
#No warranties expressed or implied. Your mileage may vary. Lather, rinse, repeat
class StreamingTarFileWriter:
#Various constants from tarfile.py that we need
GNU_FORMAT = 1
NUL = b"\0"
BLOCKSIZE = 512
RECORDSIZE = BLOCKSIZE * 20
class MemoryByteStream:
def __init__(self, bufferFullCallback = None, bufferFullByteCount = 0):
self.buf = io.BytesIO()
self.readPointer = 0
self.writePointer = 0
self.bufferFullCallback = bufferFullCallback
self.bufferFullByteCount = bufferFullByteCount
def write(self, buf: bytes):
self.buf.seek(self.writePointer)
self.writePointer += self.buf.write(buf)
bytesAvailableToRead = self.writePointer - self.readPointer
if self.bufferFullByteCount > 0 and bytesAvailableToRead > self.bufferFullByteCount:
if self.bufferFullCallback:
self.bufferFullCallback(self, bytesAvailableToRead)
def read(self, byteCount = None):
self.buf.seek(self.readPointer)
if byteCount:
chunk = self.buf.read(byteCount)
else:
chunk = self.buf.read()
self.readPointer += len(chunk)
self._truncate()
return chunk
def size(self):
return self.writePointer - self.readPointer
def _truncate(self):
self.buf.seek(self.readPointer)
self.buf = io.BytesIO(self.buf.read())
self.readPointer = 0
self.writePointer = self.buf.seek(0, 2)
def stn(self, s, length, encoding, errors):
#Convert a string to a null-terminated bytes object.
s = s.encode(encoding, errors)
return s[:length] + (length - len(s)) * self.NUL
def itn(self, n, digits=8, format=GNU_FORMAT):
#Convert a python number to a number field.
# POSIX 1003.1-1988 requires numbers to be encoded as a string of
# octal digits followed by a null-byte, this allows values up to
# (8**(digits-1))-1. GNU tar allows storing numbers greater than
# that if necessary. A leading 0o200 or 0o377 byte indicate this
# particular encoding, the following digits-1 bytes are a big-endian
# base-256 representation. This allows values up to (256**(digits-1))-1.
# A 0o200 byte indicates a positive number, a 0o377 byte a negative
# number.
original_n = n
n = int(n)
if 0 <= n < 8 ** (digits - 1):
s = bytes("%0*o" % (digits - 1, n), "ascii") + self.NUL
elif format == self.GNU_FORMAT and -256 ** (digits - 1) <= n < 256 ** (digits - 1):
if n >= 0:
s = bytearray([0o200])
else:
s = bytearray([0o377])
n = 256 ** digits + n
for i in range(digits - 1):
s.insert(1, n & 0o377)
n >>= 8
else:
raise ValueError("overflow in number field")
return s
def calc_chksums(self, buf):
"""Calculate the checksum for a member's header by summing up all
characters except for the chksum field which is treated as if
it was filled with spaces. According to the GNU tar sources,
some tars (Sun and NeXT) calculate chksum with signed char,
which will be different if there are chars in the buffer with
the high bit set. So we calculate two checksums, unsigned and
signed.
"""
unsigned_chksum = 256 + sum(struct.unpack_from("148B8x356B", buf))
signed_chksum = 256 + sum(struct.unpack_from("148b8x356b", buf))
return unsigned_chksum, signed_chksum
def __init__(self, bufferFullCallback = None, bufferFullByteCount = 0):
self.buf = self.MemoryByteStream(bufferFullCallback, bufferFullByteCount)
self.expectedFileSize = 0
self.fileBytesWritten = 0
self.offset = 0
pass
def addFileRecord(self, filename, filesize):
REGTYPE = b"0" # regular file
encoding = "utf-8"
LENGTH_NAME = 100
GNU_MAGIC = b"ustar \0" # magic gnu tar string
errors="surrogateescape"
#Copied from TarInfo.tobuf()
tarinfo = {
"name": filename,
"mode": 0o644,
"uid": 0,
"gid": 0,
"size": filesize,
"mtime": 0,
"chksum": 0,
"type": REGTYPE,
"linkname": "",
"uname": "",
"gname": "",
"devmajor": 0,
"devminor": 0,
"magic": GNU_MAGIC
}
buf = b""
if len(tarinfo["name"].encode(encoding, errors)) > LENGTH_NAME:
raise Exception("Filename is too long for tar file header.")
devmajor = self.stn("", 8, encoding, errors)
devminor = self.stn("", 8, encoding, errors)
parts = [
self.stn(tarinfo.get("name", ""), 100, encoding, errors),
self.itn(tarinfo.get("mode", 0) & 0o7777, 8, self.GNU_FORMAT),
self.itn(tarinfo.get("uid", 0), 8, self.GNU_FORMAT),
self.itn(tarinfo.get("gid", 0), 8, self.GNU_FORMAT),
self.itn(tarinfo.get("size", 0), 12, self.GNU_FORMAT),
self.itn(tarinfo.get("mtime", 0), 12, self.GNU_FORMAT),
b" ", # checksum field
tarinfo.get("type", REGTYPE),
self.stn(tarinfo.get("linkname", ""), 100, encoding, errors),
tarinfo.get("magic", GNU_MAGIC),
self.stn(tarinfo.get("uname", ""), 32, encoding, errors),
self.stn(tarinfo.get("gname", ""), 32, encoding, errors),
devmajor,
devminor,
self.stn(tarinfo.get("prefix", ""), 155, encoding, errors)
]
buf = struct.pack("%ds" % BLOCKSIZE, b"".join(parts))
chksum = self.calc_chksums(buf[-BLOCKSIZE:])[0]
buf = buf[:-364] + bytes("%06o\0" % chksum, "ascii") + buf[-357:]
self.buf.write(buf)
self.expectedFileSize = filesize
self.fileBytesWritten = 0
self.offset += len(buf)
def addFileData(self, buf):
self.buf.write(buf)
self.fileBytesWritten += len(buf)
self.offset += len(buf)
pass
def completeFileRecord(self):
if self.fileBytesWritten != self.expectedFileSize:
raise Exception(f"Expected {self.expectedFileSize:,} bytes but {self.fileBytesWritten:,} were written.")
#write the end-of-file marker
blocks, remainder = divmod(self.fileBytesWritten, BLOCKSIZE)
if remainder > 0:
self.buf.write(self.NUL * (BLOCKSIZE - remainder))
self.offset += BLOCKSIZE - remainder
def completeTarFile(self):
self.buf.write(self.NUL * (BLOCKSIZE * 2))
self.offset += (BLOCKSIZE * 2)
blocks, remainder = divmod(self.offset, self.RECORDSIZE)
if remainder > 0:
self.buf.write(self.NUL * (self.RECORDSIZE - remainder))
An example use of the class is:
OUTPUT_CHUNK_SIZE = 1024 * 1024 * 5
f_out = open("test.tar", "wb")
def get_file_block(blockNum):
block = f"block_{blockNum:010,}"
block += "0123456789abcdef" * 31
return bytes(block, 'ascii')
def buffer_full_callback(x: StreamingTarFileWriter.MemoryByteStream, bytesAvailable: int):
while x.size() > OUTPUT_CHUNK_SIZE:
buf = x.read(OUTPUT_CHUNK_SIZE)
#This is where you would write the chunk to S3
f_out.write(buf)
x = StreamingTarFileWriter(buffer_full_callback, OUTPUT_CHUNK_SIZE)
import random
numFiles = random.randint(3,8)
print(f"Creating {numFiles:,} files.")
for fileIdx in range(numFiles):
minSize = 1025 #1kB plus 1 byte
maxSize = 10 * 1024 * 1024 * 1024 + 5 #10GB plus 5 bytes
numBytes = random.randint(minSize, maxSize)
print(f"Creating file {str(fileIdx)} with {numBytes:,} bytes.")
blocks,remainder = divmod(numBytes, 512)
x.addFileRecord(f"File{str(fileIdx)}", numBytes)
for block in range(blocks):
x.addFileData(get_file_block(block))
x.addFileData(bytes(("X" * remainder), 'ascii'))
x.completeFileRecord()

How to generate all possible binary strings from a clue effectively?

I can get a clue in the form of a list (e.g. [1,3,1]) and the length of the string (e.g. 8) and I generate all possible strings given the clue. That is:
01011101
10111010
10111001
10011101
Having three groups of 1s separated by one or more 0s of length given by the clue (in that order).
The clue specifies lengths of groups of 1s separated by at least one 0. The order of these groups must follow the order in the clue list.
My approach would be to use recursion, where each call tries to insert a specific group of 1s in the string (in the order of the clue list). It uses a for-loop to place it in all possible indices of the string and recursively calls itself for each of these placements with a clue = clue[1:] and size = size - clue[0].
How can I do that effectively in Python?
I would just use combinations_with_replacement to generate all possible combinations and build your answers that way.
from itertools import combinations_with_replacement
from collections import Counter
def generate_answers(clue, length):
segs = len(clue) + 1 # segment indices that a zero can be placed
excess_zeros = length - sum(clue) - (segs - 2) # number of zeros that can be moved around
for comb in combinations_with_replacement(range(segs), excess_zeros):
count = Counter(comb) # number of zeros to be added to each segment
for i in range(1, len(clue)):
count[i] += 1 # add the zeros required to separate the ones
output = ''
for i in range(segs): # build string
output += '0' * count[i]
if i < len(clue):
output += '1' * clue[i]
print(output)
clue = [1, 3, 1]
length = 8
generate_answers(clue, length)
Output:
'01011101'
'10011101'
'10111001'
'10111010'
This is another way to do it (recursively) without external libraries:
def generate_answers(clue, size):
if len(clue) == 0:
return [[0 for _ in range(size)]]
groups = gen_groups(clue)
min_len = sum(clue) + len(clue) - 1
free_spaces = size - min_len
result = recursive_combinations(len(groups) + 1, free_spaces)
solution = []
for res in result:
in_progress = []
for i in range(len(groups)):
in_progress += [0 for _ in range(res[i])]
in_progress += groups[i]
in_progress += [0 for _ in range(res[-1])]
solution.append(in_progress)
return solution
def gen_groups(clue):
result = []
for elem in clue:
in_progress = []
for i in range(elem):
in_progress.append(1)
in_progress.append(0)
result.append(in_progress)
if len(result) > 0:
result[-1].pop()
return result
def recursive_combinations(fields, zeroes):
if fields <= 0 or zeroes< 0:
return []
if fields == 1:
return [[zeroes]]
solution = []
for i in range(zeroes+ 1):
result = recursive_combinations(fields - 1, zeroes- i)
solution += [[i] + res for res in result]
return solution

Python: Fastest way to subtract elements of datasets of HDF5 file?

Here is one interesting problem.
Input: Input is two arrays (Nx4, sorted in column-2) stored in datasets-1 and 2 in HDF5 file (input.h5). N is huge (originally belonging to 10 GB of file, hence stored in HDF5 file).
Output: Subtracting each column-2 element of dataset-2 from dataset-1, such that the difference (delta) is between +/-4000. Eventually saving this info in dset of a new HDF5 file. I need to refer to this new file back-and-forth, hence HDF5 not a text file.
Concern: I initially used .append method but that crashed the execution for 10GBs input. So, I am now using dset.resize method (and would like to stick to it preferably). I am also using binary search as I was told in one of my last posts. So now, although the script seems to be working for large (10 GBs) of datasets, it is quite slow! The subtraction (for/while) loop is possibly the culprit! Any suggestions on how I can make this fast? I aim to use the fastest approach (and possibly the simplest, since I am a beginner).
import numpy as np
import time
import h5py
import sys
import csv
f_r = h5py.File('input.h5', 'r+')
dset1 = f_r.get('dataset_1')
dset2 = f_r.get('dataset_2')
r1,c1 = dset1.shape
r2,c2 = dset2.shape
left, right, count = 0,0,0
W = 4000 # Window half-width
n = 1
# **********************************************
# HDF5 Out Creation
# **********************************************
f_w = h5py.File('data.h5', 'w')
d1 = np.zeros(shape=(0, 4))
dset = f_w.create_dataset('dataset_1', data=d1, maxshape=(None, None), chunks=True)
for j in range(r1):
e1 = dset1[j,1]
# move left pointer so that is within -delta of e
while left < r2 and dset2[left,1] - e1 <= -W:
left += 1
# move right pointer so that is outside of +delta
while right < r2 and dset2[right,1] - e1 <= W:
right += 1
for i in range(left, right):
delta = e1 - dset2[i,1]
dset.resize(dset.shape[0] + n, axis=0)
dset[count, 0:4] = [count, dset1[j,1], dset2[i,1], delta]
count += 1
print("\nFinal shape of dataset created: " + str(dset.shape))
f_w.close()
EDIT (Aug 8, chunking HDF5 file as suggested by #kcw78)
#kcw78: So, I tried chunking as well. The following works well for small files (<100MB) but the computation time increases incredibly when I play with GBs of data. Can something be improvised in my code to make it fast?
My suspicion is for j loop is computationally expensive and may be the reason, any suggestions ?
filename = 'file.h5'
with h5py.File(filename, 'r') as fid:
chunks1 = fid["dataset_1"][:, :]
with h5py.File(filename, 'r') as fid:
chunks2 = fid["dataset_2"][:, :]
print(chunks1.shape, chunks2.shape) # shape is (13900,4) and (138676,4)
count = 0
W = 4000 # Window half-width
# **********************************************
# HDF5-Out Creation
# **********************************************
f_w = h5py.File('data.h5', 'w')
d1 = np.zeros(shape=(0, 4))
dset = f_w.create_dataset('dataset_1', data=d1, maxshape=(None, None), chunks=True)
# chunk size to read from first/second dataset
size1 = 34850
size2 = 34669
# save "n" no. of subtracted values in dset
n = 10**4
u = 0
fill_index = 0
for c in range(4): # read 4 chunks of dataset-1 one-by-one
h = c * size1
chunk1 = chunks1[h:(h + size1)]
for d in range(4): # read chunks of dataset-2
g = d * size2
chunk2 = chunks2[g:(g + size2)]
r2 = chunk2.shape[0]
left, right = 0, 0
for j in range(chunk1.shape[0]): # grab col.2 values from dataset-1
e1 = chunk1[j, 1]
while left < r2 and chunk2[left, 1] - e1 <= -W:
left += 1
# move right pointer so that is outside of +delta
while right < r2 and chunk2[right, 1] - e1 <= W:
right += 1
for i in range(left, right):
if chunk1[j, 0]<8193 and chunk2[i, 0] <8193:
e2 = chunk2[i, 1]
delta = e1 - e2 # subtract col.2 values
count += 1
if fill_index == (n):
dset.resize(dset.shape[0] + n, axis=0)
dset[u:(u + n), 0:4] = [count, e1, e1, delta]
u = u * n
fill_index = 0
fill_index += 1
del chunk2
del chunk1
f_w.close()
print(count) # these are (no. of) subtracted values such that the difference is between +/- 4000
EDIT (Jul 31)
I tried reading in chunks and even using memory mapping. It is efficient if I do not perform any subtraction and just go through the chunks. The for j in range(m): is the one that is inefficient; probably because I am grabbing each value of the chunk from file-1. This is when I am just subtracting and not saving the difference. Any better logic/implementation you can think of that can be replaced for "for j in range(m):?
size1 = 100_000_0
size2 = 100_000_0
filename = ["file-1.txt", "file-2.txt"]
chunks1 = pd.read_csv(filename[0], chunksize=size1,
names=['c1', 'c2', 'lt', 'rt'])
fp1 = np.memmap('newfile1.dat', dtype='float64', mode='w+', shape=(size1,4))
fp2 = np.memmap('newfile2.dat', dtype='float64', mode='w+', shape=(size2,4))
for chunk1 in chunks1: # grab chunks from file-1
m, _ = chunk1.shape
fp1[0:m,:] = chunk1
chunks2 = pd.read_csv(filename[1], chunksize=size2,
names=['ch', 'tmstp', 'lt', 'rt'])
for chunk2 in chunks2: # grab chunks from file-2
k, _ = chunk2.shape
fp2[0:k, :] = chunk2
for j in range(m): # Grabbing values from file-1's chunk
e1 = fp1[j,1]
delta_mat = e1 - fp2 # just a test, actually e1 should be subtracted from col-2 of fp2, not the whole fp2
count += 1
fp2.flush()
a += k
fp1.flush()
del chunks2
i += m
prog_count += m

Python Image Compression

I am using the Pillow library of Python to read in image files. How can I compress and decompress using Huffman encoding? Here is an instruction:
You have been given a set of example images and your goal is to compress them as much as possible without losing any perceptible information –upon decompression they should appear identical to the original images. Images are essentially stored as a series of points of color, where each point is represented as a combination of red, green, and blue (rgb). Each component of the rgb value ranges between 0-255, so for example: (100, 0, 200) would represent a shade of purple. Using a fixed-length encoding, each component of the rgb value requires 8 bits to encode (28= 256) meaning that the entire rgb value requires 24 bits to encode. You could use a compression algorithm like Huffman encoding to reduce the number of bits needed for more common values and thereby reduce the total number of bits needed to encode your image.
# For my current code I just read the image, get all the rgb and build the tree
from PIL import Image
import sys, string
import copy
codes = {}
def sortFreq(freqs):
letters = freqs.keys()
tuples = []
for let in letters:
tuples.append (freqs[let],let)
tuples.sort()
return tuples
def buildTree(tuples):
while len (tuples) > 1:
leastTwo = tuple (tuples[0:2]) # get the 2 to combine
theRest = tuples[2:] # all the others
combFreq = leastTwo[0][0] + leastTwo[1][0] # the branch points freq
tuples = theRest + [(combFreq, leastTwo)] # add branch point to the end
tuples.sort() # sort it into place
return tuples[0] # Return the single tree inside the list
def trimTree(tree):
# Trim the freq counters off, leaving just the letters
p = tree[1] # ignore freq count in [0]
if type (p) == type (""):
return p # if just a leaf, return it
else:
return (trimTree (p[0]), trimTree (p[1]) # trim left then right and recombine
def assignCodes(node, pat=''):
global codes
if type (node) == type (""):
codes[node] = pat # A leaf. Set its code
else:
assignCodes(node[0], pat+"0") # Branch point. Do the left branch
assignCodes(node[1], pat+"1") # then do the right branch.
dictionary = {}
table = {}
image = Image.open('fall.bmp')
#image.show()
width, height = image.size
px = image.load()
totalpixel = width*height
print ("Total pixel: "+ str(totalpixel))
for x in range (width):
for y in range (height):
# print (px[x, y])
for i in range (3):
if dictionary.get(str(px[x, y][i])) is None:
dictionary[str(px[x, y][i])] = 1
else:
dictionary[str(px[x, y][i])] = dictionary[str(px[x, y][i])] +1
table = copy.deepcopy(dictionary)
#combination = len(dictionary)
#for value in table:
# table[value] = table[value] / (totalpixel * combination) * 100
#print(table)
print(dictionary)
sortdic = sortFreq(dictionary)
tree = buildTree(sortdic)
trim = trimTree(tree)
print(trim)
assignCodes(trim)
print(codes)
The class HuffmanCoding takes complete path of the text file to be compressed as parameter. (as its data members store data specific to the input file).
The compress() function returns the path of the output compressed file.
The function decompress() requires path of the file to be decompressed. (and decompress() is to be called from the same object created for compression, so as to get code mapping from its data members)
import heapq
import os
class HeapNode:
def __init__(self, char, freq):
self.char = char
self.freq = freq
self.left = None
self.right = None
def __cmp__(self, other):
if(other == None):
return -1
if(not isinstance(other, HeapNode)):
return -1
return self.freq > other.freq
class HuffmanCoding:
def __init__(self, path):
self.path = path
self.heap = []
self.codes = {}
self.reverse_mapping = {}
# functions for compression:
def make_frequency_dict(self, text):
frequency = {}
for character in text:
if not character in frequency:
frequency[character] = 0
frequency[character] += 1
return frequency
def make_heap(self, frequency):
for key in frequency:
node = HeapNode(key, frequency[key])
heapq.heappush(self.heap, node)
def merge_nodes(self):
while(len(self.heap)>1):
node1 = heapq.heappop(self.heap)
node2 = heapq.heappop(self.heap)
merged = HeapNode(None, node1.freq + node2.freq)
merged.left = node1
merged.right = node2
heapq.heappush(self.heap, merged)
def make_codes_helper(self, root, current_code):
if(root == None):
return
if(root.char != None):
self.codes[root.char] = current_code
self.reverse_mapping[current_code] = root.char
return
self.make_codes_helper(root.left, current_code + "0")
self.make_codes_helper(root.right, current_code + "1")
def make_codes(self):
root = heapq.heappop(self.heap)
current_code = ""
self.make_codes_helper(root, current_code)
def get_encoded_text(self, text):
encoded_text = ""
for character in text:
encoded_text += self.codes[character]
return encoded_text
def pad_encoded_text(self, encoded_text):
extra_padding = 8 - len(encoded_text) % 8
for i in range(extra_padding):
encoded_text += "0"
padded_info = "{0:08b}".format(extra_padding)
encoded_text = padded_info + encoded_text
return encoded_text
def get_byte_array(self, padded_encoded_text):
if(len(padded_encoded_text) % 8 != 0):
print("Encoded text not padded properly")
exit(0)
b = bytearray()
for i in range(0, len(padded_encoded_text), 8):
byte = padded_encoded_text[i:i+8]
b.append(int(byte, 2))
return b
def compress(self):
filename, file_extension = os.path.splitext(self.path)
output_path = filename + ".bin"
with open(self.path, 'r+') as file, open(output_path, 'wb') as output:
text = file.read()
text = text.rstrip()
frequency = self.make_frequency_dict(text)
self.make_heap(frequency)
self.merge_nodes()
self.make_codes()
encoded_text = self.get_encoded_text(text)
padded_encoded_text = self.pad_encoded_text(encoded_text)
b = self.get_byte_array(padded_encoded_text)
output.write(bytes(b))
print("Compressed")
return output_path
""" functions for decompression: """
def remove_padding(self, padded_encoded_text):
padded_info = padded_encoded_text[:8]
extra_padding = int(padded_info, 2)
padded_encoded_text = padded_encoded_text[8:]
encoded_text = padded_encoded_text[:-1*extra_padding]
return encoded_text
def decode_text(self, encoded_text):
current_code = ""
decoded_text = ""
for bit in encoded_text:
current_code += bit
if(current_code in self.reverse_mapping):
character = self.reverse_mapping[current_code]
decoded_text += character
current_code = ""
return decoded_text
def decompress(self, input_path):
filename, file_extension = os.path.splitext(self.path)
output_path = filename + "_decompressed" + ".txt"
with open(input_path, 'rb') as file, open(output_path, 'w') as output:
bit_string = ""
byte = file.read(1)
while(byte != ""):
byte = ord(byte)
bits = bin(byte)[2:].rjust(8, '0')
bit_string += bits
byte = file.read(1)
encoded_text = self.remove_padding(bit_string)
decompressed_text = self.decode_text(encoded_text)
output.write(decompressed_text)
print("Decompressed")
return output_path
Running the program:
Save the above code, in a file huffman.py.
Create a sample text file. Or download a sample file from sample.txt (right click, save as)
Save the code below, in the same directory as the above code, and Run this python code (edit the path variable below before running. initialize it to text file path)
UseHuffman.py
from huffman import HuffmanCoding
#input file path
path = "/home/ubuntu/Downloads/sample.txt"
h = HuffmanCoding(path)
output_path = h.compress()
h.decompress(output_path)
The compressed .bin file and the decompressed file are both saved in the same directory as of the input file.
Result
On running on the above linked sample text file:
Initial Size: 715.3 kB
Compressed file Size: 394.0 kB
Plus, the decompressed file comes out to be exactly the same as the original file, without any data loss.
And that is all for Huffman Coding implementation, with compression and decompression. This was fun to code.
The above program requires the decompression function to be run using the same object that created the compression file (because the code mapping is stored in its data members). We can also make the compression and decompression function run independently, if somehow, during compression we store the mapping info also in the compressed file (in the beginning). Then, during decompression, we will first read the mapping info from the file, then use that mapping info to decompress the rest file.

Python - binary to num and num to binary - Wrong Output

While working on some complex project, I came across an interesting bug:
Code reads the file, converts binary to integers, writes to the file.
Other fellow reads this file and converts integers to binary and writes to a file.
Ideally, input file and converted files should be same. But that is not happening.
Pl find the code below:
# read file -> convert to binary -> binary to num -> write file
def bits(f):
byte = (ord(b) for b in f.read())
for b in byte:
bstr = []
for i in range(8):
bstr.append( (b >> i) & 1)
yield bstr
def binaryToNum(S):
bits = len(S)
if (S==''): return 0
elif (S[0] == '0'): return binaryToNum(S[1:])
elif (S[0] == '1'): return ((2**(bits-1))) + binaryToNum(S[1:])
bstr = []
for b in bits(open('input_test', 'r')):
bstr.append(b)
dstr = ''
for i in bstr:
b_num = str(binaryToNum(''.join(str(e) for e in i))).zfill(6)
dstr = dstr + b_num
ter = open('im1', 'w')
for item in dstr:
ter.write(item)
ter.close()
This part seems correct, I checked manually for a-z, A-Z and 0-9
The code on other machine does this:
def readDecDataFromFile(filename):
data = []
with open(filename) as f:
data = data + f.readlines()
chunks, chunk_size = len(data[0]), 6
return [ data[0][i:i+chunk_size] for i in range(0, chunks, chunk_size) ]
def numToBinary(N):
return str(int(bin(int(N))[2:]))
ddata = readDecDataFromFile('im1')
bytes = []
for d in ddata:
bits = numToBinary(d)
bytes.append(int(bits[::-1], 2).to_bytes(1, 'little'))
f = open('orig_input', 'wb')
for b in bytes:
f.write(b)
f.close()
And here is the output:
input_test: my name is XYZ
orig_input: my7ameisY-
bytes list in last code yields:
[b'm', b'y', b'\x01', b'7', b'a', b'm', b'e', b'\x01', b'i', b's', b'\x01', b'\x0b', b'Y', b'-', b'\x05']
What could be the potential error?
two modifications are required.
while reading the bits, current order is little endian. To convert it,
reversed(range(8))
should be used in bits function.
While converting from bits to bytes at the time of writing, bit string is reversed. That is not needed. So Code changes from
bytes.append(int(bits[::-1], 2).to_bytes(1, 'little'))
To
bytes.append(int(bits, 2).to_bytes(1, 'little'))

Resources