How to append chunks of a pickled pandas DataFrame - python-3.x

I have pickled a pandas data frame on my server and I'm sending it via a socket connection and I can receive the data but I can't seem to append the chunks of data back together to the original Dataframe format that's all I'm trying to achieve! I have a feeling its the way I'm appending as its turning into a list because of data = [] but I tried an empty pd Dataframe and that didn't work so I'm kinda a bit lost as to how ill append these values
data = []
FATPACKET = 0
bytelength = self.s.recv(BUFFERSIZE)
length = int(pickle.loads(bytelength))
print(length)
ammo = 0
while True:
print("Getting Data....")
packet = self.s.recv(1100)
FATPACKET = int(sys.getsizeof(packet))
ammo += FATPACKET
print(str(FATPACKET) + ' Got this much of data out of ' +str(length))
print("Getting Data.....")
data.append(packet)
print(ammo)
if not ammo > length:
break
print(data)
unpickled = pickle.loads(data)
self.s.close()
print("Closing Connection!")
print(unpickled)
when I try this code Im constantly running into this
TypeError: a bytes-like object is required, not 'list'
or I run into this
_pickle.UnpicklingError: invalid load key, '\x00'
which is the first couple digits of my pickled Dataframe sorry this is my first time messing around with the pickle module so I'm not very knowledgeable

It would help if we could also see exactly what you're doing on the sending end. However, it's apparent that you have several problems.
First, in the initial recv, it's obvious you intended that to only obtain the initial pickle object you used to encode the length of the remaining bytes. However, that recv might also receive an initial segment of the remaining bytes (or even all of the remaining bytes, depending on how large that is). So how much of it should you give to the initial pickle.loads?
You would be better off creating a fixed length field to contain the size of the remaining data. That is often done with the struct module. On the sending side:
import struct
# Pickle the data to be sent
data = pickle.dumps(python_obj)
data_len = len(data)
# Pack data length into 4 bytes, encoded in network byte order
data_len_to_send = struct.pack('!I', data_len)
# Send exactly 4 bytes (due to 'I'), then send data
conn.sendall(data_len_to_send)
conn.sendall(data)
On the receiving side, as the exception said, pickle.loads takes a byte string not a list. So part of solving this will be to concatenate all the list elements into a single byte string before calling loads:
unpickled = pickle.loads(b''.join(data))
Other issues on receiving side: use len(packet) to get the buffer size. sys.getsizeof provides the internal memory used by the bytes object which includes unspecified interpreter overhead and isn't what you need here.
After recv, the first thing you should do is check for an empty buffer which indicates the end-of-stream (len(packet) == 0 or packet == '' or not packet even). That would happen for example if the sender got killed before completing the send (or the network link goes down, or some bug on the sender side, etc.). Otherwise, if the connection ends prematurely, your program will never reach the break and hence it will be in a very tight infinite loop.
So, altogether you could do something like this:
# First, obtain fixed-size content length
buf = b''
while len(buf) < 4:
tbuf = recv(4 - len(buf))
if tbuf == '':
raise RuntimeError("Lost connection with peer")
buf += tbuf
# Decode (unpack) length (note that unpack returns an array)
len_to_recv = struct.unpack('!I', buf)[0]
data = []
len_recved = 0
while len_recvd < len_to_recv:
buf = self.s.recv(min(len_to_recv - len_recvd, BUFFERSIZE))
if buf == '':
raise RuntimeError("Lost connection with peer")
data.append(buf)
len_recvd += len(buf)
unpickled_obj = pickle.loads(b''.join(data))
EDIT: moved parenthesis

Related

Python 3: for x in bytes as words

I have a python3 script which reads data into a buffer with
fp = open("filename", 'rb')
data = fp.read(count)
I don't fully understand (even after reading the documentation) what read() returns. It appears to be some kind of binary data which is iterable. But it is not a list.
Confusingly, elsewhere in the script, lists are used for binary data.
frames = []
# then later... inside a loop
for ...
data = b''.join(frames)
Regardless... I want to iterate over the object returned by read() in units of word (aka 2 byte blocks)
At the moment the script contains this for loop
for c in data:
# do something
Is it possible to change c such that this loop iterates over words (2 byte blocks) rather than individual bytes?
I cannot use read() in a loop to read 2 bytes at a time.
We can explicitly read (up to) n bytes from a file in binary mode with .read(n) (just as it would read n Unicode code points from a file opened in text mode). This is a blocking call and will only read fewer bytes at the end of the file.
We can use the two-argument form of iter to build an iterator that repeatedly calls a callable:
>>> help(iter)
Help on built-in function iter in module builtins:
iter(...)
iter(iterable) -> iterator
iter(callable, sentinel) -> iterator
Get an iterator from an object. In the first form, the argument must
supply its own iterator, or be a sequence.
In the second form, the callable is called until it returns the sentinel.
read at the end of the file will start returning empty results and not raise an exception, so we can use that for our sentinel.
Putting it together, we get:
for pair in iter(lambda: fp.read(2), b''):
Inside the loop, we will get bytes objects that represent two bytes of data. You should check the documentation to understand how to work with these.
When reading a file in binary mode, a bytes object is returned, which is one of the standard python builtins. In general, its representation in the code looks like that of a string, except that it is prefixed as b" " - When you try printing it, each byte may be displayed with an escape like \x** where ** are 2 hex digits corresponding to the byte's value from 0 to 255, or directly as a single printable ascii character, with the same ascii codepoint as the number. You can read more about this and methods etc of bytes (also similar to those for strings) in the bytes docs.
There already seems to be a very popular question on stack overflow about how to iterate over a bytes object. The currently accepted answer gives this example for creating a list of individual bytes in the bytes object :
L = [bytes_obj[i:i+1] for i in range(len(bytes_obj))]
I suppose that modifying it like this will work for you :
L = [bytes_obj[i:i+2] for i in range(0, len(bytes_obj), 2)]
For example :
by = b"\x00\x01\x02\x03\x04\x05\x06"
# The object returned by file.read() is also bytes, like the one above
words = [by[i:i+2] for i in range(0, len(by), 2)]
print(words)
# Output --> [b'\x00\x01', b'\x02\x03', b'\x04\x05', b'\x06']
Or create a generator that yields words in the same way if your list is likely to be too large to efficiently store at once:
def get_words(bytesobject):
for i in range(0, len(bytesobject), 2):
yield bytesobject[i:i+2]
In the most simple literal sense, something like this gives you a two byte at a time loop.
with open("/etc/passwd", "rb") as f:
w = f.read(2)
while len(w) > 0:
print( w )
w = f.read(2)
as for what you are getting from read, it's a bytes object, because you have specified 'b' as an option to the `open
I think a more python way to express it would be via an iterator or generator.

Python tuple data byte to string easy way?

I'm using socket, struct for receive and unpack the received bytes message via tcp/ip protocol, I'm getting tuple which contains numeric data as well as bytes in the defined order as per contract.
The example data as below...
Example:
receive buffer data from tcp ip
buffer = sock.recv(61)
Unpack the bytes into predefined struct format
tup_data = struct.unpack("<lll8s17sh22s", buffer)
tup_data
(61,12000,4000,b'msg\x00\x00\x00\x00\x00',b'anther
msg\x00\x00\x00\x00\x00\x00\x00',4,b'yet another
msg\x00\x00\x00\x00\x00\x00\x00')
since the data is highly streaming and execution time is matter... I don't want to load the cpu by using any looping and isinstance() method.
Since the location of bytes are defined, so I'm currently using as
processed_data = (*tup_data[:3],
tup_data[3].strip(b"\x00").decode(),
tup_data[4].strip(b"\x00").decode(),
tup_data[5],
tup_data[6].strip(b"\x00").decode())
processed_data
(61,12000,4000,"msg","anther msg",4,"yet another msg")
Is there any magic way to convert bytes into required string at one shot as the location of bytes are known...??
Since you're using struct.unpack for unpacking your buffer and due to the format-characters chart you can't get string format as your output. Therefore you should either strip the extra \x00 at the source or just use a generator comprehension as following to reformat the items that are instances of bytes.
In [12]: tuple(i.strip(b'\x00').decode() if isinstance(i, bytes) else i for i in t)
Out[12]: (61, 12000, 4000, 'msg', 'anther msg', 4, 'yet another msg')

Why is this error appearing?

AttributeError: 'builtin_function_or_method' object has no attribute 'encode'
I'm trying to make a text to code converter as an example for an assignment and this is some code based off of some I found in my research,
import binascii
text = input('Message Input: ')
data = binascii.b2a_base64.encode(text)
text = binascii.a2b_base64.encode(data)
print (text), "<=>", repr(data)
data = binascii.b2a_uu(text)
text = binascii.a2b_uu(data)
print (text), "<=>", repr(data)
data = binascii.b2a_hqx(text)
text = binascii.a2b_hqx(data)
print (text), "<=>", repr(data)
can anyone help me get it working? it's supposed to take an input in and then convert it into hex and others and display those...
I am using Python 3.6 but I am also a little out of practice...
TL;DR:
data = binascii.b2a_base64(text.encode())
text = binascii.a2b_base64(data).decode()
print (text, "<=>", repr(data))
You've hit on a common problem in the Python3 - str object vs bytes object. The bytes object contains sequence of bytes. One byte can contain any number from 0 to 255. Usually those number are translated through the ASCII table into a characters like english letters. Usually in the Python you should use bytes for working with binary data.
On the other hand the str object contains sequence of code points. One code point usually represent one character printed on your screen when you call print. Internally it is sequence of bytes so the Chinese symbol 的 is internally saved as 3 bytes long sequence.
Now to the your problem. The function requires as input the bytes object but you've got a str object from the function input. To convert str into bytes you have to call str.encode() method on the str object.
data = binascii.b2a_base64(text.encode())
Your original call binascii.b2a_base64.encode(text) means call method encode of the object binascii.b2a_base64 with parameter text.
The function binascii.b2a_base64 returns bytes contains original input encoded with the base64 algorithms. Now to get back the original str from encoded data you have to call this:
# Take base64 encoded data and return it decoded as bytes object
decoded_data = binascii.a2b_base64(data)
# Convert bytes object into str
text = decoded_data.decode()
It can be written as one line
decoded_data = binascii.a2b_base64(data).decode()
WARNING: Your call of print is invalid for Python 3 (it will work only in the python console)

Python 3- check if buffered out bytes form a valid char

I am porting some code from python 2.7 to 3.4.2, I am struck at the bytes vs string complication.
I read this 3rd point in the wolf's answer
Exactly n bytes may cause a break between logical multi-byte characters (such as \r\n in binary mode and, I think, a multi-byte character in Unicode) or some underlying data structure not known to you;
So, when I buffer read a file (say - 1 byte each time) & the very first characters happens to be a 6-byte unicode how do I figure out how many more bytes to be read? Because if I do not read till the complete char, it will be skipped from processing; as next time read(x) will read x bytes relative to it's last position (i.e. halfway between it char's byte equivalent)
I tried the following approach:
import sys, os
def getBlocks(inputFile, chunk_size=1024):
while True:
try:
data=inputFile.read(chunk_size)
if data:
yield data
else:
break
except IOError as strerror:
print(strerror)
break
def isValid(someletter):
try:
someletter.decode('utf-8', 'strict')
return True
except UnicodeDecodeError:
return False
def main(src):
aLetter = bytearray()
with open(src, 'rb') as f:
for aBlock in getBlocks(f, 1):
aLetter.extend(aBlock)
if isValid(aLetter):
# print("char is now a valid one") # just for acknowledgement
# do more
else:
aLetter.extend( getBlocks(f, 1) )
Questions:
Am I doomed if I try fileHandle.seek(-ve_value_here, 1)
Python must be having something in-built to deal with this, what is it?
how can I really test if the program meets its purpose of ensuring complete characters are read (right now I have only simple english files)
how can I determine best chunk_size to make program faster. I mean reading 1024 bytes where first 1023 bytes were 1-byte-representable-char & last was a 6-byter leaves me with the only option of reading 1 byte each time
Note: I can't prefer buffered reading as I do not know range of input file sizes in advance
The answer to #2 will solve most of your issues. Use an IncrementalDecoder via codecs.getincrementaldecoder. The decoder maintains state and only outputs fully decoded sequences:
#!python3
import codecs
import sys
byte_string = '\u5000\u5001\u5002'.encode('utf8')
# Get the UTF-8 incremental decoder.
decoder_factory = codecs.getincrementaldecoder('utf8')
decoder_instance = decoder_factory()
# Simple example, read two bytes at a time from the byte string.
result = ''
for i in range(0,len(byte_string),2):
chunk = byte_string[i:i+2]
result += decoder_instance.decode(chunk)
print('chunk={} state={} result={}'.format(chunk,decoder_instance.getstate(),ascii(result)))
result += decoder_instance.decode(b'',final=True)
print(ascii(result))
Output:
chunk=b'\xe5\x80' state=(b'\xe5\x80', 0) result=''
chunk=b'\x80\xe5' state=(b'\xe5', 0) result='\u5000'
chunk=b'\x80\x81' state=(b'', 0) result='\u5000\u5001'
chunk=b'\xe5\x80' state=(b'\xe5\x80', 0) result='\u5000\u5001'
chunk=b'\x82' state=(b'', 0) result='\u5000\u5001\u5002'
'\u5000\u5001\u5002'
Note after the first two bytes are processed the internal decoder state just buffers them and appends no characters to the result. The next two complete a character and leave one in the internal state. The last call with no additional data and final=True just flushes the buffer. It will raise an exception if there is an incomplete character pending.
Now you can read your file in whatever chunk size you want, pass them all through the decoder and be sure that you only have complete code points.
Note that with Python 3, you can just open the file and declare the encoding. The chunk you read will actually be processed Unicode code points using an IncrementalDecoder internally:
input.csv (saved in UTF-8 without BOM)
我是美国人。
Normal text.
code
with open('input.txt',encoding='utf8') as f:
while True:
data = f.read(2) # reads 2 Unicode codepoints, not bytes.
if not data: break
print(ascii(data))
Result:
'\u6211\u662f'
'\u7f8e\u56fd'
'\u4eba\u3002'
'\nN'
'or'
'ma'
'l '
'te'
'xt'
'.'

Write binary data to file in python3

I've been having a LOT of trouble with this and the other questions don't seem to be what I'm looking for. So basically I have a list of bytes gotten from
bytes = struct.pack('I',4)
bList = list(bytes)
# bList ends up being [0,0,0,4]
# Perform some operation that switches position of bytes in list, etc
So now I want to write this to a file
f = open('/path/to/file','wb')
for i in range(0,len(bList)):
f.write(bList[i])
But I keep getting the error
TypeError: 'int' does not support the buffer interface
I've also tried writing:
bytes(bList[i]) # Seems to write the incorrect number.
str(bList[i]).encode() # Seems to just write the string value instead of byte
Oh boy, I had to jump through hoops to solve this. So basically I had to instead do
bList = bytes()
bList += struct.pack('I',4)
# Perform whatever byte operations I need to
byteList = []
# I know, there's probably a list comprehension to do this more elegantly
for i in range(0,len(bList)):
byteList.append(bList[i])
f.write(bytes(byteList))
So bytes can take an array of byte values (even if they're represented in decimal form in the array) and convert it to a proper byteArray by casting

Resources