python-snappy streaming data in a loop to a client - python-3.x

I would like to send multiple compressed arrays from a server to a client using python snappy, but I cannot get it to work after the first array. Here is a snippet for what is happening:
(sock is just the network socket that these are communicating through)
Server:
for i in range(n): #number of arrays to send
val = items[i][1] #this is the array
y = (json.dumps(val)).encode('utf-8')
b = io.BytesIO(y)
#snappy.stream_compress requires a file-like object as input, as far as I know.
with b as in_file:
with sock as out_file:
snappy.stream_compress(in_file, out_file)
Client:
for i in range(n): #same n as before
data = ''
b = io.BytesIO()
#snappy.stream_decompress requires a file-like object to write o, as far as I know
snappy.stream_decompress(sock, b)
data = b.getvalue().decode('utf-8')
val = json.loads(data)
val = json.loads(data) works only on the first iteration, but afterwards it stop working. When I do a print(data), only the first iteration will print anything. I've verified that the server does flush and send all the data, so I believe it is a problem with how I decide to receive the data.
I could not find a different way to do this. I searched and the only thing I could find is this post which has led me to what I currently have.
Any suggestions or comments?

with doesn't do what you think, refer to it's documentation. It calls sock.__exit__() after the block executed, that's not what you intended.
# what you wrote
with b as in_file:
with sock as out_file:
snappy.stream_compress(in_file, out_file)
# what you meant
snappy.stream_compress(b, sock)
By the way:
The line data = '' is obsolete because it's reassigned anyways.

Adding to #paul-scharnofske's answer:
Likewise, on the receiving side: stream_decompress doesn't quit until end-of-file, which means it will read until the socket is closed. So if you send separate multiple compressed chunks, it will read all of them before finishing, which seems not what you intend. Bottom line, you need to add "framing" around each chunk so that you know on the receiving end when one ends and the next one starts. One way to do that... For each array to be sent:
Create a io.BytesIO object with the json-encoded input as you're doing now
Create a second io.BytesIO object for the compressed output
Call stream_compress with the two BytesIO objects (you can write into a BytesIO in addition to reading from it)
Obtain the len of the output object
Send the length encoded as a 32-bit integer, say, with struct.pack("!I", length)
Send the output object
On the receiving side, reverse the process. For each array:
Read 4 bytes (the length)
Create a BytesIO object. Receive exactly length bytes, writing those bytes to the object
Create a second BytesIO object
Pass the received object as input and the second object as output to stream_decompress
json-decode the resulting output object

Related

How to send a pickled object across a server with encoding? Python 3

I want to send a pickled, encoded version of an object Account across to my server, and then decoding it at the server end and reinstating it as the object with corresponding data, however I am unsure how to convert it back from a string to the bytes (?) data type as an object.
On the clients end, this is essentially what happens:
command = 'insert account'
tempAccount = Account('Isabel', 'password')
pickledAcc = pickle.dumps(tempAccount)
clientCommand = f"{command},{pickledAcc}".encode('utf-8')
client.send(clientCommand)
However on the servers side, it receives an empty string as the pickledAcc part.
I have simplified my code a bit but the I think the essentials are there, but if you need more I can give it lol. Also should mention that I have used the proper length etiquette, i.e. sending a message before this to notify the server how long this message will be. And all of my server infrastructure works properly :)
Basically, I just need to know if it is possible to encode the pickled Account object to send it, or if doing so will not ever work and is stupid to do so.
The problem with the format line is that you insert the __repr__ of the pickledAcc instead of the real bytes. This will not get you the wanted result:
for example:
command = "test"
pickledAcc = pickle.dumps("test_data")
clientCommand = f"{command},{pickledAcc}".encode('utf-8')
Now client command will output:
b"test,b'\\x80\\x03X\\t\\x00\\x00\\x00test_dataq\\x00.'"
as you can see, the representation of the bytes array was encoded to utf-8 ("b\...")
To solve this problem I suggest you will convert the command to bytes array and then send clientCommand as a bytes array instead
hope that helped
Client side:
import base64
##--snip--##
pickledAcc = base64.b64encode(pickledAcc).decode()
clientCommand = f"{command},{pickledAcc}".encode('utf-8')
client.send(clientCommand)
Server Side:
import base64
##--snip--##
pickledAcc = base64.b64decode(pickledAcc)
pickledAcc = pickle.loads(pickledAcc)

Separating header from the rest of the dataset

I am reading in a csv file and then trying to separate the header from the rest of the file.
hn variable is is the read-in file without the first line.
hn_header is supposed to be the first row in the dataset.
If I define just one of these two variables, the code works. If I define both of them, then the one written later does not contain any data. How is that possible?
from csv import reader
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)[1:] #this should contain all rows except the header
hn_header = list(read_file)[0] # this should be the header
print(hn[:5]) #works
print(len(hn_header)) #empty list, does not contain the header
The CSV reader can only iterate through the file once, which it does the first time you convert it to a list. To avoid needing to iterate through multiple times, you can save the list to a variable.
hn_list = list(read_file)
hn = hn_list[1:]
hn_header = hn_list[0]
Or you can split up the file using extended iterable unpacking
hn_header, *hn = list(read_file)
Just change below line in your code, no additional steps needed. read_file = list(reader(opened_file)). I hope now your code is running perfectly.
The reader object is an iterator, and by definition iterator objects can only be used once. When they're done iterating you don't get any more out of them.
You can refer more about from this Why can I only use a reader object once? question and also above block-quote taken from that question.

How to maintain multiple stream positions in a Python stream

I'd like to use 2 stream pointers within a stream, and position the 2 pointers at different positions. How do I make a copy of the first stream, so that the copy doesn't mirror the state of the first stream, from this point in time?
In particular, I'm interested in streams of the type io.BytesIO()
import io
stream1 = open("Input.jpg", "rb")
stream2 = stream1
print('A', stream1.tell(), stream2.tell())
stream1.seek(10)
print('B', stream1.tell(), stream2.tell())
My goal is to see output of
A 0 0
B 10 0
However, I see
A 0 0
B 10 10
#varela
Thanks for the response. Unfortunately, this doesn't work well when the stream doesn't have a file descriptor (which can happen if we don't open a file). For example, instead of stream1=open("Input.jpg", "rb")
stream1 = io.BytesIO()
image.save(stream1, format='JPEG')
Any suggestions on how to handle this case?
Thanks.
You can open file twice, like
stream1 = open("Input.jpg", "rb")
stream2 = open("Input.jpg", "rb")
Then they will be independent. When you do stream2 = stream1 you just copy object reference, which doesn't create any new object.
You need to remember to close both file objects as well.
Usually copy of file descriptions is not needed. However it's possible to do with low level system operations, but I wouldn't recommend to do it unless you really have use case for this, example:
import os
# return integer file handle
fd1 = os.open("Input.jpg", os.O_BINARY | os.O_RDONLY)
fd2 = os.dup(fd1)
# you can convert them to file objects if required.
stream1 = os.fdopen(fd1, 'rb')
stream2 = os.fdopen(fd2, 'rb')
Here some use cases when os.dup makes sense to use: dup2 / dup - why would I need to duplicate a file descriptor?

Reading file and getting values from a file. It shows only first one and others are empty

I am reading a file by using a with open in python and then do all other operation in the with a loop. While calling the function, I can print only the first operation inside the loop, while others are empty. I can do this by using another approach such as readlines, but I did not find why this does not work. I thought the reason might be closing the file, but with open take care of it. Could anyone please suggest me what's wrong
def read_datafile(filename):
with open(filename, 'r') as f:
a = [lines.split("\n")[0] for number, lines in enumerate(f) if number ==2]
b = [lines.split("\n")[0] for number, lines in enumerate(f) if number ==3]
c = [lines.split("\n")[0] for number, lines in enumerate(f) if number ==2]
return a, b, c
read_datafile('data_file_name')
I only get values for a and all others are empty. When 'a' is commented​, I get value for b and others are empty.
Updates
The file looks like this:
-0.6908270760153553 -0.4493128078936575 0.5090918714784820
0.6908270760153551 -0.2172871921063448 0.5090918714784820
-0.0000000000000000 0.6666999999999987 0.4597549674638203
0.3097856229862140 -0.1259623621214220 0.5475896447896115
0.6902143770137859 0.4593623621214192 0.5475896447896115
The construct
with open(filename) as handle:
a = [line for line in handle if condition]
b = [line for line in handle]
will always return an empty b because the iterator in a already consumed all the data from the open filehandle. Once you reach the end of a stream, additional attempts to read anything will simply return nothing.
If the input is seekable, you can rewind it and read all the same lines again; or you can close it (explicitly, or implicitly by leaving the with block) and open it again - but a much more efficient solution is to read it just once, and pick the lines you actually want from memory. Remember that reading a byte off a disk can easily take several orders of magnitude more time than reading a byte from memory. And keep in mind that the data you read could come from a source which is not seekable, such as standard output from another process, or a client on the other side of a network connection.
def read_datafile(filename):
with open(filename, 'r') as f:
lines = [line for line in f]
a = lines[2]
b = lines[3]
c = lines[2]
return a, b, c
If the file could be too large to fit into memory at once, you end up with a different set of problems. Perhaps in this scenario, where you only seem to want a few lines from the beginning, only read that many lines into memory in the first place.
What exactly are you trying to do with this script? The lines variable here may not contain what you want: it will contain a single line because the file gets enumerated by lines.

Reportlab PDF creating with python duplicating text

I am trying to automate the production of pdfs by reading data from a pandas data frame and writing it a page on an existing pdf form using pyPDF2 and reportlab. The main meat of the program is here:
def pdfOperations(row, bp):
packet = io.BytesIO()
can = canvas.Canvas(packet, pagesize=letter)
createText(row, can)
packet.seek(0)
new_pdf = PdfFileReader(packet)
textPage = new_pdf.getPage(0)
secondPage = bp.getPage(1)
secondPage.mergePage(textPage)
assemblePDF(frontPage, secondPage, row)
del packet, can, new_pdf, textPage, secondPage
def main():
df = openData()
bp = readPDF()
frontPage = bp.getPage(0)
for ind in df.index:
row = df.loc[ind]
pdfOperations(row, bp)
This works fine for the first row of data and the first pdf generated, but for the subsequent ones all the text is overwritten. I.e. the second pdf contains text from the first iteration and the second. I thought the garbage collection would take care of all the in memory changes, but that does not seem to be happening. Anyone know why?
I even tries forcing the objects to be deleted after the function has run its course, but no luck...
You read bp only once before the loop. Then in the loop, you obtain its second page via getPage(1) and merge stuff to it. But since its always from the same object (bp), each iteration will merge to the same page, therefore all the merges done before add up.
While I don't find any way to create a "deepcopy" of a page in PyPDF2's docs, it should work to just create a new bp object for each iteration.
Somewhere in readPDF you must have done something where you open your template PDF into a binary stream and then pass that to PdfFileReader. Instead, you could read the data into a variable:
with open(filename, "rb") as f:
bp_bin = f.read()
And from that, create a new PdfFileReader instance for each loop iteration:
for ind in df.index:
row = df.loc[ind]
bp = PdfFileReader(bp_bin)
pdfOperations(row, bp)
This should "reset" the secondPage everytime without any additional file I/O overhead. Only the parsing is done again each time, but depending on the file size and contents, maybe the time that takes is low and you can live with that.

Resources