I am porting some code from python 2.7 to 3.4.2, I am struck at the bytes vs string complication.
I read this 3rd point in the wolf's answer
Exactly n bytes may cause a break between logical multi-byte characters (such as \r\n in binary mode and, I think, a multi-byte character in Unicode) or some underlying data structure not known to you;
So, when I buffer read a file (say - 1 byte each time) & the very first characters happens to be a 6-byte unicode how do I figure out how many more bytes to be read? Because if I do not read till the complete char, it will be skipped from processing; as next time read(x) will read x bytes relative to it's last position (i.e. halfway between it char's byte equivalent)
I tried the following approach:
import sys, os
def getBlocks(inputFile, chunk_size=1024):
while True:
try:
data=inputFile.read(chunk_size)
if data:
yield data
else:
break
except IOError as strerror:
print(strerror)
break
def isValid(someletter):
try:
someletter.decode('utf-8', 'strict')
return True
except UnicodeDecodeError:
return False
def main(src):
aLetter = bytearray()
with open(src, 'rb') as f:
for aBlock in getBlocks(f, 1):
aLetter.extend(aBlock)
if isValid(aLetter):
# print("char is now a valid one") # just for acknowledgement
# do more
else:
aLetter.extend( getBlocks(f, 1) )
Questions:
Am I doomed if I try fileHandle.seek(-ve_value_here, 1)
Python must be having something in-built to deal with this, what is it?
how can I really test if the program meets its purpose of ensuring complete characters are read (right now I have only simple english files)
how can I determine best chunk_size to make program faster. I mean reading 1024 bytes where first 1023 bytes were 1-byte-representable-char & last was a 6-byter leaves me with the only option of reading 1 byte each time
Note: I can't prefer buffered reading as I do not know range of input file sizes in advance
The answer to #2 will solve most of your issues. Use an IncrementalDecoder via codecs.getincrementaldecoder. The decoder maintains state and only outputs fully decoded sequences:
#!python3
import codecs
import sys
byte_string = '\u5000\u5001\u5002'.encode('utf8')
# Get the UTF-8 incremental decoder.
decoder_factory = codecs.getincrementaldecoder('utf8')
decoder_instance = decoder_factory()
# Simple example, read two bytes at a time from the byte string.
result = ''
for i in range(0,len(byte_string),2):
chunk = byte_string[i:i+2]
result += decoder_instance.decode(chunk)
print('chunk={} state={} result={}'.format(chunk,decoder_instance.getstate(),ascii(result)))
result += decoder_instance.decode(b'',final=True)
print(ascii(result))
Output:
chunk=b'\xe5\x80' state=(b'\xe5\x80', 0) result=''
chunk=b'\x80\xe5' state=(b'\xe5', 0) result='\u5000'
chunk=b'\x80\x81' state=(b'', 0) result='\u5000\u5001'
chunk=b'\xe5\x80' state=(b'\xe5\x80', 0) result='\u5000\u5001'
chunk=b'\x82' state=(b'', 0) result='\u5000\u5001\u5002'
'\u5000\u5001\u5002'
Note after the first two bytes are processed the internal decoder state just buffers them and appends no characters to the result. The next two complete a character and leave one in the internal state. The last call with no additional data and final=True just flushes the buffer. It will raise an exception if there is an incomplete character pending.
Now you can read your file in whatever chunk size you want, pass them all through the decoder and be sure that you only have complete code points.
Note that with Python 3, you can just open the file and declare the encoding. The chunk you read will actually be processed Unicode code points using an IncrementalDecoder internally:
input.csv (saved in UTF-8 without BOM)
我是美国人。
Normal text.
code
with open('input.txt',encoding='utf8') as f:
while True:
data = f.read(2) # reads 2 Unicode codepoints, not bytes.
if not data: break
print(ascii(data))
Result:
'\u6211\u662f'
'\u7f8e\u56fd'
'\u4eba\u3002'
'\nN'
'or'
'ma'
'l '
'te'
'xt'
'.'
Related
I have a python3 script which reads data into a buffer with
fp = open("filename", 'rb')
data = fp.read(count)
I don't fully understand (even after reading the documentation) what read() returns. It appears to be some kind of binary data which is iterable. But it is not a list.
Confusingly, elsewhere in the script, lists are used for binary data.
frames = []
# then later... inside a loop
for ...
data = b''.join(frames)
Regardless... I want to iterate over the object returned by read() in units of word (aka 2 byte blocks)
At the moment the script contains this for loop
for c in data:
# do something
Is it possible to change c such that this loop iterates over words (2 byte blocks) rather than individual bytes?
I cannot use read() in a loop to read 2 bytes at a time.
We can explicitly read (up to) n bytes from a file in binary mode with .read(n) (just as it would read n Unicode code points from a file opened in text mode). This is a blocking call and will only read fewer bytes at the end of the file.
We can use the two-argument form of iter to build an iterator that repeatedly calls a callable:
>>> help(iter)
Help on built-in function iter in module builtins:
iter(...)
iter(iterable) -> iterator
iter(callable, sentinel) -> iterator
Get an iterator from an object. In the first form, the argument must
supply its own iterator, or be a sequence.
In the second form, the callable is called until it returns the sentinel.
read at the end of the file will start returning empty results and not raise an exception, so we can use that for our sentinel.
Putting it together, we get:
for pair in iter(lambda: fp.read(2), b''):
Inside the loop, we will get bytes objects that represent two bytes of data. You should check the documentation to understand how to work with these.
When reading a file in binary mode, a bytes object is returned, which is one of the standard python builtins. In general, its representation in the code looks like that of a string, except that it is prefixed as b" " - When you try printing it, each byte may be displayed with an escape like \x** where ** are 2 hex digits corresponding to the byte's value from 0 to 255, or directly as a single printable ascii character, with the same ascii codepoint as the number. You can read more about this and methods etc of bytes (also similar to those for strings) in the bytes docs.
There already seems to be a very popular question on stack overflow about how to iterate over a bytes object. The currently accepted answer gives this example for creating a list of individual bytes in the bytes object :
L = [bytes_obj[i:i+1] for i in range(len(bytes_obj))]
I suppose that modifying it like this will work for you :
L = [bytes_obj[i:i+2] for i in range(0, len(bytes_obj), 2)]
For example :
by = b"\x00\x01\x02\x03\x04\x05\x06"
# The object returned by file.read() is also bytes, like the one above
words = [by[i:i+2] for i in range(0, len(by), 2)]
print(words)
# Output --> [b'\x00\x01', b'\x02\x03', b'\x04\x05', b'\x06']
Or create a generator that yields words in the same way if your list is likely to be too large to efficiently store at once:
def get_words(bytesobject):
for i in range(0, len(bytesobject), 2):
yield bytesobject[i:i+2]
In the most simple literal sense, something like this gives you a two byte at a time loop.
with open("/etc/passwd", "rb") as f:
w = f.read(2)
while len(w) > 0:
print( w )
w = f.read(2)
as for what you are getting from read, it's a bytes object, because you have specified 'b' as an option to the `open
I think a more python way to express it would be via an iterator or generator.
I have pickled a pandas data frame on my server and I'm sending it via a socket connection and I can receive the data but I can't seem to append the chunks of data back together to the original Dataframe format that's all I'm trying to achieve! I have a feeling its the way I'm appending as its turning into a list because of data = [] but I tried an empty pd Dataframe and that didn't work so I'm kinda a bit lost as to how ill append these values
data = []
FATPACKET = 0
bytelength = self.s.recv(BUFFERSIZE)
length = int(pickle.loads(bytelength))
print(length)
ammo = 0
while True:
print("Getting Data....")
packet = self.s.recv(1100)
FATPACKET = int(sys.getsizeof(packet))
ammo += FATPACKET
print(str(FATPACKET) + ' Got this much of data out of ' +str(length))
print("Getting Data.....")
data.append(packet)
print(ammo)
if not ammo > length:
break
print(data)
unpickled = pickle.loads(data)
self.s.close()
print("Closing Connection!")
print(unpickled)
when I try this code Im constantly running into this
TypeError: a bytes-like object is required, not 'list'
or I run into this
_pickle.UnpicklingError: invalid load key, '\x00'
which is the first couple digits of my pickled Dataframe sorry this is my first time messing around with the pickle module so I'm not very knowledgeable
It would help if we could also see exactly what you're doing on the sending end. However, it's apparent that you have several problems.
First, in the initial recv, it's obvious you intended that to only obtain the initial pickle object you used to encode the length of the remaining bytes. However, that recv might also receive an initial segment of the remaining bytes (or even all of the remaining bytes, depending on how large that is). So how much of it should you give to the initial pickle.loads?
You would be better off creating a fixed length field to contain the size of the remaining data. That is often done with the struct module. On the sending side:
import struct
# Pickle the data to be sent
data = pickle.dumps(python_obj)
data_len = len(data)
# Pack data length into 4 bytes, encoded in network byte order
data_len_to_send = struct.pack('!I', data_len)
# Send exactly 4 bytes (due to 'I'), then send data
conn.sendall(data_len_to_send)
conn.sendall(data)
On the receiving side, as the exception said, pickle.loads takes a byte string not a list. So part of solving this will be to concatenate all the list elements into a single byte string before calling loads:
unpickled = pickle.loads(b''.join(data))
Other issues on receiving side: use len(packet) to get the buffer size. sys.getsizeof provides the internal memory used by the bytes object which includes unspecified interpreter overhead and isn't what you need here.
After recv, the first thing you should do is check for an empty buffer which indicates the end-of-stream (len(packet) == 0 or packet == '' or not packet even). That would happen for example if the sender got killed before completing the send (or the network link goes down, or some bug on the sender side, etc.). Otherwise, if the connection ends prematurely, your program will never reach the break and hence it will be in a very tight infinite loop.
So, altogether you could do something like this:
# First, obtain fixed-size content length
buf = b''
while len(buf) < 4:
tbuf = recv(4 - len(buf))
if tbuf == '':
raise RuntimeError("Lost connection with peer")
buf += tbuf
# Decode (unpack) length (note that unpack returns an array)
len_to_recv = struct.unpack('!I', buf)[0]
data = []
len_recved = 0
while len_recvd < len_to_recv:
buf = self.s.recv(min(len_to_recv - len_recvd, BUFFERSIZE))
if buf == '':
raise RuntimeError("Lost connection with peer")
data.append(buf)
len_recvd += len(buf)
unpickled_obj = pickle.loads(b''.join(data))
EDIT: moved parenthesis
I'm currently learning to use Python for binary files. I came across this code in the book I'm reading:
FILENAME = 'pc_rose_copy.txt'
def display_contents(filename):
fp = open(filename, 'rb')
print(fp.read())
fp.close()
def encrypt(filename):
fp = open(filename, 'r+b')
text = fp.read()
fp.seek(0)
for c in text:
if c <= 128:
fp.write(bytes([c+128]))
else:
fp.write(bytes([c-128]))
fp.close()
display_contents(FILENAME)
encrypt(FILENAME)
display_contents(FILENAME)
I've several doubts regarding this code for which I can't find an answer in the book:
1) In line 13 ("if c <= 128"), since the file was opened in binary mode, each character is read as its index in the ASCII table (i.e., that is equivalent to 'if ord(c) <= 128' had the file not been in binary mode)?
2) If so, then what's the point in checking if any character's index is higher than 128, since this is a .txt with a passage from Romeo and Juliet?
3) This point is more of a curiosity, so pardon naivety. I know this doesn't apply in this case, but say the script encounters a 'c' with a byte value of 128, and so adds 128 to it. What would 256 byte look like -- would it be 11111111 00000001?
What's really happening is that the script is toggling the most significant bit of every byte. This is equivalent to adding/subtracting 128 to each byte. You can see this by looking at the file contents before/after running the script (xxd -b file.txt on linux or mac will let you see the exact bits/bytes).
Here's a run on some sample text:
File Contents Before:
11110000 10011111 10011000 10000100 00001010
File Contents After:
01110000 00011111 00011000 00000100 10001010
Running the script twice (or any even number of times) restores the original text by toggling all of the high bits back to the original values.
Question / Answer:
1) If the file is ASCII-encoded, yes. e.g. for a file abc\n, the values of c are 97, 98, 99, and 10 (newline). You can verify this by adding print(c) inside the loop. This script will also work* on non-ASCII encoded files (the example above is UTF-8).
2) So that we can flip the bits. Even if we were only handling ASCII files (which isn't guaranteed), the bytes we get from encrypting ASCII files will be larger than 128, since we've added 128 to each byte. So we still need to handle that case in order to decrypt our own files.
3) As is, the script crashes, because bytes() requires values in the range 0 <= x < 256 (see documentation). You can create a file that breaks the script with echo -n -e '\x80\x80\x80' > 128.txt. The script should be using < instead to handle this case properly.
* Except for 3)
I think that the encrypt function is also meant to be a decrypt function.
The encrypt goes from a text file to a binary file with only high bytes. But the else clause is for going back from high byte to text. I think that if you added an extra encrypt(FILENAME) you'd get the original file back.
'c' cannot really be 128, in a text file. The highest value there would be 126 (~), 127 is the del "character". But c=128 and adding 128 as bytes would be 0 (wrap around) as we work modulo 256. In C this would be the case (for unsigned char).
I am having trouble determining when I have reached the end of a file in python with file.readline
fi = open('myfile.txt', 'r')
line = fi.readline()
if line == EOF: //or something similar
dosomething()
c = fp.read()
if c is None:
will not work because then I will loose data on the next line, and if a line only has a carriage return I will miss an empty line.
I have looked a dozens or related posts, and they all just use the inherent loops that just break when they are done. I am not looping so this doesn't work for me. Also I have file sizes in the GB with 100's of thousands of lines. A script could spend days processing a file. So I need to know how to tell when I am at the end of the file in python3. Any help is appreciated. Thank you!
I ran in to this same exact problem. My specific issue was iteration over two files, where the shorter one was only supposed to read a line on specific reads of the longer file.
As some mentioned here the natural pythonic way to iterate line by line is to, well, just iterate. My solution to stick with this 'naturalness' was to just utilize the iterator property of a file manually. Something like this:
with open('myfile') as lines:
try:
while True: #Just to fake a lot of readlines and hit the end
current = next(lines)
except StopIteration:
print('EOF!')
You can of course embellish this with your own IOWrapper class, but this was enough for me. Just replace all calls to readline to calls of next, and don't forget to catch the StopIteration.
The simplest way to check whether you've reached EOF with fi.readline() is to check the truthiness of the return value;
line = fi.readline()
if not line:
dosomething() # EOF reached
Reasoning
According to the official documentation
f.readline() reads a single line from the file; a newline character (\n) is left at the end of the string, and is only omitted on the last line of the file if the file doesn’t end in a newline. This makes the return value unambiguous; if f.readline() returns an empty string, the end of the file has been reached, while a blank line is represented by '\n', a string containing only a single newline.
and the only falsy string in python is the empty string ('').
You can use the output of the tell() function to determine if the last readline changed the current position of the stream.
fi = open('myfile.txt', 'r')
pos = fi.tell()
while (True):
li = fi.readline()
newpos = fi.tell()
if newpos == pos: # stream position hasn't changed -> EOF
break
else:
pos = newpos
According to the Python Tutorial:
f.tell() returns an integer giving the file object’s current position in the file represented as number of bytes from the beginning of the file when in binary mode and an opaque number when in text mode.
...
In text files (those opened without a b in the mode string), only seeks relative to the beginning of the file are allowed (the exception being seeking to the very file end with seek(0, 2)) and the only valid offset values are those returned from the f.tell(), or zero.
Since the value returned from tell() can be used to seek(), they would have to be unique (even if we can't guarantee what they correspond to). Therefore, if the value of tell() before and after a readline() is unchanged, the stream position is unchanged, and the EOF has been reached (or some other I/O exception of course). Reading an empty line will read at least the newline and advance the stream position.
This is a demonstrative example using f.tell() and f.read() with a chunk of data:
Assuming my input.txt file contain:
hello
hi
hoo
foo
bar
Test:
with open('input.txt', 'r') as f:
# Read chunk of data
chunk = 4
while True:
line = f.read(chunk)
if not line:
line = "i've read Nothing"
print("EOF reached. What i read when i reach EOF:", line)
break
else:
print('Read: {} at position: {}'.format(line.replace('\n', ''), f.tell()))
Will output:
Read: hell at position: 4
Read: ohi at position: 9
Read: hoo at position: 14
Read: foo at position: 19
Read: bar at position: 24
EOF reached. What i read when i reach EOF: i've read Nothing
with open(FILE_PATH, 'r') as fi:
for line in iter(fi.readline, ''):
parse(line)
New to this python thing.
A little while ago I saved off output from an external device that was connected to a serial port as I would not be able to keep that device. I read in the data at the serial port as bytes with the intention of creating an emulator for that device.
The data was saved one 'byte' per line to a file as example extract below.
b'\x9a'
b'X'
b'}'
b'}'
b'x'
b'\x8c'
I would like to read in each line from the data capture and append what would have been the original byte to a byte array.
I have tried various append() and concatenation operations (+=) on a bytearray but the above lines are python string objects and these operations fail.
Is there an easy way (a built-in way?) to add each of the original byte values of these lines to a byte array?
Thanks.
M
Update
I came across the .encode() string method and have created a crude function to meet my needs.
def string2byte(str):
# the 'byte' string will consist of 5, 6 or 8 characters including the newline
# as in b',' or b'\r' or b'\x0A'
if len(str) == 5:
return str[2].encode('Latin-1')
elif len(str) == 6:
return str[3].encode('Latin-1')
else:
return str[4:6].encode('Latin-1')
...well, it is functional.
If anyone knows of a more elegant solution perhaps you would be kind enough to post this.
b'\x9a' is a literal representation of the byte 0x9a in Python source code. If your file literally contains these seven characters b'\x9a' then it is bad because you could have saved it using only one byte. You could convert it to a byte using ast.literal_eval():
import ast
with open('input') as file:
a = b"".join(map(ast.literal_eval, file)) # assume 1 byte literal per line