Detect the type of strings - python-3.x

I want to create a python script that uses strings in byte format that can be written to a file.
So the problem is something like this:
packed_int = struct.pack('>I', 1234)
result = packed_int + data
Here is the problem: In python 3 this will give an error because of concatenation between str and bytes.
So I solved the problem with the following code:
data = data.encode('utf-8')
packed_int = struct.pack('>I', 1234)
result = packed_int + data
Now if the data is already in bytes format this will give an error of not having an encode method. So I did this:
if type(data) is not bytes:
data = data.encode('utf-8')
The final problem is the fact that in python 2 the last snippet does not work.
How can I solve the problem without checking for python version?

You could use hasattr for duck typing:
if hasattr(data, 'encode'):
data = data.encode('utf-8')
Or you do something like this, which is what the six compatibility library does:
import sys
PY3 = sys.version_info[0] == 3
binary_type = (bytes if PY3 else str)
text_type = (str if PY3 else unicode)
# ...
if isinstance(data, text_type):
data = data.encode('utf-8')

Related

Python: re.sub gives different output when unicode character is fetched from a string variable containing data from pandas dataframe

The following works:
import re
text = "I\u2019m happy"
text_p = text
text_p = re.sub("[\u2019]","'",text_p)
print(text_p)
Output: I'm happy
This doesn't work:
training_data = pd.read_csv('train.txt')
import re
text = training_data['tweet_text'][0] # Assume that this returns a string "I\u2019m happy"
text_p = text
text_p = re.sub("[\u2019]","'",text_p)
print(text_p)
Output: I\u2019m happy
I tried running your code and got I'm happy returned from both the string and the list item when passing each into re.sub(...) as outlined in your question.
If you're just looking to parse (decode) the unicode characters you probably don't need to be using re. Something like the below could be used to parse the unicode characters without having to run re to check each possibility.
text = training_data['tweet_text'][0]
if type(text) == str: # if value is str then encode to utf-8 byte string then decode back to str
text = text.encode()
text = text.decode()
elif type(text) == bytes: # elif value is bytes just decode to str
text = text.decode()
else: # else printout to console if value is neither str or bytes
print("Value not recognised as str or bytes!")

Problem with decoding bytes to text and comparison?

I do not understand why when you open a document in bytes format with the 'open' function and decode it to text, when compared to a variable that contains exactly the same text, python says they are different. But that only happens when the decoded text of the document has line breaks.
example:
o = open('New.py','rb')
t = o.read().decode()
x = '''this is a
message for test'''
if t == x:
print('true')
else:
print('false')
Although the decoded text 't' and the text of the 'x' are exactly the same, python recognizes them as different and prints false.
I have really tried to find the difference in many ways but I still don't understand how they differ and how I can convert 't' to equal 'x'?
It's because the line breaks are still part of the string (represented as \n) even if you don't see it.
import binascii
o = open('new.py','rb')
t = o.read().decode()
print(binascii.hexlify(t.encode()))
# b'7468697320697320610a6d65737361676520666f7220746573740a'
x = '''this is a
message for test'''
print(binascii.hexlify(x.encode()))
# b'7468697320697320610a6d65737361676520666f722074657374'
Here, 0x0a at the end of t is the byte representation for the new line.
To make them the same, you need to strip out whitespaces and new lines. Assuming new.py looks like this (same as the value for x):
this is a
message for test
Then just do this:
o = open('new.py','rb')
t = o.read().decode().strip()
x = '''this is a
message for test'''
if t == x:
print('true')
else:
print('false')

Replace slice of string python with string of different size, but maintain structure

so today I was working on a function that removes any quoted strings from a chunk of data, and replaces them with format areas instead ({0}, {1}, etc...).
I ran into a problem, because the output was becoming completely scrambled, as in a {1} was going in a seemingly random place.
I later found out that this was a problem because the replacement of slices in the list changed the list so that it's length was different, and so the previous re matches would not line up (it only worked for the first iteration).
the gathering of the strings worked perfectly, as expected, as this is most certainly not a problem with re.
I've read about mutable sequences, and a bunch of other things as well, but was not able to find anything on this.
what I think i need is something like str.replace but can take slices, instead of a substring.
here is my code:
import re
def rm_strings_from_data(data):
regex = re.compile(r'"(.*?)"')
s = regex.finditer(data)
list_data = list(data)
val = 0
strings = []
for i in s:
string = i.group()
start, end = i.span()
strings.append(string)
list_data[start:end] = '{%d}' % val
val += 1
print(strings, ''.join(list_data), sep='\n\n')
if __name__ == '__main__':
rm_strings_from_data('[hi="hello!" thing="a thing!" other="other thing"]')
i get:
['"hello!"', '"a thing!"', '"other thing"']
[hi={0} thing="a th{1}r="other thing{2}
I would like the output:
['"hello!"', '"a thing!"', '"other thing"']
[hi={0} thing={1} other={2}]
any help would be appreciated. thanks for your time :)
Why not match both key=value parts using regex capture groups like this: (\w+?)=(".*?")
Then it becomes very easy to assemble the lists as needed.
Sample Code:
import re
def rm_strings_from_data(data):
regex = re.compile(r'(\w+?)=(".*?")')
matches = regex.finditer(data)
strings = []
list_data = []
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
strings.append(match.group(2))
list_data.append((match.group(1) + '={' + str(matchNum) + '} '))
print(strings, '[' + ''.join(list_data) + ']', sep='\n\n')
if __name__ == '__main__':
rm_strings_from_data('[hi="hello!" thing="a thing!" other="other thing"]')

How to wrap a Python text stream to replace strings on the fly?

Given how convoluted my solution seems to be, I am probably doing it all wrong.
Basically, I am trying to replace strings on the fly in a text stream (e.g. open('filename', 'r') or io.StringIO(text)). The context is that I'm trying to let pandas.read_csv() handle "Infinity" as "inf" instead of choking on it.
I do not want to slurp the whole file in memory (it can be big, and even if the resulting DataFrame will live in memory, no need to have the whole text file too). Efficiency is a concern. So I'd like to keep using read(size) as the main way to get text in (no readline which is quite slower). The difficulty comes from the cases where read() might return a block of text that ends in the middle of one of the strings we'd like to replace.
Anyway, below is what I've got so far. It handles the conditions I've thrown at it so far (lines longer than size, search strings at the boundary of some read block), but I'm wondering if there is something simpler.
Oh, BTW, I don't handle anything else than calls to read().
class ReplaceIOFile(io.TextIOBase):
def __init__(self, iobuffer, old_list, new_list):
self.iobuffer = iobuffer
self.old_list = old_list
self.new_list = new_list
self.buf0 = ''
self.buf1 = ''
self.sub_has_more = True
def read(self, size=None):
if size is None:
size = 2**16
while len(self.buf0) < size and self.sub_has_more:
eol = 0
while eol <= 0:
txt = self.iobuffer.read(size)
self.buf1 += txt
if len(txt) < size:
self.sub_has_more = False
eol = len(self.buf1) + 1
else:
eol = self.buf1.rfind('\n') + 1
txt, self.buf1 = self.buf1[:eol], self.buf1[eol:]
for old, new in zip(self.old_list, self.new_list):
txt = txt.replace(old, new)
self.buf0 += txt
val, self.buf0 = self.buf0[:size], self.buf0[size:]
return val
Example:
text = """\
name,val
a,1.0
b,2.0
e,+Infinity
f,-inf
"""
size = 4 # or whatever -- I tried 1,2,4,10,100,2**16
with ReplaceIOFile(io.StringIO(text), ['Infinity'], ['inf']) as f:
while True:
buf = f.read(size)
print(buf, end='')
if len(buf) < size:
break
Output:
name,val
a,1.0
b,2.0
e,+inf
f,-inf
So for my application:
# x = pd.read_csv(io.StringIO(text), dtype=dict(val=np.float64)) ## crashes
x = pd.read_csv(ReplaceIOFile(io.StringIO(text), ['Infinity'], ['inf']), dtype=dict(val=np.float64))
Out:
name val
0 a 1.000000
1 b 2.000000
2 e inf
3 f -inf

Python3 write string as binary

For a Python 3 programming assignment I have to work with Huffman coding. It's simple enough to generate the correct codes which result in a long string of 0's and 1's.
Now my problem is actually writings this string of as binary and not as text. I attempted to do this:
result = "01010101 ... " #really long string of 0's and 1's
filewrt = open(output_file, "wb") #appending b to w should write as binary, should it not?
filewrt.write(result)
filewrt.close()
however I'm still geting a large text file of 0 and 1 characters. How do I fix this?
EDIT: It seems as if I just simply don't understand how to represent an arbitrary bit in Python 3.
Based on this SO question I devised this ugly monstrosity:
for char in result:
filewrt.write( bytes(int(char, 2)) )
Instead of getting anywhere close to working, it outputted a zero'd file that was twice as large as my input file. Can someone please explain to me how to represent binary arbitrarily? And in the context of creating a huffman tree, how do I go about concatinating or joining bits based on their leaf locations if I should not use a string to do so.
def intToTextBytes(n, stLen=0):
bs = b''
while n>0:
bs = bytes([n & 0xff]) + bs
n >>= 8
return bs.rjust(stLen, b'\x00')
num = 0b01010101111111111111110000000000000011111111111111
bs = intToTextBytes(num)
print(bs)
open(output_file, "wb").write(bs)
EDIT: A more complicated, but faster (about 3 times) way:
from math import log, ceil
intToTextBytes = lambda n, stLen=0: bytes([
(n >> (i<<3)) & 0xff for i in range(int(ceil(log(n, 256)))-1, -1, -1)
]).rjust(stLen, b'\x00')

Resources