Python 3 byte string subscription - python-3.x

In Python, I am trying to byte string to handle some 8 bit character string. I find that byte string is not necessary behavior in a string like way. With subscript, it returns an number instead of a byte string of length 1.
In [243]: s=b'hello'
In [244]: s[1]
Out[244]: 101
In [245]: s[1:2]
Out[245]: b'e'
This makes it really difficult when I iterate it. For example, this code works with string but fail for byte string.
In [260]: d = {b'e': b'E', b'h': b'H', b'l': b'L', b'o': b'O'}
In [261]: list(map(d.get, s))
Out[261]: [None, None, None, None, None]
This breaks some code from Python 2. I also find this irregularity really inconcenient. Anyone has any insight what's going on with byte string?

Byte strings store byte values in the range 0-255. The repr of bytes is just a convenience to view them, but they are storing data not text. Observe:
>>> x=bytes([104,101,108,108,111])
>>> x
b'hello'
>>> x[0]
104
>>> x[1]
101
>>> list(x)
[104, 101, 108, 108, 111]
Use strings for text. If starting with bytes, decode them appropriately:
>>> s=b'hello'.decode('ascii')
>>> d = dict(zip('hello','HELLO'))
>>> list(map(d.get,s))
['H', 'E', 'L', 'L', 'O']
But if you want to work with bytes:
>>> d=dict(zip(b'hello',b'HELLO'))
>>> d
{104: 72, 108: 76, 101: 69, 111: 79}
>>> list(map(d.get,b'hello'))
[72, 69, 76, 76, 79]
>>> bytes(map(d.get,b'hello'))
b'HELLO'

You can simply decode the string, get the element you want and encode it back :
s=b'hello'
t = s.decode()
print(t[1]) # This gives a char object
print(t[1].encode()) # This gives a byte object

Related

HuggingFace - tokenizers - Lower case with input ids

It is possible to do lower case to given input ids without decode and then encode again ?
for example
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
text1 = tokenizer.decode([713, 16, 10, 3645, 4])
print(text1)
>>> This is a sentence.
text2 = tokenizer.decode([9226, 16, 10, 3645, 4])
print(text2)
>>> this is a sentence.
I would like to know if there is some fast way to convert the id 713 to 9226, without decode, do lower and then encode again.
Thanks,
Shon

python3: bytes vs bytearray, and converting to and from strings

I'd like to understand about python3's bytes and bytearray classes. I've seen documentation on them, but not a comprehensive description of their differences and how they interact with string objects.
bytes and bytearrays are similar...
python3's bytes and bytearray classes both hold arrays of bytes, where each byte can take on a value between 0 and 255. The primary difference is that a bytes object is immutable, meaning that once created, you cannot modify its elements. By contrast, a bytearray object allows you to modify its elements.
Both bytes and bytearay provide functions to encode and decode strings.
bytes and encoding strings
A bytes object can be constructed in a few different ways:
>>> bytes(5)
b'\x00\x00\x00\x00\x00'
>>> bytes([116, 117, 118])
b'tuv'
>>> b'tuv'
b'tuv'
>>> bytes('tuv')
TypeError: string argument without an encoding
>>> bytes('tuv', 'utf-8')
b'tuv'
>>> 'tuv'.encode('utf-8')
b'tuv'
>>> 'tuv'.encode('utf-16')
b'\xff\xfet\x00u\x00v\x00'
>>> 'tuv'.encode('utf-16-le')
b't\x00u\x00v\x00'
Note the difference between the last two: 'utf-16' specifies a generic utf-16
encoding, so its encoded form includes a two-byte "byte order marker" preamble
of [0xff, 0xfe]. When specifying an explicit ordering of 'utf-16-le' as in
the latter example, the encoded form omits the byte order marker.
Because a bytes object is immutable, attempting to change one of its elements
results in an error:
>>> a = bytes('tuv', 'utf-8')
>>> a
b'tuv'
>>> a[0] = 115
TypeError: 'bytes' object does not support item assignment
bytearray and encoding strings
Like bytes, a bytearray can be constructed in a number of ways:
>>> bytearray(5)
bytearray(b'\x00\x00\x00\x00\x00')
>>>bytearray([1, 2, 3])
bytearray(b'\x01\x02\x03')
>>> bytearray('tuv')
TypeError: string argument without an encoding
>>> bytearray('tuv', 'utf-8')
bytearray(b'tuv')
>>> bytearray('tuv', 'utf-16')
bytearray(b'\xff\xfet\x00u\x00v\x00')
>>> bytearray('abc', 'utf-16-le')
bytearray(b't\x00u\x00v\x00')
Because a bytearray is mutable, you can modify its elements:
>>> a = bytearray('tuv', 'utf-8')
>>> a
bytearray(b'tuv')
>>> a[0]=115
>>> a
bytearray(b'suv')
appending bytes and bytearrays
bytes and bytearray objects may be catenated with the + operator:
>>> a = bytes(3)
>>> a
b'\x00\x00\x00'
>>> b = bytearray(4)
>>> b
bytearray(b'\x00\x00\x00\x00')
>>> a+b
b'\x00\x00\x00\x00\x00\x00\x00'
>>> b+a
bytearray(b'\x00\x00\x00\x00\x00\x00\x00')
Note that the catenated result takes on the type of the first argument, so a+b produces a bytes object and b+a produces a bytearray.
converting bytes and bytearray objects into strings
bytes and bytearray objects can be converted to strings using the decode function. The function assumes that you provide the same decoding type as the encoding type. For example:
>>> a = bytes('tuv', 'utf-8')
>>> a
b'tuv'
>>> a.decode('utf-8')
'tuv'
>>> b = bytearray('tuv', 'utf-16-le')
>>> b
bytearray(b't\x00u\x00v\x00')
>>> b.decode('utf-16-le')
'tuv'

Convert Stringed Bytes Back Into Bytes

I'm working on a project which saves some bytes as a string, but I can't seem to figure out how to get the bytes back to a actual bytes!
I've got this string:
"b'\x80\x03]q\x00(X\r\x00\x00\x00My First Noteq\x01X\x0e\x00\x00\x00My Second Noteq\x02e.'"
As you can see, the type() function on the data returns a string instead of bytes:
<class 'str'>
How can I convert this string back to bytes?
Any help is appreciated, Thanks!
Try:
x="b'\x80\x03]q\x00(X\r\x00\x00\x00My First Noteq\x01X\x0e\x00\x00\x00My Second Noteq\x02e.'"
y=x[2:-1].encode("utf-8")
>>> print(y)
b'\xc2\x80\x03]q\x00(X\r\x00\x00\x00My First Noteq\x01X\x0e\x00\x00\x00My Second Noteq\x02e.'
>>> print(type(y))
<class 'bytes'>
You have just bytes converted to regular string without encoding - so you have redundant tags indicating that: b'...' - you just need to drop them and python will do the rest for you ;)
in python 3:
>>> a=b'\x00\x00\x00\x00\x07\x80\x00\x03'
>>> b = list(a)
>>> b
[0, 0, 0, 0, 7, 128, 0, 3]
>>> c = bytes(b)
>>> c
b'\x00\x00\x00\x00\x07\x80\x00\x03'
>>>

Python 3 convert int to dictonary

I want to convert this
(554, 334, 24, 15)
to
[554', ' 334', ' 24', ' 15]
If they are a similar question then sorry i didnt find one.
print(list(map(str, list((554, 334, 24, 15)))))
result = ['554', '334', '24', '15']
The tuple is converted into a list(), then map() applies the same function to each element of the list, in this case converting the int-type list elements to a string, the resulting list after this conversion is then printed.

encode ,length of character and width of display in python3

>>> line="你好".encode("gbk").rjust(10)
>>> print(line)
b' \xc4\xe3\xba\xc3'
>>> print(line.decode("gbk"))
你好
>>> print("你好".rjust(10))
你好
>>> len("你好".rjust(10))
10
>>> len(line.decode("gbk"))
8
>>> len("你好".encode("gbk").rjust(10).decode("gbk"))
8
It is so strange that len("你好".rjust(10)) =10 ,len("你好".encode("gbk").rjust(10).decode("gbk"))=8, encode and decode can shrink two character in width.
What you are seeing is the difference between bytes and code points. When you take the len of a bytes object, you get the number of bytes. When you take the len of a str object, you get the number of unicode code points.
line is a bytes object, composed of 10 bytes:
>>> line
b' \xc4\xe3\xba\xc3'
>>> list(line)
[32, 32, 32, 32, 32, 32, 196, 227, 186, 195]
>>> len(line)
10
When you decode the bytes to a str, the str is composed of 8 code points:
>>> line.decode("gbk")
' 你好'
>>> list(line.decode("gbk"))
[' ', ' ', ' ', ' ', ' ', ' ', '你', '好']
>>> len(line.decode("gbk"))
8
The two bytes b'\xc4\xe3' get decoded to one code point:
>>> b'\xc4\xe3'.decode('gbk')
'你'
And the same goes for b'\xba\xc3'.
Note that code points are not exactly the same as characters. A code point might be a combining accent mark, for example:
>>> print(u'a\u0300')
à
>>> len(u'a\u0300')
2
Some combining marks can be composed with another code point to form one code point. Indeed, that's the case with the example above:
>>> import unicodedata as UD
>>> UD.normalize('NFKC', 'a\u0300')
'à'
>>> len(UD.normalize('NFKC', 'a\u0300'))
1
However, not all combining marks can be so composed:
>>> UD.normalize('NFKC', 'a\u030b')
'a̋'
>>> len(UD.normalize('NFKC', 'a\u030b'))
2
So even if you normalize, you can not assume that the number of characters you see is the number of code points in the str.

Resources