I'd like to understand about python3's bytes and bytearray classes. I've seen documentation on them, but not a comprehensive description of their differences and how they interact with string objects.
bytes and bytearrays are similar...
python3's bytes and bytearray classes both hold arrays of bytes, where each byte can take on a value between 0 and 255. The primary difference is that a bytes object is immutable, meaning that once created, you cannot modify its elements. By contrast, a bytearray object allows you to modify its elements.
Both bytes and bytearay provide functions to encode and decode strings.
bytes and encoding strings
A bytes object can be constructed in a few different ways:
>>> bytes(5)
b'\x00\x00\x00\x00\x00'
>>> bytes([116, 117, 118])
b'tuv'
>>> b'tuv'
b'tuv'
>>> bytes('tuv')
TypeError: string argument without an encoding
>>> bytes('tuv', 'utf-8')
b'tuv'
>>> 'tuv'.encode('utf-8')
b'tuv'
>>> 'tuv'.encode('utf-16')
b'\xff\xfet\x00u\x00v\x00'
>>> 'tuv'.encode('utf-16-le')
b't\x00u\x00v\x00'
Note the difference between the last two: 'utf-16' specifies a generic utf-16
encoding, so its encoded form includes a two-byte "byte order marker" preamble
of [0xff, 0xfe]. When specifying an explicit ordering of 'utf-16-le' as in
the latter example, the encoded form omits the byte order marker.
Because a bytes object is immutable, attempting to change one of its elements
results in an error:
>>> a = bytes('tuv', 'utf-8')
>>> a
b'tuv'
>>> a[0] = 115
TypeError: 'bytes' object does not support item assignment
bytearray and encoding strings
Like bytes, a bytearray can be constructed in a number of ways:
>>> bytearray(5)
bytearray(b'\x00\x00\x00\x00\x00')
>>>bytearray([1, 2, 3])
bytearray(b'\x01\x02\x03')
>>> bytearray('tuv')
TypeError: string argument without an encoding
>>> bytearray('tuv', 'utf-8')
bytearray(b'tuv')
>>> bytearray('tuv', 'utf-16')
bytearray(b'\xff\xfet\x00u\x00v\x00')
>>> bytearray('abc', 'utf-16-le')
bytearray(b't\x00u\x00v\x00')
Because a bytearray is mutable, you can modify its elements:
>>> a = bytearray('tuv', 'utf-8')
>>> a
bytearray(b'tuv')
>>> a[0]=115
>>> a
bytearray(b'suv')
appending bytes and bytearrays
bytes and bytearray objects may be catenated with the + operator:
>>> a = bytes(3)
>>> a
b'\x00\x00\x00'
>>> b = bytearray(4)
>>> b
bytearray(b'\x00\x00\x00\x00')
>>> a+b
b'\x00\x00\x00\x00\x00\x00\x00'
>>> b+a
bytearray(b'\x00\x00\x00\x00\x00\x00\x00')
Note that the catenated result takes on the type of the first argument, so a+b produces a bytes object and b+a produces a bytearray.
converting bytes and bytearray objects into strings
bytes and bytearray objects can be converted to strings using the decode function. The function assumes that you provide the same decoding type as the encoding type. For example:
>>> a = bytes('tuv', 'utf-8')
>>> a
b'tuv'
>>> a.decode('utf-8')
'tuv'
>>> b = bytearray('tuv', 'utf-16-le')
>>> b
bytearray(b't\x00u\x00v\x00')
>>> b.decode('utf-16-le')
'tuv'
Related
I'm working on a project which saves some bytes as a string, but I can't seem to figure out how to get the bytes back to a actual bytes!
I've got this string:
"b'\x80\x03]q\x00(X\r\x00\x00\x00My First Noteq\x01X\x0e\x00\x00\x00My Second Noteq\x02e.'"
As you can see, the type() function on the data returns a string instead of bytes:
<class 'str'>
How can I convert this string back to bytes?
Any help is appreciated, Thanks!
Try:
x="b'\x80\x03]q\x00(X\r\x00\x00\x00My First Noteq\x01X\x0e\x00\x00\x00My Second Noteq\x02e.'"
y=x[2:-1].encode("utf-8")
>>> print(y)
b'\xc2\x80\x03]q\x00(X\r\x00\x00\x00My First Noteq\x01X\x0e\x00\x00\x00My Second Noteq\x02e.'
>>> print(type(y))
<class 'bytes'>
You have just bytes converted to regular string without encoding - so you have redundant tags indicating that: b'...' - you just need to drop them and python will do the rest for you ;)
in python 3:
>>> a=b'\x00\x00\x00\x00\x07\x80\x00\x03'
>>> b = list(a)
>>> b
[0, 0, 0, 0, 7, 128, 0, 3]
>>> c = bytes(b)
>>> c
b'\x00\x00\x00\x00\x07\x80\x00\x03'
>>>
what is the work of this if Enquiry(lis1).size: and can we use .size at the time of calling the function if it is so ,then whatt will argument lis1 will receive in the function definition def Enquiry(lis1): and please elaborate this because i am beginner in python
import numpy
def Enquiry(lis1):
return(numpy.array(lis1))
lis1 = []
if Enquiry(lis1).size:
print("Not Empty")
else:
print("Empty
For a numpy array, the size is a property that contains the size of the numpy.array object. The param, lis1 is a python object exposing the array interface. And yes, you can use .size while calling Enquiry(lis1) because when it's evaluated, it returns a numpy.array object and then calls .size on it.
sample usage:
>>> import numpy
>>> v = numpy.array([1, 2, 3])
>>> v.size
3
>>> dir(v)
[..., 'shape', 'size', 'sort', ...]
>>>
>>> getattr(v, 'size')
3
The function signature (use, help(numpy.array)) to see this:
Help on built-in function array in module numpy.core.multiarray:
array(...)
array(object, dtype=None, copy=True, order='K', subok=False, ndmin=0)
Create an array.
Parameters
----------
object : array_like
An array, any object exposing the array interface, an object whose
__array__ method returns an array, or any (nested) sequence.
I have an np.array startIdx originating from a list of tuples consisting of integer and float fields:
>>> startIdx, someInt, someFloat = np.array(resultList).T
>>> startIdx
array([0.0, 111.0, 333.0]) # 10 to a few 100 positive values of the order of 100 to 10000
>>> round(startIdx[2])
333.0 # oops
>>> help(round)
Round [...] returns an int when called with one argument, otherwise the same type as the number.
>>> round(np.pi)
3
>>> round(np.pi, 2) # the optional argument is the number of decimal digits
3.14
round([0.0, 111.0, 333.0][2]) # to test whether it's specific for numpy arrays.
333
The float currently works (as index into numpy arrays) but yields a warning:
VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
I could avoid the conversion from tuples to arrays (and int to float) by collecting my results in a grossly oversized record array (with an int field ''startIdx'').
I could use something like int(. + 0.1), which is also ugly. Would int(round(.)) or even int(.) safely yield correct results?
In [70]: startIdx=np.array([0.0, 111.0, 333.0])
In [71]: startIdx
Out[71]: array([ 0., 111., 333.])
If you need an integer array, use astype:
In [72]: startIdx.astype(int)
Out[72]: array([ 0, 111, 333])
not round:
In [73]: np.round(startIdx)
Out[73]: array([ 0., 111., 333.])
np.array(resultList) produces a float dtype array because some values are float. arr=np.array(resultList, dtype='i,i,f') should produce a structured array with integer and float fields, assuming resultList is indeed a list of tuples.
startIdx = arr['f0']
should then be an integer dtype array.
I expect the memory use of the structured array to be the same as for the float one.
>>> line="你好".encode("gbk").rjust(10)
>>> print(line)
b' \xc4\xe3\xba\xc3'
>>> print(line.decode("gbk"))
你好
>>> print("你好".rjust(10))
你好
>>> len("你好".rjust(10))
10
>>> len(line.decode("gbk"))
8
>>> len("你好".encode("gbk").rjust(10).decode("gbk"))
8
It is so strange that len("你好".rjust(10)) =10 ,len("你好".encode("gbk").rjust(10).decode("gbk"))=8, encode and decode can shrink two character in width.
What you are seeing is the difference between bytes and code points. When you take the len of a bytes object, you get the number of bytes. When you take the len of a str object, you get the number of unicode code points.
line is a bytes object, composed of 10 bytes:
>>> line
b' \xc4\xe3\xba\xc3'
>>> list(line)
[32, 32, 32, 32, 32, 32, 196, 227, 186, 195]
>>> len(line)
10
When you decode the bytes to a str, the str is composed of 8 code points:
>>> line.decode("gbk")
' 你好'
>>> list(line.decode("gbk"))
[' ', ' ', ' ', ' ', ' ', ' ', '你', '好']
>>> len(line.decode("gbk"))
8
The two bytes b'\xc4\xe3' get decoded to one code point:
>>> b'\xc4\xe3'.decode('gbk')
'你'
And the same goes for b'\xba\xc3'.
Note that code points are not exactly the same as characters. A code point might be a combining accent mark, for example:
>>> print(u'a\u0300')
à
>>> len(u'a\u0300')
2
Some combining marks can be composed with another code point to form one code point. Indeed, that's the case with the example above:
>>> import unicodedata as UD
>>> UD.normalize('NFKC', 'a\u0300')
'à'
>>> len(UD.normalize('NFKC', 'a\u0300'))
1
However, not all combining marks can be so composed:
>>> UD.normalize('NFKC', 'a\u030b')
'a̋'
>>> len(UD.normalize('NFKC', 'a\u030b'))
2
So even if you normalize, you can not assume that the number of characters you see is the number of code points in the str.
In Python, I am trying to byte string to handle some 8 bit character string. I find that byte string is not necessary behavior in a string like way. With subscript, it returns an number instead of a byte string of length 1.
In [243]: s=b'hello'
In [244]: s[1]
Out[244]: 101
In [245]: s[1:2]
Out[245]: b'e'
This makes it really difficult when I iterate it. For example, this code works with string but fail for byte string.
In [260]: d = {b'e': b'E', b'h': b'H', b'l': b'L', b'o': b'O'}
In [261]: list(map(d.get, s))
Out[261]: [None, None, None, None, None]
This breaks some code from Python 2. I also find this irregularity really inconcenient. Anyone has any insight what's going on with byte string?
Byte strings store byte values in the range 0-255. The repr of bytes is just a convenience to view them, but they are storing data not text. Observe:
>>> x=bytes([104,101,108,108,111])
>>> x
b'hello'
>>> x[0]
104
>>> x[1]
101
>>> list(x)
[104, 101, 108, 108, 111]
Use strings for text. If starting with bytes, decode them appropriately:
>>> s=b'hello'.decode('ascii')
>>> d = dict(zip('hello','HELLO'))
>>> list(map(d.get,s))
['H', 'E', 'L', 'L', 'O']
But if you want to work with bytes:
>>> d=dict(zip(b'hello',b'HELLO'))
>>> d
{104: 72, 108: 76, 101: 69, 111: 79}
>>> list(map(d.get,b'hello'))
[72, 69, 76, 76, 79]
>>> bytes(map(d.get,b'hello'))
b'HELLO'
You can simply decode the string, get the element you want and encode it back :
s=b'hello'
t = s.decode()
print(t[1]) # This gives a char object
print(t[1].encode()) # This gives a byte object