>>> line="你好".encode("gbk").rjust(10)
>>> print(line)
b' \xc4\xe3\xba\xc3'
>>> print(line.decode("gbk"))
你好
>>> print("你好".rjust(10))
你好
>>> len("你好".rjust(10))
10
>>> len(line.decode("gbk"))
8
>>> len("你好".encode("gbk").rjust(10).decode("gbk"))
8
It is so strange that len("你好".rjust(10)) =10 ,len("你好".encode("gbk").rjust(10).decode("gbk"))=8, encode and decode can shrink two character in width.
What you are seeing is the difference between bytes and code points. When you take the len of a bytes object, you get the number of bytes. When you take the len of a str object, you get the number of unicode code points.
line is a bytes object, composed of 10 bytes:
>>> line
b' \xc4\xe3\xba\xc3'
>>> list(line)
[32, 32, 32, 32, 32, 32, 196, 227, 186, 195]
>>> len(line)
10
When you decode the bytes to a str, the str is composed of 8 code points:
>>> line.decode("gbk")
' 你好'
>>> list(line.decode("gbk"))
[' ', ' ', ' ', ' ', ' ', ' ', '你', '好']
>>> len(line.decode("gbk"))
8
The two bytes b'\xc4\xe3' get decoded to one code point:
>>> b'\xc4\xe3'.decode('gbk')
'你'
And the same goes for b'\xba\xc3'.
Note that code points are not exactly the same as characters. A code point might be a combining accent mark, for example:
>>> print(u'a\u0300')
à
>>> len(u'a\u0300')
2
Some combining marks can be composed with another code point to form one code point. Indeed, that's the case with the example above:
>>> import unicodedata as UD
>>> UD.normalize('NFKC', 'a\u0300')
'à'
>>> len(UD.normalize('NFKC', 'a\u0300'))
1
However, not all combining marks can be so composed:
>>> UD.normalize('NFKC', 'a\u030b')
'a̋'
>>> len(UD.normalize('NFKC', 'a\u030b'))
2
So even if you normalize, you can not assume that the number of characters you see is the number of code points in the str.
Related
I currently have a list of values and an awkward array of integer values. I want the same dimension awkward array, but where the values are the indices of the "values" arrays corresponding with the integer values of the awkward array. For instance:
values = ak.Array(np.random.rand(100))
arr = ak.Array((np.random.randint(0, 100, 33), np.random.randint(0, 100, 125)))
I want something like values[arr], but that gives the following error:
>>> values[arr]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda3\lib\site-packages\awkward\highlevel.py", line 943, in __getitem__
return ak._util.wrap(self._layout[where], self._behavior)
ValueError: cannot fit jagged slice with length 2 into RegularArray of size 100
If I run it with a loop, I get back what I want:
>>> values = ([values[i] for i in arr])
>>> values
[<Array [0.842, 0.578, 0.159, ... 0.726, 0.702] type='33 * float64'>, <Array [0.509, 0.45, 0.202, ... 0.906, 0.367] type='125 * float64'>]
Is there another way to do this, or is this it? I'm afraid it'll be too slow for my application.
Thanks!
If you're trying to avoid Python for loops for performance, note that the first line casts a NumPy array as Awkward with ak.from_numpy (no loop, very fast):
>>> values = ak.Array(np.random.rand(100))
but the second line iterates over data in Python (has a slow loop):
>>> arr = ak.Array((np.random.randint(0, 100, 33), np.random.randint(0, 100, 125)))
because a tuple of two NumPy arrays is not a NumPy array. It's a generic iterable, and the constructor falls back to ak.from_iter.
On your main question, the reason that arr doesn't slice values is because arr is a jagged array and values is not:
>>> values
<Array [0.272, 0.121, 0.167, ... 0.152, 0.514] type='100 * float64'>
>>> arr
<Array [[15, 24, 9, 42, ... 35, 75, 20, 10]] type='2 * var * int64'>
Note the types: values has type 100 * float64 and arr has type 2 * var * int64. There's no rule for values[arr].
Since it looks like you want to slice values with arr[0] and then arr[1] (from your list comprehension), it could be done in a vectorized way by duplicating values for each element of arr, then slicing.
>>> # The np.newaxis is to give values a length-1 dimension before concatenating.
>>> duplicated = ak.concatenate([values[np.newaxis]] * 2)
>>> duplicated
<Array [[0.272, 0.121, ... 0.152, 0.514]] type='2 * 100 * float64'>
Now duplicated has length 2 and one level of nesting, just like arr, so arr can slice it. The resulting array also has length 2, but the length of each sublist is the length of each sublist in arr, rather than 100.
>>> duplicated[arr]
<Array [[0.225, 0.812, ... 0.779, 0.665]] type='2 * var * float64'>
>>> ak.num(duplicated[arr])
<Array [33, 125] type='2 * int64'>
If you're scaling up from 2 such lists to a large number, then this would eat up a lot of memory. Then again, the size of the output of this operation would also scale as "length of values" × "length of arr". If this "2" is not going to scale up (if it will be at most thousands, not millions or more), then I wouldn't worry about the speed of the Python for loop. Python scales well for thousands, but not billions (depending, of course, on the size of the things being scaled!).
I'd like to understand about python3's bytes and bytearray classes. I've seen documentation on them, but not a comprehensive description of their differences and how they interact with string objects.
bytes and bytearrays are similar...
python3's bytes and bytearray classes both hold arrays of bytes, where each byte can take on a value between 0 and 255. The primary difference is that a bytes object is immutable, meaning that once created, you cannot modify its elements. By contrast, a bytearray object allows you to modify its elements.
Both bytes and bytearay provide functions to encode and decode strings.
bytes and encoding strings
A bytes object can be constructed in a few different ways:
>>> bytes(5)
b'\x00\x00\x00\x00\x00'
>>> bytes([116, 117, 118])
b'tuv'
>>> b'tuv'
b'tuv'
>>> bytes('tuv')
TypeError: string argument without an encoding
>>> bytes('tuv', 'utf-8')
b'tuv'
>>> 'tuv'.encode('utf-8')
b'tuv'
>>> 'tuv'.encode('utf-16')
b'\xff\xfet\x00u\x00v\x00'
>>> 'tuv'.encode('utf-16-le')
b't\x00u\x00v\x00'
Note the difference between the last two: 'utf-16' specifies a generic utf-16
encoding, so its encoded form includes a two-byte "byte order marker" preamble
of [0xff, 0xfe]. When specifying an explicit ordering of 'utf-16-le' as in
the latter example, the encoded form omits the byte order marker.
Because a bytes object is immutable, attempting to change one of its elements
results in an error:
>>> a = bytes('tuv', 'utf-8')
>>> a
b'tuv'
>>> a[0] = 115
TypeError: 'bytes' object does not support item assignment
bytearray and encoding strings
Like bytes, a bytearray can be constructed in a number of ways:
>>> bytearray(5)
bytearray(b'\x00\x00\x00\x00\x00')
>>>bytearray([1, 2, 3])
bytearray(b'\x01\x02\x03')
>>> bytearray('tuv')
TypeError: string argument without an encoding
>>> bytearray('tuv', 'utf-8')
bytearray(b'tuv')
>>> bytearray('tuv', 'utf-16')
bytearray(b'\xff\xfet\x00u\x00v\x00')
>>> bytearray('abc', 'utf-16-le')
bytearray(b't\x00u\x00v\x00')
Because a bytearray is mutable, you can modify its elements:
>>> a = bytearray('tuv', 'utf-8')
>>> a
bytearray(b'tuv')
>>> a[0]=115
>>> a
bytearray(b'suv')
appending bytes and bytearrays
bytes and bytearray objects may be catenated with the + operator:
>>> a = bytes(3)
>>> a
b'\x00\x00\x00'
>>> b = bytearray(4)
>>> b
bytearray(b'\x00\x00\x00\x00')
>>> a+b
b'\x00\x00\x00\x00\x00\x00\x00'
>>> b+a
bytearray(b'\x00\x00\x00\x00\x00\x00\x00')
Note that the catenated result takes on the type of the first argument, so a+b produces a bytes object and b+a produces a bytearray.
converting bytes and bytearray objects into strings
bytes and bytearray objects can be converted to strings using the decode function. The function assumes that you provide the same decoding type as the encoding type. For example:
>>> a = bytes('tuv', 'utf-8')
>>> a
b'tuv'
>>> a.decode('utf-8')
'tuv'
>>> b = bytearray('tuv', 'utf-16-le')
>>> b
bytearray(b't\x00u\x00v\x00')
>>> b.decode('utf-16-le')
'tuv'
I'm working on a project which saves some bytes as a string, but I can't seem to figure out how to get the bytes back to a actual bytes!
I've got this string:
"b'\x80\x03]q\x00(X\r\x00\x00\x00My First Noteq\x01X\x0e\x00\x00\x00My Second Noteq\x02e.'"
As you can see, the type() function on the data returns a string instead of bytes:
<class 'str'>
How can I convert this string back to bytes?
Any help is appreciated, Thanks!
Try:
x="b'\x80\x03]q\x00(X\r\x00\x00\x00My First Noteq\x01X\x0e\x00\x00\x00My Second Noteq\x02e.'"
y=x[2:-1].encode("utf-8")
>>> print(y)
b'\xc2\x80\x03]q\x00(X\r\x00\x00\x00My First Noteq\x01X\x0e\x00\x00\x00My Second Noteq\x02e.'
>>> print(type(y))
<class 'bytes'>
You have just bytes converted to regular string without encoding - so you have redundant tags indicating that: b'...' - you just need to drop them and python will do the rest for you ;)
in python 3:
>>> a=b'\x00\x00\x00\x00\x07\x80\x00\x03'
>>> b = list(a)
>>> b
[0, 0, 0, 0, 7, 128, 0, 3]
>>> c = bytes(b)
>>> c
b'\x00\x00\x00\x00\x07\x80\x00\x03'
>>>
I have a base64 string and I'm trying to figure out what it was, but I can't see anything. What am I doing wrong? Is this
>>> import base64
>>> b = base64.b64decode("FAAAAAMAAAAGAAAACQAAAAwAAAA=")
>>> b
b'\x14\x00\x00\x00\x03\x00\x00\x00\x06\x00\x00\x00\t\x00\x00\x00\x0c\x00\x00\x00'
>>> print(b.decode("utf16"))
>>> print(b.decode("utf8"))
>>>
It it is Base 64 encoding then it is not UTF-16 encoding, nor UTF-8. Have a look at RFC 3548. The Base 64 can be found at page 4 of the document.
Acually, the very purpose is different. The UTF-x encodings are here to encode a unicode string into a binary stream. That is, the abstract string is the decoded form. On the other hand, Base X and the like encodings are here to encode the original binary into a stream of selected ASCII values (basically small integers) so that the binary content could be transfered via e-mail that accepts only characters. The binary is the decoded, original form.
In your case, it looks that as if the serie of integers (32-bit) was transfered: 20, 3, 6, 9, and 12.
Updated later to answer the comment below: How I got the values...
b'\x14\x00\x00\x00\x03\x00\x00\x00\x06\x00\x00\x00\t\x00\x00\x00\x0c\x00\x00\x00'
The b prefix of the literal says it is the literal with the bytes type value. The bytes is a stream of small integers -- each of one byte, that is from zero to 255. When displayed as the literals, the hexadecimal notation of the small integers is used if the related ASCII character cannot be easily displayed -- starting with \x followed by two hexadecimal numerals. The \t is the representation of the tab character which has the ordinal value 9.
However, you can also convert it to the list of integers:
>>> list(b)
[20, 0, 0, 0, 3, 0, 0, 0, 6, 0, 0, 0, 9, 0, 0, 0, 12, 0, 0, 0]
Now it is more apparent. The zero is the filler if the values are small enough to fit into a single byte. The order of bytes is caused by endianness of the OS/machine. So, actually, it should be hexa (as five 32-bit integers):
00000014 00000003 00000006 00000009 0000000c
Which is:
20 3 6 9 12
In other words, the b'\x14\x00\x00\x00\x03\x00\x00\x00\x06\x00\x00\x00\t\x00\x00\x00\x0c\x00\x00\x00' is actually not a string. It is a bytes literal that captures the value of 5 * 4 bytes. The bytes is a sequence of small integers, not of characters. It is more apparent when you try:
>>> for value in b:
... print(value)
...
20
0
0
0
3
0
0
0
6
0
0
0
9
0
0
0
12
0
0
0
>>> type(b)
<class 'bytes'>
>>> type(b[0])
<class 'int'>
>>>
In Python, I am trying to byte string to handle some 8 bit character string. I find that byte string is not necessary behavior in a string like way. With subscript, it returns an number instead of a byte string of length 1.
In [243]: s=b'hello'
In [244]: s[1]
Out[244]: 101
In [245]: s[1:2]
Out[245]: b'e'
This makes it really difficult when I iterate it. For example, this code works with string but fail for byte string.
In [260]: d = {b'e': b'E', b'h': b'H', b'l': b'L', b'o': b'O'}
In [261]: list(map(d.get, s))
Out[261]: [None, None, None, None, None]
This breaks some code from Python 2. I also find this irregularity really inconcenient. Anyone has any insight what's going on with byte string?
Byte strings store byte values in the range 0-255. The repr of bytes is just a convenience to view them, but they are storing data not text. Observe:
>>> x=bytes([104,101,108,108,111])
>>> x
b'hello'
>>> x[0]
104
>>> x[1]
101
>>> list(x)
[104, 101, 108, 108, 111]
Use strings for text. If starting with bytes, decode them appropriately:
>>> s=b'hello'.decode('ascii')
>>> d = dict(zip('hello','HELLO'))
>>> list(map(d.get,s))
['H', 'E', 'L', 'L', 'O']
But if you want to work with bytes:
>>> d=dict(zip(b'hello',b'HELLO'))
>>> d
{104: 72, 108: 76, 101: 69, 111: 79}
>>> list(map(d.get,b'hello'))
[72, 69, 76, 76, 79]
>>> bytes(map(d.get,b'hello'))
b'HELLO'
You can simply decode the string, get the element you want and encode it back :
s=b'hello'
t = s.decode()
print(t[1]) # This gives a char object
print(t[1].encode()) # This gives a byte object