Why is passing bytes to class str constructor special?

Why is passing bytes to class str constructor special? - python-3.x

Offical Python3 docs say this about passing bytes to the single argument constructor for class str:
Passing a bytes object to str() without the encoding or errors
arguments falls under the first case of returning the informal string
representation (see also the -b command-line option to Python).
Ref: https://docs.python.org/3/library/stdtypes.html#str
informal string representation -> Huh?
Using the Python console (REPL), and I see the following weirdness:
>>> ''
''
>>> b''
b''
>>> str()
''
>>> str('')
''
>>> str(b'')
"b''" # What the heck is this?
>>> str(b'abc')
"b'abc'"
>>> "x" + str(b'')
"xb''" # Woah.
(The question title can be improved -- I'm struggling to find a better one. Please help to clarify.)

The concept behind str seems to be that it returns a "nicely printable" string, usually in a human understandable form. The documentation actually uses the phrase "nicely printable":
If neither encoding nor errors is given, str(object) returns
object.__str__(), which is the “informal” or nicely printable string
representation of object. For string objects, this is the string
itself. If object does not have a __str__() method, then str() falls
back to returning repr(object).
With that in mind, note that str of a tuple or list produces string versions such as:
>>> str( (1, 2) )
'(1, 2)'
>>> str( [1, 3, 5] )
'[1, 3, 5]'
Python considers the above to be the "nicely printable" form for these objects. With that as background, the following seems a bit more reasonable:
>>> str(b'abc')
"b'abc'"
With no encoding provided, the bytes b'abc' are just bytes, not characters. Thus, str falls back to the "nicely printable" form and the six character string b'abc' is nicely printable.

Related

can not decoed using utf-8 after encoding with utf-8

In a situation I had to store data as utf-8 and now when I want to fetch and decode('utf-8') data it's just simply does not work. Consider line below as an example:
\x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87
You can simply copy the line below to convert the string above to the human readable format:
b"\x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87".decode("utf-8")
However could not find a way to convert the string to bytestring without corrupting the string. I tried following methods but all of them failed:
.decode("utf-8")
.decode()
.bytes()
Up until this point I could not find solution in OS or other places. Appreciate any help.

x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87
b'x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87'
The above lines (both given in the question) are particular instances of String and Bytes literals (respectively):
\xhh Character with hex value hh (2, 3)
2 Unlike in Standard C, exactly two hex digits are
required.
3 In a bytes literal, hexadecimal and octal escapes denote
the byte with the given value. In a string literal, these escapes
denote a Unicode character with the given value.
Let's check the string defined in such a way (inside Python prompt):
>>> xstr = "\x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87"
>>> xstr
'\r\nØ³Ø§Ù\x82Û\x8câ\x80\x8cÙ\x86Ø§Ù\x85Ù\x87'
>>> print( xstr)
Ø³Ø§ÙÛâÙØ§Ù
Ù
>>>
Apparently, the print( xstr) output does not resemble a word in any known language however all its characters belong (by definition) to Unicode range r'[\u0000-\u00ff]' i.e. the first 256 of characters in Unicode, and voila - it's iso-8859-1 aka 'latin1'.
We need to get an encoded version of the xstr string as a bytes object, e.g. using str.encode method or built-in bytes() function. Then
print( bytes(xstr,'latin1').decode()); print(xstr.encode("latin1").decode())
ساقی‌نامه
ساقی‌نامه

Convert string with "\x" character to float

I'm converting strings to floats using float(x). However for some reason, one of the strings is "71.2\x0060". I've tried following this answer, but it does not remove the bytes character
>>> s = "71.2\x0060"
>>> "".join([x for x in s if ord(x) < 127])
'71.2\x0060'
Other methods I've tried are:
>>> s.split("\\x")
['71.2\x0060']
>>> s.split("\x")
ValueError: invalid \x escape
I'm not sure why this string is not formatted correctly, but I'd like to get as much precision from this string and move on.

Going off of wim's comment, the answer might be this:
>>> s.split("\x00")
['71.2', '60']
So I should do:
>>> float(s.split("\x00")[0])
71.2

Unfortunately the POSIX group \p{XDigit} does not exist in the re module. To remove the hex control characters with regular expressions anyway, you can try the following.
impore re
re.sub(r'[\x00-\x1F]', r'', '71.2\x0060') # or:
re.sub(r'\\x[0-9a-fA-F]{2}', r'', r'71.2\x0060')
Output:
'71.260'
'71.260'
r means raw. Take a look at the control characters up to hex 1F in the ASCII table: https://www.torsten-horn.de/techdocs/ascii.htm

Is there any way to get the direct hexadecimal value in bytes instead of getting string value?

In python3.5 I need to convert the string to IPFIX supported field value for UDP packet. While I am sending string bytes as UDP packet I am unable to recover the string data again. In Wireshark, it says that "Malformed data".
I found that IPFIX supports only the "ASCII" for strings. So I have converted ASCII value to hex and then converted into bytes. But while converting hex("4B") to byte. I am not getting my hex value in bytes instead of I am getting the string in bytes("K").
I have tried the following in the python console. I need exact byte what I have entered. But it seems like b'\x4B' instead of '\x4B' I am getting 'K'. I am using python3.5
b'\x4B'
b'K'
Code: "K".encode("ascii")
Actual OP: b'K'
Expected OP: b'\x4B'

There are multiple ways to do this:
1. The hex method (python 3.5 and up)
>>> 'K'.encode('ascii').hex()
'4b' # type str
2. Using binascii
>>> binascii.hexlify('K'.encode('ascii'))
b'4b' # type bytes
3. Using str.format
>>> ''.join('{:02x}'.format(x) for x in 'K'.encode('ascii'))
'4b' # type str
4. Using format
>>> ''.join(format(x, '02x') for x in 'K'.encode('ascii'))
'4b' # type str
Note: Methods using format are not very performance efficient.
If you really care about the \x you will have to use format, eg:
>>> print(''.join('\\x{:02x}'.format(x) for x in 'K'.encode('ascii')))
\x4b
>>> print(''.join('\\x{:02x}'.format(x) for x in 'KK'.encode('ascii')))
\x4b\x4b
If you care about uppercase then you can use X instead of x, eg:
>>> ''.join('{:02X}'.format(x) for x in 'K'.encode('ascii'))
'4B'
>>> ''.join(format(x, '02X') for x in 'K'.encode('ascii'))
'4B'
Uppercase and with \x:
>>> print(''.join('\\x{:02X}'.format(x) for x in 'Hello'.encode('ascii')))
\x48\x65\x6C\x6C\x6F
If you want bytes instead of str then just encode the output to ascii again:
>>> print(''.join('\\x{:02X}'.format(x) for x in 'Hello'.encode('ascii')).encode('ascii'))
b'\\x48\\x65\\x6C\\x6C\\x6F'

unpack syntax in python 3

I am trying to convert hex numbers into decimals using unpack.
When I use:
from struct import *
unpack("<H",b"\xe2\x07")
The output is: 2018, which is what I want.
The thing is I have my hex data in a list as a string in the form of:
asd = ['e2','07']
My question is is there a simple way of using unpack without the backslashes, the x? Something like so:
unpack("<H","e207")
I know this doesn't work but I hope you get the idea.
For clarification I know I could get the data in the form of b'\x11' in the list but then it's interpreted as ASCII, which I don't want, that's why I have it in the format I showed.

You have hex-encoded data, in a text object. So, to go back to raw hex bytes, you can decode the text string. Please note that this is not the usual convention in Python 3.x (generally, text strings are already decoded).
>>> codecs.decode('e207', 'hex')
b'\xe2\x07'
A convenience function for the same thing:
>>> bytes.fromhex('e207')
b'\xe2\x07'
Now you can struct.unpack those bytes. Putting it all together:
>>> asd = ['e2','07']
>>> text = ''.join(asd)
>>> encoded = codecs.decode(text, 'hex')
>>> struct.unpack("<H", encoded)
(2018,)

Convert hexadecimal to normal string

I'm using Python 3.3.2 and I want convert a hex to a string.
This is my code:
junk = "\x41" * 50 # A
eip = pack("<L", 0x0015FCC4)
buffer = junk + eip
I've tried use
>>> binascii.unhexlify("4142")
b'AB'
... but I want the output "AB", no "b'AB'". What can I do?
Edit:
buffer = junk + binascii.unhexlify(eip).decode('ascii')
binascii.Error: Non-hexadecimal digit found
The problem is I can't concatenate junk + eip.
Thank you.

What that b stands for is to denote that is a bytes class, i.e. a string of bytes. If you want to convert that into a string you want to use the decode method.
>>> type(binascii.unhexlify(b"4142"))
<class 'bytes'>
>>> binascii.unhexlify(b"4142").decode('ascii')
'AB'
This results in a string, which is a string of unicode characters.
Edit:
If you want to work purely with binary data, don't do decode, stick with using the bytes type, so in your edited example:
>>> #- junk = "\x41" * 50 # A
>>> junk = b"\x41" * 50 # A
>>> eip = pack("<L", 0x0015FCC4)
>>> buffer = junk + eip
>>> buffer
b'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA\xc4\xfc\x15\x00'
Note the b in b"\x41", which denote that as a binary string, i.e. standard string type in python2, or literally a string of bytes rather than a string of unicode characters which are two completely different things.

That's just a literal representation. Don't worry about the b, as it's not actually part of the string itself.
See What does the 'b' character do in front of a string literal?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Why is passing bytes to class str constructor special? - python-3.x

Related

can not decoed using utf-8 after encoding with utf-8

Convert string with "\x" character to float

Is there any way to get the direct hexadecimal value in bytes instead of getting string value?

unpack syntax in python 3

Convert hexadecimal to normal string

Categories

Resources