Convert hexadecimal to normal string - string

I'm using Python 3.3.2 and I want convert a hex to a string.
This is my code:
junk = "\x41" * 50 # A
eip = pack("<L", 0x0015FCC4)
buffer = junk + eip
I've tried use
>>> binascii.unhexlify("4142")
b'AB'
... but I want the output "AB", no "b'AB'". What can I do?
Edit:
buffer = junk + binascii.unhexlify(eip).decode('ascii')
binascii.Error: Non-hexadecimal digit found
The problem is I can't concatenate junk + eip.
Thank you.

What that b stands for is to denote that is a bytes class, i.e. a string of bytes. If you want to convert that into a string you want to use the decode method.
>>> type(binascii.unhexlify(b"4142"))
<class 'bytes'>
>>> binascii.unhexlify(b"4142").decode('ascii')
'AB'
This results in a string, which is a string of unicode characters.
Edit:
If you want to work purely with binary data, don't do decode, stick with using the bytes type, so in your edited example:
>>> #- junk = "\x41" * 50 # A
>>> junk = b"\x41" * 50 # A
>>> eip = pack("<L", 0x0015FCC4)
>>> buffer = junk + eip
>>> buffer
b'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA\xc4\xfc\x15\x00'
Note the b in b"\x41", which denote that as a binary string, i.e. standard string type in python2, or literally a string of bytes rather than a string of unicode characters which are two completely different things.

That's just a literal representation. Don't worry about the b, as it's not actually part of the string itself.
See What does the 'b' character do in front of a string literal?

Related

can not decoed using utf-8 after encoding with utf-8

In a situation I had to store data as utf-8 and now when I want to fetch and decode('utf-8') data it's just simply does not work. Consider line below as an example:
\x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87
You can simply copy the line below to convert the string above to the human readable format:
b"\x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87".decode("utf-8")
However could not find a way to convert the string to bytestring without corrupting the string. I tried following methods but all of them failed:
.decode("utf-8")
.decode()
.bytes()
Up until this point I could not find solution in OS or other places. Appreciate any help.
x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87
b'x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87'
The above lines (both given in the question) are particular instances of String and Bytes literals (respectively):
\xhh Character with hex value hh (2, 3)
2 Unlike in Standard C, exactly two hex digits are
required.
3 In a bytes literal, hexadecimal and octal escapes denote
the byte with the given value. In a string literal, these escapes
denote a Unicode character with the given value.
Let's check the string defined in such a way (inside Python prompt):
>>> xstr = "\x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87"
>>> xstr
'\r\nساÙ\x82Û\x8câ\x80\x8cÙ\x86اÙ\x85Ù\x87'
>>> print( xstr)
ساÙÛâÙاÙ
Ù
>>>
Apparently, the print( xstr) output does not resemble a word in any known language however all its characters belong (by definition) to Unicode range r'[\u0000-\u00ff]' i.e. the first 256 of characters in Unicode, and voila - it's iso-8859-1 aka 'latin1'.
We need to get an encoded version of the xstr string as a bytes object, e.g. using str.encode method or built-in bytes() function. Then
print( bytes(xstr,'latin1').decode()); print(xstr.encode("latin1").decode())
ساقی‌نامه
ساقی‌نامه

Python3 How to get raw bytes string without encode?

I want to get a string of origin bytes (assemble code) without encoding to another encoding. As the content of bytes is shellcode, I do not need to encode it and want to write it directly as raw bytes.
By simplify, I want to convert "b'\xb7\x00\x00\x00'" to "\xb7\x00\x00\x00" and get the string representation of raw bytes.
For example:
>> byte_code = b'\xb7\x00\x00\x00\x05\x00\x00\x00\x95\x00\x00\x00\x00\x00\x00\x00'
>> uc_str = str(byte_code)[2:-1]
>> print(byte_code, uc_str)
b'\xb7\x00\x00\x00\x05\x00\x00\x00\x95\x00\x00\x00\x00\x00\x00\x00' \xb7\x00\x00\x00\x05\x00\x00\x00\x95\x00\x00\x00\x00\x00\x00\x00
Currently I have only two ugly methods,
>> uc_str = str(byte_code)[2:-1]
>> uc_str = "".join('\\x{:02x}'.format(c) for c in byte_code)
Raw bytes usage:
>> my_template = "const char byte_code[] = 'TPL'"
>> uc_str = str(byte_code)[2:-1]
>> my_code = my_template.replace("TPL", uc_str)
# then write my_code to xx.h
Is there any pythonic way to do this?
Your first method is broken, because any bytes that can be represented as printable ASCII will be, for example:
>>> str(b'\x00\x20\x41\x42\x43\x20\x00')[2:-1]
'\\x00 ABC \\x00'
The second method is actually okay. Since this feature appears to be missing from stdlib I've published all-escapes which provides it.
pip install all-escapes
Example usage:
>>> b"\xb7\x00\x00\x00".decode("all-escapes")
'\\xb7\\x00\\x00\\x00'
I came across this trying to do something similar with some SNMP code.
byte_code = b'\xb7\x00\x00\x00\x05\x00\x00\x00\x95\x00\x00\x00\x00\x00\x00\x00'
text = byte_code.decode('raw_unicode_escape')
writer_func(text)
It worked to send an SNMP Hex string as an OctetString when there was no helper support for hex.
See also standard-encodings and bytes decode
and for anyone looking at the SNMP Set Types
basic of conversion byte / str is this :
>>> b"abc".decode()
'abc'
>>>
or :
>>> sb = b"abc"
>>> s = sb.decode()
>>> s
'abc'
>>>
The inverse is :
>>> "abc".encode()
b'abc'
>>>
or :
>>> s="abc"
>>> sb=s.encode()
>>> sb
b'abc'
>>>
And in your case, you should use errors argument :
>>> b"\xb7".decode(errors="replace")
'�'
>>>

bytes() initializer adding an additional byte?

I initialize a utf-8 encoding string in python3:
bytes('\xc2', encoding="utf-8", errors="strict")
but on writing it out I get two bytes!
>>> s = bytes('\xc2', encoding="utf-8", errors="strict")
>>> s
b'\xc3\x82'
Where is this additional byte coming from? Why should I not be able to encode any hex value up to 254 (I can understand that 255 is potentially reserved to extend to utf-16)?
The Unicode codepoint "\xc2" (which can also be written as "Â"), is two bytes long when encoded with the utf-8 encoding. If you were expecting it to be the single byte b'\xc2', you probably want to use a different encoding, such as "latin-1":
>>> s = bytes("\xc2", encoding="latin-1", errors="strict")
>>> s
b'\xc2'
If you area really creating "\xc2" directly with a literal though, there's no need to mess around with the bytes constructor to turn it into a bytes instance. Just use the b prefix on the literal to create the bytes directly:
s = b"\xc2"

Convert string of encoded escape sequences to Unicode in Python 3

I'm attempting to decode a Python string containing a series of Shift-JIS escape sequences in Python. When I create a bytes literal containing the sequences, I can use decode('shift-jis') to get the expected result.
>>> seq = b'\201u\202\240\202\246\202\244\202\242\202\250\201v'
>>> seq.decode("shift-jis")
'「あえういお」'
The problem is that the sequences are passed in as a plain Python string. When I use str.encode, the sequence is interpreted as Unicode and extra bytes of \xc2 are inserted:
>>> seq = "\201u\202\240\202\246\202\244\202\242\202\250\201v"
>>> str.encode(seq)
b'\xc2\x81u\xc2\x82\xc2\xa0\xc2\x82\xc2\xa6\xc2\x82\xc2\xa4\xc2\x82\xc2\xa2\xc2\x82\xc2\xa8\xc2\x81v'
Is there a way to directly convert a Python string containing encoded escape sequences into a bytes literal, in the same way as placing a b in front of a string produces a bytes literal with the escaped characters?
Str.encode defaults to using utf-8 encoding. hence you get the utf-8 \xc2 prefixes. (Check Wikipedia for details if you want.) What you want instead is for codepoints 0 to 255 to be turned into bytes 0 to 255. In others words, the same data in an object of a different class. Latin-1 does this.
>>> seqb = seq.encode('latin-1')
>>> seqb.decode('shift-jis')
'「あえういお」'

python 3, unicode conversion, two \u0000 as one character

My python3 script receives strings from c++ program via pipe.
Strings encoded via Unicode code points. I need to decode it correctly.
For example, consider string that contain cyrillic symbols: 'тест test'
Try to encode this string using python3: print('тест test'.encode()). We got b'\xd1\x82\xd0\xb5\xd1\x81\xd1\x82 test'
C++ program encodes this string like: b'\u00D1\u0082\u00D0\u00B5\u00D1\u0081\u00D1\u0082 test'
Encoded strings looks very similar - python3 uses \x (2bits) and c++ program uses \u (4bits).
But I can't figure out how to convert b'\u00D1\u0082\u00D0\u00B5\u00D1\u0081\u00D1\u0082 test' to 'тест test'.
Main problem - python3 consider b'\u00D1\u0082\u00D0\u00B5\u00D1\u0081\u00D1\u0082' as 8-character string, but it contain only 4 characters
If the string you receive from C++ is the following in Python:
s = b'\u00D1\u0082\u00D0\u00B5\u00D1\u0081\u00D1\u0082 test'
Then this will decode it:
result = s.decode('unicode-escape').encode('latin1').decode('utf8')
print(result)
Output:
тест test
The first stage converts the byte string received into a Unicode string:
>>> s1 = s.decode('unicode-escape')
>>> s1
'Ñ\x82еÑ\x81Ñ\x82 test'
Unfortunately, the Unicode codepoints are really UTF-8 byte values. The latin1 encoding is a 1:1 mapping of the first 256 Unicode codepoints, so encoding with this codec converts the codepoints back to byte values in a byte string:
>>> s2 = s1.encode('latin1')
>>> s2
b'\xd1\x82\xd0\xb5\xd1\x81\xd1\x82 test'
Now the byte string can be decoded to the correct Unicode string:
>>> s3 = s2.decode('utf8')
>>> s3
'тест test'

Resources