Convert string to bytes literally - string

I want to take a string such as:
'\\xeb\\x4d'
and turn it into:
b'\xeb\x4d'
If I do:
bytes('\\xeb\\x4d', 'utf-8')
I get:
b'\\xeb\\x4d'
I need something that does the following:
something('\\xeb\\x4d') == b'\xeb\x4d'

>>> a = '\\xeb\\x4d' # a Unicode string
>>> a.encode('latin1') # get a byte string
b'\\xeb\\x4d'
>>> a.encode('latin1').decode('unicode_escape') # unescape, get a Unicode string
'ëM'
>>> a.encode('latin1').decode('unicode_escape').encode('latin1') # get a byte string
b'\xebM'
>>> a.encode('latin1').decode('unicode_escape').encode('latin1') == b'\xeb\x4d'
True
Note that latin1 is the first 256 codepoints of Unicode, so encoding the first 256 bytes of Unicode gives the same byte values as the original codepoint.

a = '\\xeb\\x4d'
a = bytes(a, 'utf-8')
a = a.decode('unicode_escape').encode('latin1')
gives
b'\xebM'
because
'\x4d' == 'M'

Related

Python: re.sub gives different output when unicode character is fetched from a string variable containing data from pandas dataframe

The following works:
import re
text = "I\u2019m happy"
text_p = text
text_p = re.sub("[\u2019]","'",text_p)
print(text_p)
Output: I'm happy
This doesn't work:
training_data = pd.read_csv('train.txt')
import re
text = training_data['tweet_text'][0] # Assume that this returns a string "I\u2019m happy"
text_p = text
text_p = re.sub("[\u2019]","'",text_p)
print(text_p)
Output: I\u2019m happy
I tried running your code and got I'm happy returned from both the string and the list item when passing each into re.sub(...) as outlined in your question.
If you're just looking to parse (decode) the unicode characters you probably don't need to be using re. Something like the below could be used to parse the unicode characters without having to run re to check each possibility.
text = training_data['tweet_text'][0]
if type(text) == str: # if value is str then encode to utf-8 byte string then decode back to str
text = text.encode()
text = text.decode()
elif type(text) == bytes: # elif value is bytes just decode to str
text = text.decode()
else: # else printout to console if value is neither str or bytes
print("Value not recognised as str or bytes!")

How to convert string into fixed number of bytes?

I want to create an 8 bytes sized variable that will include my string.
byte = 8_bytes_variable
str = 'hello'
# Put str inside byte while byte still remains of size 8 bytes.
You can format the string first by adding some space to the beginning of it. Here I assumed that each character takes 1 bit. (Chinese characters take more)
str = 'hello'
if len(str.encode('utf-8')) > 8:
print("This is not possible!")
else:
str2 = '{0: >8}'.format(str) # adds needed space to the beginnig of str
byte = str2.encode('utf-8')
In order to get the original string later, you can use lstrip():
str2 = byte.decode()
str = str2.lstrip()

Can I print symbols in hex with more than 2 characters?

I am using Python 3.5.2
I know that with print("\x00") (where 0 is an ASCII character) I can print symbols with hex format. But how can I print number 500,000 (in hex: 7A120) when print("\x00") takes only 2 characters?
To print a constant hexidecimal expression, you can prefix the number with a 0x, and it will resolve to an int with the equivalent base 10 value, like so:
>>> print(0x7A120)
500000
If you want to print a string with arbitrary hexidecimal characters in it, use int:
>>> a = "7A120"
>>> print(int(a, 16))
500000
The second argument to int is the base to parse the string from, in this case base 16 (hex).
To print an integer in hexidecimal format, use the format operator, %:
>>> a = 0x7A120
>>> print("%x" % a)
7a120
You can change the "x" in "%x" to uppercase to print a through f in uppercase:
>>> b = 0xABCDEF
>>> print("%x" % b)
abcdef
>>> print("%X" % b)
ABCDEF

Python - part of str [urllib3 data]

I'm Trying to delete 2 chars from start of the str and 1 char from the end
import urllib3
target_url="www.klimi.hys.cz/nalada.txt"
http = urllib3.PoolManager()
r = http.request('GET', target_url)
print(r.status)
print(r.data)
print()
And Output is
200
b'smutne'
I need to output be only "smutne", only this, not the " b' " and " ' "
When you have bytes, you'll need them to decode it into a string with the appropriate encoding type. For example, if you have a ASCII characters as bytes, you can do:
>>> foo = b'mystring'
>>> print(foo)
b'mystring'
>>> print(foo.decode('ascii'))
'mystring'
Or, more commonly, you probably have Unicode characters (which includes most of the ASCII character codes too):
>>> print(foo.decode('utf-8'))
'mystring'
This will work if you have glyphs with accents and such.
More on Python encoding/decoding here: https://docs.python.org/3/howto/unicode.html
In the particular case of urllib3, r.data returns bytes that you'll need to decode in order to use as a string if that's what you want.

Python3 adding an extra byte to the byte string

file_1 = (r'res\test.png')
with open(file_1, 'rb') as file_1_:
file_1_read = file_1_.read()
file_1_hex = binascii.hexlify(file_1_read)
print ('Hexlifying test.png..')
pack = ("test.packet")
file_1_size_bytes = len(file_1_read)
print (("test.png is"),(file_1_size_bytes),("bytes."))
struct.pack( 'i', file_1_size_bytes)
file_1_size_bytes_hex = binascii.hexlify(struct.pack( '>i', file_1_size_bytes))
print (("Hexlifyed length - ("),(file_1_size_bytes_hex),(")."))
with open(pack, 'ab') as header_1_:
header_1_.write(binascii.unhexlify(file_1_size_bytes_hex))
print (("("),(binascii.unhexlify(file_1_size_bytes_hex)),(")"))
with open(pack, 'ab') as header_head_1:
header_head_1.write(binascii.unhexlify("0000020000000D007200650073002F00000074006500730074002E0070006E006700000000"))
print ("Header part 1 added.")
So this writes "0000020000000D007200650073002F00000074006500730074002E0070006E006700000000(00)" to the pack unhexlifyed.
There's an extra "00" byte at the end. this is messing everything up im trying to do because the packets length is referred back to when loading it and i have about 13 extra "00" bytes at the end of each string i write to the file. So in turn my file is 13 bytes longer than it should be. Not to mention the headers byte length isnt being read properly because the padding is off by 1 byte.
You seem to be saying that binascii.unhexlify does not really condense the input string. I have trouble believing that. Here is a minimal complete runnable example and the output I get with 3.4.2 on Win 7.
import binascii
import io
b = binascii.unhexlify(
"000000030000000100000000000000040041004E0049004D00000000000000")
print(b) # bytes
bf = io.BytesIO()
bf.write(b)
print(bf.getvalue())
>>>
b'\x00\x00\x00\x03\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x04\x00A\x00N\x00I\x00M\x00\x00\x00\x00\x00\x00\x00'
b'\x00\x00\x00\x03\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x04\x00A\x00N\x00I\x00M\x00\x00\x00\x00\x00\x00\x00'
Unhexlify has converted each pair of hex characters to the byte expected.

Resources