Convert unicode code points to alphabetical strings in Python - python-3.x

I have a list of unicode code points like
<U+041D><U+041C><U+0418><U+0426> <U+041D><U+0435><U+0439><U+0440><U+043E><U+0445><U+0438><U+0440><U+0443><U+0440><U+0433><U+0438><U+0438> <U+0438><U+043C>. <U+041D>,<U+041D>.<U+0411><U+0443><U+0440><U+0434><U+0435><U+043D><U+043A><U+043E>
How do I convert these to a regular string - they represent a particular string like 'hello'

Use regular expressions to identify unicode points. Extract the hexadecimal number from each point, convert it to decimal, and map to the character with that code:
import re
def decoder(match):
code = match.group(1) # The code in hex
return chr(int(code, 16))
uni = # Your string
re.sub("<U\+([0-9a-fA-F]+)>", decoder, uni)
#'НМИЦ Нейрохирургии им. Н,Н.Бурденко'

Related

can not decoed using utf-8 after encoding with utf-8

In a situation I had to store data as utf-8 and now when I want to fetch and decode('utf-8') data it's just simply does not work. Consider line below as an example:
\x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87
You can simply copy the line below to convert the string above to the human readable format:
b"\x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87".decode("utf-8")
However could not find a way to convert the string to bytestring without corrupting the string. I tried following methods but all of them failed:
.decode("utf-8")
.decode()
.bytes()
Up until this point I could not find solution in OS or other places. Appreciate any help.
x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87
b'x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87'
The above lines (both given in the question) are particular instances of String and Bytes literals (respectively):
\xhh Character with hex value hh (2, 3)
2 Unlike in Standard C, exactly two hex digits are
required.
3 In a bytes literal, hexadecimal and octal escapes denote
the byte with the given value. In a string literal, these escapes
denote a Unicode character with the given value.
Let's check the string defined in such a way (inside Python prompt):
>>> xstr = "\x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87"
>>> xstr
'\r\nساÙ\x82Û\x8câ\x80\x8cÙ\x86اÙ\x85Ù\x87'
>>> print( xstr)
ساÙÛâÙاÙ
Ù
>>>
Apparently, the print( xstr) output does not resemble a word in any known language however all its characters belong (by definition) to Unicode range r'[\u0000-\u00ff]' i.e. the first 256 of characters in Unicode, and voila - it's iso-8859-1 aka 'latin1'.
We need to get an encoded version of the xstr string as a bytes object, e.g. using str.encode method or built-in bytes() function. Then
print( bytes(xstr,'latin1').decode()); print(xstr.encode("latin1").decode())
ساقی‌نامه
ساقی‌نامه

Converting 16-digit hexadecimal string into double value in Python 3

I want to convert 16-digit hexadecimal numbers into doubles. I actually did the reverse of this before which worked fine:
import struct
import wrap
def double_to_hex(doublein):
return hex(struct.unpack('<Q', struct.pack('<d', doublein))[0])
for i in modified_list:
encoded_list.append(double_to_hex(i))
modified_list.clear()
encoded_msg = ''.join(encoded_list).replace('0x', '')
encoded_list.clear()
print_command('encode', encoded_message)
And now I want to sort of do the reverse. I tried this without success:
from textwrap import wrap
import struct
import binascii
MESSAGE = 'c030a85d52ae57eac0129263c4fffc34'
#Splitting up message into n 16-bit strings
MSGLIST = wrap(MESSAGE, 16)
doubles = []
print(MSGLIST)
for msg in MSGLIST:
doubles.append(struct.unpack('d', binascii.unhexlify(msg)))
print(doubles)
However, when I run this, I get crazy values, which are of course not what I put in:
[(-1.8561629252326087e+204,), (1.8922789420412524e-53,)]
Were your original numbers -16.657673995556173 and -4.642958715557189 ?
If so, then the problem is that your hex strings contain the big-endian (most-significant byte first) representation of the double, but the 'd' format string in your unpack call specifies conversion using your system's native format, which happens to be little-endian (least-significant byte first). The result is that unpack reads and processes the bytes of the unhexlify'ed string from the wrong end. Unsurprisingly, that will produce the wrong value.
To fix, do one of:
convert the hex string into little-endian format (reverse the bytes, so c030a85d52ae57ea becomes ea57ae525da830c0) before passing it to binascii.unhexlify, or
reverse the bytes produced by unhexlify (change binascii.unhexlify(msg) to binascii.unhexlify(msg)[::-1]) before you pass them to unpack, or
tell unpack to do the conversion using big-endian order (replace the format string 'd' with '>d')
I'd go with the last one, replacing the format string.

How to convert a string with unicode characters in a string with utf-8 hex characters?

I'm trying (without success) to convert the following string (it has the ł ={LATIN SMALL LETTER L WITH STROKE} character encodede in unicode):
Marta Ga\u0142szewska
in the following utf-8 hex form:
Marta Ga%C5%82uszewska
How I can achieve that conversion using Python and store the result in a variable like variable = "Marta Ga%C5%82uszewska"?
For URL-encoding, you want urllib.parse.quote:
import urllib.parse
s = "Marta Ga\u0142szewska"
q = urllib.parse.quote(s)
=> 'Marta%20Ga%C5%82szewska'
If you prefer + to %20, you can use quote_plus:
q = urllib.parse.quote_plus(s)
=> 'Marta+Ga%C5%82szewska'

How to replace hex value in a string

While importing data from a flat file, I noticed some embedded hex-values in the string (<0x00>, <0x01>).
I want to replace them with specific characters, but am unable to do so. Removing them won't work either.
What it looks like in the exported flat file: https://i.imgur.com/7MQpoMH.png
Another example: https://i.imgur.com/3ZUSGIr.png
This is what I've tried:
(and mind, <0x01> represents a none-editable entity. It's not recognized here.)
import io
with io.open('1.txt', 'r+', encoding="utf-8") as p:
s=p.read()
# included in case it bears any significance
import re
import binascii
s = "Some string with hex: <0x01>"
s = s.encode('latin1').decode('utf-8')
# throws e.g.: >>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 114: invalid start byte
s = re.sub(r'<0x01>', r'.', s)
s = re.sub(r'\\0x01', r'.', s)
s = re.sub(r'\\\\0x01', r'.', s)
s = s.replace('\0x01', '.')
s = s.replace('<0x01>', '.')
s = s.replace('0x01', '.')
or something along these lines in hopes to get a grasp of it while iterating through the whole string:
for x in s:
try:
base64.encodebytes(x)
base64.decodebytes(x)
s.strip(binascii.unhexlify(x))
s.decode('utf-8')
s.encode('latin1').decode('utf-8')
except:
pass
Nothing seems to get the job done.
I'd expect the characters to be replacable with the methods I've dug up, but they are not. What am I missing?
NB: I have to preserve umlauts (äöüÄÖÜ)
-- edit:
Could I introduce the hex-values in the first place when exporting? If so, is there a way to avoid that?
with io.open('out.txt', 'w', encoding="utf-8") as temp:
temp.write(s)
Judging from the images, these are actually control characters.
Your editor displays them in this greyed-out way showing you the value of the bytes using hex notation.
You don't have the characters "0x01" in your data, but really a single byte with the value 1, so unhexlify and friends won't help.
In Python, these characters can be produced in string literals with escape sequences using the notation \xHH, with two hexadecimal digits.
The fragment from the first image is probably equal to the following string:
"sich z\x01 B. irgendeine"
Your attempts to remove them were close.
s = s.replace('\x01', '.') should work.

How do I convert a numeric string to its corresponding Unicode character?

Given a string such as "2764" how can i programatically convert it to "\u2764"? Is there a built in function that will let me convert a standard string to its unicode-escaped equivalent?
http://docs.python.org/3/library/functions.html#chr
>>> chr( int('2764', 16) )
❤
First convert your string to the number that is intended. Then convert it to the corresponding character.

Resources