Encoding Emojis from text which includes escape sequence

Encoding Emojis from text which includes escape sequence - python-3.x

I am trying to print some text with emojis from this form text = "\\ud83d\\ude04\\n\\u3082\\u3042", into:
# my expecting output
# a new line after the emoji, then is Japanese character
>>>😄
もあ
I have read a question about this, but just solve part of the problem:
Best and clean way to Encode Emojis (Python) from text file
I followed the code mentioned in the post, and I got below result:
emoji_text = "\\ud83d\\ude04\\n\\u3082\\u3042".encode("latin_1")
output = (emoji_text
.decode("raw_unicode_escape")
.encode('utf-16', 'surrogatepass')
.decode('utf-16')
)
print(output)
>>>😄\nもあ
# it prints \n instead of a new line
Therefore, I would like to ask, how can I convert the escape sequences \n, \t, \b etc. while converting the emoji and text?

Using unicode_escape instead of raw_unicode_escape will decode the \n as well. Though if there is a reason you used raw_unicode_escape in the first place, perhaps this will not be suitable?
Your choice to encode into "latin-1" is vaguely odd, but perhaps there is a reason for that, too. Perhaps you should encode into "ascii" and be prepared to cope with any possible fallout.

Related

Why is .add_reaction not working with unicode emojis?

I canot seem to get this single line of code to output anything except BAD REQUEST Unknown Emoji
await dBot.add_reaction(message,"\\U00000031")
I cannot find any reason online why this shouldnt work. What wonderful noob mistake am I making?

The string you're using isn't an escaped Unicode character, but an escaped backslash character followed by eight digit characters. You probably want only one backslash in the literal, which will let Python parse the literal into a single character as it seems you intend. I'm still not sure that will do what you expect though, since "\U00000031" is the character '1', not an emoji.
From your comment below, it sounds like the emoji you actually want is composed of two Unicode codepoints. The first is just the normal '1' character I discussed above, which you don't need any escapes to write. The second character is U+20E3 ('COMBINING ENCLOSING KEYCAP'), which can be written in a Python literal as '\u20E3' or '\U000020E3'. This puts a keyboard key image around whatever the previous character was, so the sequence "1\u20e3" will render as 1⃣ (which my browser doesn't handle too well, but yours might). I don't know for sure, but I'd be fairly confident that discord would accept that, if it support the 1 key emoji you're looking for at all (which I expect it does).

Convert Unicode Escape to Hebrew text

I have the following text in a json file:
"\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa
\u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092"
which represents the text "אחוזת פולג" in Hebrew.
no matter which encoding/decoding i use i don't seem to get it right with
Python 3.
if for example ill try:
text = "\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa
\u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092".encode('unicode-escape')
print(text)
i get that text is:
b'\\xd7\\x90\\xd7\\x97\\xd7\\x95\\xd7\\x96\\xd7\\xaa \\xd7\\xa4\\xd7\\x95\\xd7\\x9c\\xd7\\x92'
which in bytecode is almost the correct text, if i was able to remove only one backslash and turn
b'\\xd7\\x90\\xd7\\x97\\xd7\\x95\\xd7\\x96\\xd7\\xaa \\xd7\\xa4\\xd7\\x95\\xd7\\x9c\\xd7\\x92'
into
text = b'\xd7\x90\xd7\x97\xd7\x95\xd7\x96\xd7\xaa \xd7\xa4\xd7\x95\xd7\x9c\xd7\x92'
(note how i changed double slash to single slash) then
text.decode('utf-8')
would yield the correct text in Hebrew.
but i am struggling to do so and couldn't manage to create a piece of code which will do that for me (and not manually as i just showed...)
any help much appreciated...

This string does not "represent" Hebrew text (at least not as unicode code points, UTF-16, UTF-8, or in any well-known way at all). Instead, it represents a sequence of UTF-16 code units, and this sequence consists mostly of multiplication signs, currency signs, and some weird control characters.
It looks like the original character data has been encoded and decoded several times with some strange combination of encodings.
Assuming that this is what literally is saved in your JSON file:
"\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa \u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092"
you can recover the Hebrew text as follows:
(jsonInput
.encode('latin-1')
.decode('raw_unicode_escape')
.encode('latin-1')
.decode('utf-8')
)
For the above example, it gives:
'אחוזת פולג'
If you are using a JSON deserializer to read in the data, then you should of course omit the .encode('latin-1').decode('raw_unicode_escape') steps, because the JSON deserializer would already interpret the escape sequences for you. That is, after the text element is loaded by JSON deserializer, it should be sufficient to just encode it as latin-1 and then decode it as utf-8. This works because latin-1 (ISO-8859-1) is an 8-bit character encoding that corresponds exactly to the first 256 code points of unicode, whereas your strangely broken text encodes each byte of UTF-8 encoding as an ASCII-escape of an UTF-16 code unit.
I'm not sure what you can do if your JSON contains both the broken escape sequences and valid text at the same time, it might be that the latin-1 doesn't work properly any more. Please don't apply this transformation to your JSON file unless the JSON itself contains only ASCII, it would only make everything worse.

Reading a text file with unicode characters - Python3

I am trying to read a text file which has unicode characters (u) and other tags (\n, \u) in the text, here is an example:
(u'B9781437714227000962', u'Definition\u2014Human papillomavirus
(HPV)\u2013related proliferation of the vaginal mucosa that leads to
extensive, full-thickness loss of maturation of the vaginal
epithelium.\n')
How can remove these unicode tags using python3 in Linux operating system?

To remove unicode escape sequence (or better: to translate them), in python3:
a.encode('utf-8').decode('unicode_escape')
The decode part will translate the unicode escape sequences to the relative unicode characters. Unfortunately such (un-)escape do no work on strings, so you need to encode the string first, before to decode it.
But as pointed in the question comment, you have a serialized document. Try do unserialize it with the correct tools, and you will have automatically also the unicode "unescaping" part.

Python 3: Issue writing new lines along side unicode to text file

I ran into an issue when writing the header of a text file in python 3.
I have a header that contains unicode AND new line characters. The following is a minimum working example:
with open('my_log.txt', 'wb') as my_file:
str_1 = '\u2588\u2588\u2588\u2588\u2588\n\u2588\u2588\u2588\u2588\u2588'
str_2 = 'regular ascii\nregular ascii'
my_file.write(str_1.encode('utf8'))
my_file.write(bytes(str_2, 'UTF-8'))
The above works, except the output file does not have the new lines (it basically looks like I replaced '\n' with ''). Like the following:
████████regular asciiregular ascii
I was expecting:
████
████
regular ascii
regular ascii
I have tried replacing '\n' with u'\u000A' and other characters based on similar questions - but I get the same result.
An additional, and probably related, question: I know I am making my life harder with the above encoding and byte methods. Still getting used to unicode in py3 so any advice regarding that would be great, thanks!
EDIT
Based on Ignacio's response and some more research: The following seems to produce the desired results (basically converting from '\n' to '\r\n' and ensuring the encoding is correct on all the lines):
with open('my_log.txt', 'wb') as my_file:
str_1 = '\u2588\u2588\u2588\u2588\u2588\r\n\u2588\u2588\u2588\u2588\u2588'
str_2 = '\r\nregular ascii\r\nregular ascii'
my_file.write(str_1.encode('utf8'))
my_file.write(str_2.encode('utf8'))

Since you mentioned wanting advice using Unicode on Python 3...
You are probably using Windows since the \n isn't working correctly for you in binary mode. Linux uses \n line endings for text, but Windows uses \r\n.
Open the file in text mode and declare the encoding you want, then just write the Unicode strings. Below is an example that includes different escape codes for Unicode:
#coding:utf8
str_1 = '''\
\u2588\N{FULL BLOCK}\U00002588█
regular ascii'''
with open('my_log.txt', 'w', encoding='utf8') as my_file:
my_file.write(str_1)
You can use a four-digit escape \uxxxx, eight-digit escape \Uxxxxxxxx, or the Unicode codepoint \N{codepoint_name}. The Unicode characters can also be directly used in the file as long as the #coding: declaration is present and the source code file is saved in the declared encoding.
Note that the default source encoding for Python 3 is utf8 so the declaration I used above is optional, but on Python 2 the default is ascii. The source encoding does not have to match the encoding used to open a file.
Use w or wt for writing text (t is the default). On Windows \n will translate to \r\n in text mode.

'wb'
The file is open in binary mode. As such \n isn't translated into the native newline format. If you open the file in a text editor that doesn't treat LF as a line break character then all the text will appear on a single line in the editor. Either open the file in text mode with an appropriate encoding or translate the newlines manually before writing.

Why doesn't print() in Python 3.2 seem to be defaulting to UTF-8?

I'm writing scripts to clean up unicode text files (stored as UTF-8), and I chose to use Python 3.x (3.2) rather than the more popular 2.x because 3.x is supposed to default to UTF-8. Maybe I'm doing something wrong, but it seems that the print statement, at least, still is not defaulting to UTF-8. If I try to print a string (msg below is a string) that contains special characters, I still get a UnicodeEncodeError like this:
print(label, msg)
... in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0968' in position
38: character maps to <undefined>
If I use the encode() method first (which does nicely default to UTF-8), I can avoid the error:
print(label, msg.encode())
This also works for printing objects or lists containing unicode strings--something I often have to do when debugging--since str() seems to default to UTF-8. But do I really need to remember to use print(str(myobj).encode()) every single time I want to do a print(myobj) ? If so, I suppose I could try to wrap it with my own function, but I'm not confident about handling all the argument permutations that print() supports.
Also, my script loads regular expressions from a file and applies them one by one. Before applying encode(), I was able to print something fairly legible to the console:
msg = 'Applying regex {} of {}: {}'.format(i, len(regexes), regex._findstr)
print(msg)
Applying regex 5 of 15: ^\\ge[0-9]*\b([ ]+[0-9]+\.)?[ ]*
However, this crashes if the regex includes literal unicode characters, so I applied encode() to the string first. But now the regexes are very hard to read on-screen (and I suspect I may have similar trouble if I try to write code that saves these regexes back to disk):
msg = 'Applying regex {} of {}: {}'.format(i, len(regexes), regex._findstr)
print(msg.encode())
b'Applying regex 5 of 15: ^\\\\ge[0-9]*\\b([ ]+[0-9]+\\.)?[ ]*'
I'm not very experienced yet in Python, so I may be misunderstanding. Any explanations or links to tutorials (for Python 3.x; most of what I see online is for 2.x) would be much appreciated.

print doesn't default to any encoding, it just uses whatever encoding the output device (like a console) claims to support. Your console encoding appears to be non-unicode, so print tries to encode your unicode strings in that encoding, and fails. The easiest way to get around this is to tell the console to use utf8 (like export LC_ALL=en_US.UTF-8 on unix systems).

The easier way to proceed is to only use unicode in your script, and only use encoded data when you want to interact with the "outside" world. That is, when you have input to decode or output to encode.
To do so, everytime you read something, use decode, everytime you output something, use encode.
For you regex, use the re.UNICODE flag.
I know that this doesn't exactly answer your question point by point, but I think that applying such a methodology should keep you safe from encoding issues.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Encoding Emojis from text which includes escape sequence - python-3.x

Related

Why is .add_reaction not working with unicode emojis?

Convert Unicode Escape to Hebrew text

Reading a text file with unicode characters - Python3

Python 3: Issue writing new lines along side unicode to text file

Why doesn't print() in Python 3.2 seem to be defaulting to UTF-8?

Categories

Resources