.encode("utf-8") does not seem to support my emoji? - python-3.x

Sorry, this is likely a simple solution. In a server response, I am trying to return an emoji. Right now, I have this line:
return [b"Hello World " β€œπŸ‘‹β€.encode(β€œutf-8”)]
However, I get the following error:
return [b"Hello World " β€œοΏ½πŸ‘‹β€.encode(β€œutf-8”)]
^
SyntaxError: invalid character in identifier
What I'd like to see is: Hello World πŸ‘‹

The problem is that a byte string b'...' cannot contain a character which does not fit into a byte. But you don't need a byte string here anyway; encode will convert a string to bytes - that's what it does.
Try
return ["Hello World β€œπŸ‘‹β€".encode("utf-8")]
The quotes in your question were wacky; I assume you want curly quotes around the emoji and honest-to-GvR quotes around the Python string literals.

In Python, you can put (almost) any Unicode character in a string literal.
Also, you can use most Unicode letters in identifiers, eg. if you think it's appropriate to define a variable Ξ± (Greek lower-case alpha).
But you can't use "fancy quotes" for delimiting string literals.
Look carefully: the double quotes around the emoji (and also around utf-8) aren't straight ASCII quotes, but rather typographical ones – the ones word processors substitute when typing a double quote in text.
Make sure you use a proper programming editor or an IDE for coding.
Then the line will look like this:
return [b"Hello World " "πŸ‘‹".encode("utf-8")]
Edit:
I realise that this still doesn't work: you cannot mix byte string and Unicode string literals (even though here the Unicode literal gets converted to bytes later).
Instead, you have to concatenate them with the + operator:
return [b"Hello World " + "πŸ‘‹".encode("utf-8")]
Or use a single string literal, like tripleee suggested.

There are several problems. Your code:
return [b"Hello World " β€œπŸ‘‹β€.encode(β€œutf-8”)]
You see, you are using β€œ and ” twice, instead of the proper double quote character ("). You should use a proper code editor, or disable character conversion. You should care also on copy and paste.
As you see from the error, the problem is not the emoji on the string, but a problem of identifier (and it is a syntax error), so unknown characters outside string.
But if you correct to:
return [b"Hello World " "πŸ‘‹".encode("utf-8")]
you still have an error: SyntaxError: cannot mix bytes and nonbytes literals.
This because Python will merge strings before to call the function encode, but one is a b-string and the other it is a normal string.
So you could use one of the following:
return [b"Hello World " + "πŸ‘‹".encode("utf-8")] # this force order of operator
or the following two (that are equivalent).
return [b"Hello World " "πŸ‘‹"]
return [b"Hello World πŸ‘‹"]
Python3 uses UTF-8 as default source encoding, so your editor will already encode the emoji into UTF-8, so you can use it in a b-string (UTF-8 encoded).
Note: this is not a very safe assumption: one could manually force source code to be in other encodings, but in that case, you will have probably also problem with the first method (saving a file with emoji in other encodings).

Related

create a string with quotes and back slash using python

i would like to create a string in the following exact format :
"\"\"\"\nThis is a beautiful world.\n"
But the code :
test ="\"\"\"\nthis is a beautiful world.\n"
test
gives the output :
'"""\nthis is a beautiful world.'
please help in getting an exact text.
My string test should look exactly like the string it has been initialized to. but after initialization, it actually gives the output as mentioned. i want to concatenate the test string to another string
the symbol "\" is called an escape character in most programming languages. this is used to add symbols to a string that might not be easy to add. eg - to add a double quote into a string, we add the \" to the string. eg -
a = "he said, \"hello\" to me"
this would give the output as -
he said "hello" to me
here, the "\" acts as a symbol for the code which allows it to recognize characters which might raise errors other-wise.
to include a backslash in your code, add an extra backslash to it. eg -
a = "\\"
here, the value of a is \.
if you still haven't been able to understand it, try - this tutorial
for your code,try this -
test = "\\\"\\\"\\\"\\nThis is a beautiful world.\\n"
and if you also want the double quotes at the ends,
test = "\"\\\"\\\"\\\"\\nThis is a beautiful world.\\n\""
The first thing to note is, that just typing in the variable name when running python interactively returns the canonical string representation of the object and not (necessarily) the plain value of the object.
For strings this means (among other things) that quotes are added around the output (in your example the outermost single quotes) and any newlines are replaced with "\n".
This means that, although your output does show "\n" the actual string contains a newline character in its place.
The check what a string looks like, you should use the print() function to, well, print it.
>>> test = "\"\"\"\nthis is a beautiful world."
>>> test
'"""\nthis is a beautiful world.'
>>> print(test)
"""
this is a beautiful world.
>>>
Also, when running the code from a file, lines just containing variable names will not result in any output.
To answer the question
There are a few ways to handle that.
Assuming that the desired output is
"""\nThis is a beautiful world.\n
i.e. the outermost double quotes are not supposed to be part of the string, that is
While using double quotes ("…") to denote strings: escape any \ or " by prepending it with \:
>>> test ="\\\"\\\"\\\"\\nthis is a beautiful world.\\n"
>>> print(test)
\"\"\"\nthis is a beautiful world.\n
Within regular strings \ is used to designate control character. For example: \n is interpreted as newline, \b would be a backspace. If you need to have a \ in a string, you need actually write two \\.
If you are usually using "…" for string notation, this allows for a more consistent coding style but it is (especially in this case) quite ugly and might be hard to understand at a glance.
As your string contains a lot of " characters, just use single quotes ('…') to designate the string. This removes the need to escape ":
>>> test = '\\"\\"\\"\\nthis is a beautiful world.\\n'
>>> print(test)
\"\"\"\nthis is a beautiful world.\n
This is less consistent (if "…" is usually used for strings, but allows the code to be quite a bit closer to the desired output.
Use raw strings (r'…' or r"…") to disable the interpretation of control characters and allow the use of " within the string:
>>> test = r'\"\"\"\nthis is a beautiful world.\n'
>>> print(test)
\"\"\"\nthis is a beautiful world.\n"
or even
>>> test = r"\"\"\"\nthis is a beautiful world.\n"
>>> print(test)
\"\"\"\nthis is a beautiful world.\n"
This allows the code to be identical to the desired output, but it has some limitations when it comes to freely mixing " and ' within a single string as it is not possible to escape the quotation marks within the string without also adding \ to the string output. This can be seen in the second example, where we use \" to escape the double quote within r"…" in the code but where the \s are still present in the output. While this works well in this specific case, I would recommend against using \' within r'…' or \" within r"…" to avoid confusion.

Catching String Literal Errors using Flex-Lexer

I have managed to effectively match valid string literals in my Flex program, but I would also like to match unterminated string literals and string literals with bad escape sequences.
For example, my string literals are matched using simple regex as such:
\"(\\.|[^\\"])*\"
Then I tried to find where a string literal is started with a " then some text, and then \n. This is incorrect syntax for my lexer, and I'd like to catch and produce an error out.
My current regex I came up with for that is this:
\"(\\.|[^\\"])*\n
Which does correctly catch the error, but then seems to eat up the rest of the tokens, because there's no output after that.
Additionally, I was also looking to have a special case error for when an unterminated string literal had an invalid escape sequence. For example:
"some text \
int abc
So my question boils down to, is there something wrong with my current way of matching string literals that's affecting my ability to catch these errors, or is my pattern matching unnecessarily consuming tokens? It's also possible I have no idea what I'm doing!
Some examples of strings:
"a correct string literal"
"an unterminated string literal
"an unterminated string literal with escape \
All string literals are single-line and follow the form:
"(.*)"\n
The correct flex pattern for string literals is (see below for escape sequences):
\"(\\(.|\n)|[^\\"\n])*\"
This differs from your pattern in that it allows a newline after an escape character (which is technically a splice, rather than part of the syntax of string literals [Note 1]), and bans newlines otherwise. That has to be done explicitly because [^...] includes a newline unless \n is part of the list of characters to be rejected. Only . implicitly bans newlines.
To match incorrect string literals, you only need the same pattern without the terminating ":
\"(\\(.|\n)|[^\\"\n])*
You don't need to worry about that pattern matching correct string literals, because flex always chooses the longest match, and the match with the terminating quote is guaranteed to be longer.
If you want to be more accurate about escape characters, you would need something like:
\"(\\([abfnrtv'"?\\\n]|[0-7]{1,3}|x[[:xdigit:]]+|u[[:xdigit:]]{4}|U[[:xdigit:]]{8})|[^\\"\n])*\"
You can use the same technique to match errors, but you might want to distinguish between unterminated quote errors and invalid escape errors, which you can do by using two error patterns:
\"(\\([abfnrtv'"?\\\n]|[0-7]{1,3}|x[[:xdigit:]]+|u[[:xdigit:]]{4}|U[[:xdigit:]]{8})|[^\\"\n])*\" { /* Valid string */ }
\"(\\([abfnrtv'"?\\\n]|[0-7]{1,3}|x[[:xdigit:]]+|u[[:xdigit:]]{4}|U[[:xdigit:]]{8})|[^\\"\n])*/\\ { /* Invalid escape sequence */ }
\"(\\([abfnrtv'"?\\\n]|[0-7]{1,3}|x[[:xdigit:]]+|u[[:xdigit:]]{4}|U[[:xdigit:]]{8})|[^\\"\n])* { /* Missing terminating quote */ }
Notes
A "splice" is a backslash at the end of a line. You commonly see these only in definitions of long macros, but C allows a splice anywhere: the backslash and the following newline are just deleted from the program text, so a splice can even be placed in the middle of an identifier or multicharacter operator. (But don't do that!)
Using splices to continue strings over multiple lines is not good style; it's better to use string concatenation. But the C standard does allow it.
However, splices are removed before tokenisation starts, which means that you cannot backslash escape a backslash at the end of a line:
"This is a string literal which includes a \\
t tab, with a splice in the middle of the escape."
Please don't use that in production code, either :-)

Python3 - Convert unicode literals string to unicode string

From command line parameters (sys.argv) I receive string of unicode literals like this: '\u041f\u0440\u0438\u0432\u0435\u0442\u0021'
For example this script uni.py:
import sys
print(sys.argv[1])
command line:
python uni.py \u041f\u0440\u0438\u0432\u0435\u0442\u0021
output:
\u041f\u0440\u0438\u0432\u0435\u0442\u0021
I want to convert it to unicode string 'ΠŸΡ€ΠΈΠ²Π΅Ρ‚!'
You don't have to convert it the Unicode, because it already is Unicode. In Python 3.x, strings are Unicode by default. You only have to convert them (to or from bytes) when you want to read or write bytes, for example, when writing to a file.
If you just print the string, you'll get the correct result, assuming your terminal supports the characters.
print('\u041f\u0440\u0438\u0432\u0435\u0442\u0021')
This will print:
ΠŸΡ€ΠΈΠ²Π΅Ρ‚!
UPDATE
After updating your question it became clear to me that the mentioned string is not really a string literal (or unicode literal), but input from the command line. In that case you could use the "unicode-escape" encoding to get the result you want.
Note that encoding works from Unicode to bytes, and decoding works from bytes to Unicode. In this case you want a transformation from Unicode to Unicode, so you have to add a "dummy" decoding step using latin-1 encoding, which transparently converts Unicode codepoints to bytes.
The following code will print the correct result for your example:
text = sys.argv[1].encode('latin-1').decode('unicode-escape')
print(text)
UPDATE 2
Alternatively, you could use ast.literal_eval() to parse the string from the input. However, this method expects a proper Python literal, including the quotes. You could do something like to solve this:
text = ast.literal_eval("'" + sys.argv[1] + "'")
But note that this would break if you would have a quote as part of your input string. I think it's a bit of a hack, since the method is probably not intended for the purpose you use it. The unicode-escape is simpler and robuster. However, what the best solution is depends on what you're building.

Python 3: unescape literal hex, unicode and python escapes

What is the proper way to do the following:
accept any unicode string (this gets read in from a file that contains utf-8 encoded strings)
if the string contains literal escapes (sequences) of the form \xA0\xB0 (so in Python this will show as \xA0\XB0) replace them with the actual characters or with some fallback (e.g. a space) if this is an invalid code
if the string contains literal unicode character escapes of the form \u0008 (so in python this will show as \u0008) replace with the actual character represented by that code
if the string contains python string escapes of the form \n or \t (so in python this will show as \n and \t), replace with the actual character represented by that code
As a bonus, also replace all HTML entities of the form &#a0;
all actual unicode characters should remain unchanged
Basically, something that converts back all the rubbish one sometimes gets when people have created a file in the wrong way and it now contains the literal escapes instead of the code represented by those escapes.
This has GOT to be a solved problem, but all I can find are rather clumsy and incomplete solutions.

Replace character with a safe character and vice-versa

Here's my problem:
I need to store sentences "somewhere" (it doesn't matter where).
The sentences must not contain spaces.
When I extract the sentences from that "somewhere", I need to restore the spaces.
So, before storing the sentence "I am happy" I could replace the spaces with a safe character, such as &. In C#:
theString.Replace(' ', '&');
This would yield 'I&am&happy'.
And when retrieving the sentence, I would to the reverse:
theString.Replace('&', ' ');
But what if the original sentence already contains the '&' character?
Say I would do the same thing with the sentence 'I am happy & healthy'. With the design above, the string would come back as 'I am happy healthy', since the '&' char has been replaced with a space.
(Of course, I could change the & character to a more unlikely symbol, such as Β€, but I want this to be bullet proof)
I used to know how to solve this, but I forgot how.
Any ideas?
Thanks!
Fredrik
Maybe you can use url encoding (percent encoding) as an inspiration.
Characters that are not valid in a url are escaped by writing %XX where XX is a numeric code that represents the character. The % sign itself can also be escaped in the same way, so that way you never run into problems when translating it back to the original string.
There are probably other similar encodings, and for your own application you can use an & just as well as a %, but by using an existing encoding like this, you can probably also find existing functions to do the encoding and decoding for you.

Resources