Python: Printing unicode characters beyond FFFF - string

On Python 3 printing unicode characters can be printed like this:
print('\uFFFF')
But how can I print higher unicode characters like 001FFFFF? print('\u001FFFFF') will just print 001F as unicode character and then 4 times F. Trying to use print('\u001F\uFFFF') will result in 2 unicode characters instead of the wanted one. Is it possible to print somehow the unicode character 001FFFFF in Python 3?

Use an upper-case U.
print('\U001FFFFF')

There is another way in Python 3, using the built-in function chr(i), which
Return the string representing a character whose Unicode code point is
the integer i.
and
The valid range for the argument is from 0 through 1,114,111 (0x10FFFF in base 16).
so there are no limitation for the hex digit value.
print(chr(97))
print(chr(0xFFFF))
print(chr(0x10080))

Related

Why the string in unicode form is not equal to its unicode code point value?

We can get the string 你's unicode code point value:
u'你'.encode('unicode-escape')
b'\\u4f60'
Why the string in unicode form is not equal to its unicode code point value?
u'你' == u'\x4f\x60'
False
u'你' == u'\\u4f60'
False
It is, but your comparison strings are not correct to compare. The first one is two separate characters of a single byte, and the second one has the backslash escaped, meaning that it is the literal 6 characters \u4f60.
u'你' == u"\u4f60"
True
The encoded byte string has the two backslashes since the encoding escapes it, making it not equivalent even if turned back into a string unless you decode it with unicode-escape as well.
Side note, the u is default in python 3.

How to get single backslash instead of double backslash with encode("unicode-escape")?

Get unicode point of character Ä.
Python3 version.
>>> str="Ä"
>>> str.encode("unicode-escape")
b'\\xc4'
How to get the single backslash format b'\xc4' instead of b'\\xc4' as my output ?
It's not entirely clear to me what you want, so I'll give you a few options.
Get the (Unicode) code point of a character as an integer:
>>> ord('Ä')
196
Display the integer in hex notation:
>>> hex(ord('Ä'))
'0xc4'
or with string formatting:
>>> '{:X}'.format(ord('Ä'))
'C4'
However, you talk about backslashes and show the bytestring b'\xc4'.
This is the Latin-1 encoding of 'Ä' (all characters with a Unicode codepoint below 256 can be encoded with Latin-1, and their byte value equals the Unicode codepoint).
>>> 'Ä'.encode('latin-1')
b'\xc4'
This is a bytestring of length 1.
It is displayed in a way in which you could type this character, ie. using an escape sequence with backslash-x and a two-digit hex number.
The "unicode-escape" codec produces these four ASCII characters (\, x, c 4), but not as str, but as a bytes object (because str.encode() returns bytes by definition).
To get a backslash in a str/bytes literal, you need to type two backslashes, so the representation form also uses two backslashes:
>>> 'Ä'.encode('unicode-escape')
b'\\xc4'
The "unicode-escape" codec is very Python-specific and I don't see a lot of applications; maybe if you want to write your own pickle protocol or parse fragments of Python source code.

How can I convert a character code to a string character in Lua?

How can I convert a character code to a string character in Lua?
E.g.
d = 48
-- this is what I want
str_d = "0"
You are looking for string.char:
string.char (···)
Receives zero or more integers. Returns a string with length equal to the number of arguments, in which each character has the internal numerical code equal to its corresponding argument.
Note that numerical codes are not necessarily portable across platforms.
For your example:
local d = 48
local str_d = string.char(d) -- str_d == "0"
For ASCII characters, you can use string.char.
For UTF-8 strings, you can use utf8.char(introduced in Lua 5.3) to get a character from its code point.
print(utf8.char(48)) -- 0
print(utf8.char(29790)) -- 瑞

AS3 - "\u2605" NOT the same as "\\u"+"2605"?

Trying to make a textfield where people write the unicode without the backslash. I want to add the backslash after they typed it. So the user types u2605 and the code converts it to "\u2605", i then convert this to a unicode character and insert it in textflow.
My code:
this works:
span.text = publicFunctions.htmlUnescape(he.encode("\u2605"))
this doesn't work:
span.text = publicFunctions.htmlUnescape(he.encode("\\u"+"2605"))
how to make a string that acts as a unicode string?
Tried all sorts of things, escape(unescape()), convert to number, "\u", "\u" ... nothing helps.
trace("\u2605" == "\u"+"2605") ... will return false. So will
trace("\u2605" == "\u"+"2605")
"\u2605" is a string with a single character, the character with the code point 2605, while "\\u" + "2605" is a string with 6 characters (the backslash, the u and the four digit number).
If you want to construct a unicode character from just the four digits, you should be able to use String.fromCharCode. The thing is just that the escape sequence uses a hexadecimal number, while the method obviously takes a decimal number. So if the user enters a hexadecimal string, you will have to convert that first:
trace(String.fromCharCode(parseInt('2605', 16)) == '\u2605'));
That's an interesting issue! I don't think you can concatenate a string literal and achieve what you're trying to do. The relevant character escaping happens when the string literal is originally formed, which means that you need the whole sequence together in the first place.
But you should be able to take the user-supplied number and dynamically generate a Unicode string with String.fromCharCode(...).
http://help.adobe.com/en_US/FlashPlatform/reference/actionscript/3/String.html#fromCharCode()

Pulling valid data from bytestring in Python 3

Given the following bytestring, how can I remove any characters matching \xFF, and create a list object from what's left (by splitting on removed areas)?
b"\x07\x00\x00\x00~\x10\x00pts/5\x00\x00/5\x00\x00user\x00\x00"
Desired result:
["~", "pts/5", "/5", "user"]
The above string is just an example - I'd like to remove any \x.. (non-decoded) bytes.
I'm using Python 3.2.3, and would prefer to use standard libraries only.
>>> a = b"\x07\x00\x00\x00~\x10\x00pts/5\x00\x00/5\x00\x00user\x00\x00"
>>> import re
>>> re.findall(rb"[^\x00-\x1f\x7f-\xff]+", a)
[b'~', b'pts/5', b'/5', b'user']
The results are still bytes objects. If you want the results to be strings:
>>> [i.decode("ascii") for i in re.findall(rb"[^\x00-\x1f\x7f-\xff]+", a)]
['~', 'pts/5', '/5', 'user']
Explanation:
[^\x00-\x1f\x7f-\xff]+ matches one or more (+) characters that are not in the range ([^...]) between ASCII 0 and 31 (\x00-\x1F) or between ASCII 127 and 255 (\x7f-\xff).
Be aware that this approach only works if the "embedded texts" are ASCII. It will remove all extended alphabetic characters (like ä, é, € etc.) from strings encoded in an 8-bit codepage like latin-1, and it will effectively destroy strings encoded in UTF-8 and other Unicode encodings because those do contain byte values between 0 and 31/127 and 255 as parts of their character codes.
Of course, you can always manually fine-tune the exact ranges you want to remove according to the example given in this answer.

Resources