When I put some non-alpha-numeric symbols in browser address bar, they got url-encoded. For example, http://ru2.php.net/manual-lookup.php?pattern=привет turns into http://ru2.php.net/manual-lookup.php?pattern=%EF%F0%E8%E2%E5%F2.
The question is: what do those two percent-prefixed hex digits mean?
they are bytes of the
Windows 1251 encoding of Cyrillic. Since there are only six of them, they can't be UTF-8, since it takes 12 bytes of UTF-8 for 6 chars of Cyrillic.
The code chart for CP1251 can be found here: http://en.wikipedia.org/wiki/Windows-1251.
Just like 20 is hex for a space, each of the Cyrillic characters has its numeric value, expressible as two hex digits.
Related
Long long time ago before world scripts birth, text files are all ASCII.
Nowadays, we have world scripts.
I would like to ask if I open up a text file in a hex editor, is there a way to tell its code page is in ASCII or UTF-8?
UTF-8 is backwards compatible with ASCII: an ASCII text file is also a UTF-8 text file.
If a file contains bytes starting with 8 through F it's not ASCII.
If a file is not ASCII, it may be UTF-8 if every byte that starts with C, D, E, or F is followed by one to three bytes that start with 8, 9, A, or B. If any of these bytes appears in any other context it's not UTF-8.
There are a few more requirements for valid UTF-8, but they are harder to glean with a hex editor. See https://en.m.wikipedia.org/wiki/UTF-8
I've tried all kinds of combinations of encode/decode with options 'surrogatepass' and 'surrogateescape' to no avail. I'm not sure what format this is in (it might even be a bug in Autoit), but I know for a fact the information is in there because at least one online utf decoder got it right. On the online converter website, I specified the file as utf8 and the output as utf16, and the output was as expected.
This issue is called mojibake, and your specific case occurs if you have a text stream that was encoded with UTF-8, and you decode it with Windows-1252 (which is a superset of ISO 8859-1).
So, as you have already found out, you have to decode this file with UTF-8, rather than with the default encoding of Python (which appears to be Windows-1252 in your case).
Let's see why these specific garbled characters appear in your example, namely:
É at the place of É
é at the place of é
è at the place of è
The following table summarises what's going on:
All of É, é, and è are non-ASCII characters, and they are encoded with UTF-8 to 2-byte long codes.
For example, the UTF-8 code for É is:
11000011 10001001
On the other hand, Windows-1252 is an 8-bit encoding, that is, it encodes every character of its character set to 8 bits, i.e. one byte.
So, if you now decode the bit sequence 11000011 10001001 with Windows-1252, then Windows-1252 interprets this as two 1-byte codes, each representing a separate character, rather than a 2-byte code representing a single character:
The first byte 11000011 (C3 in hexadecimal) happens to be the Windows-1252 code of the character à (Unicode code point U+00C3).
The second byte 10001001 (89 in hexadecimal) happens to be the Windows-1252 code of the character ‰ (Unicode code point U+2030).
You can look up these mappings here.
So, that's why your decoding renders É instead of É. Idem for the other non-ASCII characters é and è.
My issue was during the file reading. I solved it by specifying encoding='utf-8' in the options for open().
open(filePath, 'r', encoding='utf-8')
How to get the REAL length of the string with cyrillic symbols in lua?
If i'm using string.len("HELLO.") - I will get 6
But with string.len("ПРИВЕТ") - I will get 12(Same with "#" operator)
So the number of symbols didn't changed, but we got different numbers...
It's because cyrillic symbols has two bytes, when english letters, numbers and etc has 1.
I want to know how to get the right and real length of string(Get 6 in both samples).
Everyone who can help?
string.len and # count bytes, not chars.
In Lua 5.3+, use utf8.len.
'{FileTitle}' === '{FileTitle}'
// false
There's a space between the first string's last e and }
'{FileTitle}'.length
// 12
'{FileTitle}'.length
// 11
There is Unicode character with code 8203 between those two characters. This is a 0-width space. Have a look at the corresponding Wikipedia article for more info.
This is a great example of a sometimes nasty problem :-)
If I copy your code to TextWrangler, then I see the space. If I chose "Hex Dump", then I see the hex bytes 0B 20. Considering the little endian context (thx to #axiac), this means the character 0x200B, decimal 8203.
For informations about specific unicode characters, use this: http://unicode-table.com/de/search/?q=8203 You'll see the description "Zero Width Space".
About how this character got into your code, one can only guess. Option one is, you wrote it in your editor unwittingly by hitting a certain key combination. Option two is, you copied it from a rich text document as a stowaway. Option three is, it got there because of some stumbled multibyte string operation.
A related problem is Ascii 0xA0 (or 0x00A0), the non-breakable space. It cannot be distinguished from a normal space by eye, but causes compiler syntax errors sometimes hard to resolve.
This one has no space
'{FileTitle}' === '{FileTitle}'
You just used different encoding.
I need to get the first char of a text variable. I achieve this with one of the following simple methods:
string.sub(someText,1,1)
or
someText:sub(1,1)
If I do the following, I expect to get 'ñ' as the first letter. However, the result of either of the sub methods is 'Ã'
local someText = 'ñññññññ'
print('Test whole: '..someText)
print('first char: '..someText:sub(1,1))
print('first char with .sub: '..string.sub(someText,1,1))
Here are the results from the console:
2014-03-02 09:08:47.959 Corona Simulator[1701:507] Test whole: ñññññññ
2014-03-02 09:08:47.960 Corona Simulator[1701:507] first char: Ã
2014-03-02 09:08:47.960 Corona Simulator[1701:507] first char with .sub: Ã
It seems like the string.sub() function is encoding the returned value in UTF-8. Just for kicks I tried using the utf8_decode() function that's provided by Corona SDK. It was not successful. The simulator indicated that the function expected a number but got nil instead.
I also searched the web to see if anyone else had ran into this issue. I found out that there is a fair amount of discussion on Lua, Corona, Unicode and UTF-8 but I did not come across anything that would address this specific problem.
Lua strings are 8-bit clean, which means strings in Lua are a stream of bytes. The UTF-8 character ñ has multiple bytes, but someText:sub(1,1) returns only the first single byte.
For UTF-8 encoding, all characters in the ASCII range have the same representation as in ASCII, that is, a single byte smaller than 128. For other CodePoints, a sequences of bytes where the first byte is in the range 194-244 and continuation bytes are in the range 128-191.
Because of this, you can use the pattern ".[\128-\191]*" to match a single UTF-8 CodePoint (not Grapheme):
for c in "ñññññññ":gmatch(".[\128-\191]*") do -- pretend the first string is in NFC
print(c)
end
Output:
ñ
ñ
ñ
ñ
ñ
ñ
ñ
Regarding the used character set:
Just know what requirements you bake into your own code, and make sure those are actually satisfied.
There are various typical requirements:
ASCII-compatible (aka any byte < 128 represents an ASCII character and all ASCII characters are represented as themselves)
Fixed-Size vs Variable-Width (maybe self-synchronizing) Character Set
No embedded 0-bytes
Write your code so you need as few of those requirements as cannot be avoided, and document them.
match a single UTF-8 character: Be sure what you mean by UTF-8 character. Is it Glyph or CodePoint? AFAIK you need full unicode-tables for glyph-matching. Do you actually have to get to this level at all?