I need to get the first char of a text variable. I achieve this with one of the following simple methods:
string.sub(someText,1,1)
or
someText:sub(1,1)
If I do the following, I expect to get 'ñ' as the first letter. However, the result of either of the sub methods is 'Ã'
local someText = 'ñññññññ'
print('Test whole: '..someText)
print('first char: '..someText:sub(1,1))
print('first char with .sub: '..string.sub(someText,1,1))
Here are the results from the console:
2014-03-02 09:08:47.959 Corona Simulator[1701:507] Test whole: ñññññññ
2014-03-02 09:08:47.960 Corona Simulator[1701:507] first char: Ã
2014-03-02 09:08:47.960 Corona Simulator[1701:507] first char with .sub: Ã
It seems like the string.sub() function is encoding the returned value in UTF-8. Just for kicks I tried using the utf8_decode() function that's provided by Corona SDK. It was not successful. The simulator indicated that the function expected a number but got nil instead.
I also searched the web to see if anyone else had ran into this issue. I found out that there is a fair amount of discussion on Lua, Corona, Unicode and UTF-8 but I did not come across anything that would address this specific problem.
Lua strings are 8-bit clean, which means strings in Lua are a stream of bytes. The UTF-8 character ñ has multiple bytes, but someText:sub(1,1) returns only the first single byte.
For UTF-8 encoding, all characters in the ASCII range have the same representation as in ASCII, that is, a single byte smaller than 128. For other CodePoints, a sequences of bytes where the first byte is in the range 194-244 and continuation bytes are in the range 128-191.
Because of this, you can use the pattern ".[\128-\191]*" to match a single UTF-8 CodePoint (not Grapheme):
for c in "ñññññññ":gmatch(".[\128-\191]*") do -- pretend the first string is in NFC
print(c)
end
Output:
ñ
ñ
ñ
ñ
ñ
ñ
ñ
Regarding the used character set:
Just know what requirements you bake into your own code, and make sure those are actually satisfied.
There are various typical requirements:
ASCII-compatible (aka any byte < 128 represents an ASCII character and all ASCII characters are represented as themselves)
Fixed-Size vs Variable-Width (maybe self-synchronizing) Character Set
No embedded 0-bytes
Write your code so you need as few of those requirements as cannot be avoided, and document them.
match a single UTF-8 character: Be sure what you mean by UTF-8 character. Is it Glyph or CodePoint? AFAIK you need full unicode-tables for glyph-matching. Do you actually have to get to this level at all?
Related
I have a large excel file which I read with pandas.read_excel(). I guess it was compiled from various sources and contains some badly encoded characters.
For instance, a string foo which should be für is printed as fãâ¼r and has foo.__repr__()=='fã\x83â¼r'.
Some other string bar which should be française is printed as franãâ§aise with bar.__repr__()=='franã\x83â§aise'.
And another one baz which should also be française is printed as franãâaise with baz.__repr__()=='franã\x83â\x87aise'.
Same goes for ñ with a __repr__ of ã\x83â\x91, and so on.
What would be the best way to sanitize this input? Or is there a way to avoid the problem altogether?
Also, is there a way to simply search for such characters in my data? A .contains('\x') fails as there is nothing after the unicode escape (SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \xXX escape)
I generated a list of letters in excel, from character codes 1 to 66535.
I am trying to get back the unicode by using the function "UNICODE". However, excel return #VALUE! for character codes from 55296 to 56319.
Please advise if there are any other function that can return a proper unicodes.
Thank you.
The range you are listing is a special range in Unicode: surrogates.
So, they have Unicode code point, but the problem it is you cannot have them in a text: Windows uses UCS-2/UTF-16 as internal encoding, so there are no way you can put in text. Or better: you to have code points above 65535, Windows uses two surrogates, one in the range 0xD800-0xDBFF (high surrogate) and the second one 0xDC00-0xDFFF )low surrogate). By combining these two, you have all Unicode code points.
But so, you should never have a single surrogate (or a mismatch surrogate, e.g. a high surrogate not followed a low surrogate, or a low surrogate not preceded be a high surrogate).
So, just skip such codes. Or better use them correctly to have characters above 65535.
Note: you cannot have all Unicode characters only with one code point. many characters requires combining many code points (there is a whole category of "combining characters" in Unicode). E.g. the zero with a oblique line is rendered with two unicode characters: the normal zero, and a variant selector. Also accented characters are very limited (and often with just one accent per characters). And without going to more complex scripts.
I am aware of string escapements which can be used to insert special characters, for example there is the \u---- escapement, where - is a hexadecimal digit, which can be used to insert any utf-16 character. I mean can be used to insert any character with its utf-16 code. Except emoji which take up two "characters" which arent really characters because the emojis are actually just utf-32 characters. Except the emojis which are a pictograph followed by U+FE0F. Anyways, what is the function of \0 in a string?
I have tried searching on google and stackoverflow and even w3school's javascript String lesson and could not find anything.
This is the escape sequence for the null character. More details: https://en.wikipedia.org/wiki/Null_character
I've tried all kinds of combinations of encode/decode with options 'surrogatepass' and 'surrogateescape' to no avail. I'm not sure what format this is in (it might even be a bug in Autoit), but I know for a fact the information is in there because at least one online utf decoder got it right. On the online converter website, I specified the file as utf8 and the output as utf16, and the output was as expected.
This issue is called mojibake, and your specific case occurs if you have a text stream that was encoded with UTF-8, and you decode it with Windows-1252 (which is a superset of ISO 8859-1).
So, as you have already found out, you have to decode this file with UTF-8, rather than with the default encoding of Python (which appears to be Windows-1252 in your case).
Let's see why these specific garbled characters appear in your example, namely:
É at the place of É
é at the place of é
è at the place of è
The following table summarises what's going on:
All of É, é, and è are non-ASCII characters, and they are encoded with UTF-8 to 2-byte long codes.
For example, the UTF-8 code for É is:
11000011 10001001
On the other hand, Windows-1252 is an 8-bit encoding, that is, it encodes every character of its character set to 8 bits, i.e. one byte.
So, if you now decode the bit sequence 11000011 10001001 with Windows-1252, then Windows-1252 interprets this as two 1-byte codes, each representing a separate character, rather than a 2-byte code representing a single character:
The first byte 11000011 (C3 in hexadecimal) happens to be the Windows-1252 code of the character à (Unicode code point U+00C3).
The second byte 10001001 (89 in hexadecimal) happens to be the Windows-1252 code of the character ‰ (Unicode code point U+2030).
You can look up these mappings here.
So, that's why your decoding renders É instead of É. Idem for the other non-ASCII characters é and è.
My issue was during the file reading. I solved it by specifying encoding='utf-8' in the options for open().
open(filePath, 'r', encoding='utf-8')
In Linux, I am converting UTF-8 to ISO-8859-1 file using the following command:
iconv -f UTF-8 -t ISO-8859-1//TRANSLIT input.txt > out.txt
After conversion, when I open the out.txt
¿Quién Gómez is translated to ¿Quien Gomez.
Why are é and ó and others not translated correctly?
There are (at least) two ways to represent the accented letter é in Unicode: as a single code point U+00E9, LATIN SMALL LETTER E WITH ACUTE, and as a two-character sequence e (U+0065) followed by U+0301, COMBINING ACUTE ACCENT.
Your input file uses the latter encoding, which iconv apparently is unable to translate to Latin-1 (ISO-8859-1). With the //TRANSLIT suffix, it passes through the unaccented e unmodified and drops the combining character.
You'll probably need to convert the input so it doesn't use combining characters, replacing the sequence U+0065 U+0301 by a single code point U+00E9 (represented in 2 bytes). Either that, or arrange for whatever generates your input file to use that encoding in the first place.
So that's the problem; I don't currently know exactly how to correct it.
Keith, you are right. I found the answer from Oracle Community Sergiusz Wolicki. Here I am quoting his answer word for word. I am posting for somebody who may have this problem. "The problem is that your data is stored in the Unicode decomposed form, which is legal but seldom used for Western European languages. 'é' is stored as 'e' (0x65=U+0065) plus combining acute accent (0xcc, 0x81 = U+0301). Most simple conversion tools, including standard Oracle client/server conversion, do not account for this and do not convert a decomposed characters into a pre-composed character from ISO 8859-1. They try to convert each of the two codes independently, yielding 'e' plus some replacement for the accent character, which does not exist in ISO 8859-1. You see the result correctly in SQL Developer because there is no conversion involved and SQL Developer rendering code is capable of combining the two codes into one character, as expected.
As 'é' and 'ó' have pre-composed forms available in both Unicode and ISO 8859-1, the work around is to add COMPOSE function to your query. Thus, set NLS_LANG as I advised previously and add COMPOSE around column expressions to your query." Thank you very much, Keith