Lua string length (cyrillic in utf8) - string

How to get the REAL length of the string with cyrillic symbols in lua?
If i'm using string.len("HELLO.") - I will get 6
But with string.len("ПРИВЕТ") - I will get 12(Same with "#" operator)
So the number of symbols didn't changed, but we got different numbers...
It's because cyrillic symbols has two bytes, when english letters, numbers and etc has 1.
I want to know how to get the right and real length of string(Get 6 in both samples).
Everyone who can help?

string.len and # count bytes, not chars.
In Lua 5.3+, use utf8.len.

Related

How to get unicode of characters from 55296 to 56319 in Excel

I generated a list of letters in excel, from character codes 1 to 66535.
I am trying to get back the unicode by using the function "UNICODE". However, excel return #VALUE! for character codes from 55296 to 56319.
Please advise if there are any other function that can return a proper unicodes.
Thank you.
The range you are listing is a special range in Unicode: surrogates.
So, they have Unicode code point, but the problem it is you cannot have them in a text: Windows uses UCS-2/UTF-16 as internal encoding, so there are no way you can put in text. Or better: you to have code points above 65535, Windows uses two surrogates, one in the range 0xD800-0xDBFF (high surrogate) and the second one 0xDC00-0xDFFF )low surrogate). By combining these two, you have all Unicode code points.
But so, you should never have a single surrogate (or a mismatch surrogate, e.g. a high surrogate not followed a low surrogate, or a low surrogate not preceded be a high surrogate).
So, just skip such codes. Or better use them correctly to have characters above 65535.
Note: you cannot have all Unicode characters only with one code point. many characters requires combining many code points (there is a whole category of "combining characters" in Unicode). E.g. the zero with a oblique line is rendered with two unicode characters: the normal zero, and a variant selector. Also accented characters are very limited (and often with just one accent per characters). And without going to more complex scripts.

What is \0 in a string?

I am aware of string escapements which can be used to insert special characters, for example there is the \u---- escapement, where - is a hexadecimal digit, which can be used to insert any utf-16 character. I mean can be used to insert any character with its utf-16 code. Except emoji which take up two "characters" which arent really characters because the emojis are actually just utf-32 characters. Except the emojis which are a pictograph followed by U+FE0F. Anyways, what is the function of \0 in a string?
I have tried searching on google and stackoverflow and even w3school's javascript String lesson and could not find anything.
This is the escape sequence for the null character. More details: https://en.wikipedia.org/wiki/Null_character

Literal Strings [Lua 5.1]

So I started to learn Lua(5.1) and I saw that thing called literal strings. And I have no idea what these do. The manual says \a is a bell but when I type
print('hello\athere')
The IDE prints a weird square with 'bel' written on it.
So if someone could help me and explain every one of them[Literal Strings]. that would be really helpful.
p.s. i use Sublime Text 3
Only ASCII between 0x20 and 0x7E are printable characters. How other characters are output, including '\a' and '\b', is up to the implementation.
'\a', the ASCII 7 for BEL, is designed to be used to alert. Typical terminal would make an audible or visible alert when outputing '\a'. Your IDE choose to show a different output other than an alert. That's OK since it's up to the implementation.
Such sequences are called "escape sequences", and are found in many different languages. They are used to encode non-printable characters such as newlines in literal (hardcoded) strings.
Lua supports the following escape sequences:
\a: Bell
\b: Backspace
\f: Form feed
\n: Newline
\r: Carriage return
\t: Tab
\v: Vertical tab
\\: Backslash
\": Double quote
\': Single quote
\nnn: Octal value (nnn is 3 octal digits)
\xNN: Hex value (Lua5.2/LuaJIT, NN is two hex digits)
A literal is not more than a value inside the code, e.g.: 'some text'.
The '\a' is something different. A special "char", that is used to output a sound (was using the pc-speaker some aeons ago).

string.sub issue with non-English characters

I need to get the first char of a text variable. I achieve this with one of the following simple methods:
string.sub(someText,1,1)
or
someText:sub(1,1)
If I do the following, I expect to get 'ñ' as the first letter. However, the result of either of the sub methods is 'Ã'
local someText = 'ñññññññ'
print('Test whole: '..someText)
print('first char: '..someText:sub(1,1))
print('first char with .sub: '..string.sub(someText,1,1))
Here are the results from the console:
2014-03-02 09:08:47.959 Corona Simulator[1701:507] Test whole: ñññññññ
2014-03-02 09:08:47.960 Corona Simulator[1701:507] first char: Ã
2014-03-02 09:08:47.960 Corona Simulator[1701:507] first char with .sub: Ã
It seems like the string.sub() function is encoding the returned value in UTF-8. Just for kicks I tried using the utf8_decode() function that's provided by Corona SDK. It was not successful. The simulator indicated that the function expected a number but got nil instead.
I also searched the web to see if anyone else had ran into this issue. I found out that there is a fair amount of discussion on Lua, Corona, Unicode and UTF-8 but I did not come across anything that would address this specific problem.
Lua strings are 8-bit clean, which means strings in Lua are a stream of bytes. The UTF-8 character ñ has multiple bytes, but someText:sub(1,1) returns only the first single byte.
For UTF-8 encoding, all characters in the ASCII range have the same representation as in ASCII, that is, a single byte smaller than 128. For other CodePoints, a sequences of bytes where the first byte is in the range 194-244 and continuation bytes are in the range 128-191.
Because of this, you can use the pattern ".[\128-\191]*" to match a single UTF-8 CodePoint (not Grapheme):
for c in "ñññññññ":gmatch(".[\128-\191]*") do -- pretend the first string is in NFC
print(c)
end
Output:
ñ
ñ
ñ
ñ
ñ
ñ
ñ
Regarding the used character set:
Just know what requirements you bake into your own code, and make sure those are actually satisfied.
There are various typical requirements:
ASCII-compatible (aka any byte < 128 represents an ASCII character and all ASCII characters are represented as themselves)
Fixed-Size vs Variable-Width (maybe self-synchronizing) Character Set
No embedded 0-bytes
Write your code so you need as few of those requirements as cannot be avoided, and document them.
match a single UTF-8 character: Be sure what you mean by UTF-8 character. Is it Glyph or CodePoint? AFAIK you need full unicode-tables for glyph-matching. Do you actually have to get to this level at all?

url-encoding in browser address bar

When I put some non-alpha-numeric symbols in browser address bar, they got url-encoded. For example, http://ru2.php.net/manual-lookup.php?pattern=привет turns into http://ru2.php.net/manual-lookup.php?pattern=%EF%F0%E8%E2%E5%F2.
The question is: what do those two percent-prefixed hex digits mean?
they are bytes of the
Windows 1251 encoding of Cyrillic. Since there are only six of them, they can't be UTF-8, since it takes 12 bytes of UTF-8 for 6 chars of Cyrillic.
The code chart for CP1251 can be found here: http://en.wikipedia.org/wiki/Windows-1251.
Just like 20 is hex for a space, each of the Cyrillic characters has its numeric value, expressible as two hex digits.

Resources