How to isolate non english words separated by spaces in Lua? - string

I have this string
"Hello there, this is some line-aa."
how to slice it into an array like this?
Hello
there,
this
is
some
line-aa.
this is what I have tried so far
function sliceSpaces(arg)
local list = {}
for k in arg:gmatch("%w+") do
print(k)
table.insert(list, k)
end
return list
end
local sentence = "مرحبا يا اخوتي"
print("sliceSpaces")
print(sliceSpaces(sentence))
this code works for English text, but not for arabic, how can I make it work for arabic too?

Lua strings are sequences of bytes, not Unicode characters. The pattern %w matches alphanumeric characters, but it applies to ASCII only.
Instead, use %S to match a non-whitespace character:
for k in arg:gmatch("%S+") do

Related

How to convert accented strings to regular strings in Erlang?

I want to convert some city names with accented characters to normal strings. For example:
<<"Sosúa">> to <<"Sosua">>
<<"Luperón">> to <<"Luperon">>
Any leads on how to do this?
apply an Unicode Canonical Decomposition (NFD) to rewrite characters like ó in the two code points o (U+6F) followed by a separated combining acute accent (U+301) with unicode:characters_to_nfc_binary/1
with the regexp \p{Mn}, replace (re:replace/4) all those combining diacritics (non-spacing marks) like U+301 above
optional: apply an Unicode Canonical Composition (NFC) to recompose back the remaining and possible code points together
String = "Luperón",
{ok, Re} = re:compile("\\p{Mn}", [unicode]),
Output = unicode:characters_to_nfc_binary(
re:replace(
unicode:characters_to_nfd_binary(String),
Re,
"",
[global]
)
),
Output.
Equivalent for Elixir, for reference and information (as it is also based on Erlang's unicode module):
string = "Luperón"
output =
Regex.replace(~R<\p{Mn}>u, string |> :unicode.characters_to_nfd_binary(), "")
|> :unicode.characters_to_nfc_binary()

Python: lower() method generates wrong letter in a string

text = 'ÇEKİM GÜNÜ KALİTESİNİ DÜZENLERLSE'
sentence = text.split(' ')
print(sentence)
if "ÇEKİM" in sentence:
print("yes-1")
print(" ")
sentence_ = text.lower().split(' ')
print(sentence_)
if "çekim" in sentence_:
print("yes-2")
>> output:
['ÇEKİM', 'GÜNÜ', 'KALİTESİNİ', 'DÜZENLERLSE']
yes-1
['çeki̇m', 'günü', 'kali̇tesi̇ni̇', 'düzenlerlse']
I have a problem about string. I have a sentence like a text. When I check a specific word in this sentence-splitted list, I can find "ÇEKİM" word (prints yes). However, while I make search by lowering sentence, I can not find in the list because it changes "i" letter. What is the reason of it (encoding/decoding) ? Why "lower()" method changes string in addition to lowering ? Btw, it is a turkish word. Upper:ÇEKİM - Lower:çekim
Turkish i and English i are treated differently. Capitalized Turkish i is İ, while capitalized English i is I. To differentiate Unicode has rules for converting to lower and upper case. Lowercase Turkish i has a combining mark. Also, converting the lower case version to upper case leaves the characters in a decomposed form, so proper comparison needs to normalize the string to a standard form. You can't compare a decomposed form to a composed form. Note the differences in the strings below:
#coding:utf8
import unicodedata as ud
def dump_names(s):
print('string:',s)
for c in s:
print(f'U+{ord(c):04X} {ud.name(c)}')
turkish_i = 'İ'
dump_names(turkish_i)
dump_names(turkish_i.lower())
dump_names(turkish_i.lower().upper())
dump_names(ud.normalize('NFC',turkish_i.lower().upper()))
string: İ
U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE
string: i̇
U+0069 LATIN SMALL LETTER I
U+0307 COMBINING DOT ABOVE
string: İ
U+0049 LATIN CAPITAL LETTER I
U+0307 COMBINING DOT ABOVE
string: İ
U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE
Some terminals also have display issues. My system displays '' with the dot over the m, not the i. For example, on the Chrome browser, below displays correctly:
>>> s = 'ÇEKİM'
>>> s.lower()
'çeki̇m'
But on one of my editors it displays as:
So it appears something like this is what the OP is seeing. The following comparison will work:
if "çeki\N{COMBINING DOT ABOVE}m" in sentence_:
print("yes-2")

How to find arabic character all occurrences with its dialect?

I am trying to find the occurence of arabic character with its harakat in string such as "رَّ" in "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ".
Arabic characters can take harakat for example "ر" is the original arabic character but can have harakat so it can look something like this "رَّ"> I am using Python 3 to find the character occurence with a specific harakat but could not do that. I have tried for loop and tried converting the string to unicode but could not do that.
str = "مرة رجل حكيم قال بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
i=0
for s in str:
if s == "رَّ":
i = i + 1
print(i)
Expected output is 2 but 0 is what I get.
len("رَّ") returns 3, which means the glyph is represented by three characters. Your loop checks a single character at a time and so never finds a match.
You need to be looking for substrings, which is exactly what .count() is for.
i = str.count('رَّ')

How to split sentence including punctuation

If I had the sentence sentence = 'There is light!' and I was to split this sentence with mysentence = sentence.split(), how would I have the output as 'There, is, light, !' of print(mysentence)? What I specifically wanted to do was split the sentence including all punctuation, or just a list of selected punctuation. I got some code but the program is recognizing the characters in the word, not the word.
out = "".join(c for c in punct1 if c not in ('!','.',':'))
out2 = "".join(c for c in punct2 if c not in ('!','.',':'))
out3 = "".join(c for c in punct3 if c not in ('!','.',':'))
How would I use this without recognizing each character in a word, but the word itself. Therefore, the output of "Hello how are you?" should become "Hello, how, are, you, ?" Any way of doing this
You may use a \w+|[^\w\s]+ regex with re.findall to get those chunks:
\w+|[^\w\s]
See the regex demo
Pattern details:
\w+ - 1 or more word chars (letters, digits or underscores)
| - or
[^\w\s] - 1 char other than word / whitespace
Python demo:
import re
p = re.compile(r'\w+|[^\w\s]')
s = "There is light!"
print(p.findall(s))
NOTE: If you want to treat an underscore as punctuation, you need to use something like [a-zA-Z0-9]+|[^A-Za-z0-9\s] pattern.
UPDATE (after comments)
To make sure you match an apostrophe as part of the words, add (?:'\w+)* or (?:'\w+)? to the \w+ in the pattern above:
import re
p = re.compile(r"\w+(?:'\w+)*|[^\w\s]")
s = "There is light!? I'm a human"
print(p.findall(s))
See the updated demo
The (?:'\w+)* matches zero or more (*, if you use ?, it will match 1 or 0) occurrences of an apostrophe followed with 1+ word characters.

AS3 - "\u2605" NOT the same as "\\u"+"2605"?

Trying to make a textfield where people write the unicode without the backslash. I want to add the backslash after they typed it. So the user types u2605 and the code converts it to "\u2605", i then convert this to a unicode character and insert it in textflow.
My code:
this works:
span.text = publicFunctions.htmlUnescape(he.encode("\u2605"))
this doesn't work:
span.text = publicFunctions.htmlUnescape(he.encode("\\u"+"2605"))
how to make a string that acts as a unicode string?
Tried all sorts of things, escape(unescape()), convert to number, "\u", "\u" ... nothing helps.
trace("\u2605" == "\u"+"2605") ... will return false. So will
trace("\u2605" == "\u"+"2605")
"\u2605" is a string with a single character, the character with the code point 2605, while "\\u" + "2605" is a string with 6 characters (the backslash, the u and the four digit number).
If you want to construct a unicode character from just the four digits, you should be able to use String.fromCharCode. The thing is just that the escape sequence uses a hexadecimal number, while the method obviously takes a decimal number. So if the user enters a hexadecimal string, you will have to convert that first:
trace(String.fromCharCode(parseInt('2605', 16)) == '\u2605'));
That's an interesting issue! I don't think you can concatenate a string literal and achieve what you're trying to do. The relevant character escaping happens when the string literal is originally formed, which means that you need the whole sequence together in the first place.
But you should be able to take the user-supplied number and dynamically generate a Unicode string with String.fromCharCode(...).
http://help.adobe.com/en_US/FlashPlatform/reference/actionscript/3/String.html#fromCharCode()

Resources