Remove non ASCII characters in octave - string

I'm trying to remove non ASCII characters read from a data file using OCTAVE but I can't make it work. I tried getting the ASCII codes of these "weird" characters and they do have random ASCII codes. An example string of characters is this:
asdqwФЕДЕРАЛЬ234НОЕ234 АГЕНТСqewwqedasТВО ПasdsadО ОБРАasdasdЗОВАНИЮ
Госудаsadasdsagwfрственная акадеasdмия профессиональной п
Do you guys have any suggestions on how can i remove the non ASCII characters from this string? Or better yet, how will I be able to determine if a given string has non ASCII characters?
Thanks in advance!

To remove all non ASCII chars in the range of 0..127 decimal use
a = "asdqwФЕДЕРАЛЬ234НОЕ234 АГЕНТСqewwqedasТВО ПasdsadО ОБРАasdasdЗОВАНИЮ Госудаsadasdsagwfрственная акадеasdмия профессиональной п";
a(! isascii (a)) = []
which gives
a = asdqw234234 qewwqedas asdsad asdasd sadasdsagwf asd
and if you just want to check if there are non ASCII chars:
any (! isascii("foobar"))
ans = 0
any (! isascii("foobaröäüß"))
ans = 1

Related

How to find arabic character all occurrences with its dialect?

I am trying to find the occurence of arabic character with its harakat in string such as "رَّ" in "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ".
Arabic characters can take harakat for example "ر" is the original arabic character but can have harakat so it can look something like this "رَّ"> I am using Python 3 to find the character occurence with a specific harakat but could not do that. I have tried for loop and tried converting the string to unicode but could not do that.
str = "مرة رجل حكيم قال بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
i=0
for s in str:
if s == "رَّ":
i = i + 1
print(i)
Expected output is 2 but 0 is what I get.
len("رَّ") returns 3, which means the glyph is represented by three characters. Your loop checks a single character at a time and so never finds a match.
You need to be looking for substrings, which is exactly what .count() is for.
i = str.count('رَّ')

Confusion regarding UTF8 substring length

Can someone please help me deal with byte-order mark (BOM) bytes versus UTF8 characters in the first line of an XHTML file?
Using Python 3.5, I opened the XHTML file as UTF8 text:
inputTopicFile = open(inputFileName, "rt", encoding="utf8")
As shown in this hex-editor, the first line of that UTF8-encoded XHTML file begins with the three-bytes UTF8 BOM EF BB BF:
I wanted to remove the UTF8 BOM from what I supposed were equivalent to the three initial character positions [0:2] in the string. So I tried this:
firstLine = firstLine[3:]
Didn't work -- the characters <? were no longer present at the start of the resulting line.
So I did this experiment:
for charPos in range(0, 3):
print("charPos {0} == {1}".format(charPos, firstLine[charPos]))
Which printed:
charPos 0 ==
charPos 1 == <
charPos 2 == ?
I then added .encode to that loop as follows:
for charPos in range(0, 3):
print("charPos {0} == {1}".format(charPos, eachLine[charPos].encode('utf8')))
Which gave me:
charPos 0 == b'\xef\xbb\xbf'
charPos 1 == b'<'
charPos 2 == b'?'
Evidently Python 3 in some way "knows" that the 3-bytes BOM is a single unit of non-character data? Meaning that one cannot try to process the first three 8-bit bytes(?) in the line as if they were UTF8 characters?
At this point I know that I can "trick" my code into giving me with I want by specifying firstLine = firstLine[1:]. But it seems wrong to do it that way(?)
So what's the correct way to discard the first three BOM bytes in a UTF8 string on the way to working with only the UTF8 characters?
EDIT: The solution, per the comment made by Anthony Sottile, turned out to be as simple as using encoding="utf-8-sig" when I opened the source XHTML file:
inputTopicFile = open(inputFileName, "rt", encoding="utf-8-sig")
That strips out the BOM. Voila!
As you mentioned in your edit, you can open the file with the utf8-sig encoding, but to answer your question of why it was behaving this way:
Python 3 distinguishes between byte strings (the ones with the b prefix) and character strings (without the b prefix), and prefers to use character strings whenever possible. A byte string works with the actual bytes; a character string works with Unicode codepoints. The BOM is a single codepoint, U+FEFF, so in a regular string Python 3 will treat it as a single character (because it is a single character). When you call encode, you turn the character string into a byte string.
Thus the results you were seeing are exactly what you should have: Python 3 does know what counts as a single character, which is all it sees until you call encode.

How to isolate non english words separated by spaces in Lua?

I have this string
"Hello there, this is some line-aa."
how to slice it into an array like this?
Hello
there,
this
is
some
line-aa.
this is what I have tried so far
function sliceSpaces(arg)
local list = {}
for k in arg:gmatch("%w+") do
print(k)
table.insert(list, k)
end
return list
end
local sentence = "مرحبا يا اخوتي"
print("sliceSpaces")
print(sliceSpaces(sentence))
this code works for English text, but not for arabic, how can I make it work for arabic too?
Lua strings are sequences of bytes, not Unicode characters. The pattern %w matches alphanumeric characters, but it applies to ASCII only.
Instead, use %S to match a non-whitespace character:
for k in arg:gmatch("%S+") do

Finding mean of ascii values in a string MATLAB

The string I am given is as follows:
scrap1 =
a le h
ke fd
zyq b
ner i
You'll notice there are 2 blank spaces indicating a space (ASCII 32) in each row. I need to find the mean ASCII value in each column without taking into account the spaces (32). So first I would convert to with double(scrap1) but then how do I find the mean without taking into account the spaces?
If it's only the ASCII 32 you want to omit:
d = double(scrap1);
result = mean(d(d~=32)); %// logical indexing to remove unwanted value, then mean
You can remove the intermediate spaces in the string with scrap1(scrap1 == ' ') = ''; This replaces any space in the input with an empty string. Then you can do the conversion to double and average the result. See here for other methods.
Probably, you can use regex to find the space and ignore it. "\s"
findSpace = regexp(scrap1, '\s', 'ignore')
% I am not sure about the ignore case, this what comes to my mind. but u can read more about regexp by typying doc regexp.

Python: Printing unicode characters beyond FFFF

On Python 3 printing unicode characters can be printed like this:
print('\uFFFF')
But how can I print higher unicode characters like 001FFFFF? print('\u001FFFFF') will just print 001F as unicode character and then 4 times F. Trying to use print('\u001F\uFFFF') will result in 2 unicode characters instead of the wanted one. Is it possible to print somehow the unicode character 001FFFFF in Python 3?
Use an upper-case U.
print('\U001FFFFF')
There is another way in Python 3, using the built-in function chr(i), which
Return the string representing a character whose Unicode code point is
the integer i.
and
The valid range for the argument is from 0 through 1,114,111 (0x10FFFF in base 16).
so there are no limitation for the hex digit value.
print(chr(97))
print(chr(0xFFFF))
print(chr(0x10080))

Resources