How to Replace for multibyte character in java - multibyte

I want to convert the multibyte space. Is there any way through which i can convert this into normal characters.
String queryTerm = "DX- zzzz";
queryTerm = queryTerm.replaceAll("\\s", "AND");
System.out.println(queryTerm);
String queryTermWithNormalSpace = "DX- zzzz";
queryTermWithNormalSpace = queryTermWithNormalSpace.replaceAll("\\s+", "AND");
System.out.println(queryTermWithNormalSpace);
First queryTerm has Multi-Byte SPACE(0x3000) due to it doesn't find space and it doesn't do replace, but second string gives me output with replacement.
What would be the regular expression to do this?

Java 7:
queryTerm =
Pattern.compile("\\s",Pattern.UNICODE_CHARACTER_CLASS).match‌​er(queryTerm).replac‌​eAll("AND");

Related

Removing special characters from a string In a Groovy Script

I am looking to remove special characters from a string using groovy, i'm nearly there but it is removing the white spaces that are already in place which I want to keep. I only want to remove the special characters (and not leave a whitespace). I am running the below on a PostCode L&65$$ OBH
def removespecialpostcodce = PostCode.replaceAll("[^a-zA-Z0-9]+","")
log.info removespecialpostcodce
Currently it returns L65OBH but I am looking for it to return L65 OBH
Can anyone help?
Use below code :
PostCode.replaceAll("[^a-zA-Z0-9 ]+","")
instead of
PostCode.replaceAll("[^a-zA-Z0-9]+","")
To remove all special characters in a String you can use the invert regex character:
String str = "..\\.-._./-^+* ".replaceAll("[^A-Za-z0-1]","");
System.out.println("str: <"+str+">");
output:
str: <>
to keep the spaces in the text add a space in the character set
String str = "..\\.-._./-^+* ".replaceAll("[^A-Za-z0-1 ]","");
System.out.println("str: <"+str+">");
output:
str: < >

Remove non ASCII characters in octave

I'm trying to remove non ASCII characters read from a data file using OCTAVE but I can't make it work. I tried getting the ASCII codes of these "weird" characters and they do have random ASCII codes. An example string of characters is this:
asdqwФЕДЕРАЛЬ234НОЕ234 АГЕНТСqewwqedasТВО ПasdsadО ОБРАasdasdЗОВАНИЮ
Госудаsadasdsagwfрственная акадеasdмия профессиональной п
Do you guys have any suggestions on how can i remove the non ASCII characters from this string? Or better yet, how will I be able to determine if a given string has non ASCII characters?
Thanks in advance!
To remove all non ASCII chars in the range of 0..127 decimal use
a = "asdqwФЕДЕРАЛЬ234НОЕ234 АГЕНТСqewwqedasТВО ПasdsadО ОБРАasdasdЗОВАНИЮ Госудаsadasdsagwfрственная акадеasdмия профессиональной п";
a(! isascii (a)) = []
which gives
a = asdqw234234 qewwqedas asdsad asdasd sadasdsagwf asd
and if you just want to check if there are non ASCII chars:
any (! isascii("foobar"))
ans = 0
any (! isascii("foobaröäüß"))
ans = 1

Lua gmatch odd characters (Slovak alphabet)

I am trying to extract the characters from a string of a word in Slovak. For example, the word for "TURTLE" is "KORYTNAČKA". However, it skips over the "Č" character when I try to extract it from the string:
local str = "KORYTNAČKA"
for c in str:gmatch("%a") do print(c) end
--result: K,O,R,Y,T,N,A,K,A
I am reading this page and I have also tried just pasting in the string itself as a set, but it comes up with something weird:
local str = "KORYTNAČKA"
for c in str:gmatch("["..str.."]") do print(c) end
--result: K,O,R,Y,T,N,A,Ä,Œ,K,A
Anyone know how to solve this?
Lua is 8-bit clean, which means Lua strings assume every character is one byte. The pattern "%a" matches one-byte character, so the result is not what you expected.
The pattern "["..str.."]" works because, a Unicode character may contain more than one byte, in this pattern, it uses these bytes in a set, so that it could match the character.
If UTF-8 is used, you can use the pattern "[\0-\x7F\xC2-\xF4][\x80-\xBF]*" to match a single UTF-8 byte sequence in Lua 5.2, like this:
local str = "KORYTNAČKA"
for c in str:gmatch("[\0-\x7F\xC2-\xF4][\x80-\xBF]*") do
print(c)
end
In Lua 5.1(which is the version Corona SDK is using), use this:
local str = "KORYTNAČKA"
for c in str:gmatch("[%z\1-\127\194-\244][\128-\191]*") do
print(c)
end
For details about this pattern, see Equivalent pattern to “[\0-\x7F\xC2-\xF4][\x80-\xBF]*” in Lua 5.1.
Lua has no built-in treatment for Unicode strings. You can see that Ä,Œ is a 2 bytes representing UTF-8 encoding of a Č character.
Yu Hao already provided sample solution, but for more details here is good source.
I've tested and found this solution working properly in Lua 5.1, reserve link. You could extract individual characters using utf8sub function, see sample.
string.gmatch(str, "[%z\1-\127\192-\253][\128-\191]*")
Use utf8 plugin. Then replace string.gmatch with utf8.gmatch.
Example (tested on Win7, it works for me)
yourfilename.lua
local utf8 = require( "plugin.utf8" )
for c in utf8.gmatch( "KORYTNAČKA", "%a" ) do print(c) end
and
build.settings
settings =
{
plugins =
{
["plugin.utf8"] =
{
publisherId = "com.coronalabs"
},
},
}
Read more :
Introducing the UTF-8 string plugin,
Documenatation for utf8 plugin,
Lua String Manipulation,
string.gmatch().
Have a nice day:)

Finding mean of ascii values in a string MATLAB

The string I am given is as follows:
scrap1 =
a le h
ke fd
zyq b
ner i
You'll notice there are 2 blank spaces indicating a space (ASCII 32) in each row. I need to find the mean ASCII value in each column without taking into account the spaces (32). So first I would convert to with double(scrap1) but then how do I find the mean without taking into account the spaces?
If it's only the ASCII 32 you want to omit:
d = double(scrap1);
result = mean(d(d~=32)); %// logical indexing to remove unwanted value, then mean
You can remove the intermediate spaces in the string with scrap1(scrap1 == ' ') = ''; This replaces any space in the input with an empty string. Then you can do the conversion to double and average the result. See here for other methods.
Probably, you can use regex to find the space and ignore it. "\s"
findSpace = regexp(scrap1, '\s', 'ignore')
% I am not sure about the ignore case, this what comes to my mind. but u can read more about regexp by typying doc regexp.

AS3 - "\u2605" NOT the same as "\\u"+"2605"?

Trying to make a textfield where people write the unicode without the backslash. I want to add the backslash after they typed it. So the user types u2605 and the code converts it to "\u2605", i then convert this to a unicode character and insert it in textflow.
My code:
this works:
span.text = publicFunctions.htmlUnescape(he.encode("\u2605"))
this doesn't work:
span.text = publicFunctions.htmlUnescape(he.encode("\\u"+"2605"))
how to make a string that acts as a unicode string?
Tried all sorts of things, escape(unescape()), convert to number, "\u", "\u" ... nothing helps.
trace("\u2605" == "\u"+"2605") ... will return false. So will
trace("\u2605" == "\u"+"2605")
"\u2605" is a string with a single character, the character with the code point 2605, while "\\u" + "2605" is a string with 6 characters (the backslash, the u and the four digit number).
If you want to construct a unicode character from just the four digits, you should be able to use String.fromCharCode. The thing is just that the escape sequence uses a hexadecimal number, while the method obviously takes a decimal number. So if the user enters a hexadecimal string, you will have to convert that first:
trace(String.fromCharCode(parseInt('2605', 16)) == '\u2605'));
That's an interesting issue! I don't think you can concatenate a string literal and achieve what you're trying to do. The relevant character escaping happens when the string literal is originally formed, which means that you need the whole sequence together in the first place.
But you should be able to take the user-supplied number and dynamically generate a Unicode string with String.fromCharCode(...).
http://help.adobe.com/en_US/FlashPlatform/reference/actionscript/3/String.html#fromCharCode()

Resources