Replacing "invisible" special characters with something legible - string

About twelve years ago, I wrote a small VB.NET application that loads strings from files. These strings may contain one or more of the following characters: à, è, é, ì, ò, ù, ä, ö. The application uses a special custom font (JazzText Extended) that does not have those special characters. Yet, I somehow managed to make the application display words correctly in that font, and twelve years later, I have no idea how - thanks for not leaving a line of comment, past me!
The program has the following routine:
Private Sub SetWord(ByVal word() As String)
Dim nword(3) As String
nword(0) = word(0)
nword(1) = word(1)
nword(2) = word(2)
For i As Integer = 0 To 2
nword(i) = nword(i).Replace("à", "")
nword(i) = nword(i).Replace("é", "")
nword(i) = nword(i).Replace("è", "")
nword(i) = nword(i).Replace("ì", "ê")
nword(i) = nword(i).Replace("ò", "")
nword(i) = nword(i).Replace("ù", "")
nword(i) = nword(i).Replace("ä", "")
nword(i) = nword(i).Replace("ö", "")
Next
lblItaWord.Text = nword(0).ToUpper
lblEngWord.Text = nword(1).ToUpper
lblFinWord.Text = nword(2).ToUpper
End Sub
What it does is, it takes an array that contains three words, and for each of those three words, it looks if it contains any of the special characters. If it does, it replaces them with... something, makes the words all caps, and then assigns each of them to one of three labels.
In Visual Studio, the replacement characters look like empty strings. I had to put the cursor in between the quotation marks to realise that it was in fact not an empty string and there was an invisible character there. Here on SO... I'm not sure what you'll see. You might see just a square, or some other weird character. (The ê character is an exception, it seems to display in the same way everywhere.)
If you copypaste any of the invisible/square characters to Google and search for it, you'll get a different representation that uses two characters—for example, the first one translates to ‡. Using this pair in place of the invisible/square character in the Replace method does not produce the correct result. FYI, the encoding I use to read the files (the default one used by IO.StreamReader if you don't specify any encoding) works fine: if I use a more standard font, all special characters display correctly without using the SetWord sub at all.
Now, I have absolutely no idea how those characters, whatever they may be, manage to make the app display correctly the words when the font I use does not have those characters. I have no idea how I found out about this trick, either. Right now, my problem is that I would like to replace those squares/invisible characters with something intelligible, and I have no idea how. Any ideas?

Related

How are the double-quotes managed when I put this code into a loop?

On SO, I was just given two answers that both work when called a single time. Now I want to put them in a loop and loop over several rows of data. However, I'm having a heck of time getting the code correct. I'm suspect it has to how I'm handling the double quotes.
The stand alone code lines are as follows.
Var = ActiveSheet.Evaluate("And(A1:F1)") and
Var = Application.WorksheetFunction.And(Range("A1:F1"))
for the first example I tried:
for i = 2 to 20
Var = ActiveSheet.Evaluate("And(A & i & :F & i)")
Next i
This produces "Error 2015"
for the second:
for i = 2 to 20
Var = Application.WorksheetFunction.And(Range("A" & i & ":F" & i))
Next i
This produces a line of red code
What am I doing wrong?
The Visual Basic Editor is making this harder than it should be, because its default syntax highlighting is making string literals the same color as identifiers:
You can change that under Tools/Options, and make Identifier Text a different color - here teal:
Now string literals are still black, but now identifiers look visually distinctive:
What you want to make sure, is that your variables are syntax-highlighted like identifiers - so they're teal, not black - like in your second example:
Contrast with your first attempt, where i doesn't get syntax-highlighted as the identifier it should be:
And since you know that i is a VBA variable and you want VBA to concatenate its value into this string, then i being syntax-highlighted as any other string literal (and not as an identifier) is your visual cue that something's off!
Compare to #JNevill's fixed version:
With Identifier Text having a different syntax highlighting than string literals in the editor, it becomes much easier to quickly locate a variable that's accidentally inside a string literal.
That first snippet isn't working, because ActiveSheet.Evaluate takes its parameter and gives it to Excel's expression evaluation engine, ...which has no idea what to do with this i. Variable i only exists in the execution context of the VBA code: only VBA code can evaluate its value.

Reading from a string using sscanf in Matlab

I'm trying to read a string in a specific format
RealSociedad
this is one example of string and what I want to extract is the name of the team.
I've tried something like this,
houseteam = sscanf(str, '%s');
but it does not work, why?
You can use regexprep like you did in your post above to do this for you. Even though your post says to use sscanf and from the comments in your post, you'd like to see this done using regexprep. You would have to do this using two nested regexprep calls, and you can retrieve the team name (i.e. RealSociedad) like so, given that str is in the format that you have provided:
str = 'RealSociedad';
houseteam = regexprep(regexprep(str, '^<a(.*)">', ''), '</a>$', '')
This looks very intimidating, but let's break this up. First, look at this statement:
regexprep(str, '^<a(.*)">', '')
How regexprep works is you specify the string you want to analyze, the pattern you are searching for, then what you want to replace this pattern with. The pattern we are looking for is:
^<a(.*)">
This says you are looking for patterns where the beginning of the string starts with a a<. After this, the (.*)"> is performing a greedy evaluation. This is saying that we want to find the longest sequence of characters until we reach the characters of ">. As such, what the regular expression will match is the following string:
<ahref="/teams/spain/real-sociedad-de-futbol/2028/">
We then replace this with a blank string. As such, the output of the first regexprep call will be this:
RealSociedad</a>
We want to get rid of the </a> string, and so we would make another regexprep call where we look for the </a> at the end of the string, then replace this with the blank string yet again. The pattern you are looking for is thus:
</a>$
The dollar sign ($) symbolizes that this pattern should appear at the end of the string. If we find such a pattern, we will replace it with the blank string. Therefore, what we get in the end is:
RealSociedad
Found a solution. So, %s stops when it finds a space.
str = regexprep(str, '<', ' <');
str = regexprep(str, '>', '> ');
houseteam = sscanf(str, '%*s %s %*s');
This will create a space between my desired string.

Tell if specific char in string is a long char or a short char

Be prepared, this is one of those hard questions.
In Farsi or Persian language ی which sounds like y or i and is written in 4 different shapes according to it's place in word. I'll call ی as YA from now for simplification.
take a look at this image
All YA characters are painted in red, in the first word YA is attached to it's previous (right , in Farsi we right from RIGHT to LEFT) character and is free at the end whereas the last YA (3rd word, left-most red char) is free both from left or right.
Having said this long story, I want to find out if a part of a string ends with long YA (YA without points) or short YA (YA with two points beneath it).
i.e تحصیلداری (the 3rd word) ends with long YA but تحصیـ which is a part of 3rd word does not ends with short YA.
Question: How can I say تحصیلداری ends whit which unicode? I just have a simple string, "تحصیلداری", how can I convert its characters to unicode?
I tried the unicodes
string unicodes = "";
foreach (char c in "تحصیلداری")
{
unicodes += c+" "+((int)c).ToString() + Environment.NewLine;
}
MessageBox.Show(unicodes);
result :
but at the end of the day unfortunately all YAs have the same unicode.
Bad news : YA was an example, a real one though. There are also a dozen of other characters like YA with different appearances too.
Additional info :
using this useful link about unicodes I found unicode of different YAs
We solved similar problem the way bellow:
We had a core banking application, the customer sub-system needed a full text search on customers name, family, father name etc.
Different encoding, legacy migrated data, keyboard layouts and Farsi fonts ... made search process inaccurate.
We overcame the problem by replacing problematic characters with some standard one and saving the standard string for search purpose.
After several iterations, the replacement is as bellow that may come in handy:
Formula="UPPER(REPLACE(REPLACE(REPLACE
(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE
(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE
(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE
(REPLACE(REPLACE(REPLACE(REPLACE
(REPLACE(FirsName || LastName || FatherName,
chr(32),''),
chr(13),''),
chr(9),''),
chr(10),''),
'-',''),
'-',''),
'آ','ا'),
'أ', 'ا'),
'ئ', 'ي'),
'ي', 'ي'),
'ك', 'ک'),
'آإئؤةي','اايوهي'),
'ء',''),
'شأل','شاال'),
'ا.','اله'),
'.',''),
'الله','اله'),
'ؤ','و'),
'إ','ا'),
'ة','ه'),
' ا لله','اله'),
'ا لله','اله'),
' ا لله','اله'))"
Despite there are different YEHs in Unicode, it must noticed that all presentation forms of YEHs are same Unicode character with code 0x06cc. You can not determine presentation forms by their Unicode code.
But you can reach your goal be checking to see what characters is before or after YEH.
You can also use Fardis to see Unicode codes of strings.

Split string by first delimiter

I have a column with a long list of folder and file names. The folders and file names vary. I want to extract the file name from the column into another column but I struggling to do this in Excel.
Example of column data:(files and folder altered to hide details that should not be public)
c:\data\1\nc2\media\ss\system media\ne\d - wnd enging works v5.swf
c:\data\1\nc2\media\ss\special campaigns\samns dec 2012\trainerv5.swf
C:\Local\Messages\17362~000000001~20131231235910~4.MUF
c:\data\1\nc2\media\ss\system media\tl\nd - tfl statusv4.swf
c:\data\1\nc2\media\ss\system media\core\ss_bagage v2.swf
I know I should be able to search from the right to the first occurence of "\" but I can't figure out the syntax.
Many thanks
UPDATE:
Formula =RIGHT(B2,LEN(B2)-SEARCH("\",B2,1)) should work, but it shows incorrect results. But If I change it to search for "." it pulls out the file extension. So there is a key item I'm missing
=RIGHT(A1,LEN(A1)-FIND("~",SUBSTITUTE(A1,"\","~",LEN(A1)-LEN(SUBSTITUTE(A1,"\","")))))
copy it in any column say b drag down,you are done
VBA is a more efficient option if you have many files to parse. Create a module and add the below:
Function GetFileName(file As String) As String
Set fso = CreateObject("Scripting.FileSystemObject")
GetFileName = fso.GetFileName(file)
End Function
There are several different ways to get the text following the last slash in a string, including the following formula. In this example, H15 is the cell containing the string to search. If it can't find a slash, it returns the "-" (dash) character.
=iferror(RIGHT(H15,LEN(H15)-SEARCH("|",SUBSTITUTE(H15,"/","|",LEN(H15)-LEN(SUBSTITUTE(H15,"/",""))))),"-")
The formula first finds the number of slashes in the string. LEN gives the total length of the string, and LEN of the string without slashes after using SUBSTITUTE to eliminate the slashes in the original string - the difference is the number of slashes.
Then, you substitute in a marker character(I used "|") for the last slash. By searching for the marker, you find where the bit after the slash starts. The total length of the string minus where the marker starts tells you how many characters to take from the right, which you then do.
If you need more generic string parsing and are willing to use a little bit of VBA, you can use the split function as suggested by Jamie Bull in his answer to this question on SuperUser.
His function will use any character you choose to split the string into segments and return whichever segment you choose.
I've copied Jamie's function here for convenient reference:
Function STR_SPLIT(str, sep, n) As String
Dim V() As String
V = Split(str, sep)
STR_SPLIT = V(n - 1)
End Function

Eggplant/Sensetalk parsing and separating a string with capitalized words

I'm in need of the ability to parse and separate a text string using Sensetalk (the scripting language the Eggplant GUI tester uses). What I'd like to be able to do is provide the code a text string:
Put "MyTextIsHere" into exampleString
And then have spaces inserted before every capital letter save for the first, so the following is then stored in exampleString:
"My Text Is Here"
I basically want to separate the string into the words it contains. After searching the documentation and the web, I'm no closer to finding a solution to this (I agree, it would be far easier in a different language - alas, not my choice).
Thank you in advance to anyone who can provide some insight!
See question at http://www.testplant.com/phpBB2/viewtopic.php?t=2192.
With credit to Pamela at TestPlant forums:
set startingString to "HereAreMyWords"
set myRange to 2 to the number of characters in startingString // The range to iterate over– every character except the first
Put the first character in startingString into endString // The first character isn't included in the repeat loop, so you have to put it in separately
repeat with each character myletter of characters myRange of startingString
if charToNum(myLetter) is between 65 and 90 // if the character's unicode number is between 65-90...
Put space after endString
end if
Put myLetter after endString
end repeat
put endString
or you could do it this way:
Put "MyTextIsHere" into exampleString
repeat with each char of chars 2 to last of exampleString by reference
if it is an uppercase then put space before it
end repeat
put exampleString

Resources