Identify invalid characters in text based cell - excel

I recently inherited a VBA macro that needs to have validation logic added to it. I need to be able to determine if any characters in a text based cell are non ASCII characters (i.e. have a binary value > 0x7F). The cells may contain some carriage control values (particularly linefeeds) that need to be retained (so the CLEAN function does not work for this validation). I have tried the IsText function, but found that it will interpret UTF-8 character sequences as valid text (which I don't want).
I don't need to actually manipulate the string, I just want to display an error to the user that runs the macro to tell him that there are invalid (non-ASCII) characters in a specific cell.

If you want a technically pure approach you might try a regular expression. Add a reference in VBA to the Microsoft Scripting library, and try this code. It looks a little complex, but you will be blown away by what regular expressions can do, and you will have a valuable tool for future use.
Function IsTooHigh(c As String) As Boolean
Dim RegEx As Object
Set RegEx = CreateObject("vbscript.regexp")
With RegEx
.Global = True
.MultiLine = True
.Pattern = "[^\x00-\x7F]"
End With
IsTooHigh = RegEx.Test(c)
End Function
This function gives TRUE if any character in string c is not (^) in the range 0 (x00) to 127 (x7F).
You can Google for "regular expression" and whatever you need it to do, and take the answer from almost any language, because like SQL, regular expression patterns seem to be language agnostic.

The asc(character) command will convert a character to it's ASCII value.
hex(asc(character)) will convert the character to it's HEX value.
Once you've done that you can easily do some comparisons to determine if the data is bad and toss the errors if required.
Here's some sample code:
http://www.freevbcode.com/ShowCode.asp?ID=4486

Function IsGoodAscii(aString as String) as Boolean
Dim i as Long
Dim iLim as Long
i=1
iLim=Len(aString)
While i<=iLim
If Asc(Mid(aString,i,1))>127 then
IsGoodAscii=False
Exit Function
EndIf
i=i+1
Wend
IsGoodAscii=True
End Function

Related

Replacing "invisible" special characters with something legible

About twelve years ago, I wrote a small VB.NET application that loads strings from files. These strings may contain one or more of the following characters: à, è, é, ì, ò, ù, ä, ö. The application uses a special custom font (JazzText Extended) that does not have those special characters. Yet, I somehow managed to make the application display words correctly in that font, and twelve years later, I have no idea how - thanks for not leaving a line of comment, past me!
The program has the following routine:
Private Sub SetWord(ByVal word() As String)
Dim nword(3) As String
nword(0) = word(0)
nword(1) = word(1)
nword(2) = word(2)
For i As Integer = 0 To 2
nword(i) = nword(i).Replace("à", "")
nword(i) = nword(i).Replace("é", "")
nword(i) = nword(i).Replace("è", "")
nword(i) = nword(i).Replace("ì", "ê")
nword(i) = nword(i).Replace("ò", "")
nword(i) = nword(i).Replace("ù", "")
nword(i) = nword(i).Replace("ä", "")
nword(i) = nword(i).Replace("ö", "")
Next
lblItaWord.Text = nword(0).ToUpper
lblEngWord.Text = nword(1).ToUpper
lblFinWord.Text = nword(2).ToUpper
End Sub
What it does is, it takes an array that contains three words, and for each of those three words, it looks if it contains any of the special characters. If it does, it replaces them with... something, makes the words all caps, and then assigns each of them to one of three labels.
In Visual Studio, the replacement characters look like empty strings. I had to put the cursor in between the quotation marks to realise that it was in fact not an empty string and there was an invisible character there. Here on SO... I'm not sure what you'll see. You might see just a square, or some other weird character. (The ê character is an exception, it seems to display in the same way everywhere.)
If you copypaste any of the invisible/square characters to Google and search for it, you'll get a different representation that uses two characters—for example, the first one translates to ‡. Using this pair in place of the invisible/square character in the Replace method does not produce the correct result. FYI, the encoding I use to read the files (the default one used by IO.StreamReader if you don't specify any encoding) works fine: if I use a more standard font, all special characters display correctly without using the SetWord sub at all.
Now, I have absolutely no idea how those characters, whatever they may be, manage to make the app display correctly the words when the font I use does not have those characters. I have no idea how I found out about this trick, either. Right now, my problem is that I would like to replace those squares/invisible characters with something intelligible, and I have no idea how. Any ideas?

How to put text validation for a special character on a particulate cell in excel?

In a particular cell in excel, I have to put validation so a user can not input text values with special charter like "-", ",", "|", "/", ... in between except "_" (underscore).
I have written a custom formula for this and it is working but it has a limitation, isn't resolve my problem completely.
Here is a formula:
=ISNUMBER(FIND("_",A1))
so when a user enters text with some other character like "," or "-" in between text values, it will throw a validation error.
But if a user enters only text without any special character then it also throws an error and user not able to enter text.
if the user enters only text then it allows the text but if the user enters text with special character then it allows only "_" special character.
example:
allowed: "StackOverflow", "Stack_Overflow"
not allowed: "Stack-Overflow", "Stack, Overflow" or any other special character.
For a more complex test that will only allow these characters [A-Za-z0-9_] use Regular Expressions.
This pattern ^[A-Za-z0-9_]+$ will only allow …
[A-Z] capital letters
[a-z] lower case letters
[0-9] numbers
[_] underscores
… where any of them occur at least once or more. Any other character is not allowed. See and test at: https://regex101.com/r/DnQrAq/1
Option Explicit
Public Function SpecialValidate(ByVal InputValue As String) As Boolean
Dim RegEx As Object
Set RegEx = CreateObject("vbscript.regexp")
Const RegExPattern As String = "^[A-Za-z0-9_]+$"
With RegEx
.Pattern = RegExPattern
.Global = True
.MultiLine = True
If .test(InputValue) Then
SpecialValidate = True
End If
End With
End Function
You can then easily use this as formula =SpecialValidate(A2) to validate any cell value:
Sorry for the German screenshot: WAHR = TRUE and FALSCH = FALSE
This assumes that a single character in the excluded group is also not valid:
=SUM(ISNUMBER(FIND("-",A1)),ISNUMBER(FIND(",",A1)),ISNUMBER(FIND("|",A1)),ISNUMBER(FIND("/",A1)))=0
Since array constants cannot be included in a validation formula, you can list each excluded character separately.
With a complete list of what should be included or excluded, a more compact formula might be possible.
You could use:
=IF(IF(LEN(C1)-LEN(SUBSTITUTE(C1,"-",""))>0,1,0)+IF(LEN(C1)-LEN(SUBSTITUTE(C1,",",""))>0,1,0)+IF(LEN(C1)-LEN(SUBSTITUTE(C1,"/",""))>0,1,0)+IF(LEN(C1)-LEN(SUBSTITUTE(C1,"|",""))>0,1,0)>0,"Invalid Character","Correct")
Results:

Reading from a string using sscanf in Matlab

I'm trying to read a string in a specific format
RealSociedad
this is one example of string and what I want to extract is the name of the team.
I've tried something like this,
houseteam = sscanf(str, '%s');
but it does not work, why?
You can use regexprep like you did in your post above to do this for you. Even though your post says to use sscanf and from the comments in your post, you'd like to see this done using regexprep. You would have to do this using two nested regexprep calls, and you can retrieve the team name (i.e. RealSociedad) like so, given that str is in the format that you have provided:
str = 'RealSociedad';
houseteam = regexprep(regexprep(str, '^<a(.*)">', ''), '</a>$', '')
This looks very intimidating, but let's break this up. First, look at this statement:
regexprep(str, '^<a(.*)">', '')
How regexprep works is you specify the string you want to analyze, the pattern you are searching for, then what you want to replace this pattern with. The pattern we are looking for is:
^<a(.*)">
This says you are looking for patterns where the beginning of the string starts with a a<. After this, the (.*)"> is performing a greedy evaluation. This is saying that we want to find the longest sequence of characters until we reach the characters of ">. As such, what the regular expression will match is the following string:
<ahref="/teams/spain/real-sociedad-de-futbol/2028/">
We then replace this with a blank string. As such, the output of the first regexprep call will be this:
RealSociedad</a>
We want to get rid of the </a> string, and so we would make another regexprep call where we look for the </a> at the end of the string, then replace this with the blank string yet again. The pattern you are looking for is thus:
</a>$
The dollar sign ($) symbolizes that this pattern should appear at the end of the string. If we find such a pattern, we will replace it with the blank string. Therefore, what we get in the end is:
RealSociedad
Found a solution. So, %s stops when it finds a space.
str = regexprep(str, '<', ' <');
str = regexprep(str, '>', '> ');
houseteam = sscanf(str, '%*s %s %*s');
This will create a space between my desired string.

C# 4.0 function to check for first four characters in the string

I need to validate for valid code name.
So, my string can have values like below:
String test = "C000. ", "C010. ", "C020. ", "C030. ", "CA00. ","C0B0. ","C00C. "
So my function needs to validate below conditions:
It should start with C
After that next 3 characters should be numeric before .
Rest it can be anything.
So in above string values, only ["C000.", "C010.", "C020.", "C030."] are valid ones.
EDIT:
Below is the code I tried:
if (nameObject.Title.StartsWith(String.Format("^[C][0-9]{3}$",nameObject.Title)))
I'd suggest a regex, for example (written off the top of my head, may need work):
string s = "C030.";
Regex reg = new Regex("C[0-9]{3,3}\\.");
bool isMatch = reg.IsMatch(s);
This regex should do the trick:
Regex.IsMatch(input, #"C[0-9]{3}\..*")
Check out http://www.techotopia.com/index.php/Working_with_Strings_in_C_Sharp
for a quick tutorial on (among other things) individual access of string elements, so you can test each element for your criteria.
If you think your criteria may change, using regular expressions gives you maximum flexibility (but is more runtime intensive than regular string-element evaluation). In your case, it may be overkill, IMHO.

Split string by first delimiter

I have a column with a long list of folder and file names. The folders and file names vary. I want to extract the file name from the column into another column but I struggling to do this in Excel.
Example of column data:(files and folder altered to hide details that should not be public)
c:\data\1\nc2\media\ss\system media\ne\d - wnd enging works v5.swf
c:\data\1\nc2\media\ss\special campaigns\samns dec 2012\trainerv5.swf
C:\Local\Messages\17362~000000001~20131231235910~4.MUF
c:\data\1\nc2\media\ss\system media\tl\nd - tfl statusv4.swf
c:\data\1\nc2\media\ss\system media\core\ss_bagage v2.swf
I know I should be able to search from the right to the first occurence of "\" but I can't figure out the syntax.
Many thanks
UPDATE:
Formula =RIGHT(B2,LEN(B2)-SEARCH("\",B2,1)) should work, but it shows incorrect results. But If I change it to search for "." it pulls out the file extension. So there is a key item I'm missing
=RIGHT(A1,LEN(A1)-FIND("~",SUBSTITUTE(A1,"\","~",LEN(A1)-LEN(SUBSTITUTE(A1,"\","")))))
copy it in any column say b drag down,you are done
VBA is a more efficient option if you have many files to parse. Create a module and add the below:
Function GetFileName(file As String) As String
Set fso = CreateObject("Scripting.FileSystemObject")
GetFileName = fso.GetFileName(file)
End Function
There are several different ways to get the text following the last slash in a string, including the following formula. In this example, H15 is the cell containing the string to search. If it can't find a slash, it returns the "-" (dash) character.
=iferror(RIGHT(H15,LEN(H15)-SEARCH("|",SUBSTITUTE(H15,"/","|",LEN(H15)-LEN(SUBSTITUTE(H15,"/",""))))),"-")
The formula first finds the number of slashes in the string. LEN gives the total length of the string, and LEN of the string without slashes after using SUBSTITUTE to eliminate the slashes in the original string - the difference is the number of slashes.
Then, you substitute in a marker character(I used "|") for the last slash. By searching for the marker, you find where the bit after the slash starts. The total length of the string minus where the marker starts tells you how many characters to take from the right, which you then do.
If you need more generic string parsing and are willing to use a little bit of VBA, you can use the split function as suggested by Jamie Bull in his answer to this question on SuperUser.
His function will use any character you choose to split the string into segments and return whichever segment you choose.
I've copied Jamie's function here for convenient reference:
Function STR_SPLIT(str, sep, n) As String
Dim V() As String
V = Split(str, sep)
STR_SPLIT = V(n - 1)
End Function

Resources