Identify and extract noun and modifier - excel

Any idea how to Identify and extract noun and modifier using VBA (excel)
Example:
ball valve 2in for green pump with gasket
Should be: ball valve
Any help will be appreciated

There are some different approaches, depending on the type of sentence you expect. In your example, the two words you want to extract are on the beginning of the sentence, and separated by whitespaces. If you expect this to be always the case, then you could use something simple as
Function getNoun(ByVal sentence As String)
getNoun = ""
pos1 = InStr(1, sentence, " ") 'find the first whitespace
If pos1 <= 0 Then
getNoun = sentence 'if no whitespace, then assume there is only the noun
Exit Function
End If
pos2 = InStr(pos1 + 1, sentence, " ") 'find the second whitespace
If pos2 <= 0 Then
getNoun = sentence 'if no second whitespace, then assume there is only the noun and qualifier
Exit Function
End If
getNoun = Left(sentence, pos2 - 1) 'if there are two or more spaces, get all chars before the second one
End Function
Tests in immediate window:
? getNoun("ball valve 2in for green pump with gasket")
ball valve
? getNoun("ball valve")
ball valve
? getNoun("ball")
ball
If your scenario is more complex and you need to use specific criteria to determine which words are the desired noun and qualifier, you would probably find use for the Regex COM class (see this topic for example).
EDIT: Based on the comments, I understand that positions are variable, and that it is acceptable to use the MS Word thesaurus as a reference. If the code will run in Microsoft Word, the following function will tell you whether or not a word is a noun:
Function is_noun(ByVal wrd As String)
Dim s As Object, l As Variant
is_noun = False
Set s = SynonymInfo(wrd)
Let l = s.PartOfSpeechList
If s.MeaningCount <> 0 Then
For i = LBound(l) To UBound(l)
If l(i) = wdNoun Then
is_noun = True
End If
Next
End If
End Function
If you are not running on MS Word (your tags suggest MS Excel) but MS Word is installed in the target system, then you can adapt the above code to use MS Word COM automation object.
Then you can extract the first noun, and the next word - if any -, from a sentence, with something like this
Function getNoun(ByVal sentence As String)
getNoun = ""
Dim wrds() As String
wrds = Split(sentence)
For i = LBound(wrds) To UBound(wrds)
If is_noun(wrds(i)) Then
getNoun = wrds(i)
If i < UBound(wrds) Then
getNoun = getNoun & " " & wrds(i + 1)
End If
Exit Function
End If
Next
End Function
Notice, however, that with this you are trusting blindly in MS Word's word database and may get weird results if your sentences contain, for example, words that may be a verb or a noun depending on context. Also, the above example will use the default language of your setup of MS Word (it is possible to use a different one - if installed - by including a language parameter in SynonymInfo)

Related

Tokenize Myanmar names, typed using Pyidaungsu Unicode font, in MS Excel, using VBA UDF

I searched everywhere for a way to sort Myanmar names, typed up in Pyidaungsu unicode font, by the last consonant in MS Excel.
Doing the same in English is relatively easy using Excel's builtin formulae/functions.
But it is hard for Myanmar names in Burmese because Myanmar names do not require a white space between each word and the first, middle and last names are not that distinct as in, eg. John W. Smith.
In a Myanmar name, eg. အောင်မြင့်မြတ်=Aung Myint Myat, there is no distinct first/last name and no white space is required if it is written in Myanmar font!
Thus, it is pretty hard to find the word boundary between each word, i.e, where အောင် starts and ends and မြင့် starts and ends etc. and so on!
So I need a VBA UDF to be able to tokenize Myanmar names!
After much searching and reading through NLP literature, a lot of which I don't really understand, I realized that the Myanmar font, Pyidaungsu by name, has a character binding method where all Myanmar characters: consonants and diacritics were bound together like: the consonants come first for each word, followed by diacritics (or may be I am wrong about how it is called).
So if only I could place a delimiter/separator just before each consonant, I should be able to tokenize each word!
Fortunately, it helps me write VBA code like:
Const kagyi = 4096
Const ah = 4129 '+9 to include ou
Const athat = 4154
Const shiftF = 4153 'for typing something under something
Const witecha = 4140
Const moutcha = 4139
'Return a tokenized Myanmar String
Function MMRTokenizer(target As Range) As String
Dim ch As String
Dim returnString As String
Dim charCounter As Integer
Dim previousChIsAthat As Boolean
Dim shiftFfound As Boolean
Dim previousCharAt As Long
If target.Cells.CountLarge > 1 Then MMRTokenizer = ">1Cell!": Exit Function
returnString = "": previousChIsAthat = False: shiftFfound = False: previousCharAt = Len(target.Value) + 1
If target.CountLarge = 1 Then
If target.Value <> "" Then
For charCounter = Len(target.Value) To 1 Step -1
ch = Mid(target.Value, charCounter, 1)
If AscW(ch) <> shiftF Then
If Not shiftFfound Or AscW(ch) = athat Then
If AscW(ch) <> athat Then
If AscW(ch) >= kagyi And AscW(ch) < ah + 9 Then
If Not previousChIsAthat Then
returnString = Mid(target.Value, charCounter, previousCharAt - charCounter) & IIf(Len(returnString) > 0, "|", "") & returnString
previousCharAt = charCounter
Else
previousChIsAthat = False
End If
Else
If AscW(ch) = witecha Or AscW(ch) = moutcha Then
previousChIsAthat = False
End If
End If
Else
previousChIsAthat = True
If shiftFfound Then shiftFfound = False
End If
Else
shiftFfound = False
If previousChIsAthat Then previousChIsAthat = False
End If
Else
shiftFfound = True
End If
Next charCounter
End If
End If
MMRTokenizer = returnString
End Function
In theory, it should be pretty simple since I am not using any NLP or ML methods but employed some string manipulations only.
I took out each character of the name/word from the right (it may be ok to start from the left) then go left until I found a consonant and place a separator/delimiter to the left of it and then keep going left and repeating the same process until the left-most character is reached.
The caveat here is, that, sometimes, there could be a consonant, which in Myanmar language is part of a combination of a consonant and a diacritic (pretty common behavior), eg. in အောင်=‌ေ+အ+ ာ+င+် though it looks like that way, the Pyidaungsu font bound it like အ+‌ေ+ာ+င+် ,if it were entered using Windows Burmese keyboard (Visual Order), the rightmost two, င+် where င=consonant called nga and ် =diacritic called Athat.
In such cases, we just skip over that renegade consonant (if we encountered that specific diacritic just right of it) as it should not be counted as such, according the Burmese way of spelling words.
I used chrW and ascW functions because Myanmar font cannot be rendered in VBIDE (even after tweaking in the Regional settings) and thus, I am forced to check the unicode character codes instead of directly comparing Burmese characters.
Above is just a gist of how the whole thing works.
Further details are available on my GitHub.
After we tokenized like above, we got something like: အောင်|မြင့်|မြတ် which is now pretty easy to be splitted up or reversed using builtin Excel formulae to become မြတ်|မြင့်|အောင် so that it can now be sorted by the last word (or last name) or separated into a last name/first name basis!
NB: This whole tokenization process could/may be achieved by using a combination of various formulae in Excel as nothing is impossible, especially in Excel365 (where arrays just spill without CSE), IMHO, however, I hope that we can easily see the benefits vs. complexity and effort in this case.
I, hereby, admit that the above code may not be the most elegant, but, it is a proven-working proof-of-concept tool, so employ it at your own risk but bugs can be reported to my GitHub provided above.

Excel FindJobCode's problems

I am new in VBA and I have a code as below to find some job numbers in a description.
However, i have 3 problems on it...
if 1st character is small letter such as "s", "m", then it show error
i cannot solve Example3, the result will show "M3045.67," but all i need is "M3045.67" only, no comma
i don't know why it is failed to run the code Range("E2").Value = "Overhead" after Else in Example5
but for problem 3, i can run result "overhead" before i add 2nd criteria, is something wrong there ? Please help~~~thanks.
P.S. the looping will be added after solving above questions......
Sub FindCode()
'Example1 : G5012.123 Management Fee / Get Result = G5012.123
'Example2 : G3045.67 Management Fee / Get Result = G3045.67
'Example3 : M3045.67, S7066 Retenal Fee / Get Result = M3045.67,
'Example4 : P9876-123A Car Park / Get Result = P9876
'Example5 : A4 paper / Get result = Overehad
'Criteria1 : 1st Character = G / S / M / P
If Left(Range("A2"), 1) = "G" Or Left(Range("A2"), 1) = "S" Or Left(Range("A2"), 1) = "M" Or Left(Range("A2"), 1) = "P" Then
'Criteria2 : 2nd-5th Character = Number only
If IsNumeric(Mid(Range("A2"), 2, 4)) Then
'Get string before "space"
Range("E2").Value = Left(Range("A2"), InStr(1, Range("A2"), " ") - 1)
Else
'If not beginning from Crit 1&2, show "Overhead"
Range("E2").Value = "Overhead"
End If
End If
'If start from "P", get first 5 string
If Left(Range("A2"), 1) = "P" And IsNumeric(Mid(Range("A2"), 2, 4)) Then
Range("E2").Value = Left(Range("A2"), 5)
Else
End If
End Sub
The function below will extract the job number and return it to the procedure that called it.
Function JobCode(Cell As Range) As String
' 303
'Example1 : G5012.123 Management Fee / Get Result = G5012.123
'Example2 : G3045.67 Management Fee / Get Result = G3045.67
'Example3 : M3045.67, S7066 Rental Fee / Get Result = M3045.67,
'Example4 : P9876-123A Car Park / Get Result = P9876
'Example5 : A4 paper / Get result = Overhead
Dim Fun As String ' function return value
Dim Txt As String ' Text to extract number from
' Minimize the number of times your code reads from the sheet because it's slow
Txt = Cell.Value ' actually, it's Cells(2, 1)
' Criteria1 : 1st Character = G / S / M / P
If InStr("GSMP", UCase(Left(Txt, 1))) Then
Txt = Split(Txt)(0) ' split on blank, take first element
' Criteria2 : 2nd-5th Character = Number only
' Isnumeric(Mid("A4", 2, 4)) = true
If (Len(Txt) >= 5) And (IsNumeric(Mid(Txt, 2, 4))) Then
Fun = Replace(Txt, ",", "")
Fun = Split(Fun, "-")(0) ' discard "-123A" in example 4
End If
End If
' If no job number was extracted, show "Overhead"
If Len(Fun) = 0 Then Fun = "Overhead"
JobCode = Fun
End Function
The setup as a function, rather than a sub, is typical for this sort of search. In my trials I had your 5 examples in A2:A6 and called them in a loop, giving a different cell to the function on each loop. Very likely, this is what you are angling for, too. This is the calling procedure I used for testing.
Sub Test_JobCode()
' 303
Dim R As Long
For R = 2 To Cells(Rows.Count, "A").End(xlUp).Row
' I urge you not to use syntax for addressing ranges when addressing cells
Debug.Print JobCode(Cells(R, "A")) ' actually, it's Cells(2, 1)
Next R
End Sub
Of course, instead of Debug.Print JobCode(Cells(R, "A")) you could also have Cells(R, "B").Value = JobCode(Cells(R, "A"))
The reason why your Else statement didn't work was a logical error. The "Overhead" caption doesn't apply if criteria 1 & 2 aren't met but if all previous efforts failed, which is slightly broader in meaning. This combined with the fact that Isnumeric(Mid("A4", 2, 4)) = True, causing the test not to fail as you expected.
In rough terms, the code first checks if the first letter qualifies the entry for examination (and returns "Overhead" if it doesn't). Then the text is split into words, only the first one being considered. If it's too short or non-numeric no job code is extracted resulting in "Overhead" in the next step. If this test is passed, the final result is modified: The trailing comma is removed (it it exists) and anything appended with a hyphen is removed (if it exists). I'm not sure you actually want this. So, you can easily remove the line. Or you might add more modifications at that point.
What you are trying to do is FAR easier using regular expression matching and replacing, so I recommend enabling that library of functions. The best news about doing that is that you can invoke those functions in EXCEL formulas and do not need to use Visual Basic for Applications at all.
To enable Regular Expressions as Excel functions:
Step 1: Enable the Regular Expression library in VBA.
A. In the Visual Basic for Applications window (where you enter VBA code) find the Tools menu and
select it, then select the References... entry in the sub-menu.
B. A dialogue box will appear listing the possible "Available References:" in alphabetical order.
Scroll down to find the entry "Microsoft VBScript Regular Expressions 5.5".
C. Check the checkbox on that line and press the OK button.
Step 2: Create function calls. In the Visual Basic for Applications window select Insert..Module. Then paste the following VBA code into the blank window that comes up:
' Some function wrappers to make the VBScript RegExp reference Library useful in both VBA code and in Excel & Access formulas
'
Private rg As RegExp 'All of the input data to control the RegExp parsing
' RegExp object contains 3 Boolean options that correspond to the 'i', 'g', and 'm' options in Unix-flavored regexp
' IgnoreCase - pretty self-evident. True means [A-Z] matches lowercase letters and vice versa, false means it won't
' IsGlobal - True means after the first match has been processed, continue on from the current point in DataString and look to process more matches. False means stop after first match is processed.
' MultiLine - False means ^ and $ match only Start and End of DataString, True means they match embedded newlines. This provides an option to process line-by-line when Global is true also.
'
' Returns true/false: does DataString match pattern? IsGlobal=True makes no sense here
Public Function RegExpMatch(DataString As String, Pattern As String, Optional IgnoreCase As Boolean = True, Optional IsGlobal As Boolean = False, Optional MultiLine As Boolean = False) As Boolean
If rg Is Nothing Then Set rg = New RegExp
rg.IgnoreCase = IgnoreCase
rg.Global = IsGlobal
rg.MultiLine = MultiLine
rg.Pattern = Pattern
RegExpMatch = rg.Test(DataString)
End Function
'
' Find <pattern> in <DataString>, replace with <ReplacePattern>
' Default IsGlobal=True means replace all matching occurrences. Call with False to replace only first occurrence.
'
Public Function RegExpReplace(DataString As String, Pattern As String, ReplacePattern As String, Optional IgnoreCase As Boolean = True, Optional IsGlobal As Boolean = True, Optional MultiLine As Boolean = False) As String
If rg Is Nothing Then Set rg = New RegExp
rg.IgnoreCase = IgnoreCase
rg.Global = IsGlobal
rg.MultiLine = MultiLine
rg.Pattern = Pattern
RegExpReplace = rg.Replace(DataString, ReplacePattern)
End Function
Now you can call RegExpMatch & RegExpReplace in Excel formulas and we can start to think of how to solve your particular problem. To be a match, your string must start with G, S, M, or P. In a regular expression code that is ^[GSMP], where the up-arrow says to start at the beginning and the [GSMP] says to accept a G, S, M or P in the next position. Then any matching string must next have a number of numeric digits. Code that as \d+, where the \d means one numeric digit and the + is a modifier that means accept one or more of them. Then you could have a dot followed by some more digits, or not. This is a little more complicated - you would code it as (\.\d+)? because dot is a special character in regular expressions and \. says to accept a literal dot. That is followed by \d+ which is one or more digits, but this whole expression is enclosed in parentheses and followed by a ?, which means what is in parentheses can appear once or not at all. Finally, comes the rest of the line and we don't really care what is in it. We code .*$ for zero or more characters (any) followed by the line's end. That all goes together as ^[GSMP]\d+(\.\d+)?.*$.
Putting that pattern into our RegExpReplace call:
=RegExpReplace(A2,"^([GSMP]\d+(\.\d+)?).*$","$1")
We wrapped the part we were interested in keeping in parentheses because the "$1" as part of the replacement pattern says to use whatever was found inside the first set of parentheses. Here is that formula used in Excel
This works for all your examples but the last one, which is your else clause in your logic. We can fix that by testing whether the pattern matched using RegExpMatch:
=IF(regexpMatch(A2,"^([GSMP]\d+(\.\d+)?).*$"),RegExpReplace(A2,"^([GSMP]\d+(\.\d+)?).*$","$1"),"Overhead")
This gives the results you are looking for and you have also gained a powerful text manipulation tool to solve future problems.

Replace Whole Words from a Predefined List

I currently have coding which will review an equipment description field, the aim of which to standardize entries. That is - whatever is found in COL A, replace with COL B
I want to post the answer back to a new clean description column (that will work OK, no dramas on that section, but I don't need any messages etc, and this may be doing 100,000+ descriptions at a time, so looking for efficient coding).
However when it applies the Replace function, it also replaces part words, instead of distinct whole words, no matter how I sort the words on the Dictionary tab.
** 99 times out of a hundred there are no preceding or trailing spaces in Col A entries, but there are rare occasions...
Description Examples:
AIR COMPRESSOR
LEVEL GAUGE OIL SEPARATOR GAS COMPRESSOR
PRESS CTRL VV
PRESSURE GAUGE FLAME FRONT
PRESS as part of word becomes PRESSURE, e.g.:
COL A: COL B:
COMPRESSSOR COMPRESSOR
PRESSURE PRESSURE
PRESSURE GAUGE PRESSURE GAUGE
PRESS PRESSURE
AIR COMPRESSOR AIR COMPRESSOR
I think I'm very close to getting this right, but I can't figure out how to adjust to make it run and replace whole words only - I think it is the order of where I have stuff, but not 100% sure, or if something is missing.
I would greatly appreciate your help with this.
Thanks, Wendy
Function CleanUntil(original As String, targetReduction As Integer)
Dim newString As String
newString = original
Dim targetLength As Integer
targetLength = Len(original) - targetReduction
Dim rowCounter As Integer
rowCounter = 2
Dim CleanSheet As Worksheet
Set CleanSheet = ActiveWorkbook.Sheets("Dictionary")
Dim word As String
Dim cleanword As String
' Coding for replacement of WHOLE words - with a regular expression using a pattern with the \b marker (for the word boundary) before and after word
Dim RgExp As Object
Set re = CreateObject("VBScript.RegExp")
With RgExp
.Global = True
'.IgnoreCase = True 'True if search is case insensitive. False otherwise
End With
'Loop through each word until we reach the target length (or other value noted), or run out of clean words to apply
'While Len(newString) > 1 (this line will do ALL descriptions - confirmed)
'While Len(newString) > targetLength (this line will only do to target length)
While Len(newString) > 1
word = CleanSheet.Cells(rowCounter, 1).Value
cleanword = CleanSheet.Cells(rowCounter, 2).Value
RgExp.Pattern = "\b" & word & "\b"
If (word = "") Then
CleanUntil = newString
Exit Function
End If
' TODO: Make sure it is replacing whole words and not just portions of words
' newString = Replace(newString, word, cleanword) ' This line works if no RgExp applied, but finds part words.
newString = RgExp.Replace(newString, word, cleanword)
rowCounter = rowCounter + 1
Wend
' Once word find/replace finished, set close out loop for RgExp Object with word boundaries.
Set RgExp = Nothing
' Finally return the cleaned string as clean as we could get it, based on dictionary
CleanUntil = newString
End Function
NB: I would strongly recommend adding a reference to the Microsoft VBScript Regular Expressions 5.5 library (via Tools -> References...). This will give you strong typing and Intellisense on the RegExp object.
Dim RgExp As New RegExp
If I understand correctly, you can find the entries that need to be replaced using a regular expression; the regular expression only matches entries where the value in A is a complete word.
But when you try to replace with the VBA Replace function, it replaces even partial words in the text. And using the RegExp.Replace method has no effect -- the string always remains the same.
This is a quirk of the regular expression engine used in VBA. You cannot replace a complete match; you can only replace something which has been captured in a group, using ( ).
RgExp.Pattern = "\b(" & word & ")\b"
' ...
newString = RgExp.Replace(newString, cleanword)
If you want to exclude the hyphen from the boundary characters, you might be able to use a negative pattern which excludes any word characters or the hyphen:
RgExp.Pattern = "[^\w-](" & word & ")[^w-]"
Reference:
Replace method
Introduction to the VBScript regular expression library

VBA: Add Carriage Return + Line Feed at the start of Uppercase phrase

I have cells that contain various information.
In these cells, there are multiple Uppercase phrases.
I would like to be able to split the contents of the cell by adding the CHAR(13) + CHAR(10) Carriage return - linefeed combination
to the start of each new Uppercase phrase.
The only consistency is that the multiple Uppercase phrases begin after a period (.) and before open parenthesis "("
Example:
- Add CRLF to start of PERSUADER
- Add CRLF to start of RIVER JEWEL
- Add CRLF to start of TAHITIAN DANCER
- Add CRLF to start of AMBLEVE
- Add CRLF to start of GINA'S HOPE
NOTE:
There are multiple periods (.) in the text.
I have highlighted the text in red for a visual purpose only (normal text/font during import).
I am OK with either formula, UDF or VBA sub.
TEXT
PERSUADER (1) won by a margin first up at Kyneton. Bit of authority about her performance there and with the stable finding form it's easy to see her going right on with that. Ran really well when placed at Caulfield second-up last prep and that rates well against these. RIVER JEWEL (2) has been racing well at big odds. I have to like the form lines that she brings back in class now. Shapes as a key danger. TAHITIAN DANCER (5) will run well. She was okay without a lot of room at Flemington last time. AMBLEVE (13) is winning and can measure up while GINA'S HOPE (11) wasn't too far from River Jewel at Flemington and ties in as a hope off that form line.
I was able to extract with this function - but not able to manipulate the data in the cell
This is my code so far:
Function UpperCaseWords(ByVal S As String) As String
Dim X As Long, Words() As String
Const OkayPunctuation As String = ",."";:'&,-?!"
For X = 1 To Len(OkayPunctuation)
S = Replace(S, Mid(OkayPunctuation, X, 1), " ")
Next
Words = Split(WorksheetFunction.Trim(S))
For X = 0 To UBound(Words)
If Words(X) Like "*[!A-Z]*" Then Words(X) = ""
Next
UpperCaseWords = Trim(Join(Words))
End Function
Your description is not the same as your examples.
None of your examples start after a dot.
Most start after a dot-space except
PERSUADER starts at the start of the string
GINA'S HOPE starts after a space
I incorporated those rules into a regular expression, but, since your upper case words can include punctuation, for brevity I just looked for
- words that excluded lower case letters and digits
- words at least three characters long
If that is not sufficient in your real data, the regex can easily be made more specific:
Option Explicit
Function upperCaseWords(S As String) As String
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
With RE
.Global = True
.MultiLine = True
.Pattern = "^|\s(\b[^a-z0-9]+\b\s*\()"
upperCaseWords = .Replace(S, vbCrLf & "$1")
End With
End Function
as per your wording
The only consistency is that the multiple Uppercase phrases begin
after a period (.) and before open parenthesis "("
this should do:
Function UpperCaseWords(ByVal s As String) As String
Dim w As Variant
Dim s1 As String
For Each w In Split(s, ". ")
If InStr(w, "(") Then w = Chr(13) + Chr(10) & w
s1 = s1 & w
Next
UpperCaseWords = s1
End Function
Since the OP accepted the formula solution, and here is a formula answer .
Assume data put in A1
In B1, enter formula and copied across until blank :
=TRIM(RIGHT(SUBSTITUTE(TRIM(MID(SUBSTITUTE(SUBSTITUTE(" (. "&$A1," while ",". ")," (",REPT(" ",700)),COLUMN(A1)*700,700))&" ",". ",REPT(" ",300)),300))

VBA Trim leaving leading white space

I'm trying to compare strings in a macro and the data isn't always entered consistently. The difference comes down to the amount of leading white space (ie " test" vs. "test" vs. " test")
For my macro the three strings in the example should be equivalent. However I can't use Replace, as any spaces in the middle of the string (ex. "test one two three") should be retained. I had thought that was what Trim was supposed to do (as well as removing all trailing spaces). But when I use Trim on the strings, I don't see a difference, and I'm definitely left with white space at the front of the string.
So A) What does Trim really do in VBA? B) Is there a built in function for what I'm trying to do, or will I just need to write a function?
Thanks!
So as Gary's Student aluded to, the character wasn't 32. It was in fact 160. Now me being the simple man I am, white space is white space. So in line with that view I created the following function that will remove ALL Unicode characters that don't actual display to the human eye (i.e. non-special character, non-alphanumeric). That function is below:
Function TrueTrim(v As String) As String
Dim out As String
Dim bad As String
bad = "||127||129||141||143||144||160||173||" 'Characters that don't output something
'the human eye can see based on http://www.gtwiki.org/mwiki/?title=VB_Chr_Values
out = v
'Chop off the first character so long as it's white space
If v <> "" Then
Do While AscW(Left(out, 1)) < 33 Or InStr(1, bad, "||" & AscW(Left(out, 1)) & "||") <> 0 'Left(out, 1) = " " Or Left(out, 1) = Chr(9) Or Left(out, 1) = Chr(160)
out = Right(out, Len(out) - 1)
Loop
'Chop off the last character so long as it's white space
Do While AscW(Right(out, 1)) < 33 Or InStr(1, bad, "||" & AscW(Right(out, 1)) & "||") <> 0 'Right(out, 1) = " " Or Right(out, 1) = Chr(9) Or Right(out, 1) = Chr(160)
out = Left(out, Len(out) - 1)
Loop
End If 'else out = "" and there's no processing to be done
'Capture result for return
TrueTrim = out
End Function
TRIM() will remove all leading spaces
Sub demo()
Dim s As String
s = " test "
s2 = Trim(s)
msg = ""
For i = 1 To Len(s2)
msg = msg & i & vbTab & Mid(s2, i, 1) & vbCrLf
Next i
MsgBox msg
End Sub
It is possible your data has characters that are not visible, but are not spaces either.
Without seeing your code it is hard to know, but you could also use the Application.WorksheetFunction.Clean() method in conjunction with the Trim() method which removes non-printable characters.
MSDN Reference page for WorksheetFunction.Clean()
Why don't you try using the Instr function instead? Something like this
Function Comp2Strings(str1 As String, str2 As String) As Boolean
If InStr(str1, str2) <> 0 Or InStr(str2, str1) <> 0 Then
Comp2Strings = True
Else
Comp2Strings = False
End If
End Function
Basically you are checking if string1 contains string2 or string2 contains string1. This will always work, and you dont have to trim the data.
VBA's Trim function is limited to dealing with spaces. It will remove spaces at the start and end of your string.
In order to deal with things like newlines and tabs, I've always imported the Microsoft VBScript RegEx library and used it to replace whitespace characters.
In your VBA window, go to Tools, References, the find Microsoft VBScript Regular Expressions 5.5. Check it and hit OK.
Then you can create a fairly simple function to trim all white space, not just spaces.
Private Function TrimEx(stringToClean As String)
Dim re As New RegExp
' Matches any whitespace at start of string
re.Pattern = "^\s*"
stringToClean = re.Replace(stringToClean, "")
' Matches any whitespace at end of string
re.Pattern = "\s*$"
stringToClean = re.Replace(stringToClean, "")
TrimEx = stringToClean
End Function
Non-printables divide different lines of a Web page. I replaced them with X, Y and Z respectively.
Debug.Print Trim(Mid("X test ", 2)) ' first place counts as 2 in VBA
Debug.Print Trim(Mid("XY test ", 3)) ' second place counts as 3 in VBA
Debug.Print Trim(Mid("X Y Z test ", 2)) ' more rounds needed :)
Programmers prefer large text as may neatly be chopped with built in tools (inSTR, Mid, Left, and others). Use of text from several children (i.e taking .textContent versus .innerText) may result several non-printables to cope with, yet DOM and REGEX are not for beginners. Addressing sub-elements for inner text precisely (child elements one-by-one !) may help evading non-printable characters.

Resources