Quickly remove unnecessary whitespace from a (very large) string - excel

I'm working with very large (45,000,000+ character) strings in VBA, and I need to remove superfluous whitespace.
One space (aka, ASCII Code 32) is okay but any sections with two or more consecutive spaces should be reduced to only one.
I found a similar question here, although that OP's definition of a "very long string" was only 39,000 characters. The accepted answer was a loop using Replace:
Function MyTrim(s As String) As String
Do While InStr(s, " ") > 0
s = Replace$(s, " ", " ")
Loop
MyTrim = Trim$(s)
End Function
I tried this method and it was "worked", but was painfully slow:
Len In: 44930886
Len Out: 35322469
Runtime: 247.6 seconds
Is there a faster way to remove whitespace from a "very large" string?

I suspect the performance problem is due to creating a very large number of large intermediate strings. So, any method that does things without creating intermediate strings or with much fewer would perform better.
A Regex replace has a good chance of that.
Option Explicit
Sub Test(ByVal text As String)
Static Regex As Object
If Regex Is Nothing Then
Set Regex = CreateObject("VBScript.RegExp")
Regex.Global = True
Regex.MultiLine = True
End If
Regex.Pattern = " +" ' space, one or more times
Dim result As String: result = Regex.Replace(text, " ")
Debug.Print Len(result), Left(result, 20)
End Sub
With an input string of 45 million characters takes about a second.
Runner:
Sub Main()
Const ForReading As Integer = 1
Const FormatUTF16 As Integer = -1 ' aka TriStateTrue
Dim fso As Object: Set fso = CreateObject("Scripting.FileSystemObject")
Dim file As Object: Set file = fso.OpenTextFile("C:\ProgramData\test.txt", ForReading, False, FormatUTF16)
Dim text As String: text = file.ReadAll()
Set file = Nothing
Set fso = Nothing
Debug.Print Len(text), Left(text, 20)
Test (text)
End Sub
Test data creator (C#):
var substring = "××\n× ×× ";
var text = String.Join("", Enumerable.Repeat(substring, 45_000_000 / substring.Length));
var encoding = new UnicodeEncoding(false, false);
File.WriteAllText(#"C:\ProgramData\test.txt", text, encoding);
BTW—Since VBA (VB4, Java, JavaScript, C#, VB, …) uses UTF-16, the space character is the one UTF-16 code unit ChrW(32). (Any similarity to or comparison with ASCII, is unnecessary mental gymnastics, and if put into code as ANSI [Chr(32)], unnecessary conversion behind the scenes, with different behavior for different machines, users and times.)

In VBA, the size of a String is limited to approximately 2 Billion Characters. The "Replace-Loop" method above took 247 seconds for a 45 Million character string, which is over 4 minutes.
Theoretically, that means a 2 Billion character string would take at least 3 hours — if it even finished without crashing — so it's not exactly practical.
Excel has a built-in worksheet function Trim which is not the same as VBA's Trim function.
Worksheet function Trim removes all spaces from text except for single spaces between words.
The problem is that Trim, like all functions called with Application.WorksheetFunction, has a size limit of 32,767 characters, and this [unfortunately] applies even when calling the function from VBA with a string that's not even in a cell.
However, we can still use the function if we use it to loop through our "gigantic string" in sections, like this:
EDIT: Don't even bother with this crap (my function, below)! See the RegEx answer above.
Function bigTrim(strIn As String) As String
Const maxLen = 32766
Dim loops As Long, x As Long
loops = Int(Len(strIn) / maxLen)
If (Len(strIn) / maxLen) <> loops Then loops = loops + 1
For x = 1 To loops
bigTrim = bigTrim & _
Application.WorksheetFunction.Trim(Mid(strIn, _
((x - 1) * maxLen) + 1, maxLen))
Next x
End Function
Running this function on the same string that was used with the "Replace-Loop" method yielded much better results:
Len In: 44930886
Len Out: 35321845
Runtime: 33.6 seconds
That's more than 7x faster than the "Replace-Loop" method, and managed to remove 624 spaces that were somehow missed by the other method.
(I though about looking into why the first method missed characters, but since I know my string isn't missing anything, and the point of this exercise was to save time, that would be silly!) ☺

Related

Tokenize Myanmar names, typed using Pyidaungsu Unicode font, in MS Excel, using VBA UDF

I searched everywhere for a way to sort Myanmar names, typed up in Pyidaungsu unicode font, by the last consonant in MS Excel.
Doing the same in English is relatively easy using Excel's builtin formulae/functions.
But it is hard for Myanmar names in Burmese because Myanmar names do not require a white space between each word and the first, middle and last names are not that distinct as in, eg. John W. Smith.
In a Myanmar name, eg. အောင်မြင့်မြတ်=Aung Myint Myat, there is no distinct first/last name and no white space is required if it is written in Myanmar font!
Thus, it is pretty hard to find the word boundary between each word, i.e, where အောင် starts and ends and မြင့် starts and ends etc. and so on!
So I need a VBA UDF to be able to tokenize Myanmar names!
After much searching and reading through NLP literature, a lot of which I don't really understand, I realized that the Myanmar font, Pyidaungsu by name, has a character binding method where all Myanmar characters: consonants and diacritics were bound together like: the consonants come first for each word, followed by diacritics (or may be I am wrong about how it is called).
So if only I could place a delimiter/separator just before each consonant, I should be able to tokenize each word!
Fortunately, it helps me write VBA code like:
Const kagyi = 4096
Const ah = 4129 '+9 to include ou
Const athat = 4154
Const shiftF = 4153 'for typing something under something
Const witecha = 4140
Const moutcha = 4139
'Return a tokenized Myanmar String
Function MMRTokenizer(target As Range) As String
Dim ch As String
Dim returnString As String
Dim charCounter As Integer
Dim previousChIsAthat As Boolean
Dim shiftFfound As Boolean
Dim previousCharAt As Long
If target.Cells.CountLarge > 1 Then MMRTokenizer = ">1Cell!": Exit Function
returnString = "": previousChIsAthat = False: shiftFfound = False: previousCharAt = Len(target.Value) + 1
If target.CountLarge = 1 Then
If target.Value <> "" Then
For charCounter = Len(target.Value) To 1 Step -1
ch = Mid(target.Value, charCounter, 1)
If AscW(ch) <> shiftF Then
If Not shiftFfound Or AscW(ch) = athat Then
If AscW(ch) <> athat Then
If AscW(ch) >= kagyi And AscW(ch) < ah + 9 Then
If Not previousChIsAthat Then
returnString = Mid(target.Value, charCounter, previousCharAt - charCounter) & IIf(Len(returnString) > 0, "|", "") & returnString
previousCharAt = charCounter
Else
previousChIsAthat = False
End If
Else
If AscW(ch) = witecha Or AscW(ch) = moutcha Then
previousChIsAthat = False
End If
End If
Else
previousChIsAthat = True
If shiftFfound Then shiftFfound = False
End If
Else
shiftFfound = False
If previousChIsAthat Then previousChIsAthat = False
End If
Else
shiftFfound = True
End If
Next charCounter
End If
End If
MMRTokenizer = returnString
End Function
In theory, it should be pretty simple since I am not using any NLP or ML methods but employed some string manipulations only.
I took out each character of the name/word from the right (it may be ok to start from the left) then go left until I found a consonant and place a separator/delimiter to the left of it and then keep going left and repeating the same process until the left-most character is reached.
The caveat here is, that, sometimes, there could be a consonant, which in Myanmar language is part of a combination of a consonant and a diacritic (pretty common behavior), eg. in အောင်=‌ေ+အ+ ာ+င+် though it looks like that way, the Pyidaungsu font bound it like အ+‌ေ+ာ+င+် ,if it were entered using Windows Burmese keyboard (Visual Order), the rightmost two, င+် where င=consonant called nga and ် =diacritic called Athat.
In such cases, we just skip over that renegade consonant (if we encountered that specific diacritic just right of it) as it should not be counted as such, according the Burmese way of spelling words.
I used chrW and ascW functions because Myanmar font cannot be rendered in VBIDE (even after tweaking in the Regional settings) and thus, I am forced to check the unicode character codes instead of directly comparing Burmese characters.
Above is just a gist of how the whole thing works.
Further details are available on my GitHub.
After we tokenized like above, we got something like: အောင်|မြင့်|မြတ် which is now pretty easy to be splitted up or reversed using builtin Excel formulae to become မြတ်|မြင့်|အောင် so that it can now be sorted by the last word (or last name) or separated into a last name/first name basis!
NB: This whole tokenization process could/may be achieved by using a combination of various formulae in Excel as nothing is impossible, especially in Excel365 (where arrays just spill without CSE), IMHO, however, I hope that we can easily see the benefits vs. complexity and effort in this case.
I, hereby, admit that the above code may not be the most elegant, but, it is a proven-working proof-of-concept tool, so employ it at your own risk but bugs can be reported to my GitHub provided above.

How do i get part of string after a special character?

I have a column where i pickup increasing numbering values, and their format is xx_yy
so the first is 1_0, second 1_1 and so forth, no we are at 23_31
I want to get the right side of the string, and i am already getting the left side correctly.
using
newActionId = Left(lastActionID, (Application.WorksheetFunction.Find("_", lastActionID, 1) - 1))
i wish to do the following, human writing below
nextSubid = entire stringvalue AFTER special character "_"
I tried just switching left to right, didnt go so well, do you have a suggestion?
You can use Split function to get the relevant text.
Syntax: Split(expression, [ delimiter, [ limit, [ compare ]]])
Option Explicit
Sub Sample()
Dim id As String
Dim beforeSplChr As String
Dim afterSplChr As String
id = "23_31"
beforeSplChr = Split(id, "_")(0)
afterSplChr = Split(id, "_")(1)
Debug.Print beforeSplChr
Debug.Print afterSplChr
End Sub
Another way
Debug.Print Left(id, (InStrRev(id, "_", -1) - 1)) '<~~ Left Part
Debug.Print Right(id, (InStrRev(id, "_", -1) - 1)) '<~~ Right Part
Even though Siddharth Rout has given what can probably be considered a better answer here, I felt that this was worth adding:
To get the second part of the string using your original method, you would want to use the Mid function in place of Left, rather than trying to use Right.
Mid(string, start, [ length ])
Returns length characters from string, starting at the start position
If length is omitted, then will return characters from the start position until the end of the string
newActionId = Mid(lastActionID, Application.WorksheetFunction.Find("_", lastActionID, 1) + 1)
Just for fun (Split is the way to go here), an alternative way using regular expressions:
Sub Test()
Dim str As String: str = "23_31"
With CreateObject("VBScript.RegExp")
.Global = True
.Pattern = "\d+"
Debug.Print .Execute(str)(0) 'Left Part
Debug.Print .Execute(str)(1) 'Right Part
End With
End Sub
Btw, as per my comment, your first value could also be achieved through:
Debug.Print Val(str)
Split function of string is very usefull for this type of query.
Like:
String s = "23_34";
String left = s.split("_")[0];
String right = s.split("_")[1];
Or you can also use combination of indexOf and substring method together.
String left = s.substring(0,s.indexOf('_')+1)
String right = s.substring(s.indexOf('_'));

Replace Whole Words from a Predefined List

I currently have coding which will review an equipment description field, the aim of which to standardize entries. That is - whatever is found in COL A, replace with COL B
I want to post the answer back to a new clean description column (that will work OK, no dramas on that section, but I don't need any messages etc, and this may be doing 100,000+ descriptions at a time, so looking for efficient coding).
However when it applies the Replace function, it also replaces part words, instead of distinct whole words, no matter how I sort the words on the Dictionary tab.
** 99 times out of a hundred there are no preceding or trailing spaces in Col A entries, but there are rare occasions...
Description Examples:
AIR COMPRESSOR
LEVEL GAUGE OIL SEPARATOR GAS COMPRESSOR
PRESS CTRL VV
PRESSURE GAUGE FLAME FRONT
PRESS as part of word becomes PRESSURE, e.g.:
COL A: COL B:
COMPRESSSOR COMPRESSOR
PRESSURE PRESSURE
PRESSURE GAUGE PRESSURE GAUGE
PRESS PRESSURE
AIR COMPRESSOR AIR COMPRESSOR
I think I'm very close to getting this right, but I can't figure out how to adjust to make it run and replace whole words only - I think it is the order of where I have stuff, but not 100% sure, or if something is missing.
I would greatly appreciate your help with this.
Thanks, Wendy
Function CleanUntil(original As String, targetReduction As Integer)
Dim newString As String
newString = original
Dim targetLength As Integer
targetLength = Len(original) - targetReduction
Dim rowCounter As Integer
rowCounter = 2
Dim CleanSheet As Worksheet
Set CleanSheet = ActiveWorkbook.Sheets("Dictionary")
Dim word As String
Dim cleanword As String
' Coding for replacement of WHOLE words - with a regular expression using a pattern with the \b marker (for the word boundary) before and after word
Dim RgExp As Object
Set re = CreateObject("VBScript.RegExp")
With RgExp
.Global = True
'.IgnoreCase = True 'True if search is case insensitive. False otherwise
End With
'Loop through each word until we reach the target length (or other value noted), or run out of clean words to apply
'While Len(newString) > 1 (this line will do ALL descriptions - confirmed)
'While Len(newString) > targetLength (this line will only do to target length)
While Len(newString) > 1
word = CleanSheet.Cells(rowCounter, 1).Value
cleanword = CleanSheet.Cells(rowCounter, 2).Value
RgExp.Pattern = "\b" & word & "\b"
If (word = "") Then
CleanUntil = newString
Exit Function
End If
' TODO: Make sure it is replacing whole words and not just portions of words
' newString = Replace(newString, word, cleanword) ' This line works if no RgExp applied, but finds part words.
newString = RgExp.Replace(newString, word, cleanword)
rowCounter = rowCounter + 1
Wend
' Once word find/replace finished, set close out loop for RgExp Object with word boundaries.
Set RgExp = Nothing
' Finally return the cleaned string as clean as we could get it, based on dictionary
CleanUntil = newString
End Function
NB: I would strongly recommend adding a reference to the Microsoft VBScript Regular Expressions 5.5 library (via Tools -> References...). This will give you strong typing and Intellisense on the RegExp object.
Dim RgExp As New RegExp
If I understand correctly, you can find the entries that need to be replaced using a regular expression; the regular expression only matches entries where the value in A is a complete word.
But when you try to replace with the VBA Replace function, it replaces even partial words in the text. And using the RegExp.Replace method has no effect -- the string always remains the same.
This is a quirk of the regular expression engine used in VBA. You cannot replace a complete match; you can only replace something which has been captured in a group, using ( ).
RgExp.Pattern = "\b(" & word & ")\b"
' ...
newString = RgExp.Replace(newString, cleanword)
If you want to exclude the hyphen from the boundary characters, you might be able to use a negative pattern which excludes any word characters or the hyphen:
RgExp.Pattern = "[^\w-](" & word & ")[^w-]"
Reference:
Replace method
Introduction to the VBScript regular expression library

How to break a text block up so that it will display only One Word on each line

I am importing longer form text into a Unity program. I need one word of the longer text to be displayed on each line...
Thanks
The problem with working with large blocks of text in Word is that operations like Find and Replace can only be performed with Find text strings of 255 characters or less without causing an error. Once you import your text and assign it to a string variable, you can use Len() to determine the length of the string and then use Left() Mid() and Right() to breakup the larger string into shorter chunks of 250 characters each. Here's some code I wrote for just a find and replace situation:
With Selection.Find
y = Len(Selection.Text)
Select Case y
Case Is <= 250
x = 1
.Text = stFound
.Execute Replace:=wdReplaceAll
Case Is <= 500
Dim stFound2 As String
x = 2
z = Len(stFound) - 250
stFound1 = Left(stFound, 250)
stFound2 = Right(stFound, z)
Case Is <= 750
Dim stFound2 As String
Dim stFound3 As String
x = 3
stFound1 = Left(stFound, 250)
stFound2 = Mid(stFound, 251, 249)
stFound3 = Right(stFound, Len(stFound) - 500)
End Select
End With
I then used a For Next loop to run a Find and Replace on each string.
In your situation, it's going to be important to not break up the strings in the middle of a word. To do this you can use the InStr() function to find the position of spaces within your string and then break up the text according to where the spaces are. I wouldn't try using the Split() function on the raw text as depending on the size of the string you could run into a Subscript Out of Range error.
Once the text is chunked down into useable pieces, use the Split() function to send each word to an array and then run the following code to put each word on it's own line or paragraph:
Dim stTxt as String
dim stWord as String
dim stArr() as String
dim x as long
stTxt = 'One of your text strings
stArr() = Split(stTxt)
For x = LBound(stArr()) to UBound(stArr())
stWord = stArr(x) & "^p"
Selection.Typetext stWord
Next
After a little more research, I determined that the 255 character limit to text strings only affects some functions, not all. So I took a 17,335 character (including spaces) Word document and ran Split() on it to create an Array. There were no errors and the resulting array had a UBound of 2690.
So the next question is what kind of text is being imported into Word and what size is it. Is it just a list of words separated by spaces, or another delimiter? Does it contain any punctuation? If it's just a list of words separated by spaces or another delimiter such as a comma or semicolon, the Split() function will sort the words into an Array, at least up to 17,000 characters. More testing would be required for a larger text block. If the text contains punctuation, you would have to process the text to remove the unwanted punctuation which can be done with a Wildcard Find and Replace as long as the Find string is <= 255 characters. But if all you have are words and spaces or some other delimiter, using Split() to separate each word into an array element would work and then just run code as in the second half of my previous example:
For x = LBound(stArr()) to UBound(stArr())
stWord = stArr(x) & "^p"
Selection.Typetext stWord
Next

VBA Trim leaving leading white space

I'm trying to compare strings in a macro and the data isn't always entered consistently. The difference comes down to the amount of leading white space (ie " test" vs. "test" vs. " test")
For my macro the three strings in the example should be equivalent. However I can't use Replace, as any spaces in the middle of the string (ex. "test one two three") should be retained. I had thought that was what Trim was supposed to do (as well as removing all trailing spaces). But when I use Trim on the strings, I don't see a difference, and I'm definitely left with white space at the front of the string.
So A) What does Trim really do in VBA? B) Is there a built in function for what I'm trying to do, or will I just need to write a function?
Thanks!
So as Gary's Student aluded to, the character wasn't 32. It was in fact 160. Now me being the simple man I am, white space is white space. So in line with that view I created the following function that will remove ALL Unicode characters that don't actual display to the human eye (i.e. non-special character, non-alphanumeric). That function is below:
Function TrueTrim(v As String) As String
Dim out As String
Dim bad As String
bad = "||127||129||141||143||144||160||173||" 'Characters that don't output something
'the human eye can see based on http://www.gtwiki.org/mwiki/?title=VB_Chr_Values
out = v
'Chop off the first character so long as it's white space
If v <> "" Then
Do While AscW(Left(out, 1)) < 33 Or InStr(1, bad, "||" & AscW(Left(out, 1)) & "||") <> 0 'Left(out, 1) = " " Or Left(out, 1) = Chr(9) Or Left(out, 1) = Chr(160)
out = Right(out, Len(out) - 1)
Loop
'Chop off the last character so long as it's white space
Do While AscW(Right(out, 1)) < 33 Or InStr(1, bad, "||" & AscW(Right(out, 1)) & "||") <> 0 'Right(out, 1) = " " Or Right(out, 1) = Chr(9) Or Right(out, 1) = Chr(160)
out = Left(out, Len(out) - 1)
Loop
End If 'else out = "" and there's no processing to be done
'Capture result for return
TrueTrim = out
End Function
TRIM() will remove all leading spaces
Sub demo()
Dim s As String
s = " test "
s2 = Trim(s)
msg = ""
For i = 1 To Len(s2)
msg = msg & i & vbTab & Mid(s2, i, 1) & vbCrLf
Next i
MsgBox msg
End Sub
It is possible your data has characters that are not visible, but are not spaces either.
Without seeing your code it is hard to know, but you could also use the Application.WorksheetFunction.Clean() method in conjunction with the Trim() method which removes non-printable characters.
MSDN Reference page for WorksheetFunction.Clean()
Why don't you try using the Instr function instead? Something like this
Function Comp2Strings(str1 As String, str2 As String) As Boolean
If InStr(str1, str2) <> 0 Or InStr(str2, str1) <> 0 Then
Comp2Strings = True
Else
Comp2Strings = False
End If
End Function
Basically you are checking if string1 contains string2 or string2 contains string1. This will always work, and you dont have to trim the data.
VBA's Trim function is limited to dealing with spaces. It will remove spaces at the start and end of your string.
In order to deal with things like newlines and tabs, I've always imported the Microsoft VBScript RegEx library and used it to replace whitespace characters.
In your VBA window, go to Tools, References, the find Microsoft VBScript Regular Expressions 5.5. Check it and hit OK.
Then you can create a fairly simple function to trim all white space, not just spaces.
Private Function TrimEx(stringToClean As String)
Dim re As New RegExp
' Matches any whitespace at start of string
re.Pattern = "^\s*"
stringToClean = re.Replace(stringToClean, "")
' Matches any whitespace at end of string
re.Pattern = "\s*$"
stringToClean = re.Replace(stringToClean, "")
TrimEx = stringToClean
End Function
Non-printables divide different lines of a Web page. I replaced them with X, Y and Z respectively.
Debug.Print Trim(Mid("X test ", 2)) ' first place counts as 2 in VBA
Debug.Print Trim(Mid("XY test ", 3)) ' second place counts as 3 in VBA
Debug.Print Trim(Mid("X Y Z test ", 2)) ' more rounds needed :)
Programmers prefer large text as may neatly be chopped with built in tools (inSTR, Mid, Left, and others). Use of text from several children (i.e taking .textContent versus .innerText) may result several non-printables to cope with, yet DOM and REGEX are not for beginners. Addressing sub-elements for inner text precisely (child elements one-by-one !) may help evading non-printable characters.

Resources