How to break a text block up so that it will display only One Word on each line - text

I am importing longer form text into a Unity program. I need one word of the longer text to be displayed on each line...
Thanks

The problem with working with large blocks of text in Word is that operations like Find and Replace can only be performed with Find text strings of 255 characters or less without causing an error. Once you import your text and assign it to a string variable, you can use Len() to determine the length of the string and then use Left() Mid() and Right() to breakup the larger string into shorter chunks of 250 characters each. Here's some code I wrote for just a find and replace situation:
With Selection.Find
y = Len(Selection.Text)
Select Case y
Case Is <= 250
x = 1
.Text = stFound
.Execute Replace:=wdReplaceAll
Case Is <= 500
Dim stFound2 As String
x = 2
z = Len(stFound) - 250
stFound1 = Left(stFound, 250)
stFound2 = Right(stFound, z)
Case Is <= 750
Dim stFound2 As String
Dim stFound3 As String
x = 3
stFound1 = Left(stFound, 250)
stFound2 = Mid(stFound, 251, 249)
stFound3 = Right(stFound, Len(stFound) - 500)
End Select
End With
I then used a For Next loop to run a Find and Replace on each string.
In your situation, it's going to be important to not break up the strings in the middle of a word. To do this you can use the InStr() function to find the position of spaces within your string and then break up the text according to where the spaces are. I wouldn't try using the Split() function on the raw text as depending on the size of the string you could run into a Subscript Out of Range error.
Once the text is chunked down into useable pieces, use the Split() function to send each word to an array and then run the following code to put each word on it's own line or paragraph:
Dim stTxt as String
dim stWord as String
dim stArr() as String
dim x as long
stTxt = 'One of your text strings
stArr() = Split(stTxt)
For x = LBound(stArr()) to UBound(stArr())
stWord = stArr(x) & "^p"
Selection.Typetext stWord
Next

After a little more research, I determined that the 255 character limit to text strings only affects some functions, not all. So I took a 17,335 character (including spaces) Word document and ran Split() on it to create an Array. There were no errors and the resulting array had a UBound of 2690.
So the next question is what kind of text is being imported into Word and what size is it. Is it just a list of words separated by spaces, or another delimiter? Does it contain any punctuation? If it's just a list of words separated by spaces or another delimiter such as a comma or semicolon, the Split() function will sort the words into an Array, at least up to 17,000 characters. More testing would be required for a larger text block. If the text contains punctuation, you would have to process the text to remove the unwanted punctuation which can be done with a Wildcard Find and Replace as long as the Find string is <= 255 characters. But if all you have are words and spaces or some other delimiter, using Split() to separate each word into an array element would work and then just run code as in the second half of my previous example:
For x = LBound(stArr()) to UBound(stArr())
stWord = stArr(x) & "^p"
Selection.Typetext stWord
Next

Related

Replace Whole Words from a Predefined List

I currently have coding which will review an equipment description field, the aim of which to standardize entries. That is - whatever is found in COL A, replace with COL B
I want to post the answer back to a new clean description column (that will work OK, no dramas on that section, but I don't need any messages etc, and this may be doing 100,000+ descriptions at a time, so looking for efficient coding).
However when it applies the Replace function, it also replaces part words, instead of distinct whole words, no matter how I sort the words on the Dictionary tab.
** 99 times out of a hundred there are no preceding or trailing spaces in Col A entries, but there are rare occasions...
Description Examples:
AIR COMPRESSOR
LEVEL GAUGE OIL SEPARATOR GAS COMPRESSOR
PRESS CTRL VV
PRESSURE GAUGE FLAME FRONT
PRESS as part of word becomes PRESSURE, e.g.:
COL A: COL B:
COMPRESSSOR COMPRESSOR
PRESSURE PRESSURE
PRESSURE GAUGE PRESSURE GAUGE
PRESS PRESSURE
AIR COMPRESSOR AIR COMPRESSOR
I think I'm very close to getting this right, but I can't figure out how to adjust to make it run and replace whole words only - I think it is the order of where I have stuff, but not 100% sure, or if something is missing.
I would greatly appreciate your help with this.
Thanks, Wendy
Function CleanUntil(original As String, targetReduction As Integer)
Dim newString As String
newString = original
Dim targetLength As Integer
targetLength = Len(original) - targetReduction
Dim rowCounter As Integer
rowCounter = 2
Dim CleanSheet As Worksheet
Set CleanSheet = ActiveWorkbook.Sheets("Dictionary")
Dim word As String
Dim cleanword As String
' Coding for replacement of WHOLE words - with a regular expression using a pattern with the \b marker (for the word boundary) before and after word
Dim RgExp As Object
Set re = CreateObject("VBScript.RegExp")
With RgExp
.Global = True
'.IgnoreCase = True 'True if search is case insensitive. False otherwise
End With
'Loop through each word until we reach the target length (or other value noted), or run out of clean words to apply
'While Len(newString) > 1 (this line will do ALL descriptions - confirmed)
'While Len(newString) > targetLength (this line will only do to target length)
While Len(newString) > 1
word = CleanSheet.Cells(rowCounter, 1).Value
cleanword = CleanSheet.Cells(rowCounter, 2).Value
RgExp.Pattern = "\b" & word & "\b"
If (word = "") Then
CleanUntil = newString
Exit Function
End If
' TODO: Make sure it is replacing whole words and not just portions of words
' newString = Replace(newString, word, cleanword) ' This line works if no RgExp applied, but finds part words.
newString = RgExp.Replace(newString, word, cleanword)
rowCounter = rowCounter + 1
Wend
' Once word find/replace finished, set close out loop for RgExp Object with word boundaries.
Set RgExp = Nothing
' Finally return the cleaned string as clean as we could get it, based on dictionary
CleanUntil = newString
End Function
NB: I would strongly recommend adding a reference to the Microsoft VBScript Regular Expressions 5.5 library (via Tools -> References...). This will give you strong typing and Intellisense on the RegExp object.
Dim RgExp As New RegExp
If I understand correctly, you can find the entries that need to be replaced using a regular expression; the regular expression only matches entries where the value in A is a complete word.
But when you try to replace with the VBA Replace function, it replaces even partial words in the text. And using the RegExp.Replace method has no effect -- the string always remains the same.
This is a quirk of the regular expression engine used in VBA. You cannot replace a complete match; you can only replace something which has been captured in a group, using ( ).
RgExp.Pattern = "\b(" & word & ")\b"
' ...
newString = RgExp.Replace(newString, cleanword)
If you want to exclude the hyphen from the boundary characters, you might be able to use a negative pattern which excludes any word characters or the hyphen:
RgExp.Pattern = "[^\w-](" & word & ")[^w-]"
Reference:
Replace method
Introduction to the VBScript regular expression library

VBA: Add Carriage Return + Line Feed at the start of Uppercase phrase

I have cells that contain various information.
In these cells, there are multiple Uppercase phrases.
I would like to be able to split the contents of the cell by adding the CHAR(13) + CHAR(10) Carriage return - linefeed combination
to the start of each new Uppercase phrase.
The only consistency is that the multiple Uppercase phrases begin after a period (.) and before open parenthesis "("
Example:
- Add CRLF to start of PERSUADER
- Add CRLF to start of RIVER JEWEL
- Add CRLF to start of TAHITIAN DANCER
- Add CRLF to start of AMBLEVE
- Add CRLF to start of GINA'S HOPE
NOTE:
There are multiple periods (.) in the text.
I have highlighted the text in red for a visual purpose only (normal text/font during import).
I am OK with either formula, UDF or VBA sub.
TEXT
PERSUADER (1) won by a margin first up at Kyneton. Bit of authority about her performance there and with the stable finding form it's easy to see her going right on with that. Ran really well when placed at Caulfield second-up last prep and that rates well against these. RIVER JEWEL (2) has been racing well at big odds. I have to like the form lines that she brings back in class now. Shapes as a key danger. TAHITIAN DANCER (5) will run well. She was okay without a lot of room at Flemington last time. AMBLEVE (13) is winning and can measure up while GINA'S HOPE (11) wasn't too far from River Jewel at Flemington and ties in as a hope off that form line.
I was able to extract with this function - but not able to manipulate the data in the cell
This is my code so far:
Function UpperCaseWords(ByVal S As String) As String
Dim X As Long, Words() As String
Const OkayPunctuation As String = ",."";:'&,-?!"
For X = 1 To Len(OkayPunctuation)
S = Replace(S, Mid(OkayPunctuation, X, 1), " ")
Next
Words = Split(WorksheetFunction.Trim(S))
For X = 0 To UBound(Words)
If Words(X) Like "*[!A-Z]*" Then Words(X) = ""
Next
UpperCaseWords = Trim(Join(Words))
End Function
Your description is not the same as your examples.
None of your examples start after a dot.
Most start after a dot-space except
PERSUADER starts at the start of the string
GINA'S HOPE starts after a space
I incorporated those rules into a regular expression, but, since your upper case words can include punctuation, for brevity I just looked for
- words that excluded lower case letters and digits
- words at least three characters long
If that is not sufficient in your real data, the regex can easily be made more specific:
Option Explicit
Function upperCaseWords(S As String) As String
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
With RE
.Global = True
.MultiLine = True
.Pattern = "^|\s(\b[^a-z0-9]+\b\s*\()"
upperCaseWords = .Replace(S, vbCrLf & "$1")
End With
End Function
as per your wording
The only consistency is that the multiple Uppercase phrases begin
after a period (.) and before open parenthesis "("
this should do:
Function UpperCaseWords(ByVal s As String) As String
Dim w As Variant
Dim s1 As String
For Each w In Split(s, ". ")
If InStr(w, "(") Then w = Chr(13) + Chr(10) & w
s1 = s1 & w
Next
UpperCaseWords = s1
End Function
Since the OP accepted the formula solution, and here is a formula answer .
Assume data put in A1
In B1, enter formula and copied across until blank :
=TRIM(RIGHT(SUBSTITUTE(TRIM(MID(SUBSTITUTE(SUBSTITUTE(" (. "&$A1," while ",". ")," (",REPT(" ",700)),COLUMN(A1)*700,700))&" ",". ",REPT(" ",300)),300))

VBA: How to find the values after a "#" symbol in a string

I am trying to set the letters after a # symbol to a variable.
For example, x = #BAL
I want to set y = BAL
Or x = #NE
I want y = NE
I am using VBA.
Split() in my opinion is the easiest way to do it:
Dim myStr As String
myStr = "#BAL"
If InStr(, myStr, "#") > 0 Then '<-- Check for your string to not throw error
MsgBox Split(myStr, "#")(1)
End If
As wisely pointed out by Scott Craner, you should check to ensure the string contains the value, which he checks in this comment by doing: y = Split(x,"#")(ubound(Split(x,"#")). Another way you can do it is using InStr(): If InStr(, x, "#") > 0 Then...
The (1) will take everything after the first instance of the character you are looking for. If you were to have used (0), then this would have taken everything before the #.
Similar but different example:
Dim myStr As String
myStr = "#BAL#TEST"
MsgBox Split(myStr, "#")(2)
The message box would have returned TEST because you used (2), and this was the second instance of your # character.
Then you can even split them into an array:
Dim myStr As String, splitArr() As String
myStr = "#BAL#TEST"
splitArr = Split(myStr, "#") '< -- don't append the collection number this time
MsgBox SplitArr(1) '< -- This would return "BAL"
MsgBox SplitArr(2) '< -- This would return "TEST"
If you are looking for additional reading, here is more from the MSDN:
Split Function
Description Returns a zero-based, one-dimensional array containing a specified number of substrings. SyntaxSplit( expression [ ,delimiter [ ,limit [ ,compare ]]] ) The Split function syntax has thesenamed arguments:
expression
Required. String expression containing substrings and delimiters. If expression is a zero-length string(""), Split returns an empty array, that is, an array with no elements and no data.
delimiter
Optional. String character used to identify substring limits. If omitted, the space character (" ") is assumed to be the delimiter. If delimiter is a zero-length string, a single-element array containing the entire expression string is returned.
limit
Optional. Number of substrings to be returned; -1 indicates that all substrings are returned.
compare
Optional. Numeric value indicating the kind of comparison to use when evaluating substrings. See Settings section for values.
You can do the following to get the substring after the # symbol.
x = "#BAL"
y = Right(x,len(x)-InStr(x,"#"))
Where x can be any string, with characters before or after the # symbol.

Quickly remove unnecessary whitespace from a (very large) string

I'm working with very large (45,000,000+ character) strings in VBA, and I need to remove superfluous whitespace.
One space (aka, ASCII Code 32) is okay but any sections with two or more consecutive spaces should be reduced to only one.
I found a similar question here, although that OP's definition of a "very long string" was only 39,000 characters. The accepted answer was a loop using Replace:
Function MyTrim(s As String) As String
Do While InStr(s, " ") > 0
s = Replace$(s, " ", " ")
Loop
MyTrim = Trim$(s)
End Function
I tried this method and it was "worked", but was painfully slow:
Len In: 44930886
Len Out: 35322469
Runtime: 247.6 seconds
Is there a faster way to remove whitespace from a "very large" string?
I suspect the performance problem is due to creating a very large number of large intermediate strings. So, any method that does things without creating intermediate strings or with much fewer would perform better.
A Regex replace has a good chance of that.
Option Explicit
Sub Test(ByVal text As String)
Static Regex As Object
If Regex Is Nothing Then
Set Regex = CreateObject("VBScript.RegExp")
Regex.Global = True
Regex.MultiLine = True
End If
Regex.Pattern = " +" ' space, one or more times
Dim result As String: result = Regex.Replace(text, " ")
Debug.Print Len(result), Left(result, 20)
End Sub
With an input string of 45 million characters takes about a second.
Runner:
Sub Main()
Const ForReading As Integer = 1
Const FormatUTF16 As Integer = -1 ' aka TriStateTrue
Dim fso As Object: Set fso = CreateObject("Scripting.FileSystemObject")
Dim file As Object: Set file = fso.OpenTextFile("C:\ProgramData\test.txt", ForReading, False, FormatUTF16)
Dim text As String: text = file.ReadAll()
Set file = Nothing
Set fso = Nothing
Debug.Print Len(text), Left(text, 20)
Test (text)
End Sub
Test data creator (C#):
var substring = "××\n× ×× ";
var text = String.Join("", Enumerable.Repeat(substring, 45_000_000 / substring.Length));
var encoding = new UnicodeEncoding(false, false);
File.WriteAllText(#"C:\ProgramData\test.txt", text, encoding);
BTW—Since VBA (VB4, Java, JavaScript, C#, VB, …) uses UTF-16, the space character is the one UTF-16 code unit ChrW(32). (Any similarity to or comparison with ASCII, is unnecessary mental gymnastics, and if put into code as ANSI [Chr(32)], unnecessary conversion behind the scenes, with different behavior for different machines, users and times.)
In VBA, the size of a String is limited to approximately 2 Billion Characters. The "Replace-Loop" method above took 247 seconds for a 45 Million character string, which is over 4 minutes.
Theoretically, that means a 2 Billion character string would take at least 3 hours — if it even finished without crashing — so it's not exactly practical.
Excel has a built-in worksheet function Trim which is not the same as VBA's Trim function.
Worksheet function Trim removes all spaces from text except for single spaces between words.
The problem is that Trim, like all functions called with Application.WorksheetFunction, has a size limit of 32,767 characters, and this [unfortunately] applies even when calling the function from VBA with a string that's not even in a cell.
However, we can still use the function if we use it to loop through our "gigantic string" in sections, like this:
EDIT: Don't even bother with this crap (my function, below)! See the RegEx answer above.
Function bigTrim(strIn As String) As String
Const maxLen = 32766
Dim loops As Long, x As Long
loops = Int(Len(strIn) / maxLen)
If (Len(strIn) / maxLen) <> loops Then loops = loops + 1
For x = 1 To loops
bigTrim = bigTrim & _
Application.WorksheetFunction.Trim(Mid(strIn, _
((x - 1) * maxLen) + 1, maxLen))
Next x
End Function
Running this function on the same string that was used with the "Replace-Loop" method yielded much better results:
Len In: 44930886
Len Out: 35321845
Runtime: 33.6 seconds
That's more than 7x faster than the "Replace-Loop" method, and managed to remove 624 spaces that were somehow missed by the other method.
(I though about looking into why the first method missed characters, but since I know my string isn't missing anything, and the point of this exercise was to save time, that would be silly!) ☺

How to split string after x characters and on on last space character in vb

What i want to do is, after X amount of characters, split the string on last whitespace, and then after the next X amount of characters repeat.
Till now i have something like this , using regex but i get error on "Function"
Dim input = "This is a long sentence with more than 18 letters."
Dim output = Regex.Split(input, "(.{1,18})(?:\s|$)").Where(Function(x) x.Length > 0).ToList()

Resources