How to split words in a string, with whitespaces using VBA?

How to split words in a string, with whitespaces using VBA? - excel

I am trying to split() a string that has words and numbers, but separated with multiple variants of whitespaces, such as <tab> and <newline> characters.
I was attempting this by normalising all whitespaces into single spaces and then split() the result into a string array in VBA.
I've found a similar question on this, but the answer is a routine to normalise multiple spaces only into a single space. I need a variant that can find whitespaces (ASCII chars below 32).
I tried running this routine and I get Error 13 error...
Public Function SplitRe(Text As String, Pattern As String, Optional IgnoreCase As Boolean) As String()
Static re As Object
If re Is Nothing Then
Set re = CreateObject("VBScript.RegExp")
re.Global = True
re.MultiLine = True
End If
re.IgnoreCase = IgnoreCase
re.Pattern = Pattern
SplitRe = Strings.Split(re.Replace(Text, ChrW(-1)), ChrW(-1))
End Function
Here is the string I am trying to split().
<tab>6219<tab><nl>
Changes to Pilot Ops Dashboard<nl>
Daniel Son<tab>Gatehouse<tab>Port of Collinsville<nl>
Medium<nl>
Support<tab>waiting<tab><nl>
30 minutes ago<nl>
8 seconds ago<nl>
5 minutes<nl>
What I am after is a string array of the following (each line is an element in the array):
6219
Changes to Pilot Ops Dashboard
Daniel Son
Gatehouse
Port of Collinsville
Medium
Support
waiting
30 minutes ago
8 seconds ago
5 minutes

You could try the following pattern and function:
Function SplitRE(s As String) As String()
With CreateObject("vbscript.regexp")
.Global = True
.Pattern = "(?:^[\t\n]+|[\t\n]*(?!.)|[\t\n]*([\t\n]))"
SplitRE = Split(Replace(.Replace(s, "$1"), Chr(9), Chr(10)), Chr(10))
End With
End Function
The pattern (?:^[\t\n]+|[\t\n]*(?!.)|[\t\n]*([\t\n])) would match:
(?: open non-capture group;
^[\t\n]+ - 1+ Tab- or Newline characters after star-line anchor. This is to remove these leading characters since unfortunately \A is not available in VBA;
| - Or;
[\t\n]*(?!.) - 0+ Tab- or Newline characters with a negative lookahead to assert position is not followed by another character. This is to remove these trailing characters since unfortunately \Z is not available to us in VBA;
| - Or;
[\t\n]*([\t\n]) - 0+ Tab- or Newline characters with a capture group to hold the last of them.
We then replace all matches with the content of the capture group, replace any remaining tab characters to newlines before we split on newline characters.
The above may be a bit verbose but this would prevent empty elements in your array. Example on how to invoke:
Sub test()
Dim str As String: str = Replace(Replace("<tab>6219<tab><nl>Changes to Pilot Ops Dashboard<nl>" & _
"Daniel Son<tab>Gatehouse<tab>Port of Collinsville<nl>Medium<nl>Support<tab>waiting" & _
"<tab><nl>30 minutes ago<nl>8 seconds ago<nl>5 minutes<nl>", "<tab>", Chr(9)), "<nl>", Chr(10))
arr = SplitRE(str)
End Sub
Don't mind the Replace() functions, I just had to re-create your input string. The above would lead to following array:

Related

Replace Numbers followed by Non-Printable Characters Chr(25) into Numbers followed by double quotation

I need to replace non-printable characters with double quotation "
The problem is, this bad character Chr(25) can come after number, single time or twice and even comes after number and double quotation "
If I used excel clean function ,that will remove all Chr(25) and not replace it.
Range("C2") = WorksheetFunction.Clean(Range("B2"))
I also tried to use vba Replace function, but again the problem is count and position of Non-Printable Characters:
Range("C2") = Replace(Range("B2"), Chr(25) & Chr(25), """")
'If Chr(25) is single, this code will replace and add again
In advance, grateful for all your help.

I'd suggest a regular expression to catch and replace these characters:
Function RegexReplace(s As String) As String
With CreateObject("vbscript.regexp")
.Pattern = "(\d+)[""']*\u0019+"
.Global = True
RegexReplace = .Replace(s, "$1""")
End With
End Function
See an online demo of the pattern which means:
(\d+) - Capture any 1+ digits in a capture group;
["']* - Match 0+ double/single quotes;
\u0019+ - Match 1+ 'END OF MEDIUM' characters.
Replace with $1" means to input the captured digits followed by a double quote.
Formula in B1:
=RegexReplace(A1)
Note: If you don't need to specify the digits, you could leave (\d+) out and just use ["']*\u0019+ with a simple replacement of a single ".

Excel: Find and Replace without also grabbing the beginning of another word

I'm currently working on shortening a large excel sheet using Find/Replace. I'm finding all instances of words like ", Inc.", ", Co." " LLC", etc. and replacing them with nothing (aka removing them). The problem I am having is that I'm unable to do similar searches for " Inc", ", Inc", ", Co", etc. and remove them because it will also remove them the beginnings of words like ", Inc"orporated, and ", Co"mpany.
Is there a blank character or something I can do in VBA that would allow me to just find/replace items with nothing after what I'm finding (I.e. finding ", Co" without also catching ", Co"rporated)?

In VBA you can use Regular Expressions to ensure that there are "word boundaries" before and after the abbreviation you are trying to remove. You can also remove extraneous spaces that might appear, depending on the original string.
Function remAbbrevs(S As String, ParamArray abbrevs()) As String
Dim RE As Object
Dim sPat As String
sPat = "\s*\b(?:" & Join(abbrevs, "|") & ")\b\.?"
Set RE = CreateObject("vbscript.regexp")
With RE
.Global = True
.ignorecase = False
.Pattern = sPat
remAbbrevs = .Replace(S, "")
End With
End Function
For arguments to this function you can enter a series of abbreviations. The function creates an appropriate regex to use.
For example in the below, I entered:
=remAbbrevs(A1,"Inc","Corp")
and filled down:
Explanation of the regex:
remAbbrevs
\s*\b(?:Inc|Corp)\b\.?
Options: Case sensitive
Match a single character that is a “whitespace character” \s*
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) *
Assert position at a word boundary \b
Match the regular expression below (?:Inc|Corp)
Match this alternative Inc
Match the character string “Inc” literally Inc
Or match this alternative Corp
Match the character string “Corp” literally Corp
Assert position at a word boundary \b
Match the character “.” literally \.?
Between zero and one times, as many times as possible, giving back as needed (greedy) ?
Created with RegexBuddy

Excel: Delete Last Word In String IF it is a number

Sorry if this has been asked before, but I can't find any answers anywhere that do what I need... :(
Basically I have a bunch of product titles, some with a number (SKU) at the end, some with not.
All SKUS will end with a number. In Pseudo: IF the last character of the last word is a number, remove the last word. - This isn't actually a word but a number.
Here is an example: "Medium Racer Back Top And Short Lounge Set 7091" (I want to delete the bold part if the last character is a number)
As always, thanks in advance for any help :)
D

You could use a regex to replace the end numbers. Use pattern "\s\d+$" to get rid of white space at end.
Option Explicit
Public Sub test()
Debug.Print GetString("Medium Racer Back Top And Short Lounge Set 7091")
End Sub
Public Function GetString(ByVal inputString As String) As Variant
With CreateObject("vbscript.regexp")
.Global = True
.MultiLine = True
.IgnoreCase = True
.Pattern = "[^\s]\d+$" '<== \s\d+$ '<==to remove white space at end as well
If .test(inputString) Then
GetString = .Replace(inputString, vbNullString)
Else
GetString = inputString
End If
End With
End Function
Used as User Defined Function in sheet:
Regex:
Try it
[^\s]\d+$
Match a single character not present in the list below [^\s]
\s matches any whitespace character (equal to [\r\n\t\f\v ])
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)

excel formula find part number in file path text string

I have a extract of all the files on a network drive, and in the some file names is a part number, the part numbers format is 0000-000000-00. Now in the 600,000+ path names in this file I'm trying to figure out how to extract my part numbers out of the path names. I think a mid formula might work but I am at a loss on how to tell it to find anything with the part # format 0000-000000-00 and extract only those 14 characters from the path?
input looks like this
c:\users\stuff\folder_name\1234-000001-01_ baskets_1.pdf
c:\users\stuff\folder_name\1234-000001-02_ baskets_2.pdf
c:\users\stuff\folder_name\1234-000001-03_ baskets_3.pdf
c:\users\stuff\folder_name\1234-000030-01_ tree_30.pdf
c:\users\stuff\folder_name\random text_1234-000030-02_ tree_30.pdf
c:\users\stuff\folder_name\more random stuff_1234-000030-02_ tree_30.pdf
output I'm hoping for
1234-000001-01
1234-000001-02
1234-000001-03
1234-000030-01

Since you have a pattern we can exploit, use this:
=MID(A1,SEARCH("????-??????-??",A1),14)
Finds the start of the pattern and returns the 14 character after.

You wanted a formula but a UDF could also be used to apply a regex to get the pattern (a little overkill in this instance but worth being aware of):
Option Explicit
Public Sub GetCustomString()
Dim i As Long, tests()
tests = Array("c:\users\stuff\folder_name\1234-000001-01_ baskets_1.pdf", _
"c:\users\stuff\folder_name\1234-000001-02_ baskets_2.pdf", _
"c:\users\stuff\folder_name\1234-000001-03_ baskets_3.pdf", _
"c:\users\stuff\folder_name\1234-000030-01_ tree_30.pdf", _
"c:\users\stuff\folder_name\random text_1234-000030-02_ tree_30.pdf", _
"c:\users\stuff\folder_name\more random stuff_1234-000030-02_ tree_30.pdf")
For i = LBound(tests) To UBound(tests)
Debug.Print GetString(tests(i))
Next
End Sub
Public Function GetString(ByVal inputString As String) As String
Dim arr() As String, i As Long, matches As Object, re As Object
Set re = CreateObject("VBScript.RegExp")
With re
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = "\d{4}-\d{6}-\d{2}"
If .test(inputString) Then
GetString = .Execute(inputString)(0)
Else
GetString = vbNullString
End If
End With
End Function
Using UDF in sheet:
Pattern: \d{4}-\d{6}-\d{2}
Explanation:
\d{4} matches a digit (equal to [0-9])
{4} Quantifier — Matches exactly 4 times
"-" matches the character - literally (case sensitive)
\d{6} matches a digit (equal to [0-9])
{6} Quantifier — Matches exactly 6 times
"-" matches the character - literally (case sensitive)
\d{2} matches a digit (equal to [0-9])
{2} Quantifier — Matches exactly 2 times
Global pattern flags:
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)

Search for a specific number (in inch) where the number isn't part of a larger expression

I want to use an Excel formula that returns the correct index in the cell when the text in the cell contains the term '2"' (two inch). This is possible with the search function.
The catch is that I only want to find instances where it's actually '2"', not cases where you have other expressions such as '1/2"' or '12"'. See the image below for an example to clarify where search works and where it doesn't.

I think a VBA solution using Regular Expressions will be easiest in order to be able to return measurements like 1 1/2".
To enter this User Defined Function (UDF), alt-F11 opens the Visual Basic Editor.
Ensure your project is highlighted in the Project Explorer window.
Then, from the top menu, select Insert/Module and
paste the code below into the window that opens.
To use this User Defined Function (UDF), enter a formula like
=FindMeasure(A1,$E$1)
in some cell, where E1 contains a value like 2" or 1 1/2"
Option Explicit
Function FindMeasure(sSearch As String, ByVal sMeasure As String)
Dim RE As Object, MC As Object, SM As Variant
Dim sPat As String
sPat = "\D(\s+)" & sMeasure & "|^" & sMeasure
Set RE = CreateObject("vbscript.regexp")
With RE
.Global = False
.MultiLine = True
.Pattern = sPat
End With
If RE.test(sSearch) = True Then
Set MC = RE.Execute(sSearch)
SM = MC(0).submatches(0)
FindMeasure = MC(0).firstindex + Len(SM) + IIf(Len(SM) > 0, 2, 1)
Else
FindMeasure = 0
End If
End Function
EDIT: Reviewing my answer reveals that under certain circumstances, incorrect results will be returned.
If there is a "word" preceding the measurement which ends with a digit, the routine will fail to recognize the measurement. This can be avoided by ensuring that there is at least one non-digit in the string preceding the measurement (by modifying the regex). However, if the entire word consists of digits, the measurement will not be recognized.
If the line starts with a SPACE, the measurement will not be recognized. This can be corrected by modifying both the code and the regex to account for that possibility.
If the cell containing the measurement, or the cell containing the string, is blank, then the result will be incorrect. This can be avoided by testing for those conditions, by modifying the code.
Modified Code
Option Explicit
Function FindMeasure(sSearch As String, ByVal sMeasure As String)
Dim RE As Object, MC As Object, SM As Variant
Dim sPat As String
sPat = "(\S*\D\S*\s+)" & sMeasure & "|(^\s*)" & sMeasure
Set RE = CreateObject("vbscript.regexp")
With RE
.Global = False
.MultiLine = True
.Pattern = sPat
End With
If RE.test(sSearch) = True And _
Len(sSearch) > 0 And _
Len(sMeasure) > 0 Then
Set MC = RE.Execute(sSearch)
SM = MC(0).submatches(0) & MC(0).submatches(1)
FindMeasure = MC(0).firstindex + Len(SM) + 1
Else
FindMeasure = 0
End If
End Function
Explanation of Regex with sMeasure = 2"
(\S*\D\S*\s+)2"|(^\s*)2"
(\S*\D\S*\s+)2"|(^\s*)2"
Options: Case insensitive; ^$ match at line breaks
Match this alternative (\S*\D\S*\s+)2"
Match the regex below and capture its match into backreference number 1 (\S*\D\S*\s+)
Match a single character that is NOT a “whitespace character” \S*
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) *
Match a single character that is NOT a “digit” \D
Match a single character that is NOT a “whitespace character” \S*
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) *
Match a single character that is a “whitespace character” \s+
Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
Match the character string “2"” literally 2"
Or match this alternative (^\s*)2"
Match the regex below and capture its match into backreference number 2 (^\s*)
Assert position at the beginning of a line ^
Match a single character that is a “whitespace character” \s*
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) *
Match the character string “2"” literally 2"
Created with RegexBuddy

use:
=SEARCH(" 2"""," " & A1)-1
There are three tiny tricks associated with this formula:
we search for {space} 2 "
we place a blank at the start of the string
we account for the blank by subtracting one from the position
EDIT#1:
This may be better. With data in A3 and the string in B2 try:
=IFERROR(IF(LEFT(A3,2)=$B$2,1,SEARCH(" " & $B$2, A3)+1),0)

If I'm reading this right the requirements are:
1) if it starts with 2" (followed by a space)
2) there is 2" in the middle of the string (with a space on each side)
3) the string ends with 2" (preceeded by a space) it should be"OK", otherwise zero
If those are the requirements this formula should work:
=IF(OR(LEFT(A2,3)="2"" ",ISNUMBER(SEARCH(" 2"" ",A2)),RIGHT(A2,3)=" 2"""),"OK",0)
-- or you may have to use this for the second row depends on your requirements --
=IF(OR(LEFT(A2,2)="2""",ISNUMBER(SEARCH(" 2"" ",A2)),RIGHT(A2,3)=" 2"""),"OK",0)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to split words in a string, with whitespaces using VBA? - excel

Related

Replace Numbers followed by Non-Printable Characters Chr(25) into Numbers followed by double quotation

Excel: Find and Replace without also grabbing the beginning of another word

Excel: Delete Last Word In String IF it is a number

excel formula find part number in file path text string

Search for a specific number (in inch) where the number isn't part of a larger expression

Categories

Resources