Extract Excel VBA filenames from string - excel

I want to extract file names from a string. The length of the string and the length of the file name are always different.
Must be done with VBA!
String:
href ist gleich: "abc/db://test.pdf|0|100">Bsp.:
I would like that:
test.pdf
I do not know how to proceed.
It would also be nice if the script could extract multiple filenames from a string.
Zb:
String:
href ist gleich: "abc//db://test.t.pdf|0|100" "db://test1.pdf|0|100">Bsp.
I would like that:
test.t.pdf test1.pdf

Sub testExtractFileName()
Debug.Print extractFileName("file://D:/ETVGI_556/Carconfigurator_file/carconf_d.pdf", "//")
Debug.Print extractFileName("abc//db://test.t.pdf|0|100")
Debug.Print extractFileName("db://test1.pdf|0|100")
End Sub
Function extractFileName(initString As String, Optional delim As String) As String
Dim necString As String
necString = left(initString, InStr(initString, ".pdf") + 3)
necString = Right(necString, Len(necString) - InStrRev(necString, _
IIf(delim <> "", delim, "/")) - IIf(delim <> "", Len(delim) - 1, 0))
extractFileName = necString
End Function
The single condition is that in front of the file name (all the time) to exist "//" characters in the initial string. And of course the file extension to all the time to be .pdf. If not, this extension is required and the function can be easily adapted...
The function returns full name if the second (optional) parameter will be "//" or just the file name (without path) if it is omitted.

One option could be using a pattern where you would match the preceding / and capture in a group 1+ word characters \w+ followed by .pdf
Your value is in capturing group 1.
/(\w+\.pdf)
See a regex demo
If you want to have a broader match than \w you could extend what you do want to match using a character class or use a negated character class [^ to match any char except the listed in the character class.
In this case the negated character class [^/|"\s] would match any char except / | " or a whitespace character \s
/([^/|"\s]+\.pdf)
See another regex demo

Try this and edit it according to your needs. At least it was designed for two of your examples.
Dim sStringToFormat As String
Dim i As Integer
Dim vSplit As Variant
Dim colFileNames As Collection
Dim sFormattedString As String
Set colFileNames = New Collection
sStringToFormat = "href ist gleich: ""abc//db://test.t.pdf|0|100"" ""db://test1.pdf|0|100"">Bsp."
vSplit = Split(sStringToFormat, "/")
For i = LBound(vSplit) To UBound(vSplit)
If InStr(vSplit(i), ".") > 0 Then
sFormattedString = Split(vSplit(i), "|")(0)
sFormattedString = Split(sFormattedString, "<")(0)
sFormattedString = Split(sFormattedString, ">")(0)
colFileNames.Add sFormattedString
End If
Next i

Related

Get every word ending with dot using Regex/VBA

I am using excel 2019 and I am trying to extract from a bunch of messed up text cells any (up to 5) word ending with dot that comes after a ].
This is a sample of the text I am trying to parse/clean
`
some text [asred.] ost. |Monday - Ribben (ult.) lot. ac, sino. other maybe long text; collan.
`
I expect to get this:
ost. ult. lot. sino. collan.
I am using this Function found somewhere on the internet which appears to do the job:
`
Public Function RegExtract(Txt As String, Pattern As String) As String
With CreateObject("vbscript.regexp")
'.Global = True
.Pattern = Pattern
If .test(Txt) Then
RegExtract = .Execute(Txt)(0)
Else
RegExtract = "No match found"
End If
End With
End Function
`
and I call it from an empty cell:
=RegExtract(D2; "([\]])(\s\w+[.]){0,5}")
It's the first time I am using regexp, so I might have done terrible things in the eyes of an expert.
So this is my expression: ([]])(\s\w+[.]){0,5}
Right now it returns only
] ost.
Which is much more than I was expecting to be able to do on my first approach to regex, but:
I am not able to get rid of the first ] which is needed to find the place where my useful bits start inside the text block, since \K does not work in excel. I might "find and replace" it later as a smart barbarian, but I'd like to know the way to do it clean, if any clean way exists :)
2)I don't understand how iterators work to get all my "up to 5 occurrencies": I was expecting that {0,5} after the second group meant exactly: "repeat the previous group again until the end of the text block (or until you manage to do it 5 times)".
Thank you for your time :)
--Added after JdvD accepted answer for the records--
I am using this pattern to get all the words ending with dot, after the FIRST occurrence of the closing bracket.
^.*?\]|(\w+\.\s?)|.
This one (without the question mark) instead gets all the words ending with dot, after the LAST occurrence of the closing bracket.
^.*\]|(\w+\.\s?)|.
I was even missing something in my regExtract function: I needed to store the matches into an array through a for loop and then output this array as a string.
I was wrongly assuming that the regex engine was already storing matches as a unique string.
The correct RegExtract function to extract EVERY match is the following:
Public Function RegExtract(Txt As String, Pattern As String) As String
Dim rMatch As Object, arrayMatches(), i As Long
With CreateObject("vbscript.regexp")
.Global = True
.Pattern = Pattern
If .Test(Txt) Then
For Each rMatch In .Execute(Txt)
If Not IsEmpty(rMatch.SubMatches(0)) Then
ReDim Preserve arrayMatches(i)
arrayMatches(i) = rMatch.SubMatches(0)
i = i + 1
End If
Next
RegExtract = Join(arrayMatches, " ")
Else
RegExtract = "No match found"
End If
End With
End Function
RegexMatch:
In addition to the answer given by #RonRosenfeld one could apply what some refer to as 'The Best Regex Trick Ever' which would imply to first match what you don't want and then match what you do want in a capture group. For example:
^.*\]|(\w+\.)
See an online demo where in short this means:
^.*\] - Match 0+ (Greedy) characters from the start of the string upto the last occurence of closing square brackets;
| - Or;
(\w+\.) - Capture group holding 1+ (Greedy) word-characters ending with a dot.
Here is how it could work in an UDF:
Sub Test()
Dim s As String: s = "some text [asred.] ost. |Monday - Ribben (ult.) lot. ac, sino. other maybe long text; collan. "
Debug.Print RegExtract(s, "^.*\]|(\w+\.)")
End Sub
'------
'The above Sub would invoke the below function as an example.
'But you could also invoke this through: `=RegExtract(A1,"^.*\]|(\w+\.)")`
'on your sheet.
'------
Public Function RegExtract(Txt As String, Pattern As String) As String
Dim rMatch As Object, arrayMatches(), i As Long
With CreateObject("vbscript.regexp")
.Global = True
.Pattern = Pattern
If .Test(Txt) Then
For Each rMatch In .Execute(Txt)
If Not IsEmpty(rMatch.SubMatches(0)) Then
ReDim Preserve arrayMatches(i)
arrayMatches(i) = rMatch.SubMatches(0)
i = i + 1
End If
Next
RegExtract = Join(arrayMatches, " ")
Else
RegExtract = "No match found"
End If
End With
End Function
RegexReplace:
Depending on your desired output one could also use a replace function. You'd have to match any remaining character with another alternative for that. For example:
^.*\]|(\w+\.\s?)|.
See an online demo where in short this means that we added another alternative which is simply any single character. A 2nd small addition is that we added the option of an optional space character \s? in the 2nd alternative.
Sub Test()
Dim s As String: s = "some text [asred.] ost. |Monday - Ribben (ult.) lot. ac, sino. other maybe long text; collan. "
Debug.Print RegReplace(s, "^.*\]|(\w+\.\s?)|.", "$1")
End Sub
'------
'There are now 3 parameters to parse to the UDF; String, Pattern and Replacement.
'------
Public Function RegReplace(Txt As String, Pattern As String, Replacement) As String
Dim rMatch As Object, arrayMatches(), i As Long
With CreateObject("vbscript.regexp")
.Global = True
.Pattern = Pattern
RegReplace = Trim(.Replace(Txt, Replacement))
End With
End Function
Note that I used Trim() to remove possible trailing spaces.
Both RegexMatch and RegexReplace would currently return a single string to clean the input but the former does give you the option to deal with the array in the arrayMatches() variable.
There is a method to return all the matches in a string starting after a certain pattern. But I can't recall it at this time.
In the meantime, it seems the simplest would be to remove everything prior to the first ], and then apply Regex to the remainder.
For example:
Option Explicit
Sub findit()
Const str As String = "some text [asred.] ost. |Monday - Ribben (ult.) lot. ac, sino. other maybe long text; collan."
Dim RE As RegExp, MC As MatchCollection, M As Match
Dim S As String
Dim sOutput As String
S = Mid(str, InStr(str, "]"))
Set RE = New RegExp
With RE
.Pattern = "\w+(?=\.)"
.Global = True
If .Test(S) = True Then
Set MC = .Execute(S)
For Each M In MC
sOutput = sOutput & vbLf & M
Next M
End If
End With
MsgBox Mid(sOutput, 2)
End Sub
You could certainly limit the number of matches to 5 by using a counter instead of the For each loop
You can use the following regex
([a-zA-Z]+)\.
Let me explain a little bit.
[a-zA-Z] - this looks for anything that contain any letter from a to z and A to Z, but it only matches the first letter.
\+ - with this you are telling that matches all the letters until it finds something that is not a letter from a to z and A to Z
\. - with this you are just looking for the . at the end of the match
Here the example.

Remove Certain Characters from a String using UDF

I have a column which contain cells that have some list of alphanumeric number system as follows:
4A(4,5,6,7,8,9); 4B(4,5,7,8); 3A(1,2,3); 3B(1,2,3), 3C(1,2)
On a cell next to it, I use a UDF function to get rid of special characters "(),;" in order to leave the data as
4A456789 4B4578 3A123 3B123 3C12
Function RemoveSpecial(Str As String) As String
Dim SpecialChars As String
Dim i As Long
SpecialChars = "(),;-abcdefghijklmnopqrstuvwxyz"
For i = 1 To Len(SpecialChars)
Str = Replace$(Str, Mid$(SpecialChars, i, 1), "")
Next
RemoveSpecial = Str
End Function
For the most part this works well. However, on certain occasions, the cell would contain an unorthodox pattern such as when a space is included between the 4A and the parenthesized items:
4A (4,5,6,7,8,9);
or when a text appears inside the parenthesis (including two spaces on each side):
4A (4,5, skip 8,9);
or a space appears between the first two characters:
4 A(4,5,6)
How would you fix this so that the random spaces are removed except to delaminate the actual combination of data?
One strategy would be to substitute the patterns you want to keep before eliminating the "special" characters, then restore the desired patterns.
From your sample data, it look like you want to keep a space only if it follow ); or ),
Something like this:
Function RemoveSpecial(Data As Variant) As Variant
Dim SpecialChars As String
Dim KeepStr As Variant, PlaceHolder As Variant, ReplaceStr As Variant
Dim i As Long
Dim DataStr As String
SpecialChars = " (),;-abcdefghijklmnopqrstuvwxyz"
KeepStr = Array("); ", "), ")
PlaceHolder = Array("~0~", "~1~") ' choose a PlaceHolder that won't appear in the data
ReplaceStr = Array(" ", " ")
DataStr = Data
For i = LBound(KeepStr) To UBound(KeepStr)
DataStr = Replace$(DataStr, KeepStr(i), PlaceHolder(i))
Next
For i = 1 To Len(SpecialChars)
DataStr = Replace$(DataStr, Mid$(SpecialChars, i, 1), vbNullString)
Next
For i = LBound(KeepStr) To UBound(KeepStr)
DataStr = Replace$(DataStr, PlaceHolder(i), ReplaceStr(i))
Next
RemoveSpecial = Application.Trim(DataStr)
End Function
Another strategy would be regular expressions (RegEx)
It looks like a regular expression could come in handy here, for example:
Function RemoveSpecial(Str As String) As String
With CreateObject("vbscript.regexp")
.Global = True
.Pattern = "\)[;,]( )|[^A-Z\d]+"
RemoveSpecial = .Replace(Str, "$1")
End With
End Function
I have used the regular expression:
\)[;,]( )|[^A-Z\d]+
You can see an online demo to see the result in your browser. The way this works is to apply a form of what some would call "The best regex trick ever!"
\)[;,]( ) - Escape a closing paranthesis, then match either a comma or semicolon before we capture a space character in our 1st capture group.
| - Or use the following alternation:
[^A-Z\d]+ - Any 1+ char any other than in given character class.
EDIT:
In case you have values like 4A; or 4A, you can use:
(?:([A-Z])|\))[;,]( )|[^A-Z\d]+
And replace with $1$2. See an online demo.

Split String New Line After 3 Space in VB.net

i have problem to split string into newline in vb.net.
right now i can make it to split by a single space.i want split new line after 3 space.
Dim s As String = "SOMETHING BIGGER THAN YOUR DREAM"
Dim words As String() = s.Split(New Char() {" "c})
For Each word As String In words
Console.WriteLine(word)
Next
output :
SOMETHING
BIGGER
THAN
YOUR
DREAM
Desire output :
SOMETHING BIGGER THAN
YOUR DREAM
Another alternative added to existing efficient answers might to be:
Dim separator As Char = CChar(" ")
Dim sArr As String() = "SOMETHING BIGGER THAN YOUR DREAM".Split(separator)
Dim indexOfSplit As Integer = 3
Dim sFinal As String = Join(sArr.Take(indexOfSplit).ToArray, separator) & vbNewLine &
Join(sArr.Skip(indexOfSplit).ToArray, separator)
Console.WriteLine(sFinal)
You can split your input string, then loop the array of parts generated and add them to a StringBuilder object.
When you have read a number of parts that is multiple of a defined value, (wordsPerLine, here), you append vbNewLine to the current part.
When the loop completes, print the content of the StringBuilder to the Console:
Dim input As String = "SOMETHING BIGGER THAN YOUR DREAM, NOT MORE THAN YOUR ACCOUNT BALANCE"
Dim wordsPerLine As Integer = 3
Dim wordsCounter As Integer = 1
Dim sb As StringBuilder = New StringBuilder()
For Each word As String In input.Split()
sb.Append(word & If(wordsCounter Mod wordsPerLine = 0, vbNewLine, " "))
wordsCounter += 1
Next
Console.WriteLine(sb.ToString())
Prints:
SOMETHING BIGGER THAN
YOUR DREAM, NOT
MORE THAN YOUR
ACCOUNT BALANCE
Instead of using split, you might capture 3 words in a capturing group and match the trailing whitespace chars.
In the replacement use the group followed by a newline.
Pattern
(\S+(?:\s+\S+){2})\s*
That will match:
( Capture group 1
\S+ Match 1+ non whitespace chars
(?:\s+\S+){2} Repeat 2 times matching 1+ whitespace chars and 1+ non whitespace chars
) Close group 1
\s* Match trailing whitespace chars
.NET Regex demo | VB.NET demo
Example code
Dim s As String = "SOMETHING BIGGER THAN YOUR DREAM"
Dim output As String = Regex.Replace(s, "(\S+(?:\s+\S+){2})\s*", "$1" + Environment.NewLine)
Console.WriteLine(output)
Output
SOMETHING BIGGER THAN
YOUR DREAM
String.Join has an overload that will help you.
First parameter is the character to use between elements of your array.
Second parameter is the array you wish to join.
Third parameter is the starting position, for the first line in your desired output this would be the element at index 0.
Fourth parameter is the length to use, for the first line we want three array elements.
Private Sub OPCode()
Dim s As String = "SOMETHING BIGGER THAN YOUR DREAM"
Dim words As String() = s.Split(New Char() {" "c})
Dim line1 As String = String.Join(" ", words, 0, 3)
Console.WriteLine(line1)
Dim line2 As String = String.Join(" ", words, 3, words.Length - 3)
Console.WriteLine(line2)
End Sub

How do i get part of string after a special character?

I have a column where i pickup increasing numbering values, and their format is xx_yy
so the first is 1_0, second 1_1 and so forth, no we are at 23_31
I want to get the right side of the string, and i am already getting the left side correctly.
using
newActionId = Left(lastActionID, (Application.WorksheetFunction.Find("_", lastActionID, 1) - 1))
i wish to do the following, human writing below
nextSubid = entire stringvalue AFTER special character "_"
I tried just switching left to right, didnt go so well, do you have a suggestion?
You can use Split function to get the relevant text.
Syntax: Split(expression, [ delimiter, [ limit, [ compare ]]])
Option Explicit
Sub Sample()
Dim id As String
Dim beforeSplChr As String
Dim afterSplChr As String
id = "23_31"
beforeSplChr = Split(id, "_")(0)
afterSplChr = Split(id, "_")(1)
Debug.Print beforeSplChr
Debug.Print afterSplChr
End Sub
Another way
Debug.Print Left(id, (InStrRev(id, "_", -1) - 1)) '<~~ Left Part
Debug.Print Right(id, (InStrRev(id, "_", -1) - 1)) '<~~ Right Part
Even though Siddharth Rout has given what can probably be considered a better answer here, I felt that this was worth adding:
To get the second part of the string using your original method, you would want to use the Mid function in place of Left, rather than trying to use Right.
Mid(string, start, [ length ])
Returns length characters from string, starting at the start position
If length is omitted, then will return characters from the start position until the end of the string
newActionId = Mid(lastActionID, Application.WorksheetFunction.Find("_", lastActionID, 1) + 1)
Just for fun (Split is the way to go here), an alternative way using regular expressions:
Sub Test()
Dim str As String: str = "23_31"
With CreateObject("VBScript.RegExp")
.Global = True
.Pattern = "\d+"
Debug.Print .Execute(str)(0) 'Left Part
Debug.Print .Execute(str)(1) 'Right Part
End With
End Sub
Btw, as per my comment, your first value could also be achieved through:
Debug.Print Val(str)
Split function of string is very usefull for this type of query.
Like:
String s = "23_34";
String left = s.split("_")[0];
String right = s.split("_")[1];
Or you can also use combination of indexOf and substring method together.
String left = s.substring(0,s.indexOf('_')+1)
String right = s.substring(s.indexOf('_'));

Split a long string by using a complete word

I'm splitting a long string like shown below wherever it finds 'END' keyword:
string_end.Split(New String() {"END"}, StringSplitOptions.None)
This would perfectly split the string into multiple parts wherever it finds 'END' . But the problem arises when a string contains the word 'RECOMMENDED'. It would split it as 'RECOMM' and 'ED'. I want it to split by searching for the whole word, so that words like 'RECOMMENDED' stays as it is. Kindly help.
C# and VB.NET codes would suffice.
You can use Regex.Split to solve this:
Dim rgx As New Regex("\bEND\b")
Dim input As String = "RECOMMENDED AND THE END OF A STRING END"
Dim result() As String = rgx.Split(input)
'Output:
'-----------------------------
'result = {Length=3}
'(0) = "RECOMMENDED AND THE "
'(1) = " OF A STRING "
'(2) = ""
The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a "word boundary". This match is zero-length.
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
source: http://www.regular-expressions.info/wordboundaries.html
Why you shouldn't use spaces on a String.Split?
Dim input As String = "RECOMMENDED AND THE END OF A STRING END"
Dim res() As String = input.Split(New String() {" END "}, StringSplitOptions.None)
'Output:
'----------------------------
'res = {Length=2}
'(0) = "RECOMMENDED AND THE"
'(1) = "OF A STRING END"
The split doesn't work with this code only the word END is a single word with surrounded space. But the word can be surrounded by another character or could be the beginning or end of a string:
END TEST - doesn't work
TEST END - doesn't work
TEST END, HELLO WORLD - doesn't work
...
If your simply looking for the word: END then simply add a white-space before and after:
string_end.Split(New String() {" END "}, StringSplitOptions.None)
After I don't recommend using split to include or exclude a line.
A cleaner solution would simply to do
For Each line As String In Lines
If line.Contains(" END ") Then
'Do Stuff
End If
Next
You could also use Regex but that depends if you are familiar with it.
Private _pattern As New Regex("\bEND\b", RegexOptions.Compiled)
For Each line As String In Lines
Dim matches As MatchCollection = _pattern.Matches(line)
If matches.Count <> 0 Then
'Do Stuff
End If
Next

Resources