Regular Expression VBA for multiple rows/table? - excel

I have a text (already stored in a String variable, if you want).
The text is structured as follows:
( 124314 ) GSK67SJ/11 ADS SDK
blah blah blah blah blah
blah blah blah
blah blah blah
( 298 ) 2KEER/98 EOR PRT
blah blah blah
blah blah blah blah blah
etc.
The number of empty spaces between the words is variable;
The value in brackets is variable, as the length of the alphanumeric
group (this one ends always with "/" and then two numbers);
The text "blah blah" at the end can be divided in an unknown number
of lines, each one with a variable number of characters
The last two groups of letters are always of 3 letters each. After
those there is a "/n" immediately after, without spaces;
The list goes down for 0 to N elements.
For each of them I have to store the number, the first 3-letters, the second 3-letters, and the "blah blah" in 4 columns of an Excel file.
Let's say that the columns are A, B, C, D. The result should be as follow (from A1):
124314 | ADS | SDK | blah blah blah.....
298 | EOR | PRT | blah blah.....
.........
Any help would be greatly appreciated

I managed to solve it
Dim RegX As VBScript_RegExp_55.RegExp 'Rememeber to reference it...
Dim Mats As Object
Dim TextFiltered As String
Dim counter As Integer
Set RegX = New VBScript_RegExp_55.RegExp
With RegX
.Global = True
.MultiLine = True
.Pattern = "[\s]{2,}(?!\(\s+(\d+)\s+\))" 'This will clear the annoying splitting into different lines of the "blah blah" A PART for the ones before "( number )"
TextFiltered = .Replace(TextFiltered, " ") ' You could also write [\r\n] instead of [\s] but in that way you eliminate all the spaces in one hit
End With
With RegX 'This is the pattern you're looking for, the brackets mean different elements you could retrieve from the array of the results
.Pattern = "\(\s+(\d+)\s+\)(\s+\w+/[0-9]{2}\s+)([A-Z]{3})\s+([A-Z]{3})\s+(.+)" 'I think you can remove the "+" from the "\s"
Set Mats = .Execute(TextFiltered)
End With
For counter = 0 To Mats.Count - 1 'SubMatches is what will give you the various elements one by one (124314, ADS, SDK, etc)
MsgBox Mats(counter).SubMatches(0) & " " & Mats(counter).SubMatches(2) & " " & Mats(counter).SubMatches(3) & " " & Mats(counter).SubMatches(4)
Next

Related

Regex for getting multiple words after a delimiter

I have been trying to get the separate groups from the below string using regex in PCRE:
drop = blah blah blah something keep = bar foo nlah aaaa rename = (a=b d=e) obs=4 where = (foo > 45 and bar == 35)
Groups I am trying to make is like:
1. drop = blah blah blah something
2. keep = bar foo nlah aaaa
3. rename = (a=b d=e)
4. obs=4
5. where = (foo > 45 and bar == 35)
I have written a regex using recursion but for some reason recursion is partially working for selecting multiple words after drop like it's selecting just first 3 words (blah blah blah) and not the 4th one. I have looked through various stackoverflow questions and have tried using positive lookahead also but this is the closest I could get to and now I am stuck because I am unable to understand what I am doing wrong.
The regex that I have written: (?i)(drop|keep|where|rename|obs)\s*=\s*((\w+|\d+)(\s+\w+)(?4)|(\((.*?)\)))
Same can be seen here: RegEx Demo.
Any help on this or understanding what I am doing wrong is appreciated.
You could use the newer regex module with DEFINE:
(?(DEFINE)
(?<key>\w+)
(?<sep>\s*=\s*)
(?<value>(?:(?!(?&key)(?&sep))[^()=])+)
(?<par>\((?:[^()]+|(?&par))+\))
)
(?P<k>(?&key))(?&sep)(?P<v>(?:(?&value)|(?&par)))
See a demo on regex101.com.
In Python this could be:
import regex as re
data = """
drop = blah blah blah something keep = bar foo nlah aaaa rename = (a=b d=e) obs=4 where = (foo > 45 and bar == 35)
"""
rx = re.compile(r'''
(?(DEFINE)
(?<key>\w+)
(?<sep>\s*=\s*)
(?<value>(?:(?!(?&key)(?&sep))[^()=])+)
(?<par>\((?:[^()]+|(?&par))+\))
)
(?P<k>(?&key))(?&sep)(?P<v>(?:(?&value)|(?&par)))''', re.X)
result = {m.group('k').strip(): m.group('v').strip()
for m in rx.finditer(data)}
print(result)
And yields
{'drop': 'blah blah blah something', 'keep': 'bar foo nlah aaaa', 'rename': '(a=b d=e)', 'obs': '4', 'where': '(foo > 45 and bar == 35)'}
You can use a branch reset group solution:
(?i)\b(drop|keep|where|rename|obs)\s*=\s*(?|(\w+(?:\s+\w+)*)(?=\s+\w+\s+=|$)|\((.*?)\))
See the PCRE regex demo
Details
(?i) - case insensitive mode on
\b - a word boundary
(drop|keep|where|rename|obs) - Group 1: any of the words in the group
\s*=\s* - a = char enclosed with 0+ whitespace chars
(?| - start of a branch reset group:
(\w+(?:\s+\w+)*) - Group 2: one or more word chars followed with zero or more repetitions of one or more whitespaces and one or more word chars
(?=\s+\w+\s+=|$) - up to one or more whitespaces, one or more word chars, one or more whitespaces, and =, or end of string
| - or
\((.*?)\) - (, then Group 2 capturing any zero or more chars other than line break chars, as few as possible and then )
) - end of the branch reset group.
See Python demo:
import regex
pattern = r"(?i)\b(drop|keep|where|rename|obs)\s*=\s*(?|(\w+(?:\s+\w+)*)(?=\s+\w+\s+=|$)|\((.*?)\))"
text = "drop = blah blah blah something keep = bar foo nlah aaaa rename = (a=b d=e) obs=4 where = (foo > 45 and bar == 35)"
print( [x.group() for x in regex.finditer(pattern, text)] )
# => ['drop = blah blah blah something', 'keep = bar foo nlah aaaa', 'rename = (a=b d=e)', 'obs=4', 'where = (foo > 45 and bar == 35)']
print( regex.findall(pattern, text) )
# => [('drop', 'blah blah blah something'), ('keep', 'bar foo nlah aaaa'), ('rename', 'a=b d=e'), ('obs', '4'), ('where', 'foo > 45 and bar == 35')]

Split String New Line After 3 Space in VB.net

i have problem to split string into newline in vb.net.
right now i can make it to split by a single space.i want split new line after 3 space.
Dim s As String = "SOMETHING BIGGER THAN YOUR DREAM"
Dim words As String() = s.Split(New Char() {" "c})
For Each word As String In words
Console.WriteLine(word)
Next
output :
SOMETHING
BIGGER
THAN
YOUR
DREAM
Desire output :
SOMETHING BIGGER THAN
YOUR DREAM
Another alternative added to existing efficient answers might to be:
Dim separator As Char = CChar(" ")
Dim sArr As String() = "SOMETHING BIGGER THAN YOUR DREAM".Split(separator)
Dim indexOfSplit As Integer = 3
Dim sFinal As String = Join(sArr.Take(indexOfSplit).ToArray, separator) & vbNewLine &
Join(sArr.Skip(indexOfSplit).ToArray, separator)
Console.WriteLine(sFinal)
You can split your input string, then loop the array of parts generated and add them to a StringBuilder object.
When you have read a number of parts that is multiple of a defined value, (wordsPerLine, here), you append vbNewLine to the current part.
When the loop completes, print the content of the StringBuilder to the Console:
Dim input As String = "SOMETHING BIGGER THAN YOUR DREAM, NOT MORE THAN YOUR ACCOUNT BALANCE"
Dim wordsPerLine As Integer = 3
Dim wordsCounter As Integer = 1
Dim sb As StringBuilder = New StringBuilder()
For Each word As String In input.Split()
sb.Append(word & If(wordsCounter Mod wordsPerLine = 0, vbNewLine, " "))
wordsCounter += 1
Next
Console.WriteLine(sb.ToString())
Prints:
SOMETHING BIGGER THAN
YOUR DREAM, NOT
MORE THAN YOUR
ACCOUNT BALANCE
Instead of using split, you might capture 3 words in a capturing group and match the trailing whitespace chars.
In the replacement use the group followed by a newline.
Pattern
(\S+(?:\s+\S+){2})\s*
That will match:
( Capture group 1
\S+ Match 1+ non whitespace chars
(?:\s+\S+){2} Repeat 2 times matching 1+ whitespace chars and 1+ non whitespace chars
) Close group 1
\s* Match trailing whitespace chars
.NET Regex demo | VB.NET demo
Example code
Dim s As String = "SOMETHING BIGGER THAN YOUR DREAM"
Dim output As String = Regex.Replace(s, "(\S+(?:\s+\S+){2})\s*", "$1" + Environment.NewLine)
Console.WriteLine(output)
Output
SOMETHING BIGGER THAN
YOUR DREAM
String.Join has an overload that will help you.
First parameter is the character to use between elements of your array.
Second parameter is the array you wish to join.
Third parameter is the starting position, for the first line in your desired output this would be the element at index 0.
Fourth parameter is the length to use, for the first line we want three array elements.
Private Sub OPCode()
Dim s As String = "SOMETHING BIGGER THAN YOUR DREAM"
Dim words As String() = s.Split(New Char() {" "c})
Dim line1 As String = String.Join(" ", words, 0, 3)
Console.WriteLine(line1)
Dim line2 As String = String.Join(" ", words, 3, words.Length - 3)
Console.WriteLine(line2)
End Sub

How can I include a list inside a string

I want to include a list variable inside a string. Like this:
aStr = """
blah blah blah
blah = ["one","two","three"]
blah blah
"""
I tried these:
l = ["one","two","three"]
aStr = """
blah blah blah
blah = %s
blah blah
"""%str(l)
OR
l = ["one","two","three"]
aStr = """
blah blah blah
blah = """+str(l)+"""
blah blah
"""
They didn't work unfortunately
Both of your snippets work almost the way you mention. The only difference is that the line that includes your list has spaces after each comma. The attached code gives exactly the output you wanted.
l = ["one","two","three"]
aStr = """
blah blah blah
blah = %s
blah blah
"""% ('["'+'","'.join(l)+'"]')
So what does that do?
The most interesting part here is this one '","'.join(l). The .join() method takes the string that is passed before the dot and joins every element of the list separated by that string.
Other ways of doing this:
l = [1,2,3]
"a "+str(l)+" b"
"a {} b".format(l)
"a {name} b".format(name=l)
f"a {l} b"
If you just want a string that includes the list, you could do:
s1 = "blah blah blah"
l = [1,2,3]
s2 = "blah blah blah"
s3 = s1 + str(l) + s2
How about this using f string.
strQ =f"""
blah blah blah
blah = {l}
blah blah"""
This will give :
"\nblah blah blah \nblah = ['one', 'two', 'three']\nblah blah\n"
This will work in python 3.6 and beyond however not in below 3.6.

Unwanted new-line when plugging a string variable into a RichTextBox

Im generating a string variable from other string variables that themselves are generated in a button press and take their values from textboxes / user input. Simultaneously, the mass string variable is being loaded into a RichTextBox. While I do purposely use one VbLf in my mass string variable, I am encountering an additional point where a new line begins in my string where it is not supposed too. How can I avoid this?
Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
numb = numbtxt.Text
year = yeartxt.Text
author = authortxt.Text
pages = pagestxt.Text
pgnum = pgnumtxt.Text
item1 = item1txt.Text
item2 = item2txt.Text
format = numb + ") """ + item1 + """" + vbLf + item2 + _
" - " + author + " (" + year + ") " + pages + " " + pgnum
rtf.Text = format
End Sub
I am expecting this:
https://i.imgur.com/5q5MME2.png
but I am getting this:
https://i.imgur.com/dCMAIqA.png
Any help would be very much so appreciated.
Please check your content of cour variable item2 containing the url. I guess, that this variable contains a line break at the end. Try to just print this variable with a leading and trailing char.
If so, you can either replace the new line character in item2 or try to substring the content.
To remove any unwanted characters from the input strings, you can trim them, or do a character replacement. For example, to remove CR/LF from the end of strings (untested, from memory);
numb = numbtxt.Text.TrimEnd( {vbCr, vbLf, vbCrLf, Environment.Newline} )
or to remove all occurrences;
year = yeartxt.Text.Replace(vbCrLf, String.Empty)
or;
year = yeartxt.Text.Replace(vbCr, String.Empty).Replace(vbLf, String.Empty)
etc. YMMV depending on the actual CR/LF character arrangments and language settings!

How to remove any trailing numbers from a string?

Sample inputs:
"Hi there how are you"
"What is the #1 pizza place in NYC?"
"Dominoes is number 1"
"Blah blah 123123"
"More blah 12321 123123 123132"
Expected output:
"Hi there how are you"
"What is the #1 pizza place in NYC?"
"Dominoes is number"
"Blah blah"
"More blah"
I'm thinking it's a 2 step process:
Split the entire string into characters, one row per character (including spaces), in reverse order
Loop through, and for each one if it's a space or a number, skip, otherwise add to the start of another array.
And i should end up with the desired result.
I can think of a few quick and dirty ways, but this needs to perform fairly well, as it's a trigger that runs on a busy table, so thought i'd throw it out to the T-SQL pros.
Any suggestions?
This solution should be a bit more efficient because it first checks to see if the string contains a number, then it checks to see if the string ends in a number.
CREATE FUNCTION dbo.trim_ending_numbers(#columnvalue AS VARCHAR(100)) RETURNS VARCHAR(100)
BEGIN
--This will make the query more efficient by first checking to see if it contains any numbers at all
IF #columnvalue NOT LIKE '%[0-9]%'
RETURN #columnvalue
DECLARE #counter INT
SET #counter = LEN(#columnvalue)
IF ISNUMERIC(SUBSTRING(#columnvalue,#counter,1)) = 0
RETURN #columnvalue
WHILE ISNUMERIC(SUBSTRING(#columnvalue,#counter,1)) = 1 OR SUBSTRING(#columnvalue,#counter,1) = ' '
BEGIN
SET #counter = #counter -1
IF #counter < 0
BREAK
END
SET #columnvalue = SUBSTRING(#columnvalue,0,#counter+1)
RETURN #columnvalue
END
If you run
SELECT dbo.trim_ending_numbers('More blah 12321 123123 123132')
It will return
'More blah'
A loop on a busy table will be very unlikely to perform adequately. Use REVERSE and PATINDEX to find the first non digit, begin a SUBSTRING there, then REVERSE the result. This will be plenty slow with no loops.
Your examples imply that you also don't want to match spaces.
DECLARE #t TABLE (s NVARCHAR(500))
INSERT INTO #t (s)
VALUES
('Hi there how are you'),('What is the #1 pizza place in NYC?'),('Dominoes is number 1'),('Blah blah 123123'),('More blah 12321 123123 123132')
select s
, reverse(s) as beginning
, patindex('%[^0-9 ]%',reverse(s)) as progress
, substring(reverse(s),patindex('%[^0-9 ]%',reverse(s)), 1+len(s)-patindex('%[^0-9 ]%',reverse(s))) as [more progress]
, reverse(substring(reverse(s),patindex('%[^0-9 ]%',reverse(s)), 1+len(s)-patindex('%[^0-9 ]%',reverse(s)))) as SOLUTION
from #t
Final answer:
reverse( substring( reverse( #s ), patindex( '%[^0-9 ]%', reverse( #s ) ), 1 + len( #s ) - patindex( '%[^0-9 ]%', reverse( #s ) ) ) )
I believe that the below query is fast and useful
select reverse(substring(reverse(colA),PATINDEX('%[0-9][a-z]%',reverse(colA))+1,
len(colA)-PATINDEX('%[0-9][a-z]%',reverse(colA))))
from TBLA
--DECLARE #String VARCHAR(100) = 'the fat cat sat on the mat'
--DECLARE #String VARCHAR(100) = 'the fat cat 2 sat33 on4 the mat'
--DECLARE #String VARCHAR(100) = 'the fat cat sat on the mat1'
--DECLARE #String VARCHAR(100) = '2121'
DECLARE #String VARCHAR(100) = 'the fat cat 2 2 2 2 sat on the mat2121'
DECLARE #Answer NVARCHAR(MAX),
#Index INTEGER = LEN(#String),
#Character CHAR,
#IncorrectCharacterIndex SMALLINT
-- Start from the end, going to the front.
WHILE #Index > 0 BEGIN
-- Get each character, starting from the end
SET #Character = SUBSTRING(#String, #Index, 1)
-- Regex check.
SET #IncorrectCharacterIndex = PATINDEX('%[A-Za-z-]%', #Character)
-- Is there a match? We're lucky here because it will either match on index 1 or not (index 0)
IF (#IncorrectCharacterIndex != 0)
BEGIN
-- We have a legit character.
SET #Answer = SUBSTRING(#String, 0, #Index + 1)
SET #Index = 0
END
ELSE
SET #Index = #Index - 1 -- No match, lets go back one index slot.
END
PRINT LTRIM(RTRIM(#Answer))
NOTE: I've included a dash in the valid regex match.
Thanks for all the contributions which were very helpful. To go further and extract off JUST the trailing number:
, substring(s, 2 + len(s) - patindex('%[^0-9 ]%',reverse(s)), 99) as numeric_suffix
I needed to sort on the number suffix so had to restrict the pattern to numerics and to get around numbers of different lengths sorting as text (ie I wanted 2 to sort before 19) cast the result:
,cast(substring(s, 2 + len(s) - patindex('%[^0-9]%',reverse(s)),99) as integer) as numeric_suffix

Resources