Regex for getting multiple words after a delimiter

Regex for getting multiple words after a delimiter - python-3.x

I have been trying to get the separate groups from the below string using regex in PCRE:
drop = blah blah blah something keep = bar foo nlah aaaa rename = (a=b d=e) obs=4 where = (foo > 45 and bar == 35)
Groups I am trying to make is like:
1. drop = blah blah blah something
2. keep = bar foo nlah aaaa
3. rename = (a=b d=e)
4. obs=4
5. where = (foo > 45 and bar == 35)
I have written a regex using recursion but for some reason recursion is partially working for selecting multiple words after drop like it's selecting just first 3 words (blah blah blah) and not the 4th one. I have looked through various stackoverflow questions and have tried using positive lookahead also but this is the closest I could get to and now I am stuck because I am unable to understand what I am doing wrong.
The regex that I have written: (?i)(drop|keep|where|rename|obs)\s*=\s*((\w+|\d+)(\s+\w+)(?4)|(\((.*?)\)))
Same can be seen here: RegEx Demo.
Any help on this or understanding what I am doing wrong is appreciated.

You could use the newer regex module with DEFINE:
(?(DEFINE)
(?<key>\w+)
(?<sep>\s*=\s*)
(?<value>(?:(?!(?&key)(?&sep))[^()=])+)
(?<par>\((?:[^()]+|(?&par))+\))
)
(?P<k>(?&key))(?&sep)(?P<v>(?:(?&value)|(?&par)))
See a demo on regex101.com.
In Python this could be:
import regex as re
data = """
drop = blah blah blah something keep = bar foo nlah aaaa rename = (a=b d=e) obs=4 where = (foo > 45 and bar == 35)
"""
rx = re.compile(r'''
(?(DEFINE)
(?<key>\w+)
(?<sep>\s*=\s*)
(?<value>(?:(?!(?&key)(?&sep))[^()=])+)
(?<par>\((?:[^()]+|(?&par))+\))
)
(?P<k>(?&key))(?&sep)(?P<v>(?:(?&value)|(?&par)))''', re.X)
result = {m.group('k').strip(): m.group('v').strip()
for m in rx.finditer(data)}
print(result)
And yields
{'drop': 'blah blah blah something', 'keep': 'bar foo nlah aaaa', 'rename': '(a=b d=e)', 'obs': '4', 'where': '(foo > 45 and bar == 35)'}

You can use a branch reset group solution:
(?i)\b(drop|keep|where|rename|obs)\s*=\s*(?|(\w+(?:\s+\w+)*)(?=\s+\w+\s+=|$)|\((.*?)\))
See the PCRE regex demo
Details
(?i) - case insensitive mode on
\b - a word boundary
(drop|keep|where|rename|obs) - Group 1: any of the words in the group
\s*=\s* - a = char enclosed with 0+ whitespace chars
(?| - start of a branch reset group:
(\w+(?:\s+\w+)*) - Group 2: one or more word chars followed with zero or more repetitions of one or more whitespaces and one or more word chars
(?=\s+\w+\s+=|$) - up to one or more whitespaces, one or more word chars, one or more whitespaces, and =, or end of string
| - or
\((.*?)\) - (, then Group 2 capturing any zero or more chars other than line break chars, as few as possible and then )
) - end of the branch reset group.
See Python demo:
import regex
pattern = r"(?i)\b(drop|keep|where|rename|obs)\s*=\s*(?|(\w+(?:\s+\w+)*)(?=\s+\w+\s+=|$)|\((.*?)\))"
text = "drop = blah blah blah something keep = bar foo nlah aaaa rename = (a=b d=e) obs=4 where = (foo > 45 and bar == 35)"
print( [x.group() for x in regex.finditer(pattern, text)] )
# => ['drop = blah blah blah something', 'keep = bar foo nlah aaaa', 'rename = (a=b d=e)', 'obs=4', 'where = (foo > 45 and bar == 35)']
print( regex.findall(pattern, text) )
# => [('drop', 'blah blah blah something'), ('keep', 'bar foo nlah aaaa'), ('rename', 'a=b d=e'), ('obs', '4'), ('where', 'foo > 45 and bar == 35')]

Related

How can I include a list inside a string

I want to include a list variable inside a string. Like this:
aStr = """
blah blah blah
blah = ["one","two","three"]
blah blah
"""
I tried these:
l = ["one","two","three"]
aStr = """
blah blah blah
blah = %s
blah blah
"""%str(l)
OR
l = ["one","two","three"]
aStr = """
blah blah blah
blah = """+str(l)+"""
blah blah
"""
They didn't work unfortunately

Both of your snippets work almost the way you mention. The only difference is that the line that includes your list has spaces after each comma. The attached code gives exactly the output you wanted.
l = ["one","two","three"]
aStr = """
blah blah blah
blah = %s
blah blah
"""% ('["'+'","'.join(l)+'"]')
So what does that do?
The most interesting part here is this one '","'.join(l). The .join() method takes the string that is passed before the dot and joins every element of the list separated by that string.
Other ways of doing this:
l = [1,2,3]
"a "+str(l)+" b"
"a {} b".format(l)
"a {name} b".format(name=l)
f"a {l} b"

If you just want a string that includes the list, you could do:
s1 = "blah blah blah"
l = [1,2,3]
s2 = "blah blah blah"
s3 = s1 + str(l) + s2

How about this using f string.
strQ =f"""
blah blah blah
blah = {l}
blah blah"""
This will give :
"\nblah blah blah \nblah = ['one', 'two', 'three']\nblah blah\n"
This will work in python 3.6 and beyond however not in below 3.6.

Search for multiple keywords and corresponding index

I have strings like:
a = "currency is like gbp"
a= "currency blah blah euro"
a= "currency is equivalent to usd" .....
I want to substring or slice the above string wherever I found any of "gbp" , "euro" or "usd".
Not Working:
i = a.find("gbp") or a.find("euro") or a.find("usd")
a = a[i:]
Can do:
x = a.find('gbp')
y = a.find('euro')
z = a.find('usd')
But then I need to check which of them is greater than -1 and use that variable to slice the string which will be too much code.
Also, in my original example I have 10+ currencies so want a scalable solution.
Summary:
Want to slice/substring the main sentence from any of the words found till the end

You could try something like:
currency_array = ['gbp', 'euro', 'usd']
index = max(a.find(currency) for currency in currency_array)
print(a[index:])

Use regex for such purposes:
import re
a = "currency is like gbp currency"
print(re.findall(r'((?:gbp|euro|usd).*)', a))
# ['gbp currency']

Regular Expression VBA for multiple rows/table?

I have a text (already stored in a String variable, if you want).
The text is structured as follows:
( 124314 ) GSK67SJ/11 ADS SDK
blah blah blah blah blah
blah blah blah
blah blah blah
( 298 ) 2KEER/98 EOR PRT
blah blah blah
blah blah blah blah blah
etc.
The number of empty spaces between the words is variable;
The value in brackets is variable, as the length of the alphanumeric
group (this one ends always with "/" and then two numbers);
The text "blah blah" at the end can be divided in an unknown number
of lines, each one with a variable number of characters
The last two groups of letters are always of 3 letters each. After
those there is a "/n" immediately after, without spaces;
The list goes down for 0 to N elements.
For each of them I have to store the number, the first 3-letters, the second 3-letters, and the "blah blah" in 4 columns of an Excel file.
Let's say that the columns are A, B, C, D. The result should be as follow (from A1):
124314 | ADS | SDK | blah blah blah.....
298 | EOR | PRT | blah blah.....
.........
Any help would be greatly appreciated

I managed to solve it
Dim RegX As VBScript_RegExp_55.RegExp 'Rememeber to reference it...
Dim Mats As Object
Dim TextFiltered As String
Dim counter As Integer
Set RegX = New VBScript_RegExp_55.RegExp
With RegX
.Global = True
.MultiLine = True
.Pattern = "[\s]{2,}(?!\(\s+(\d+)\s+\))" 'This will clear the annoying splitting into different lines of the "blah blah" A PART for the ones before "( number )"
TextFiltered = .Replace(TextFiltered, " ") ' You could also write [\r\n] instead of [\s] but in that way you eliminate all the spaces in one hit
End With
With RegX 'This is the pattern you're looking for, the brackets mean different elements you could retrieve from the array of the results
.Pattern = "\(\s+(\d+)\s+\)(\s+\w+/[0-9]{2}\s+)([A-Z]{3})\s+([A-Z]{3})\s+(.+)" 'I think you can remove the "+" from the "\s"
Set Mats = .Execute(TextFiltered)
End With
For counter = 0 To Mats.Count - 1 'SubMatches is what will give you the various elements one by one (124314, ADS, SDK, etc)
MsgBox Mats(counter).SubMatches(0) & " " & Mats(counter).SubMatches(2) & " " & Mats(counter).SubMatches(3) & " " & Mats(counter).SubMatches(4)
Next

Insert quoted and unquoted parts of string in table

I've been working on this part of a saycommand system which is supposed to separate parts of a string and put them in a table which is sent to a function, which is queried at the beginning of the string. This would look like, for example, !save 1 or !teleport 0 1, or !tell 5 "a private message".
I would like this string to turn into a table:
[[1 2 word 2 9 'more words' 1 "and more" "1 2 34"]]
(Every non-quoted part of the string gets its own key, and the quoted parts get grouped into a key)
1 = 1
2 = 2
3 = word
4 = 2
5 = 9
6 = more words
7 = 1
8 = and more
9 = 1 2 34
I've tried doing this with Lua pattern, but I'm stuck trying to find out how to capture both quoted and unquoted pieces of the string. I've tried a lot of things, but nothing helped.
My current pattern attempts look like this:
a, d = '1 2 word 2 9 "more words" 1 "and more" "1 2 34"" ', {}
-- previous attempts
--[[
This one captures quotes
a:gsub('(["\'])(.-)%1', function(a, b) table.insert(d, b) end)
This one captures some values and butchered quotes,
which might have to do with spaces in the string
a:gsub('(["%s])(.-)%1', function(a, b) table.insert(d, b) end)
This one captures every value, but doesn't take care of quotes
a:gsub('(%w+)', function(a) table.insert(d, a) end)
This one tries making %s inside of quotes into underscores to
ignore them there, but it doesn't work
a = a:gsub('([%w"\']+)', '%1_')
a:gsub('(["\'_])(.-)%1', function(a, b) table.insert(d, b) end)
a:gsub('([%w_]+)', function(a) table.insert(d, a) end)
This one was a wild attempt at cracking it, but no success
a:gsub('["\']([^"\']-)["\'%s]', function(a) table.insert(d, a) end)
--]]
-- This one adds spaces, which would later be trimmed off, to test
-- whether it helped with the butchered strings, but it doesn't
a = a:gsub('(%w)(%s)(%w)', '%1%2%2%3')
a:gsub('(["\'%s])(.-)%1', function(a, b) table.insert(d, b) end)
for k, v in pairs(d) do
print(k..' = '..v)
end
This would not be needed for simple commands, but a more complex one like !tell 1 2 3 4 5 "a private message sent to five people" does need it, first to check if it's sent to multiple people and next to find out what the message is.
Further down the line I want to add commands like !give 1 2 3 "component:material_iron:weapontype" "food:calories", which is supposed to add two items to three different people, would benefit greatly from such a system.
If this is impossible in Lua pattern, I'll try doing it with for loops and such, but I really feel like I'm missing something obvious. Am I overthinking this?

You cannot process quoted strings with Lua patterns. You need to parse the string explicitly, as in the code below.
function split(s)
local t={}
local n=0
local b,e=0,0
while true do
b,e=s:find("%s*",e+1)
b=e+1
if b>#s then break end
n=n+1
if s:sub(b,b)=="'" then
b,e=s:find(".-'",b+1)
t[n]=s:sub(b,e-1)
elseif s:sub(b,b)=='"' then
b,e=s:find('.-"',b+1)
t[n]=s:sub(b,e-1)
else
b,e=s:find("%S+",b)
t[n]=s:sub(b,e)
end
end
return t
end
s=[[1 2 word 2 9 'more words' 1 "and more" "1 2 34"]]
print(s)
t=split(s)
for k,v in ipairs(t) do
print(k,v)
end

Lua string patterns and regex for that matter generally aren't well suited when you need to do parsing that requires varying nesting levels or token count balancing like parenthesis ( ). But there is another tool available to Lua that's powerful enough to deal with that requirement: LPeg.
The LPeg syntax is a bit archaic and takes some getting use to so I'll use the lpeg re module instead to make it easier to digest. Keep in mind that anything you can do in one form of the syntax you can also express in the other form as well.
I'll start by defining the grammar for parsing your format description:
local re = require 're'
local cmdgrammar =
[[
saycmd <- '!' cmd extra
cmd <- %a%w+
extra <- (singlequote / doublequote / unquote / .)*
unquote <- %w+
singlequote <- "'" (unquote / %s)* "'"
doublequote <- '"' (unquote / %s)* '"'
]]
Next, compile the grammar and use it to match some of your test examples:
local cmd_parser = re.compile(cmdgrammar)
local saytest =
{
[[!save 1 2 word 2 9 'more words' 1 "and more" "1 2 34"]],
[[!tell 5 "a private message"]],
[[!teleport 0 1]],
[[!say 'another private message' 42 "foo bar" baz]],
}
There are currently no captures in the grammar so re.match returns the last character position in the string it was able to match up to + 1. That means a successful parse will return the full character count of the string + 1 and therefore is a valid instance of your grammar.
for _, test in ipairs(saytest) do
assert(cmd_parser:match(test) == #test + 1)
end
Now comes the interesting part. Once you have the grammar working as desired you can now add captures that automatically extracts the results you want into a lua table with relatively little effort. Here's the final grammar spec + table captures:
local cmdgrammar =
[[
saycmd <- '!' {| {:cmd: cmd :} {:extra: extra :} |}
cmd <- %a%w+
extra <- {| (singlequote / doublequote / { unquote } / .)* |}
unquote <- %w+
singlequote <- "'" { (unquote / %s)* } "'"
doublequote <- '"' { (unquote / %s)* } '"'
]]
Running the tests again and dumping the re.match results:
for i, test in ipairs(saytest) do
print(i .. ':')
dump(cmd_parser:match(test))
end
You should get output similar to:
lua say.lua
1:
{
extra = {
"1",
"2",
"word",
"2",
"9",
"more words",
"1",
"and more",
"1 2 34"
},
cmd = "save"
}
2:
{
extra = {
"5",
"a private message"
},
cmd = "tell"
}
3:
{
extra = {
"0",
"1"
},
cmd = "teleport"
}
4:
{
extra = {
"another private message",
"42",
"foo bar",
"baz"
},
cmd = "say"
}

How to remove any trailing numbers from a string?

Sample inputs:
"Hi there how are you"
"What is the #1 pizza place in NYC?"
"Dominoes is number 1"
"Blah blah 123123"
"More blah 12321 123123 123132"
Expected output:
"Hi there how are you"
"What is the #1 pizza place in NYC?"
"Dominoes is number"
"Blah blah"
"More blah"
I'm thinking it's a 2 step process:
Split the entire string into characters, one row per character (including spaces), in reverse order
Loop through, and for each one if it's a space or a number, skip, otherwise add to the start of another array.
And i should end up with the desired result.
I can think of a few quick and dirty ways, but this needs to perform fairly well, as it's a trigger that runs on a busy table, so thought i'd throw it out to the T-SQL pros.
Any suggestions?

This solution should be a bit more efficient because it first checks to see if the string contains a number, then it checks to see if the string ends in a number.
CREATE FUNCTION dbo.trim_ending_numbers(#columnvalue AS VARCHAR(100)) RETURNS VARCHAR(100)
BEGIN
--This will make the query more efficient by first checking to see if it contains any numbers at all
IF #columnvalue NOT LIKE '%[0-9]%'
RETURN #columnvalue
DECLARE #counter INT
SET #counter = LEN(#columnvalue)
IF ISNUMERIC(SUBSTRING(#columnvalue,#counter,1)) = 0
RETURN #columnvalue
WHILE ISNUMERIC(SUBSTRING(#columnvalue,#counter,1)) = 1 OR SUBSTRING(#columnvalue,#counter,1) = ' '
BEGIN
SET #counter = #counter -1
IF #counter < 0
BREAK
END
SET #columnvalue = SUBSTRING(#columnvalue,0,#counter+1)
RETURN #columnvalue
END
If you run
SELECT dbo.trim_ending_numbers('More blah 12321 123123 123132')
It will return
'More blah'

A loop on a busy table will be very unlikely to perform adequately. Use REVERSE and PATINDEX to find the first non digit, begin a SUBSTRING there, then REVERSE the result. This will be plenty slow with no loops.
Your examples imply that you also don't want to match spaces.
DECLARE #t TABLE (s NVARCHAR(500))
INSERT INTO #t (s)
VALUES
('Hi there how are you'),('What is the #1 pizza place in NYC?'),('Dominoes is number 1'),('Blah blah 123123'),('More blah 12321 123123 123132')
select s
, reverse(s) as beginning
, patindex('%[^0-9 ]%',reverse(s)) as progress
, substring(reverse(s),patindex('%[^0-9 ]%',reverse(s)), 1+len(s)-patindex('%[^0-9 ]%',reverse(s))) as [more progress]
, reverse(substring(reverse(s),patindex('%[^0-9 ]%',reverse(s)), 1+len(s)-patindex('%[^0-9 ]%',reverse(s)))) as SOLUTION
from #t
Final answer:
reverse( substring( reverse( #s ), patindex( '%[^0-9 ]%', reverse( #s ) ), 1 + len( #s ) - patindex( '%[^0-9 ]%', reverse( #s ) ) ) )

I believe that the below query is fast and useful
select reverse(substring(reverse(colA),PATINDEX('%[0-9][a-z]%',reverse(colA))+1,
len(colA)-PATINDEX('%[0-9][a-z]%',reverse(colA))))
from TBLA

--DECLARE #String VARCHAR(100) = 'the fat cat sat on the mat'
--DECLARE #String VARCHAR(100) = 'the fat cat 2 sat33 on4 the mat'
--DECLARE #String VARCHAR(100) = 'the fat cat sat on the mat1'
--DECLARE #String VARCHAR(100) = '2121'
DECLARE #String VARCHAR(100) = 'the fat cat 2 2 2 2 sat on the mat2121'
DECLARE #Answer NVARCHAR(MAX),
#Index INTEGER = LEN(#String),
#Character CHAR,
#IncorrectCharacterIndex SMALLINT
-- Start from the end, going to the front.
WHILE #Index > 0 BEGIN
-- Get each character, starting from the end
SET #Character = SUBSTRING(#String, #Index, 1)
-- Regex check.
SET #IncorrectCharacterIndex = PATINDEX('%[A-Za-z-]%', #Character)
-- Is there a match? We're lucky here because it will either match on index 1 or not (index 0)
IF (#IncorrectCharacterIndex != 0)
BEGIN
-- We have a legit character.
SET #Answer = SUBSTRING(#String, 0, #Index + 1)
SET #Index = 0
END
ELSE
SET #Index = #Index - 1 -- No match, lets go back one index slot.
END
PRINT LTRIM(RTRIM(#Answer))
NOTE: I've included a dash in the valid regex match.

Thanks for all the contributions which were very helpful. To go further and extract off JUST the trailing number:
, substring(s, 2 + len(s) - patindex('%[^0-9 ]%',reverse(s)), 99) as numeric_suffix
I needed to sort on the number suffix so had to restrict the pattern to numerics and to get around numbers of different lengths sorting as text (ie I wanted 2 to sort before 19) cast the result:
,cast(substring(s, 2 + len(s) - patindex('%[^0-9]%',reverse(s)),99) as integer) as numeric_suffix

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Regex for getting multiple words after a delimiter - python-3.x

Related

How can I include a list inside a string

Search for multiple keywords and corresponding index

Regular Expression VBA for multiple rows/table?

Insert quoted and unquoted parts of string in table

How to remove any trailing numbers from a string?

Categories

Resources