Why this regular expression (\d{n,m})* does not work as expected?

Why this regular expression (\d{n,m})* does not work as expected? - regular-language

Why r'^ABC (\d{1,4})*\.txt$' can match r'ABC 00040.txt'?
As expected, r'^ABC( \d{1,4})*\.txt$' can't match r'ABC 00040.txt', moving the anterior space into the parentheses.
Thanks in advance.
Python code:
abs_name = r'ABC 00040.txt'
pattern=r'^ABC (\d{1,4})*\.txt$'
res = re.search(pattern, abs_name, re.IGNORECASE)
if res:
print(res.group())

Related

Split string with commas while keeping numeric parts

I'm using the following function to separate strings with commas right on the capitals, as long as it is not preceded by a blank space.
def func(x):
y = re.findall('[A-Z][^A-Z\s]+(?:\s+\S[^A-Z\s]*)*', x)
return ','.join(y)
However, when I try to separate the next string it removes the part with numbers.
Input = '49ersRiders Mapple'
Output = 'Riders Mapple'
I tried the following code but now it removes the 'ers' part.
def test(x):
y = re.findall(r'\d+[A-Z]*|[A-Z][^A-Z\s]+(?:\s+\S[^A-Z\s]*)*', x)
return ','.join(y)
Output = '49,Riders Mapple'
The output I'm looking for is this:
'49ers,Riders Mapple'
Is it possible to add this indication to my regex?
Thanks in advance

Maybe naive but why don't you use re.sub:
def func(x):
return re.sub(r'(?<!\s)([A-Z])', r',\1', x)
inp = '49ersRiders Mapple'
out = func(inp)
print(out)
# Output
49ers,Riders Mapple

Here is a regex re.findall approach:
inp = "49ersRiders"
output = ','.join(re.findall('(?:[A-Z]|[0-9])[^A-Z]+', inp))
print(output) # 49ers,Riders
The regex pattern used here says to match:
(?:
[A-Z] a leading uppercase letter (try to find this first)
| OR
[0-9] a leading number (fallback for no uppercase)
)
[^A-Z]+ one or more non capital letters following

how to get values using regex in python

here is my sample code
import re
string = '[P-123,SHA-123]'
pattern = re.compile(r"^\[(?P<curve>).*\]$", re.MULTILINE | re.IGNORECASE)
result = pattern.search(string)
print(result)
Expected output:
P-123

If you want to match that data format:
^\[(?P<curve>[A-Z]-\d+),[A-Z]+-\d+]\Z
Explanation
^ Start of string
\[ Match [
(?P<curve> Named capture group curve
[A-Z]-\d+ Match a single uppercase char, - and 1+ digits
) Close group
,[A-Z]+-\d+ Match 1+ uppercase chars - and 1+ digits
] Match ]
\Z End of string (or use $ if a newline after is allowed)
The value is in named capturing group curve. You could also use re.match instead of re.search as you are looking for a single group in the whole string.
Regex demo | Python demo
Example code
import re
string = '[P-123,SHA-123]'
pattern = re.compile(r"\[(?P<curve>[A-Z]-\d+),[A-Z]+-\d+]\Z", re.MULTILINE | re.IGNORECASE)
result = pattern.match(string)
print(result.group("curve"))
Output
P-123

string = '[P-123,SHA-123]'
pattern = re.compile(r"(P.\d*)", re.MULTILINE | re.IGNORECASE)
result = pattern.search(string)
print(result[1])

You can try this regex \W([A-Z]-[0-9]*) that extracting capital letter follow by - and then numbers
import re
string = '[P-123,SHA-123]'
pattern = re.compile(r"\W([A-Z]-[0-9]*)", re.MULTILINE | re.IGNORECASE)
result = pattern.search(string).group(1)
print(result)
Output
P-123

Replace slice of string python with string of different size, but maintain structure

so today I was working on a function that removes any quoted strings from a chunk of data, and replaces them with format areas instead ({0}, {1}, etc...).
I ran into a problem, because the output was becoming completely scrambled, as in a {1} was going in a seemingly random place.
I later found out that this was a problem because the replacement of slices in the list changed the list so that it's length was different, and so the previous re matches would not line up (it only worked for the first iteration).
the gathering of the strings worked perfectly, as expected, as this is most certainly not a problem with re.
I've read about mutable sequences, and a bunch of other things as well, but was not able to find anything on this.
what I think i need is something like str.replace but can take slices, instead of a substring.
here is my code:
import re
def rm_strings_from_data(data):
regex = re.compile(r'"(.*?)"')
s = regex.finditer(data)
list_data = list(data)
val = 0
strings = []
for i in s:
string = i.group()
start, end = i.span()
strings.append(string)
list_data[start:end] = '{%d}' % val
val += 1
print(strings, ''.join(list_data), sep='\n\n')
if __name__ == '__main__':
rm_strings_from_data('[hi="hello!" thing="a thing!" other="other thing"]')
i get:
['"hello!"', '"a thing!"', '"other thing"']
[hi={0} thing="a th{1}r="other thing{2}
I would like the output:
['"hello!"', '"a thing!"', '"other thing"']
[hi={0} thing={1} other={2}]
any help would be appreciated. thanks for your time :)

Why not match both key=value parts using regex capture groups like this: (\w+?)=(".*?")
Then it becomes very easy to assemble the lists as needed.
Sample Code:
import re
def rm_strings_from_data(data):
regex = re.compile(r'(\w+?)=(".*?")')
matches = regex.finditer(data)
strings = []
list_data = []
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
strings.append(match.group(2))
list_data.append((match.group(1) + '={' + str(matchNum) + '} '))
print(strings, '[' + ''.join(list_data) + ']', sep='\n\n')
if __name__ == '__main__':
rm_strings_from_data('[hi="hello!" thing="a thing!" other="other thing"]')

Optional capture of balanced brackets in Lua

Let's say I have lines of the form:
int[4] height
char c
char[50] userName
char[50+foo("bar")] userSchool
As you see, the bracketed expression is optional.
Can I parse these strings using Lua's string.match() ?
The following pattern works for lines that contain brackets:
line = "int[4] height"
print(line:match('^(%w+)(%b[])%s+(%w+)$'))
But is there a pattern that can handle also the optional brackets? The following does not work:
line = "char c"
print(line:match('^(%w+)(%b[]?)%s+(%w+)$'))
Can the pattern be written in another way to solve this?

Unlike regular expressions, ? in Lua pattern matches a single character.
You can use the or operator to do the job like this:
line:match('^(%w+)(%b[])%s+(%w+)$') or line:match('^(%w+)%s+(%w+)$')
A little problem with it is that Lua only keeps the first result in an expression. It depends on your needs, use an if statement or you can give the entire string the first capture like this
print(line:match('^((%w+)(%b[])%s+(%w+))$') or line:match('^((%w+)%s+(%w+))$'))

LPeg may be more appropriate for your case, especially if you plan to expand your grammar.
local re = require're'
local p = re.compile( [[
prog <- stmt* -> set
stmt <- S { type } S { name }
type <- name bexp ?
bexp <- '[' ([^][] / bexp)* ']'
name <- %w+
S <- %s*
]], {set = function(...)
local t, args = {}, {...}
for i=1, #args, 2 do t[args[i+1]] = args[i] end
return t
end})
local s = [[
int[4] height
char c
char[50] userName
char[50+foo("bar")] userSchool
]]
for k, v in pairs(p:match(s)) do print(k .. ' = ' .. v) end
--[[
c = char
userSchool = char[50+foo("bar")]
height = int[4]
userName = char[50]
--]]

Find string between two substrings [duplicate]

This question already has answers here:
How to extract the substring between two markers?
(22 answers)
Closed 4 years ago.
How do I find a string between two substrings ('123STRINGabc' -> 'STRING')?
My current method is like this:
>>> start = 'asdf=5;'
>>> end = '123jasd'
>>> s = 'asdf=5;iwantthis123jasd'
>>> print((s.split(start))[1].split(end)[0])
iwantthis
However, this seems very inefficient and un-pythonic. What is a better way to do something like this?
Forgot to mention:
The string might not start and end with start and end. They may have more characters before and after.

import re
s = 'asdf=5;iwantthis123jasd'
result = re.search('asdf=5;(.*)123jasd', s)
print(result.group(1))

s = "123123STRINGabcabc"
def find_between( s, first, last ):
try:
start = s.index( first ) + len( first )
end = s.index( last, start )
return s[start:end]
except ValueError:
return ""
def find_between_r( s, first, last ):
try:
start = s.rindex( first ) + len( first )
end = s.rindex( last, start )
return s[start:end]
except ValueError:
return ""
print find_between( s, "123", "abc" )
print find_between_r( s, "123", "abc" )
gives:
123STRING
STRINGabc
I thought it should be noted - depending on what behavior you need, you can mix index and rindex calls or go with one of the above versions (it's equivalent of regex (.*) and (.*?) groups).

start = 'asdf=5;'
end = '123jasd'
s = 'asdf=5;iwantthis123jasd'
print s[s.find(start)+len(start):s.rfind(end)]
gives
iwantthis

s[len(start):-len(end)]

String formatting adds some flexibility to what Nikolaus Gradwohl suggested. start and end can now be amended as desired.
import re
s = 'asdf=5;iwantthis123jasd'
start = 'asdf=5;'
end = '123jasd'
result = re.search('%s(.*)%s' % (start, end), s).group(1)
print(result)

Just converting the OP's own solution into an answer:
def find_between(s, start, end):
return (s.split(start))[1].split(end)[0]

If you don't want to import anything, try the string method .index():
text = 'I want to find a string between two substrings'
left = 'find a '
right = 'between two'
# Output: 'string'
print(text[text.index(left)+len(left):text.index(right)])

source='your token _here0#df and maybe _here1#df or maybe _here2#df'
start_sep='_'
end_sep='#df'
result=[]
tmp=source.split(start_sep)
for par in tmp:
if end_sep in par:
result.append(par.split(end_sep)[0])
print result
must show:
here0, here1, here2
the regex is better but it will require additional lib an you may want to go for python only

Here is one way to do it
_,_,rest = s.partition(start)
result,_,_ = rest.partition(end)
print result
Another way using regexp
import re
print re.findall(re.escape(start)+"(.*)"+re.escape(end),s)[0]
or
print re.search(re.escape(start)+"(.*)"+re.escape(end),s).group(1)

Here is a function I did to return a list with a string(s) inbetween string1 and string2 searched.
def GetListOfSubstrings(stringSubject,string1,string2):
MyList = []
intstart=0
strlength=len(stringSubject)
continueloop = 1
while(intstart < strlength and continueloop == 1):
intindex1=stringSubject.find(string1,intstart)
if(intindex1 != -1): #The substring was found, lets proceed
intindex1 = intindex1+len(string1)
intindex2 = stringSubject.find(string2,intindex1)
if(intindex2 != -1):
subsequence=stringSubject[intindex1:intindex2]
MyList.append(subsequence)
intstart=intindex2+len(string2)
else:
continueloop=0
else:
continueloop=0
return MyList
#Usage Example
mystring="s123y123o123pp123y6"
List = GetListOfSubstrings(mystring,"1","y68")
for x in range(0, len(List)):
print(List[x])
output:
mystring="s123y123o123pp123y6"
List = GetListOfSubstrings(mystring,"1","3")
for x in range(0, len(List)):
print(List[x])
output:
2
2
2
2
mystring="s123y123o123pp123y6"
List = GetListOfSubstrings(mystring,"1","y")
for x in range(0, len(List)):
print(List[x])
output:
23
23o123pp123

To extract STRING, try:
myString = '123STRINGabc'
startString = '123'
endString = 'abc'
mySubString=myString[myString.find(startString)+len(startString):myString.find(endString)]

You can simply use this code or copy the function below. All neatly in one line.
def substring(whole, sub1, sub2):
return whole[whole.index(sub1) : whole.index(sub2)]
If you run the function as follows.
print(substring("5+(5*2)+2", "(", "("))
You will pobably be left with the output:
(5*2
rather than
5*2
If you want to have the sub-strings on the end of the output the code must look like below.
return whole[whole.index(sub1) : whole.index(sub2) + 1]
But if you don't want the substrings on the end the +1 must be on the first value.
return whole[whole.index(sub1) + 1 : whole.index(sub2)]

These solutions assume the start string and final string are different. Here is a solution I use for an entire file when the initial and final indicators are the same, assuming the entire file is read using readlines():
def extractstring(line,flag='$'):
if flag in line: # $ is the flag
dex1=line.index(flag)
subline=line[dex1+1:-1] #leave out flag (+1) to end of line
dex2=subline.index(flag)
string=subline[0:dex2].strip() #does not include last flag, strip whitespace
return(string)
Example:
lines=['asdf 1qr3 qtqay 45q at $A NEWT?$ asdfa afeasd',
'afafoaltat $I GOT BETTER!$ derpity derp derp']
for line in lines:
string=extractstring(line,flag='$')
print(string)
Gives:
A NEWT?
I GOT BETTER!

This is essentially cji's answer - Jul 30 '10 at 5:58.
I changed the try except structure for a little more clarity on what was causing the exception.
def find_between( inputStr, firstSubstr, lastSubstr ):
'''
find between firstSubstr and lastSubstr in inputStr STARTING FROM THE LEFT
http://stackoverflow.com/questions/3368969/find-string-between-two-substrings
above also has a func that does this FROM THE RIGHT
'''
start, end = (-1,-1)
try:
start = inputStr.index( firstSubstr ) + len( firstSubstr )
except ValueError:
print ' ValueError: ',
print "firstSubstr=%s - "%( firstSubstr ),
print sys.exc_info()[1]
try:
end = inputStr.index( lastSubstr, start )
except ValueError:
print ' ValueError: ',
print "lastSubstr=%s - "%( lastSubstr ),
print sys.exc_info()[1]
return inputStr[start:end]

from timeit import timeit
from re import search, DOTALL
def partition_find(string, start, end):
return string.partition(start)[2].rpartition(end)[0]
def re_find(string, start, end):
# applying re.escape to start and end would be safer
return search(start + '(.*)' + end, string, DOTALL).group(1)
def index_find(string, start, end):
return string[string.find(start) + len(start):string.rfind(end)]
# The wikitext of "Alan Turing law" article form English Wikipeida
# https://en.wikipedia.org/w/index.php?title=Alan_Turing_law&action=edit&oldid=763725886
string = """..."""
start = '==Proposals=='
end = '==Rival bills=='
assert index_find(string, start, end) \
== partition_find(string, start, end) \
== re_find(string, start, end)
print('index_find', timeit(
'index_find(string, start, end)',
globals=globals(),
number=100_000,
))
print('partition_find', timeit(
'partition_find(string, start, end)',
globals=globals(),
number=100_000,
))
print('re_find', timeit(
're_find(string, start, end)',
globals=globals(),
number=100_000,
))
Result:
index_find 0.35047444528454114
partition_find 0.5327825636197754
re_find 7.552149639286381
re_find was almost 20 times slower than index_find in this example.

My method will be to do something like,
find index of start string in s => i
find index of end string in s => j
substring = substring(i+len(start) to j-1)

This I posted before as code snippet in Daniweb:
# picking up piece of string between separators
# function using partition, like partition, but drops the separators
def between(left,right,s):
before,_,a = s.partition(left)
a,_,after = a.partition(right)
return before,a,after
s = "bla bla blaa <a>data</a> lsdjfasdjöf (important notice) 'Daniweb forum' tcha tcha tchaa"
print between('<a>','</a>',s)
print between('(',')',s)
print between("'","'",s)
""" Output:
('bla bla blaa ', 'data', " lsdjfasdj\xc3\xb6f (important notice) 'Daniweb forum' tcha tcha tchaa")
('bla bla blaa <a>data</a> lsdjfasdj\xc3\xb6f ', 'important notice', " 'Daniweb forum' tcha tcha tchaa")
('bla bla blaa <a>data</a> lsdjfasdj\xc3\xb6f (important notice) ', 'Daniweb forum', ' tcha tcha tchaa')
"""

Parsing text with delimiters from different email platforms posed a larger-sized version of this problem. They generally have a START and a STOP. Delimiter characters for wildcards kept choking regex. The problem with split is mentioned here & elsewhere - oops, delimiter character gone. It occurred to me to use replace() to give split() something else to consume. Chunk of code:
nuke = '~~~'
start = '|*'
stop = '*|'
julien = (textIn.replace(start,nuke + start).replace(stop,stop + nuke).split(nuke))
keep = [chunk for chunk in julien if start in chunk and stop in chunk]
logging.info('keep: %s',keep)

Further from Nikolaus Gradwohl answer, I needed to get version number (i.e., 0.0.2) between('ui:' and '-') from below file content (filename: docker-compose.yml):
version: '3.1'
services:
ui:
image: repo-pkg.dev.io:21/website/ui:0.0.2-QA1
#network_mode: host
ports:
- 443:9999
ulimits:
nofile:test
and this is how it worked for me (python script):
import re, sys
f = open('docker-compose.yml', 'r')
lines = f.read()
result = re.search('ui:(.*)-', lines)
print result.group(1)
Result:
0.0.2

This seems much more straight forward to me:
import re
s = 'asdf=5;iwantthis123jasd'
x= re.search('iwantthis',s)
print(s[x.start():x.end()])

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Why this regular expression (\d{n,m})* does not work as expected? - regular-language

Related

Split string with commas while keeping numeric parts

how to get values using regex in python

Replace slice of string python with string of different size, but maintain structure

Optional capture of balanced brackets in Lua

Find string between two substrings [duplicate]

Categories

Resources