re.MULTILINE flag is interfering with the end of line $ operator - python-3.x

Sorry if this is a duplicate/basic question, I couldn't find any similar questions.
I have the following multiline string
my_txt = """
foo.exe\n
bar.exec\n
abab.exe\n
"""
(The newlines aren't actually written in my code, I put them there for clarity).
I want to match every file that ends with a .exe, (not .exec).
My regex was initially:
my_reg = re.compile(".+[.](?=exe$)")
my_matches = my_reg.finditer(my_txt)
I hoped that it would first find every character, go back until it found the ., and then check if the characters exe and a newline followed.
Only one match was found, and that was:
abab.exe.
I tried to mess around a bit, and changed the first line:
my_reg = re.compile(".+[.](?=exe$)",flags=re.MULTILINE).
This time, it successfully ran, returning
foo.
abab.
I thought re.MULTILINE wasn't supposed to interfere with the $ operator, or am I wrong about the $ operator/misusing something?
Thanks in advance!

You do need the multiline flag, otherwise $ will only match the absolute end of your input. You just need to match exe instead of using a lookahead:
my_reg = re.compile(".+[.]exe$", re.MULTILINE)
Output:
['foo.exe', 'abab.exe']
Demo
If you are trying to match the filename without the extension, you can put the period inside the lookahead:
my_reg = re.compile(r".+(?=\.exe$)", re.MULTILINE)
Output:
['foo', 'abab']
Demo

Related

How to get the content after a string using regex in python

I am having a string as follows:
A5697[2:10] = {ravi, rageev, raghav, smith};
I want the content after "A5697[2:10] =". So, my output should be:
{ravi, rageev, raghav, smith};
This is my code:
print(re.search(r'(?<=A\d+\[.*\] =\s).*', line).group())
But, this is giving error:
sre_constants.error: look-behind requires fixed-width pattern
Can anyone help to solve this issue? I would prefer to use regex.
You can try re.sub , like below, Since you have given only one data point. I am assuming all the other data points are following the similar pattern.
import re
text = "A5697[2:10] = {ravi, rageev, raghav, smith}"
re.sub(r'(A\d+\[\d+:\d+\]\s+=\s+)(.+)', r'\2', text)
returns,
'{ravi, rageev, raghav, smith}'
re.sub : substitutes the entire match as given as regex with the 2nd capturing group. The second capturing group captures every thing after '= '.
Simply replace the bits you don't want:
print re.sub(r'A\d[^=]*= *','',line)
See demo here: https://rextester.com/NSG17655

multiple variable in python regex

I have seen several related posts and several forums to find an answer for my question, but nothing has come up to what I need.
I am trying to use variable instead of hard-coded values in regex which search for either word in a line.
However i am able to get desired result if i don't use variable.
<http://www.somesite.com/software/sub/a1#Msoffice>
<http://www.somesite.com/software/sub1/a1#vlc>
<http://www.somesite.com/software/sub2/a2#dell>
<http://www.somesite.com/software/sub3/a3#Notepad>
re.search(r"\#Msoffice|#vlc|#Notepad", line)
This regex will return the line which has #Msoffice OR #vlc OR #Notepad.
I tried defining a single variable using re.escape and that worked absolutely fine. However i have tried many combination using | and , (pipe and comma) but no success.
Is there any way i can specify #Msoffice , #vlc and #Notepad in different variables and so later i can change those ?
Thanks in advance!!
If I did understand you the right way you'd like to insert variables in your regex.
You are actually using a raw string using r' ' to make the regex more readable, but if you're using f' ' it allows you to insert any variables using {your_var} then construct your regex as you like:
var1 = '#Msoffice'
var2 = '#vlc'
var3 = '#Notepad'
re.search(f'{var1}|{var2}|{var3}', line)
The most annoying issue is that you will have to add \ to escaped char, to look for \ it will be \\
Hope it helped
import re
lines = ["<http://www.somesite.com/software/sub/a1#Msoffice>",
"<http://www.somesite.com/software/sub1/a1#vlc>",
"<http://www.somesite.com/software/sub2/a2#dell>",
"<http://www.somesite.com/software/sub3/a3#Notepad>"]
for line in lines:
if re.search(r'\b(?:\#{}|\#{}|\#{})\b'.format('Msoffice', 'vlc', 'Notepad'), line):
print(line)
Output :
<http://www.somesite.com/software/sub/a1#Msoffice>
<http://www.somesite.com/software/sub1/a1#vlc>
<http://www.somesite.com/software/sub3/a3#Notepad>

set function with file- python3

I have a text file with given below content
Credit
Debit
21/12/2017
09:10:00
Written python code to convert text into set and discard \n.
with open('text_file_name', 'r') as file1:
same = set(file1)
print (same)
print (same.discard('\n'))
for first print statement print (same). I get correct result:
{'Credit\n','Debit\n','21/12/2017\n','09:10:00\n'}
But for second print statement print (same.discard('\n')) . I am getting result as
None.
Can anybody help me to figure out why I am getting None. I am using same.discard('\n') to discard \n in the set.
Note:
I am trying to understand the discard function with respect to set.
The discard method will only remove an element from the set, since your set doesn't contain just \n it can't discard it. What you are looking for is a map that strips the \n from each element like so:
set(map(lambda x: x.rstrip('\n'), same))
which will return {'Credit', 'Debit', '09:10:00', '21/12/2017'} as the set. This sample works by using the map builtin which applies it's first argument to each element in the set. The first argument in our map usage is lambda x: x.rstrip('\n') which is simply going to remove any occurrences of \n on the right-hand side of each string.
discard removes the given element from the set only if it presents in it.
In addition, the function doesn't return any value as it changes the set it was ran from.
with open('text_file_name', 'r') as file1:
same = set(file1)
print (same)
same = {elem[:len(elem) - 1] for elem in same if elem.endswith('\n')}
print (same)
There are 4 elements in the set, and none of them are newline.
It would be more usual to use a list in this case, as that preserves order while a set is not guaranteed to preserve order, plus it discards duplicate lines. Perhaps you have your reasons.
You seem to be looking for rstrip('\n'). Consider processing the file in this way:
s = {}
with open('text_file_name') as file1:
for line in file1:
s.add(line.rstrip('\n'))
s.discard('Credit')
print(s) # This displays 3 elements, without trailing newlines.

str.format places last variable first in print

The purpose of this script is to parse a text file (sys.argv[1]), extract certain strings, and print them in columns. I start by printing the header. Then I open the file, and scan through it, line by line. I make sure that the line has a specific start or contains a specific string, then I use regex to extract the specific value.
The matching and extraction work fine.
My final print statement doesn't work properly.
import re
import sys
print("{}\t{}\t{}\t{}\t{}".format("#query", "target", "e-value",
"identity(%)", "score"))
with open(sys.argv[1], 'r') as blastR:
for line in blastR:
if line.startswith("Query="):
queryIDMatch = re.match('Query= (([^ ])+)', line)
queryID = queryIDMatch.group(1)
queryID.rstrip
if line[0] == '>':
targetMatch = re.match('> (([^ ])+)', line)
target = targetMatch.group(1)
target.rstrip
if "Score = " in line:
eValue = re.search(r'Expect = (([^ ])+)', line)
trueEvalue = eValue.group(1)
trueEvalue = trueEvalue[:-1]
trueEvalue.rstrip()
print('{0}\t{1}\t{2}'.format(queryID, target, trueEvalue), end='')
The problem occurs when I try to print the columns. When I print the first 2 columns, it works as expected (except that it's still printing new lines):
#query target e-value identity(%) score
YAL002W Paxin1_129011
YAL003W Paxin1_167503
YAL005C Paxin1_162475
YAL005C Paxin1_167442
The 3rd column is a number in scientific notation like 2e-34
But when I add the 3rd column, eValue, it breaks down:
#query target e-value identity(%) score
YAL002W Paxin1_129011
4e-43YAL003W Paxin1_167503
1e-55YAL005C Paxin1_162475
0.0YAL005C Paxin1_167442
0.0YAL005C Paxin1_73182
I have removed all new lines, as far I know, using the rstrip() method.
At least three problems:
1) queryID.rstrip and target.rstrip are lacking closing ()
2) Something like trueEValue.rstrip() doesn't mutate the string, you would need
trueEValue = trueEValue.rstrip()
if you want to keep the change.
3) This might be a problem, but without seeing your data I can't be 100% sure. The r in rstrip stands for "right". If trueEvalue is 4e-43\n then it is true the trueEValue.rstrip() would be free of newlines. But the problem is that your values seem to be something like \n43-43. If you simply use .strip() then newlines will be removed from either side.

Lua pattern to stop when end of line

I need to get help for a pattern in Lua stopping to read after a line break.
My code:
function getusers(file)
local list, close = {}
local user, value = string.match(file,"(UserName=)(.*)")
print(value)
f:close()
end
f = assert(io.open('file2.ini', "r"))
local t = f:read("*all")
getusers(t)
--file2.ini--
user=a
UserName=Tom
Password=xyz
UserName=Jane
Output of script using file2.ini:
Tom
Password=xyz
UserName=Jane
How to get the pattern to stop after it reaches the end of line?
You can use the pattern
"(UserName=)(.-)\n"
Note that besides the extra \n, the lazy modifier - is used instead of *.
As #lhf points out, make sure the file ends with a new line. I think you can append a \n to the string manually before matching.

Resources