Python - how to find string and remove string plus next x characters - python-3.x

I have the following string:
mystr = '(string_to_delete_20221012_11-36) keep this (string_to_delete_20221016_22-22) keep this (string_to_delete_20221017_20-55) keep this'
I wish to delete all the entries (string_to_deletexxxxxxxxxxxxxxx) (including the trailing space)
I sort of need pseudo code as follows:
If you find a string (string_to_delete then replace that string and the timestamp, closing parenthesis and trailing space with null e.g. delete the string (string_to_delete_20221012_11-36)
I would use a list comprehension but given that not all strings are contained inside parenthesis I cannot see what I could use to create the list via a string.split().
Is this somethng that needs regular expressions?

it seemed like a good place to put regex:
import re
pattern = r'\(string_to_delete_.*?\)\s*'
mystr = '(string_to_delete_20221012_11-36) keep this (string_to_delete_20221016_22-22) keep this (string_to_delete_20221017_20-55) keep this'
for match in re.findall(pattern, mystr):
mystr = mystr.replace(match, '', 1) # replace 1st occurence of matched str with empty string
print(mystr)
results with:
>> keep this keep this keep this
brief regex breakdown: \(string_to_delete_.*?\)\s*
\( look for left parenthesis - escape needed
match string string_to_delete_
.*? look for zero or more characters if any
\) match closing parenthesis
\s* include zero or more whitespaces after that

Related

Python regular expression to remove only specific square brackets and question mark in a string by leaving other square brackets unchanged

I am trying to remove only specific pattern in a string using regEx which includes brackets and question marks and also by not changing of other brackets
Here is the code which I was trying
import re
string = "aaaa{v?}a $1?{23ru{n?}kkkk"
pattern = '[{?}]'
replace = ''
new_string = re.sub(pattern, replace, string)
print(new_string)
It generates below output
"aaaava $123runkkkk"
I want the output to be like below
"aaaava $1?{23runkkkk"
You can notice that it removed {v?}, {n?} bracket({}) and question mark(?) only in this format
There is unchange of brackets and question marks at the remaining places.
You can use
re.sub(r'{([a-z])\?}', r'\1', text)
See the regex demo. Details:
{ - a { char
([a-z]) - Group 1 (\1 refers to this group text from the replacement pattern): any lowercase ASCII letter
\? - a ? char
} - a } char.

Replace matched susbtring using re sub

Is there a way to replace the matched pattern substring using a single re.sub() line?.
What I would like to avoid is using a string replace method to the current re.sub() output.
Input = "/J&L/LK/Tac1_1/shareloc.pdf"
Current output using re.sub("[^0-9_]", "", input): "1_1"
Desired output in a single re.sub use: "1.1"
According to the documentation, re.sub is defined as
re.sub(pattern, repl, string, count=0, flags=0)
If repl is a function, it is called for every non-overlapping occurrence of pattern.
This said, if you pass a lambda function, you can remain the code in one line. Furthermore, remember that the matched characters can be accessed easier to an individual group by: x[0].
I removed _ from the regex to reach the desired output.
txt = "/J&L/LK/Tac1_1/shareloc.pdf"
x = re.sub("[^0-9]", lambda x: '.' if x[0] is '_' else '', txt)
print(x)
There is no way to use a string replacement pattern in Python re.sub to replace with two possible strings, as there is no conditional replacement construct support in Python re.sub. So, using a callable as the replacement argument or use other work-arounds.
It looks like you only expect one match of <DIGITS>_<DIGITS> in the input string. In this case, you can use
import re
text = "/J&L/LK/Tac1_1/shareloc.pdf"
print( re.sub(r'^.*?(\d+)_(\d+).*', r'\1.\2', text, flags=re.S) )
# => 1.1
See the Python demo. See the regex demo. Details:
^ - start of string
.*? - zero or more chars as few as possible
(\d+) - Group 1: one or more digits
_ - a _ char
(\d+) - Group 2: one or more digits
.* - zero or more chars as many as possible.

find regex expression based character match

I have a list of strings something like this:
a=['bukt/id=gdhf/year=989/month=98/day=12/hgjhg.csv','bukt/id=76fhfh/year=989/month=08/day=128/hkngjhg.csv']
ids are unique.I want to have a output list which will be something like this
output_list = ['bukt/id=gdhf/','bukt/id=76fhfh/']
So basically need a regex expression to match any id and remove the rest of the part from the string
How can I do that in most efficient way considering the length of the input list is more than 100K
import re
rgx = r'(bukt/id=[a-zA-Z0-9]+/).+'
re.search(rgx, string).group(1)
The result will be in group 1. This captures "bukt/id=", followed by any alphanumeric characters and then a slash, and throws away the rest.
There's no need for regex, you can just split your string on /, discard everything after the second / and then join again with /:
a=['bukt/id=gdhf/year=989/month=98/day=12/hgjhg.csv','bukt/id=76fhfh/year=989/month=08/day=128/hkngjhg.csv']
out = ['/'.join(u.split('/')[:2]) for u in a]
print(out)
Output:
['bukt/id=gdhf', 'bukt/id=76fhfh']
If you want the trailing /, just add an empty string to the end of the split array:
out = ['/'.join(u.split('/')[:2] + ['']) for u in a]
Output:
['bukt/id=gdhf/', 'bukt/id=76fhfh/']

Wrong matching regex

So I'm using re module to compile my regex, and my regex looks like this:
"(^~\w+?[ & ~\w+?]*?$)"
So I compile it using pattern = re.compile(regex) and then I use re.findall(pattern, string) to find if the given string is matching and to give me the group if it is.
String that I'm matching is "v1 V ~v2_ V ~~v3".
I'd expect to not have a match but it says that it matches the regular expression. I suspect that \w+ matches white spaces so that it matches the whole string but I could not find in the documentation that is correct. What am I missing?
Here this is minimum reproductible example:
import re
test_string = "v1 V ~v2_ V ~~v3"
regex = "(^~*\w+?[ & ~*\w+?]*?$)"
pattern = re.compile(regex)
for elem in re.findall(regex, test_string):
print(elem)
If you expect to not match I think your problem is with [ & ~*\w+?]* part.
The characters between square brackets means one occurrence of, in this case one occurrence of &, ~, *, ?, word and space. And the asterisk (*) at the end makes zero or many occurrences of what is in the brackets.
If what you wanted is to match this sub-regex & ~*\w+? zero or more times use parenthesis.
So I would say that you wanted this regex: (^~*\w+?( & ~*\w+?)*?$) (just change brackets for parenthesis.

Python List Formatting and Updation

I have a list Eg. a = ["dgbbgfbjhffbjjddvj/n//n//n' "]
How do I remove the trailing new lines i.e. all /n with extra single inverted comma at the end?
Expected result = ["dfgjhgjjhgfjjfgg"] (I typed it randomly)
you can use string rstrip() method.
usage:
str.rstrip([c])
where c are what chars have to be trimmed, whitespace is the default when no arg provided.
example:
a = ['Return a copy of the string\n', 'with trailing characters removed\n\n']
[i.rstrip('\n') for i in a]
result:
['Return a copy of the string', 'with trailing characters removed']
more about strip():
https://www.tutorialspoint.com/python3/string_rstrip.htm

Resources