regex for blank string - python-3.x

I have a string as:
s=
"(2021-06-29T10:53:42.647Z) [Denis]: hi
(2021-06-29T10:54:53.693Z) [Nicholas]: TA FOR SHOWING
(2021-06-29T11:58:29.053Z) [Nicholas]: how are you bane
(2021-06-29T11:58:29.053Z) [Nicholas]:
(2021-06-29T11:58:29.053Z) [Nicholas]: #END_REMOTE#
(2021-06-30T08:07:42.029Z) [Denis]: VAL 01JUL2021
(2021-06-30T08:07:42.029Z) [Denis]: ##ENDED AT 08:07 GMT##"
I want to extract the text from it. Expected output as:
comments=['hi','TA FOR SHOWING','how are you bane',' ','#END_REMOTE#','VAL 01JUL2021','##ENDED AT 08:07 GMT##']
What I have tried is:
comments=re.findall(r']:\s+(.*?)\n',s)
regex works well but I'm not able to get the blank text as ''

You can exclude matching the ] instead in the capture group, and if you also want to match the value on the last line, you can assert the end of the string $ instead of matching a mandatory newline with \n
Note that \s can match a newline and also the negated character class [^]]* can match a newline
]:\s+([^]]*)$
Regex demo | Python demo
import re
regex = r"]:\s+([^]]*)$"
s = ("(2021-06-29T10:53:42.647Z) [Denis]: hi\n"
"(2021-06-29T10:54:53.693Z) [Nicholas]: TA FOR SHOWING\n"
"(2021-06-29T11:58:29.053Z) [Nicholas]: how are you bane \n"
"(2021-06-29T11:58:29.053Z) [Nicholas]: \n"
"(2021-06-29T11:58:29.053Z) [Nicholas]: #END_REMOTE#\n"
"(2021-06-30T08:07:42.029Z) [Denis]: VAL 01JUL2021\n"
"(2021-06-30T08:07:42.029Z) [Denis]: ##ENDED AT 08:07 GMT##")
print(re.findall(regex, s, re.MULTILINE))
Output
['hi', 'TA FOR SHOWING', 'how are you bane ', '', '#END_REMOTE#', 'VAL 01JUL2021', '##ENDED AT 08:07 GMT##']
If you don't want to cross lines:
]:[^\S\n]+([^]\n]*)$
Regex demo

You could identify all after the colon into an array from capture group 1.
re.findall(r'(?m):[ \t]+(.*?)[ \t]*$',s)
then loop the array assigning a space to all empty elements.
>>> import re
>>>
>>> s= """
... (2021-06-29T10:53:42.647Z) [Denis]: hi
... (2021-06-29T10:54:53.693Z) [Nicholas]: TA FOR SHOWING
... (2021-06-29T11:58:29.053Z) [Nicholas]: how are you bane
... (2021-06-29T11:58:29.053Z) [Nicholas]:
... (2021-06-29T11:58:29.053Z) [Nicholas]: #END_REMOTE#
... (2021-06-30T08:07:42.029Z) [Denis]: VAL 01JUL2021
... (2021-06-30T08:07:42.029Z) [Denis]: ##ENDED AT 08:07 GMT##
... """
>>>
>>> talk = [re.sub('^$', ' ', w) for w in re.findall(r'(?m):[ \t]+(.*?)[ \t]*$',s)]
>>> print(talk)
['hi', 'TA FOR SHOWING', 'how are you bane', ' ', '#END_REMOTE#', 'VAL 01JUL2021', '##ENDED AT 08:07 GMT##']

Is this what you want?
comments = re.findall(r']:\s(.*?)\n',s)
If the space after : is always one space, \s+ should be \s. \s+ means one or more spaces.

With your shown samples please try following regex.
^\(\d{4}-\d{2}-\d{2}T(?:\d{2}:){2}\d{2}\.\d{3}Z\)\s+\[[^]]*\]:\s+([^)]*)$
Online demo for above regex
Explanation: Adding detailed explanation for above.
^\(\d{4}-\d{2}-\d{2} ##Matching from starting of line ( followed by 4 digits-2 digits- 2 digits here.
T(?:\d{2}:){2} ##Matching T followed by a non-capturing group which is matching 2 digits followed by colon 2 times.
\d{2}\.\d{3}Z\)\s+ ##Matching 2 digits followed by dot followed by 3 digits Z and ) followed by space(s).
\[[^]]*\]:\s+ ##Matching literal [ till first occurrence of ] followed by ] colon and space(s).
([^)]*)$ ##Creating 1st capturing group which has everything till next occurrence of `)`.
With Python3x:
import re
regex = r"^\(\d{4}-\d{2}-\d{2}T(?:\d{2}:){2}\d{2}\.\d{3}Z\)\s+\[[^]]*\]:\s+([^)]*)$"
varVal = ("(2021-06-29T10:53:42.647Z) [Denis]: hi\n"
"(2021-06-29T10:54:53.693Z) [Nicholas]: TA FOR SHOWING\n"
"(2021-06-29T11:58:29.053Z) [Nicholas]: how are you bane \n"
"(2021-06-29T11:58:29.053Z) [Nicholas]: \n"
"(2021-06-29T11:58:29.053Z) [Nicholas]: #END_REMOTE#\n"
"(2021-06-30T08:07:42.029Z) [Denis]: VAL 01JUL2021\n"
"(2021-06-30T08:07:42.029Z) [Denis]: ##ENDED AT 08:07 GMT##")
print(re.findall(regex, varVal, re.MULTILINE))
Output will be as follows with samples shown by OP:
['hi', 'TA FOR SHOWING', 'how are you bane ', '', '#END_REMOTE#', 'VAL 01JUL2021', '##ENDED AT 08:07 GMT##']

Related

Remove leading dollar sign from data and improve current solution

I have string like so:
"Job 1233:name_uuid (table n_Cars_1234567$20220316) done. Records: 24, with errors: 0."
I'd like to retieve the datte from the table name, so far I use:
"\$[0-9]+"
but this yields $20220316. How do I get only the date, without $?
I'd also like to get the table name: n_Cars_12345678$20220316
So far I have this:
pattern_table_info = "\(([^\)]+)\)"
pattern_table_name = "(?<=table ).*"
table_info = re.search(pattern_table_info, message).group(1)
table = re.search(pattern_table_name, table_info).group(0)
However I'd like to have a more simpler solution, how can I improve this?
EDIT:
Actually the table name should be:
n_Cars_12345678
So everything before the "$" sign and after "table"...how can this part of the string be retrieved?
You can use a regex with two capturing groups:
table\s+([^()]*)\$([0-9]+)
See the regex demo. Details:
table - a word
\s+ - one or more whitespaces
([^()]*) - Group 1: zero or more chars other than ( and )
\$ - a $ char
([0-9]+) - Group 2: one or more digits.
See the Python demo:
import re
text = "Job 1233:name_uuid (table n_Cars_1234567$20220316) done. Records: 24, with errors: 0."
rx = r"table\s+([^()]*)\$([0-9]+)"
m = re.search(rx, text)
if m:
print(m.group(1))
print(m.group(2))
Output:
n_Cars_1234567
20220316
You can write a single pattern with 2 capture groups:
\(table (\w+\$(\d+))\)
The pattern matches:
\(table
( Capture group 1
\w+\$ match 1+ word characters and $
(\d+) Capture group 2, match 1+ digits
) Close group 1
\) Match )
See a Regex demo and a Python demo.
import re
s = "Job 1233:name_uuid (table n_Cars_1234567$20220316) done. Records: 24, with errors: 0."
m = re.search(r"\(table (\w+\$(\d+))\)", s)
if m:
print(m.group(1))
print(m.group(2))
Output
n_Cars_1234567$20220316
20220316

extract substring from large string

I have a string as:
string="(2021-07-02 01:00:00 AM BST)
---
syl.hs has joined the conversation
(2021-07-02 01:00:23 AM BST)
---
e.wang
Good Morning
How're you?
(2021-07-02 01:05:11 AM BST)
---
wk.wang
Hi, I'm Good.
(2021-07-02 01:08:01 AM BST)
---
perter.derrek
we got the update on work.
It will get complete by next week.
(2021-07-15 08:59:41 PM BST)
---
ad.ft has left the conversation
---
* * *"
I want to extract the conversation text only (text in between name and timestamp) expected output as:
comments=['Good Morning How're you?','Hi, I'm Good.','we got the
update on work.It will get complete by next week.']
What I have tried is:
comments=re.findall(r'---\s*\n(.(?:\n(?!(?:(\s\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s*[AP]M\s+GMT\s*)\w+\s*\n)?---).))',string)
You could use a single capture group:
^---\s*\n(?!.* has (?:joined|left) the conversation|\* \* \*)\S.*((?:\n(?!\(\d|---).*)*)
The pattern matches:
^ Start of string
---\s*\n Match --- optional whitespace chars and a newline
(?!.* has (?:joined|left) the conversation|\* \* \*) Assert that the line does not contain a has joined or has left the conversation part, or contains * * *
\S.* Match at least a non whitespace char at the start of the line and the rest of the line
( Capture group 1 (this will be returned by re.findall)
(?:\n(?!\(\d|---).*)* Match all lines the do not start with ( and a digit or --
) Close group 1
See a regex demo and a Python demo.
Example
pattern = r"^---\s*\n(?!.* has (?:joined|left) the conversation|\* \* \*)\S.*((?:\n(?!\(\d|---).*)*)"
result = [m.strip() for m in re.findall(pattern, s, re.M) if m]
print(result)
Output
["Good Morning\nHow're you?", "Hi, I'm Good.", 'we got the update on work. \nIt will get complete by next week.']
I've assumed:
The text of interest begins after a block of three lines: a line containing a timestamp, followed by the line "---", which may be padded to the right with spaces, followed by a line comprised of a string of letters containing one period which is neither at the beginning nor end of that string and that string may be padded on the right with spaces.
The block of text of interest may contain blank lines, a blank line being a string that contains nothing other than spaces and a line terminator.
The last line of the block of text of interest cannot be a blank line.
I believe the following regular expression (with multiline (m) and case-indifferent (i) flags set) meets these requirements.
^\(\d{4}\-\d{2}\-\d{2} .*\) *\r?\n-{3} *\r?\n[a-z]+\.[a-z]+ *\r?\n((?:.*[^ (\n].*\r?\n| *\r?\n(?=(?: *\r?\n)*(?!\(\d{4}\-\d{2}\-\d{2} .*\)).*[^ (\n]))*)
The blocks of lines of interest are contained in capture group 1.
Start your engine!
The elements of the expression are as follows.
^\(\d{4}\-\d{2}\-\d{2} .*\) *\r?\n # match timestamp line
-{3} *\r?\n # match 3-hyphen line
[a-z]+\.[a-z]+ *\r?\n # match name
( # begin capture group 1
(?: # begin non-capture group (a)
.*[^ (\n].*\r?\n # match a non-blank line
| # or
\ *\r?\n # match a blank line
(?= # begin a positive lookahead
(?: # begin non-capture group (b)
\ *\r?\n # match a blank line
)* # end non-capture group b and execute 0+ times
(?! # begin a negative lookahead
\(\d{4}\-\d{2}\-\d{2} .*\) # match timestamp line
) # end negative lookahead
.*[^ (\n] # march a non-blank line
) # end positive lookahead
)* # end non-capture group a and execute 0+ times
) # end capture group 1
Here is a self-documenting regex that will strip leading and trailing whitespace:
(?x)(?m)(?s) # re.X, re.M, re.S (DOTALL)
(?: # start of non capturing group
^\(\d{4}-\d{2}-\d{2}\ \d{2}:\d{2}:\d{2}\ [AP]M\ BST\)\s*\r?\n # date and time
(?!---\s*\r?\nad\.ft has) # next lines are not the ---\n\ad.ft etc.
---\s*\r?\n # --- line
[\w.]+\s*\r?\n # name line
\s* # skip leading whitespace
) # end of non-capture group
# The folowing is capture group 1. Match characters until you get to the next date-time:
((?:(?!\s*\r?\n\(\d{4}-\d{2}-\d{2}\ \d{2}:\d{2}:\d{2}\ [AP]M\ BST\)).)*)# skip trailing whitespace
See Regex Demo
See Python Demo
import re
string = """(2021-07-02 01:00:00 AM BST)
---
syl.hs has joined the conversation
(2021-07-02 01:00:23 AM BST)
---
e.wang
Good Morning
How're you?
(2021-07-02 01:05:11 AM BST)
---
wk.wang
Hi, I'm Good.
(2021-07-02 01:08:01 AM BST)
---
perter.derrek
we got the update on work.
It will get complete by next week.
(2021-07-15 08:59:41 PM BST)
---
ad.ft has left the conversation
---
* * *"""
regex = r'''(?x)(?m)(?s) # re.X, re.M, re.S (DOTALL)
(?: # start of non capturing group
^\(\d{4}-\d{2}-\d{2}\ \d{2}:\d{2}:\d{2}\ [AP]M\ BST\)\s*\r?\n # date and time
(?!---\s*\r?\nad\.ft has) # next lines are not the ---\n\ad.ft etc.
---\s*\r?\n # --- line
[\w.]+\s*\r?\n # name line
\s* # skip leading whitespace
) # end of non-capture group
# The folowing is capture group 1. Match characters until you get to the next date-time:
((?:(?!\s*\r?\n\(\d{4}-\d{2}-\d{2}\ \d{2}:\d{2}:\d{2}\ [AP]M\ BST\)).)*)# skip trailing whitespace
'''
matches = re.findall(regex, string)
print(matches)
Prints:
["Good Morning\nHow're you?", "Hi, I'm Good.", 'we got the update on work.\nIt will get complete by next week.']

Replace $$ or more with single spaceusing Regex in python

In the following list of string i want to remove $$ or more with only one space.
eg- if i have $$ then one space character or if there are $$$$ or more then also only 1 space is to be replaced.
I am using the following regex but i'm not sure if it serves the purpose
regex_pattern = r"['$$']{2,}?"
Following is the test string list:
['1', 'Patna City $$$$ $$$$$$$$View Details', 'Serial No:$$$$5$$$$ $$$$Deed No:$$$$5$$$$ $$$$Token No:$$$$7$$$$ $$$$Reg Year:2020', 'Anil Kumar Singh Alias Anil Kumar$$$$$$$$Executant$$$$$$$$Late. Harinandan Singh$$$$$$$$$$$$Md. Shahzad Ahmad$$$$$$$$Claimant$$$$$$$$Late. Md. Serajuddin', 'Anil Kumar Singh Alias Anil Kumar', 'Executant', 'Late. Harinandan Singh', 'Md. Shahzad Ahmad', 'Claimant', 'Late. Md. Serajuddin', 'Circle:Patna City Mauja: $$$$ $$$$Khata : na$$$$ $$$$Plot :2497 Area(in Decimal):1.5002 Land Type :Res. Branch Road Land Value :1520000 MVR Value :1000000', 'Circle:Patna City Mauja: $$$$ $$$$Khata : na$$$$ $$$$Plot :2497 Area(in Decimal):1.5002 Land Type :Res. Branch Road Land Value :1520000 MVR Value :1000000']
About
I am using the following regex but i'm not sure if it serves the
purpose
The pattern ['$$']{2,}? can be written as ['$]{2,}? and matches 2 or more chars being either ' or $ in a non greedy way.
Your pattern currently get the right matches, as there are no parts present like '' or $'
As the pattern is non greedy, it will only match 2 chars and will not match all 3 characters in $$$
You could write the pattern matching 2 or more dollar signs without making it non greedy so the odd number of $ will also be matched:
regex_pattern = r"\${2,}"
In the replacement use a space.
Is this what you need?:
import re
for d in data:
d = re.sub(r'\${2,}', ' ', d)

How to replace backslash followed by 2 letters with empty string in Python?

Following is the test string and we need to replace the '\xa' with ''
'FF 6 VRV AVENUE SUBRAMANIYAM PALAYAM PinCode:-\xa0641034'
i was using the following set of lines in python to do the objective but to no use
new_str = str.replace(r'\\xa', '')
but the output is same
'FF 6 VRV AVENUE SUBRAMANIYAM PALAYAM PinCode:-\xa0641034'
I think you are trying to replace the unicode character '\xa0' -
s = 'FF 6 VRV AVENUE SUBRAMANIYAM PALAYAM PinCode:-\xa0641034'
s = s.replace('\xa0', '')
print(s)
#'FF 6 VRV AVENUE SUBRAMANIYAM PALAYAM PinCode:-641034'

How to fix my RE to get my expected arguments of group

I m new learner to python and learning Regex at this moment.
I made a Regex that is supposed to find all phone numbers.
I think I did it right but my code doesn't seem to be working correctly.
phoneRegex = re.compile(r'''(
(\d{2,3}|\(\d(2,3)\))? # first 2-3 digits
(\s|-|\.)? # -
(\d{3,4}) # second 3-4 digits
(\s|-|\.) # -
(\d{4}) # last 4 digits.
(\s*(ext|x|ext.)\s*(\d{3,4}))? # extension
)''', re.VERBOSE)
phoneRegex.findall('010 1234 5678 ext1234')
I am working on automatetheboringstuff tutorials, and read the Regex chapter through for 3 times.
If there are some minor things that I should read or consider, sorry for my hasty, but I spent roughly 2hrs, and I am happy to any of your suggesting reading materials and help.
I appreciate in advance.
Expected result:
[('010 1234 5678 ext1234', '010', ' ', '1234', ' ', '5678', ' ext1234', 'ext', '1234')]
Actual result:
[('010 1234 5678 ext1234', '010', '', ' ', '1234', ' ', '5678', ' ext1234', 'ext', '1234')]
what is the 3rd thing ('') and where did it come from?

Resources