How to fix my RE to get my expected arguments of group - python-3.x

I m new learner to python and learning Regex at this moment.
I made a Regex that is supposed to find all phone numbers.
I think I did it right but my code doesn't seem to be working correctly.
phoneRegex = re.compile(r'''(
(\d{2,3}|\(\d(2,3)\))? # first 2-3 digits
(\s|-|\.)? # -
(\d{3,4}) # second 3-4 digits
(\s|-|\.) # -
(\d{4}) # last 4 digits.
(\s*(ext|x|ext.)\s*(\d{3,4}))? # extension
)''', re.VERBOSE)
phoneRegex.findall('010 1234 5678 ext1234')
I am working on automatetheboringstuff tutorials, and read the Regex chapter through for 3 times.
If there are some minor things that I should read or consider, sorry for my hasty, but I spent roughly 2hrs, and I am happy to any of your suggesting reading materials and help.
I appreciate in advance.
Expected result:
[('010 1234 5678 ext1234', '010', ' ', '1234', ' ', '5678', ' ext1234', 'ext', '1234')]
Actual result:
[('010 1234 5678 ext1234', '010', '', ' ', '1234', ' ', '5678', ' ext1234', 'ext', '1234')]
what is the 3rd thing ('') and where did it come from?

Related

Replace $$ or more with single spaceusing Regex in python

In the following list of string i want to remove $$ or more with only one space.
eg- if i have $$ then one space character or if there are $$$$ or more then also only 1 space is to be replaced.
I am using the following regex but i'm not sure if it serves the purpose
regex_pattern = r"['$$']{2,}?"
Following is the test string list:
['1', 'Patna City $$$$ $$$$$$$$View Details', 'Serial No:$$$$5$$$$ $$$$Deed No:$$$$5$$$$ $$$$Token No:$$$$7$$$$ $$$$Reg Year:2020', 'Anil Kumar Singh Alias Anil Kumar$$$$$$$$Executant$$$$$$$$Late. Harinandan Singh$$$$$$$$$$$$Md. Shahzad Ahmad$$$$$$$$Claimant$$$$$$$$Late. Md. Serajuddin', 'Anil Kumar Singh Alias Anil Kumar', 'Executant', 'Late. Harinandan Singh', 'Md. Shahzad Ahmad', 'Claimant', 'Late. Md. Serajuddin', 'Circle:Patna City Mauja: $$$$ $$$$Khata : na$$$$ $$$$Plot :2497 Area(in Decimal):1.5002 Land Type :Res. Branch Road Land Value :1520000 MVR Value :1000000', 'Circle:Patna City Mauja: $$$$ $$$$Khata : na$$$$ $$$$Plot :2497 Area(in Decimal):1.5002 Land Type :Res. Branch Road Land Value :1520000 MVR Value :1000000']
About
I am using the following regex but i'm not sure if it serves the
purpose
The pattern ['$$']{2,}? can be written as ['$]{2,}? and matches 2 or more chars being either ' or $ in a non greedy way.
Your pattern currently get the right matches, as there are no parts present like '' or $'
As the pattern is non greedy, it will only match 2 chars and will not match all 3 characters in $$$
You could write the pattern matching 2 or more dollar signs without making it non greedy so the odd number of $ will also be matched:
regex_pattern = r"\${2,}"
In the replacement use a space.
Is this what you need?:
import re
for d in data:
d = re.sub(r'\${2,}', ' ', d)

regex to match paragraph in between 2 substrings

I have a string look like this:
string=""
( 2021-07-10 01:24:55 PM GMT )TEST
---
Badminton is a racquet sport played using racquets to hit a shuttlecock across
a net. Although it may be played with larger teams, the most common forms of
the game are "singles" (with one player per side) and "doubles" (with two
players per side).
( 2021-07-10 01:27:55 PM GMT )PATRICKWARR
---
Good morning, I am doing well. And you?
---
---
* * *""
I am trying to split the String up into parts as:
text=['Badminton is a racquet sport played using racquets to hit a
shuttlecock across a net. Although it may be played with larger teams,
the most common forms of the game are "singles" (with one player per
side) and "doubles" (with two players per side).','Good morning, I am
doing well. And you?']
What I have tried as:
text=re.findall(r'\( \d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2} PM GMT \)\w+ [\S\n]--- .*',string)
I'm not able get how to extract multiple lines.
You can use
(?m)^\(\s*\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s*[AP]M\s+GMT\s*\)\w+\s*\n---\s*\n(.*(?:\n(?!(?:\(\s*\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s*[AP]M\s+GMT\s*\)\w+\s*\n)?---).*)*)
See the regex demo. Details:
^ - start of line
{left_rx} - left boundary
--- - three hyphens
\s*\n - zero or more whitespaces and then an LF char
(.*(?:\n(?!(?:{left_rx})?---).*)*) - Group 1:
.* - zero or more chars other than line break chars as many as possible
(?:\n(?!(?:{left_rx})?---).*)* - zero or more (even empty, due to .*) lines that do not start with the (optional) left boundary pattern followed with ---
The boundary pattern defined in left_rx is \(\s*\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s*[AP]M\s+GMT\s*\)\w+\s*\n, it is basically the same as the original, I used \s* to match any zero or more whitespaces or \s+ to match one or more whitespaces between "words".
See the Python demo:
import re
text = '''string=""\n( 2021-07-10 01:24:55 PM GMT )TEST \n--- \nBadminton is a racquet sport played using racquets to hit a shuttlecock across\na net. Although it may be played with larger teams, the most common forms of\nthe game are "singles" (with one player per side) and "doubles" (with two\nplayers per side). \n \n \n\n \n\n( 2021-07-10 01:27:55 PM GMT )PATRICKWARR \n--- \nGood morning, I am doing well. And you? \n \n \n\n \n \n \n--- \n \n \n \n \n--- \n \n* * *""'''
left_rx = r"\(\s*\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s*[AP]M\s+GMT\s*\)\w+\s*\n"
rx = re.compile(fr"^{left_rx}---\s*\n(.*(?:\n(?!(?:{left_rx})?---).*)*)", re.M)
print ( [x.strip().replace('\n', ' ') for x in rx.findall(text)] )
Output:
['Badminton is a racquet sport played using racquets to hit a shuttlecock across a net. Although it may be played with larger teams, the most common forms of the game are "singles" (with one player per side) and "doubles" (with two players per side).', 'Good morning, I am doing well. And you?']
One of the approaches:
import re
# Replace all \n with ''
string = string.replace('\n', '')
# Replace the date string '( 2021-07-10 01:27:55 PM GMT )PATRICKWARR ' and string like '* * *' with ''
string = re.sub(r"\(\s*\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2} [AP]M GMT\s*\)\w+|\*+", '', string)
data = string.split('---')
data = [item.strip() for item in data if item.strip()]
print (data)
Output:
['Badminton is a racquet sport played using racquets to hit a shuttlecock acrossa net. Although it may be played with larger teams, the most common forms ofthe game are "singles" (with one player per side) and "doubles" (with twoplayers per side).', 'Good morning, I am doing well. And you?']

How to extract several timestamp pairs from a list in Python

I have extracted all timestamps from a transcript file. The output looks like this:
('[, 00:00:03,950, 00:00:06,840, 00:00:06,840, 00:00:09,180, 00:00:09,180, '
'00:00:10,830, 00:00:10,830, 00:00:14,070, 00:00:14,070, 00:00:16,890, '
'00:00:16,890, 00:00:19,080, 00:00:19,080, 00:00:21,590, 00:00:21,590, '
'00:00:24,030, 00:00:24,030, 00:00:26,910, 00:00:26,910, 00:00:29,640, '
'00:00:29,640, 00:00:31,920, 00:00:31,920, 00:00:35,850, 00:00:35,850, '
'00:00:38,629, 00:00:38,629, 00:00:40,859, 00:00:40,859, 00:00:43,170, '
'00:00:43,170, 00:00:45,570, 00:00:45,570, 00:00:48,859, 00:00:48,859, '
'00:00:52,019, 00:00:52,019, 00:00:54,449, 00:00:54,449, 00:00:57,210, '
'00:00:57,210, 00:00:59,519, 00:00:59,519, 00:01:02,690, 00:01:02,690, '
'00:01:05,820, 00:01:05,820, 00:01:08,549, 00:01:08,549, 00:01:10,490, '
'00:01:10,490, 00:01:13,409, 00:01:13,409, 00:01:16,409, 00:01:16,409, '
'00:01:18,149, 00:01:18,149, 00:01:20,340, 00:01:20,340, 00:01:22,649, '
'00:01:22,649, 00:01:26,159, 00:01:26,159, 00:01:28,740, 00:01:28,740, '
'00:01:30,810, 00:01:30,810, 00:01:33,719, 00:01:33,719, 00:01:36,990, '
'00:01:36,990, 00:01:39,119, 00:01:39,119, 00:01:41,759, 00:01:41,759, '
'00:01:43,799, 00:01:43,799, 00:01:46,619, 00:01:46,619, 00:01:49,140, '
'00:01:49,140, 00:01:51,240, 00:01:51,240, 00:01:53,759, 00:01:53,759, '
'00:01:56,460, 00:01:56,460, 00:01:58,740, 00:01:58,740, 00:02:01,640, '
'00:02:01,640, 00:02:04,409, 00:02:04,409, 00:02:07,229, 00:02:07,229, '
'00:02:09,380, 00:02:09,380, 00:02:12,060, 00:02:12,060, 00:02:14,840, ]')
In this output, there are always timestamp pairs, i.e. always 2 consecutive timestamps belong together, for example: 00:00:03,950 and 00:00:06,840, 00:00:06,840 and 00:00:09,180, etc.
Now, I want to extract all these timestamp pairs separately so that the output looks like this:
00:00:03,950 - 00:00:06,840
00:00:06,840 - 00:00:09,180
00:00:09,180 - 00:00:10,830
etc.
For now, I have the following (very inconvenient) solution for my problem:
# get first part of first timestamp
a = res_timestamps[2:15]
print(dedent(a))
# get second part of first timestamp
b = res_timestamps[17:29]
print(b)
# combine timestamp parts
c = a + ' - ' + b
print(dedent(c))
Of course, this is very bad since I cannot extract the indices manually for all transcripts. Trying to use a loop has not worked yet because each item is not a timestamp but a single character.
Is there an elegant solution for my problem?
I appreciate any help or tip.
Thank you very much in advance!
Regex to the rescue!
A solution that works perfectly on your example data:
import re
from pprint import pprint
pprint(re.findall(r"(\d{2}:\d{2}:\d{2},\d{3}), (\d{2}:\d{2}:\d{2},\d{3})", your_data))
This prints:
[('00:00:03,950', '00:00:06,840'),
('00:00:06,840', '00:00:09,180'),
('00:00:09,180', '00:00:10,830'),
('00:00:10,830', '00:00:14,070'),
('00:00:14,070', '00:00:16,890'),
('00:00:16,890', '00:00:19,080'),
('00:00:19,080', '00:00:21,590'),
('00:00:21,590', '00:00:24,030'),
('00:00:24,030', '00:00:26,910'),
('00:00:26,910', '00:00:29,640'),
('00:00:29,640', '00:00:31,920'),
('00:00:31,920', '00:00:35,850'),
('00:00:35,850', '00:00:38,629'),
('00:00:38,629', '00:00:40,859'),
('00:00:40,859', '00:00:43,170'),
('00:00:43,170', '00:00:45,570'),
('00:00:45,570', '00:00:48,859'),
('00:00:48,859', '00:00:52,019'),
('00:00:52,019', '00:00:54,449'),
('00:00:54,449', '00:00:57,210'),
('00:00:57,210', '00:00:59,519'),
('00:00:59,519', '00:01:02,690'),
('00:01:02,690', '00:01:05,820'),
('00:01:05,820', '00:01:08,549'),
('00:01:08,549', '00:01:10,490'),
('00:01:10,490', '00:01:13,409'),
('00:01:13,409', '00:01:16,409'),
('00:01:16,409', '00:01:18,149'),
('00:01:18,149', '00:01:20,340'),
('00:01:20,340', '00:01:22,649'),
('00:01:22,649', '00:01:26,159'),
('00:01:26,159', '00:01:28,740'),
('00:01:28,740', '00:01:30,810'),
('00:01:30,810', '00:01:33,719'),
('00:01:33,719', '00:01:36,990'),
('00:01:36,990', '00:01:39,119'),
('00:01:39,119', '00:01:41,759'),
('00:01:41,759', '00:01:43,799'),
('00:01:43,799', '00:01:46,619'),
('00:01:46,619', '00:01:49,140'),
('00:01:49,140', '00:01:51,240'),
('00:01:51,240', '00:01:53,759'),
('00:01:53,759', '00:01:56,460'),
('00:01:56,460', '00:01:58,740'),
('00:01:58,740', '00:02:01,640'),
('00:02:01,640', '00:02:04,409'),
('00:02:04,409', '00:02:07,229'),
('00:02:07,229', '00:02:09,380'),
('00:02:09,380', '00:02:12,060'),
('00:02:12,060', '00:02:14,840')]
You could output this in your desired format like so:
for start, end in timestamps:
print(f"{start} - {end}")
Here's a solution without regular expressions
Clean the string, and split on ', ' to create a list
Use string slicing to select the odd and even values and zip them together.
# give data as your string
# convert data into a list by removing end brackets and spaces, and splitting
data = data.replace('[, ', '').replace(', ]', '').split(', ')
# use list slicing and zip the two components
combinations = list(zip(data[::2], data[1::2]))
# print the first 5
print(combinations[:5])
[out]:
[('00:00:03,950', '00:00:06,840'),
('00:00:06,840', '00:00:09,180'),
('00:00:09,180', '00:00:10,830'),
('00:00:10,830', '00:00:14,070'),
('00:00:14,070', '00:00:16,890')]

After a seperator there is a key in loop. How to keep it?

---------------------------
CompanyID: 000000000000
Pizza: 2 3.15 6.30
spaghetti: 1 7 7
ribye: 2 40 80
---------------------------
CompanyID: 000000000001
burger: 1 3.15 6.30
spaghetti: 1 7 7
ribye: 2 40 80
--------------------------
I'm doing a for loop over a list of lines. Every line is an item of a list. I need to keep the companyID while looking for a user input.
While this is printing the variable x=True. I cant take company ID to print it.
a='-'
for line in lines:
if a in line:
companyID= next(line)
if product in line:
x=True
TypeError: 'str' object is not an iterator
You can use your line seperator to identify when new data starts. Once you see the line with "----" then you can start collecing info in a new dictionary. for each line take its key and value by splitting on ":" and create the entry in the dictionary.
When you see the next "----" line you know thats the end of the data for this company so then do your check to see if they have the product and if so print the company id from the dictionary.
line_seperator_char = '-'
company_data = {}
product = 'burger'
with open('data.dat') as lines:
for line in lines:
line = line.rstrip()
if line.startswith(line_seperator_char):
if product in company_data:
print(f'{company_data["CompanyID"]} contains the product {product}')
company_data = {}
else:
key, value = line.split(':')
company_data[key] = value
OUTPUT
000000000001 contains the product burger
No it doesnt run. Could you explain what does "[1] means near split()[1]?
Another try that doesnt run is
y=[]
y=lines[1].split(' ')
for line in lines:
y=line.split(' ')
if len(y[1])==10:
companyID=y[1]
if product in line:
x=True
Thanks for the answers.Something that finally worked in my case was that:
y=[]
y=line[1].split(' ')
a='-'
for line in lines:
if line.startswith("CompanyID:"):
y=line.split(' ')
companyID=y[1]
if product in line:
x=True

parsing dates from strings

I have a list of strings in python like this
['AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5',
'AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5']
I want to parse only the date and time (for example, 2016-08-05 15:10:00 )from these strings.
So far I used a for loop like the one below but it's very time consuming, is there a better way to do this?
for files in glob.glob("AM_B0_*.flac.h5"):
if files[11]=='_':
year=files[12:16]
month=files[17:19]
day= files[20:22]
hour=files[23:25]
minute=files[25:27]
second=files[27:29]
tindex=pd.date_range(start= '%d-%02d-%02d %02d:%02d:%02d' %(int(year),int(month), int(day), int(hour), int(minute), int(second)), periods=60, freq='10S')
else:
year=files[11:15]
month=files[16:18]
day= files[19:21]
hour=files[22:24]
minute=files[24:26]
second=files[26:28]
tindex=pd.date_range(start= '%d-%02d-%02d %02d:%02d:%02d' %(int(year), int(month), int(day), int(hour), int(minute), int(second)), periods=60, freq='10S')
Try this (based on the 2nd last '-', no need of if-else case):
filesall = ['AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5',
'AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5']
def find_second_last(text, pattern):
return text.rfind(pattern, 0, text.rfind(pattern))
for files in filesall:
start = find_second_last(files,'-') - 4 # from yyyy- part
timepart = (files[start:start+17]).replace("T"," ")
#insert 2 ':'s
timepart = timepart[:13] + ':' + timepart[13:15] + ':' +timepart[15:]
# print(timepart)
tindex=pd.date_range(start= timepart, periods=60, freq='10S')
In Place of using file[11] as hard coded go for last or 2nd last index of _ then use your code then you don't have to write 2 times same code. Or use regex to parse the string.

Resources