How to identify number near word using regex - python-3.x

Need to identify numbers near keyword number:, no:, etc..
Tried:
import re
matchstring="Sales Quote"
string_lst = ['number:', 'No:','no:','number','No : ']
x=""" Sentence1: Sales Quote number 36886DJ9 is entered
Sentence2: SALES QUOTE No: 89745DFD is entered
Sentence3: Sales Quote No : 7964KL is entered
Sentence4: SALES QUOTE NUMBER:879654DF is entered
Sentence5: salesquote no: 9874656LD is entered"""
documentnumber= re.findall(r"(?:(?<="+matchstring+ '|'.join(string_lst)+r')) [\w\d-]',x,flags=re.IGNORECASE)
print(documentnumber)
Required soln:36886DJ9,89745DFD,7964KL,879654DF,9874656LD
Is there any solution?

Actually your solution is very close. You just need some missing parenthesis and check for optional whitespace:
documentnumber = re.findall(r"(?:(?<="+matchstring + ").*?(?:" + '|'.join(string_lst) + ')\s?)([\w\d-]*)', x, re.IGNORECASE)
However this won't match with the last one (9874656LD) because of the missing whitespace between "Sales" and "quote". If you want to build it in the same way than the rest of the pattern, replace the lookbehind by a non capturing group and join words with \s?:
documentnumber= re.findall(r"(?:(?:" + "\s?".join(matchstring.split()) + ").*?(?:" + '|'.join(string_lst) + ')\s?)([\w\d-]*)', x, re.IGNORECASE)
Output:
['36886DJ9', '89745DFD', '7964KL', '879654DF', '9874656LD']

Related

Split a big text file into multiple smaller one on set parameter of regex

I have a large text file looking like:
....
sdsdsd
..........
asdfhjgjksdfk dfkaskk sdkfk skddkf skdf sdk ssaaa akskdf sdksdfsdf ksdf sd kkkkallwow.
sdsdllla lsldlsd lsldlalllLlsdd asdd. sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
......
ddss
................
asdfhjgjksdfk ddjafjijjjj.dfsdfsdfsdfsi dfodoof ooosdfow oaosofoodf aosolflldlfl , dskdkkfkdsa asddf;akkdfkdkk . sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
.....
xxxx
.......
asdfghjkl
I want to split the text files into multiple small text files and save them as .txt in my system on occurences of ..... [multiple period markers] saved like
group1_sdsdsd.txt
....
sdsdsd
..........
asdfhjgjksdfk dfkaskk sdkfk skddkf skdf sdk ssaaa akskdf sdksdfsdf ksdf sd kkkkallwow.
sdsdllla lsldlsd lsldlalllLlsdd asdd. sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
group1_ddss.txt
ddss
................
asdfhjgjksdfk ddjafjijjjj.dfsdfsdfsdfsi dfodoof ooosdfow oaosofoodf aosolflldlfl , dskdkkfkdsa asddf;akkdfkdkk . sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
and
group1_xxxx.txt
.....
xxxx
.......
asdfghjkl
I have figured that by usinf regex of sort of following can be done
txt =re.sub(r'(([^\w\s])\2+)', r' ', txt).strip() #for letters more than 2 times
but not able to figure out completely.
The saved text files should be named as group1_sdsdsd.txt , group1_ddss.txt and group1_xxxx.txt [group1 being identifier for the specific big text file as I have multiple bigger text files and need to do same on all to know which big text file i am splitting.
If you want to get the parts with multiple dots only on the same line, you can use and get the separate parts, you might use a pattern like:
^\.{3,}\n(\S+)\n\.{3,}(?:\n(?!\.{3,}\n\S+\n\.{3,}).*)*
Explanation
^ Start of string
\.{3,}\n Match 3 or more dots and a newline
(\S+)\n Capture 1+ non whitespace chars in group 1 for the filename and match a newline
\.{3,} Match 3 or more dots
(?: Non capture group to repeat as a whole part
\n Match a newline
(?!\.{3,}\n\S+\n\.{3,}) Negative lookahead, assert that from the current position we are not looking at a pattern that matches the dots with a filename in between
.* Match the whole line
)* Close the non capture group and optionally repeat it
Then you can use re.finditer to loop the matches, and use the group 1 value as part of the filename.
See a regex demo and a Python demo with the separate parts.
Example code
import re
pattern = r"^\.{3,}\n(\S+)\n\.{3,}(?:\n(?!\.{3,}\n\S+\n\.{3,}).*)*"
s = ("....your data here")
matches = re.finditer(pattern, s, re.MULTILINE)
your_path = "/your/path/"
for matchNum, match in enumerate(matches, start=1):
f = open(your_path + "group1_{}".format(match.group(1)), 'w')
f.write(match.group())
f.close()

Remove leading dollar sign from data and improve current solution

I have string like so:
"Job 1233:name_uuid (table n_Cars_1234567$20220316) done. Records: 24, with errors: 0."
I'd like to retieve the datte from the table name, so far I use:
"\$[0-9]+"
but this yields $20220316. How do I get only the date, without $?
I'd also like to get the table name: n_Cars_12345678$20220316
So far I have this:
pattern_table_info = "\(([^\)]+)\)"
pattern_table_name = "(?<=table ).*"
table_info = re.search(pattern_table_info, message).group(1)
table = re.search(pattern_table_name, table_info).group(0)
However I'd like to have a more simpler solution, how can I improve this?
EDIT:
Actually the table name should be:
n_Cars_12345678
So everything before the "$" sign and after "table"...how can this part of the string be retrieved?
You can use a regex with two capturing groups:
table\s+([^()]*)\$([0-9]+)
See the regex demo. Details:
table - a word
\s+ - one or more whitespaces
([^()]*) - Group 1: zero or more chars other than ( and )
\$ - a $ char
([0-9]+) - Group 2: one or more digits.
See the Python demo:
import re
text = "Job 1233:name_uuid (table n_Cars_1234567$20220316) done. Records: 24, with errors: 0."
rx = r"table\s+([^()]*)\$([0-9]+)"
m = re.search(rx, text)
if m:
print(m.group(1))
print(m.group(2))
Output:
n_Cars_1234567
20220316
You can write a single pattern with 2 capture groups:
\(table (\w+\$(\d+))\)
The pattern matches:
\(table
( Capture group 1
\w+\$ match 1+ word characters and $
(\d+) Capture group 2, match 1+ digits
) Close group 1
\) Match )
See a Regex demo and a Python demo.
import re
s = "Job 1233:name_uuid (table n_Cars_1234567$20220316) done. Records: 24, with errors: 0."
m = re.search(r"\(table (\w+\$(\d+))\)", s)
if m:
print(m.group(1))
print(m.group(2))
Output
n_Cars_1234567$20220316
20220316

Replace $$ or more with single spaceusing Regex in python

In the following list of string i want to remove $$ or more with only one space.
eg- if i have $$ then one space character or if there are $$$$ or more then also only 1 space is to be replaced.
I am using the following regex but i'm not sure if it serves the purpose
regex_pattern = r"['$$']{2,}?"
Following is the test string list:
['1', 'Patna City $$$$ $$$$$$$$View Details', 'Serial No:$$$$5$$$$ $$$$Deed No:$$$$5$$$$ $$$$Token No:$$$$7$$$$ $$$$Reg Year:2020', 'Anil Kumar Singh Alias Anil Kumar$$$$$$$$Executant$$$$$$$$Late. Harinandan Singh$$$$$$$$$$$$Md. Shahzad Ahmad$$$$$$$$Claimant$$$$$$$$Late. Md. Serajuddin', 'Anil Kumar Singh Alias Anil Kumar', 'Executant', 'Late. Harinandan Singh', 'Md. Shahzad Ahmad', 'Claimant', 'Late. Md. Serajuddin', 'Circle:Patna City Mauja: $$$$ $$$$Khata : na$$$$ $$$$Plot :2497 Area(in Decimal):1.5002 Land Type :Res. Branch Road Land Value :1520000 MVR Value :1000000', 'Circle:Patna City Mauja: $$$$ $$$$Khata : na$$$$ $$$$Plot :2497 Area(in Decimal):1.5002 Land Type :Res. Branch Road Land Value :1520000 MVR Value :1000000']
About
I am using the following regex but i'm not sure if it serves the
purpose
The pattern ['$$']{2,}? can be written as ['$]{2,}? and matches 2 or more chars being either ' or $ in a non greedy way.
Your pattern currently get the right matches, as there are no parts present like '' or $'
As the pattern is non greedy, it will only match 2 chars and will not match all 3 characters in $$$
You could write the pattern matching 2 or more dollar signs without making it non greedy so the odd number of $ will also be matched:
regex_pattern = r"\${2,}"
In the replacement use a space.
Is this what you need?:
import re
for d in data:
d = re.sub(r'\${2,}', ' ', d)

How to extract several timestamp pairs from a list in Python

I have extracted all timestamps from a transcript file. The output looks like this:
('[, 00:00:03,950, 00:00:06,840, 00:00:06,840, 00:00:09,180, 00:00:09,180, '
'00:00:10,830, 00:00:10,830, 00:00:14,070, 00:00:14,070, 00:00:16,890, '
'00:00:16,890, 00:00:19,080, 00:00:19,080, 00:00:21,590, 00:00:21,590, '
'00:00:24,030, 00:00:24,030, 00:00:26,910, 00:00:26,910, 00:00:29,640, '
'00:00:29,640, 00:00:31,920, 00:00:31,920, 00:00:35,850, 00:00:35,850, '
'00:00:38,629, 00:00:38,629, 00:00:40,859, 00:00:40,859, 00:00:43,170, '
'00:00:43,170, 00:00:45,570, 00:00:45,570, 00:00:48,859, 00:00:48,859, '
'00:00:52,019, 00:00:52,019, 00:00:54,449, 00:00:54,449, 00:00:57,210, '
'00:00:57,210, 00:00:59,519, 00:00:59,519, 00:01:02,690, 00:01:02,690, '
'00:01:05,820, 00:01:05,820, 00:01:08,549, 00:01:08,549, 00:01:10,490, '
'00:01:10,490, 00:01:13,409, 00:01:13,409, 00:01:16,409, 00:01:16,409, '
'00:01:18,149, 00:01:18,149, 00:01:20,340, 00:01:20,340, 00:01:22,649, '
'00:01:22,649, 00:01:26,159, 00:01:26,159, 00:01:28,740, 00:01:28,740, '
'00:01:30,810, 00:01:30,810, 00:01:33,719, 00:01:33,719, 00:01:36,990, '
'00:01:36,990, 00:01:39,119, 00:01:39,119, 00:01:41,759, 00:01:41,759, '
'00:01:43,799, 00:01:43,799, 00:01:46,619, 00:01:46,619, 00:01:49,140, '
'00:01:49,140, 00:01:51,240, 00:01:51,240, 00:01:53,759, 00:01:53,759, '
'00:01:56,460, 00:01:56,460, 00:01:58,740, 00:01:58,740, 00:02:01,640, '
'00:02:01,640, 00:02:04,409, 00:02:04,409, 00:02:07,229, 00:02:07,229, '
'00:02:09,380, 00:02:09,380, 00:02:12,060, 00:02:12,060, 00:02:14,840, ]')
In this output, there are always timestamp pairs, i.e. always 2 consecutive timestamps belong together, for example: 00:00:03,950 and 00:00:06,840, 00:00:06,840 and 00:00:09,180, etc.
Now, I want to extract all these timestamp pairs separately so that the output looks like this:
00:00:03,950 - 00:00:06,840
00:00:06,840 - 00:00:09,180
00:00:09,180 - 00:00:10,830
etc.
For now, I have the following (very inconvenient) solution for my problem:
# get first part of first timestamp
a = res_timestamps[2:15]
print(dedent(a))
# get second part of first timestamp
b = res_timestamps[17:29]
print(b)
# combine timestamp parts
c = a + ' - ' + b
print(dedent(c))
Of course, this is very bad since I cannot extract the indices manually for all transcripts. Trying to use a loop has not worked yet because each item is not a timestamp but a single character.
Is there an elegant solution for my problem?
I appreciate any help or tip.
Thank you very much in advance!
Regex to the rescue!
A solution that works perfectly on your example data:
import re
from pprint import pprint
pprint(re.findall(r"(\d{2}:\d{2}:\d{2},\d{3}), (\d{2}:\d{2}:\d{2},\d{3})", your_data))
This prints:
[('00:00:03,950', '00:00:06,840'),
('00:00:06,840', '00:00:09,180'),
('00:00:09,180', '00:00:10,830'),
('00:00:10,830', '00:00:14,070'),
('00:00:14,070', '00:00:16,890'),
('00:00:16,890', '00:00:19,080'),
('00:00:19,080', '00:00:21,590'),
('00:00:21,590', '00:00:24,030'),
('00:00:24,030', '00:00:26,910'),
('00:00:26,910', '00:00:29,640'),
('00:00:29,640', '00:00:31,920'),
('00:00:31,920', '00:00:35,850'),
('00:00:35,850', '00:00:38,629'),
('00:00:38,629', '00:00:40,859'),
('00:00:40,859', '00:00:43,170'),
('00:00:43,170', '00:00:45,570'),
('00:00:45,570', '00:00:48,859'),
('00:00:48,859', '00:00:52,019'),
('00:00:52,019', '00:00:54,449'),
('00:00:54,449', '00:00:57,210'),
('00:00:57,210', '00:00:59,519'),
('00:00:59,519', '00:01:02,690'),
('00:01:02,690', '00:01:05,820'),
('00:01:05,820', '00:01:08,549'),
('00:01:08,549', '00:01:10,490'),
('00:01:10,490', '00:01:13,409'),
('00:01:13,409', '00:01:16,409'),
('00:01:16,409', '00:01:18,149'),
('00:01:18,149', '00:01:20,340'),
('00:01:20,340', '00:01:22,649'),
('00:01:22,649', '00:01:26,159'),
('00:01:26,159', '00:01:28,740'),
('00:01:28,740', '00:01:30,810'),
('00:01:30,810', '00:01:33,719'),
('00:01:33,719', '00:01:36,990'),
('00:01:36,990', '00:01:39,119'),
('00:01:39,119', '00:01:41,759'),
('00:01:41,759', '00:01:43,799'),
('00:01:43,799', '00:01:46,619'),
('00:01:46,619', '00:01:49,140'),
('00:01:49,140', '00:01:51,240'),
('00:01:51,240', '00:01:53,759'),
('00:01:53,759', '00:01:56,460'),
('00:01:56,460', '00:01:58,740'),
('00:01:58,740', '00:02:01,640'),
('00:02:01,640', '00:02:04,409'),
('00:02:04,409', '00:02:07,229'),
('00:02:07,229', '00:02:09,380'),
('00:02:09,380', '00:02:12,060'),
('00:02:12,060', '00:02:14,840')]
You could output this in your desired format like so:
for start, end in timestamps:
print(f"{start} - {end}")
Here's a solution without regular expressions
Clean the string, and split on ', ' to create a list
Use string slicing to select the odd and even values and zip them together.
# give data as your string
# convert data into a list by removing end brackets and spaces, and splitting
data = data.replace('[, ', '').replace(', ]', '').split(', ')
# use list slicing and zip the two components
combinations = list(zip(data[::2], data[1::2]))
# print the first 5
print(combinations[:5])
[out]:
[('00:00:03,950', '00:00:06,840'),
('00:00:06,840', '00:00:09,180'),
('00:00:09,180', '00:00:10,830'),
('00:00:10,830', '00:00:14,070'),
('00:00:14,070', '00:00:16,890')]

parsing dates from strings

I have a list of strings in python like this
['AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5',
'AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5']
I want to parse only the date and time (for example, 2016-08-05 15:10:00 )from these strings.
So far I used a for loop like the one below but it's very time consuming, is there a better way to do this?
for files in glob.glob("AM_B0_*.flac.h5"):
if files[11]=='_':
year=files[12:16]
month=files[17:19]
day= files[20:22]
hour=files[23:25]
minute=files[25:27]
second=files[27:29]
tindex=pd.date_range(start= '%d-%02d-%02d %02d:%02d:%02d' %(int(year),int(month), int(day), int(hour), int(minute), int(second)), periods=60, freq='10S')
else:
year=files[11:15]
month=files[16:18]
day= files[19:21]
hour=files[22:24]
minute=files[24:26]
second=files[26:28]
tindex=pd.date_range(start= '%d-%02d-%02d %02d:%02d:%02d' %(int(year), int(month), int(day), int(hour), int(minute), int(second)), periods=60, freq='10S')
Try this (based on the 2nd last '-', no need of if-else case):
filesall = ['AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5',
'AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5']
def find_second_last(text, pattern):
return text.rfind(pattern, 0, text.rfind(pattern))
for files in filesall:
start = find_second_last(files,'-') - 4 # from yyyy- part
timepart = (files[start:start+17]).replace("T"," ")
#insert 2 ':'s
timepart = timepart[:13] + ':' + timepart[13:15] + ':' +timepart[15:]
# print(timepart)
tindex=pd.date_range(start= timepart, periods=60, freq='10S')
In Place of using file[11] as hard coded go for last or 2nd last index of _ then use your code then you don't have to write 2 times same code. Or use regex to parse the string.

Resources