How to identify number near word using regex - python-3.x
Need to identify numbers near keyword number:, no:, etc..
Tried:
import re
matchstring="Sales Quote"
string_lst = ['number:', 'No:','no:','number','No : ']
x=""" Sentence1: Sales Quote number 36886DJ9 is entered
Sentence2: SALES QUOTE No: 89745DFD is entered
Sentence3: Sales Quote No : 7964KL is entered
Sentence4: SALES QUOTE NUMBER:879654DF is entered
Sentence5: salesquote no: 9874656LD is entered"""
documentnumber= re.findall(r"(?:(?<="+matchstring+ '|'.join(string_lst)+r')) [\w\d-]',x,flags=re.IGNORECASE)
print(documentnumber)
Required soln:36886DJ9,89745DFD,7964KL,879654DF,9874656LD
Is there any solution?
Actually your solution is very close. You just need some missing parenthesis and check for optional whitespace:
documentnumber = re.findall(r"(?:(?<="+matchstring + ").*?(?:" + '|'.join(string_lst) + ')\s?)([\w\d-]*)', x, re.IGNORECASE)
However this won't match with the last one (9874656LD) because of the missing whitespace between "Sales" and "quote". If you want to build it in the same way than the rest of the pattern, replace the lookbehind by a non capturing group and join words with \s?:
documentnumber= re.findall(r"(?:(?:" + "\s?".join(matchstring.split()) + ").*?(?:" + '|'.join(string_lst) + ')\s?)([\w\d-]*)', x, re.IGNORECASE)
Output:
['36886DJ9', '89745DFD', '7964KL', '879654DF', '9874656LD']
Related
Split a big text file into multiple smaller one on set parameter of regex
I have a large text file looking like: .... sdsdsd .......... asdfhjgjksdfk dfkaskk sdkfk skddkf skdf sdk ssaaa akskdf sdksdfsdf ksdf sd kkkkallwow. sdsdllla lsldlsd lsldlalllLlsdd asdd. sdlsllall asdsdlallOEFOOASllsdl lsdlla. slldlllasdlsd.ss;sdsdasdas. ...... ddss ................ asdfhjgjksdfk ddjafjijjjj.dfsdfsdfsdfsi dfodoof ooosdfow oaosofoodf aosolflldlfl , dskdkkfkdsa asddf;akkdfkdkk . sdlsllall asdsdlallOEFOOASllsdl lsdlla. slldlllasdlsd.ss;sdsdasdas. ..... xxxx ....... asdfghjkl I want to split the text files into multiple small text files and save them as .txt in my system on occurences of ..... [multiple period markers] saved like group1_sdsdsd.txt .... sdsdsd .......... asdfhjgjksdfk dfkaskk sdkfk skddkf skdf sdk ssaaa akskdf sdksdfsdf ksdf sd kkkkallwow. sdsdllla lsldlsd lsldlalllLlsdd asdd. sdlsllall asdsdlallOEFOOASllsdl lsdlla. slldlllasdlsd.ss;sdsdasdas. group1_ddss.txt ddss ................ asdfhjgjksdfk ddjafjijjjj.dfsdfsdfsdfsi dfodoof ooosdfow oaosofoodf aosolflldlfl , dskdkkfkdsa asddf;akkdfkdkk . sdlsllall asdsdlallOEFOOASllsdl lsdlla. slldlllasdlsd.ss;sdsdasdas. and group1_xxxx.txt ..... xxxx ....... asdfghjkl I have figured that by usinf regex of sort of following can be done txt =re.sub(r'(([^\w\s])\2+)', r' ', txt).strip() #for letters more than 2 times but not able to figure out completely. The saved text files should be named as group1_sdsdsd.txt , group1_ddss.txt and group1_xxxx.txt [group1 being identifier for the specific big text file as I have multiple bigger text files and need to do same on all to know which big text file i am splitting.
If you want to get the parts with multiple dots only on the same line, you can use and get the separate parts, you might use a pattern like: ^\.{3,}\n(\S+)\n\.{3,}(?:\n(?!\.{3,}\n\S+\n\.{3,}).*)* Explanation ^ Start of string \.{3,}\n Match 3 or more dots and a newline (\S+)\n Capture 1+ non whitespace chars in group 1 for the filename and match a newline \.{3,} Match 3 or more dots (?: Non capture group to repeat as a whole part \n Match a newline (?!\.{3,}\n\S+\n\.{3,}) Negative lookahead, assert that from the current position we are not looking at a pattern that matches the dots with a filename in between .* Match the whole line )* Close the non capture group and optionally repeat it Then you can use re.finditer to loop the matches, and use the group 1 value as part of the filename. See a regex demo and a Python demo with the separate parts. Example code import re pattern = r"^\.{3,}\n(\S+)\n\.{3,}(?:\n(?!\.{3,}\n\S+\n\.{3,}).*)*" s = ("....your data here") matches = re.finditer(pattern, s, re.MULTILINE) your_path = "/your/path/" for matchNum, match in enumerate(matches, start=1): f = open(your_path + "group1_{}".format(match.group(1)), 'w') f.write(match.group()) f.close()
Remove leading dollar sign from data and improve current solution
I have string like so: "Job 1233:name_uuid (table n_Cars_1234567$20220316) done. Records: 24, with errors: 0." I'd like to retieve the datte from the table name, so far I use: "\$[0-9]+" but this yields $20220316. How do I get only the date, without $? I'd also like to get the table name: n_Cars_12345678$20220316 So far I have this: pattern_table_info = "\(([^\)]+)\)" pattern_table_name = "(?<=table ).*" table_info = re.search(pattern_table_info, message).group(1) table = re.search(pattern_table_name, table_info).group(0) However I'd like to have a more simpler solution, how can I improve this? EDIT: Actually the table name should be: n_Cars_12345678 So everything before the "$" sign and after "table"...how can this part of the string be retrieved?
You can use a regex with two capturing groups: table\s+([^()]*)\$([0-9]+) See the regex demo. Details: table - a word \s+ - one or more whitespaces ([^()]*) - Group 1: zero or more chars other than ( and ) \$ - a $ char ([0-9]+) - Group 2: one or more digits. See the Python demo: import re text = "Job 1233:name_uuid (table n_Cars_1234567$20220316) done. Records: 24, with errors: 0." rx = r"table\s+([^()]*)\$([0-9]+)" m = re.search(rx, text) if m: print(m.group(1)) print(m.group(2)) Output: n_Cars_1234567 20220316
You can write a single pattern with 2 capture groups: \(table (\w+\$(\d+))\) The pattern matches: \(table ( Capture group 1 \w+\$ match 1+ word characters and $ (\d+) Capture group 2, match 1+ digits ) Close group 1 \) Match ) See a Regex demo and a Python demo. import re s = "Job 1233:name_uuid (table n_Cars_1234567$20220316) done. Records: 24, with errors: 0." m = re.search(r"\(table (\w+\$(\d+))\)", s) if m: print(m.group(1)) print(m.group(2)) Output n_Cars_1234567$20220316 20220316
Replace $$ or more with single spaceusing Regex in python
In the following list of string i want to remove $$ or more with only one space. eg- if i have $$ then one space character or if there are $$$$ or more then also only 1 space is to be replaced. I am using the following regex but i'm not sure if it serves the purpose regex_pattern = r"['$$']{2,}?" Following is the test string list: ['1', 'Patna City $$$$ $$$$$$$$View Details', 'Serial No:$$$$5$$$$ $$$$Deed No:$$$$5$$$$ $$$$Token No:$$$$7$$$$ $$$$Reg Year:2020', 'Anil Kumar Singh Alias Anil Kumar$$$$$$$$Executant$$$$$$$$Late. Harinandan Singh$$$$$$$$$$$$Md. Shahzad Ahmad$$$$$$$$Claimant$$$$$$$$Late. Md. Serajuddin', 'Anil Kumar Singh Alias Anil Kumar', 'Executant', 'Late. Harinandan Singh', 'Md. Shahzad Ahmad', 'Claimant', 'Late. Md. Serajuddin', 'Circle:Patna City Mauja: $$$$ $$$$Khata : na$$$$ $$$$Plot :2497 Area(in Decimal):1.5002 Land Type :Res. Branch Road Land Value :1520000 MVR Value :1000000', 'Circle:Patna City Mauja: $$$$ $$$$Khata : na$$$$ $$$$Plot :2497 Area(in Decimal):1.5002 Land Type :Res. Branch Road Land Value :1520000 MVR Value :1000000']
About I am using the following regex but i'm not sure if it serves the purpose The pattern ['$$']{2,}? can be written as ['$]{2,}? and matches 2 or more chars being either ' or $ in a non greedy way. Your pattern currently get the right matches, as there are no parts present like '' or $' As the pattern is non greedy, it will only match 2 chars and will not match all 3 characters in $$$ You could write the pattern matching 2 or more dollar signs without making it non greedy so the odd number of $ will also be matched: regex_pattern = r"\${2,}" In the replacement use a space.
Is this what you need?: import re for d in data: d = re.sub(r'\${2,}', ' ', d)
How to extract several timestamp pairs from a list in Python
I have extracted all timestamps from a transcript file. The output looks like this: ('[, 00:00:03,950, 00:00:06,840, 00:00:06,840, 00:00:09,180, 00:00:09,180, ' '00:00:10,830, 00:00:10,830, 00:00:14,070, 00:00:14,070, 00:00:16,890, ' '00:00:16,890, 00:00:19,080, 00:00:19,080, 00:00:21,590, 00:00:21,590, ' '00:00:24,030, 00:00:24,030, 00:00:26,910, 00:00:26,910, 00:00:29,640, ' '00:00:29,640, 00:00:31,920, 00:00:31,920, 00:00:35,850, 00:00:35,850, ' '00:00:38,629, 00:00:38,629, 00:00:40,859, 00:00:40,859, 00:00:43,170, ' '00:00:43,170, 00:00:45,570, 00:00:45,570, 00:00:48,859, 00:00:48,859, ' '00:00:52,019, 00:00:52,019, 00:00:54,449, 00:00:54,449, 00:00:57,210, ' '00:00:57,210, 00:00:59,519, 00:00:59,519, 00:01:02,690, 00:01:02,690, ' '00:01:05,820, 00:01:05,820, 00:01:08,549, 00:01:08,549, 00:01:10,490, ' '00:01:10,490, 00:01:13,409, 00:01:13,409, 00:01:16,409, 00:01:16,409, ' '00:01:18,149, 00:01:18,149, 00:01:20,340, 00:01:20,340, 00:01:22,649, ' '00:01:22,649, 00:01:26,159, 00:01:26,159, 00:01:28,740, 00:01:28,740, ' '00:01:30,810, 00:01:30,810, 00:01:33,719, 00:01:33,719, 00:01:36,990, ' '00:01:36,990, 00:01:39,119, 00:01:39,119, 00:01:41,759, 00:01:41,759, ' '00:01:43,799, 00:01:43,799, 00:01:46,619, 00:01:46,619, 00:01:49,140, ' '00:01:49,140, 00:01:51,240, 00:01:51,240, 00:01:53,759, 00:01:53,759, ' '00:01:56,460, 00:01:56,460, 00:01:58,740, 00:01:58,740, 00:02:01,640, ' '00:02:01,640, 00:02:04,409, 00:02:04,409, 00:02:07,229, 00:02:07,229, ' '00:02:09,380, 00:02:09,380, 00:02:12,060, 00:02:12,060, 00:02:14,840, ]') In this output, there are always timestamp pairs, i.e. always 2 consecutive timestamps belong together, for example: 00:00:03,950 and 00:00:06,840, 00:00:06,840 and 00:00:09,180, etc. Now, I want to extract all these timestamp pairs separately so that the output looks like this: 00:00:03,950 - 00:00:06,840 00:00:06,840 - 00:00:09,180 00:00:09,180 - 00:00:10,830 etc. For now, I have the following (very inconvenient) solution for my problem: # get first part of first timestamp a = res_timestamps[2:15] print(dedent(a)) # get second part of first timestamp b = res_timestamps[17:29] print(b) # combine timestamp parts c = a + ' - ' + b print(dedent(c)) Of course, this is very bad since I cannot extract the indices manually for all transcripts. Trying to use a loop has not worked yet because each item is not a timestamp but a single character. Is there an elegant solution for my problem? I appreciate any help or tip. Thank you very much in advance!
Regex to the rescue! A solution that works perfectly on your example data: import re from pprint import pprint pprint(re.findall(r"(\d{2}:\d{2}:\d{2},\d{3}), (\d{2}:\d{2}:\d{2},\d{3})", your_data)) This prints: [('00:00:03,950', '00:00:06,840'), ('00:00:06,840', '00:00:09,180'), ('00:00:09,180', '00:00:10,830'), ('00:00:10,830', '00:00:14,070'), ('00:00:14,070', '00:00:16,890'), ('00:00:16,890', '00:00:19,080'), ('00:00:19,080', '00:00:21,590'), ('00:00:21,590', '00:00:24,030'), ('00:00:24,030', '00:00:26,910'), ('00:00:26,910', '00:00:29,640'), ('00:00:29,640', '00:00:31,920'), ('00:00:31,920', '00:00:35,850'), ('00:00:35,850', '00:00:38,629'), ('00:00:38,629', '00:00:40,859'), ('00:00:40,859', '00:00:43,170'), ('00:00:43,170', '00:00:45,570'), ('00:00:45,570', '00:00:48,859'), ('00:00:48,859', '00:00:52,019'), ('00:00:52,019', '00:00:54,449'), ('00:00:54,449', '00:00:57,210'), ('00:00:57,210', '00:00:59,519'), ('00:00:59,519', '00:01:02,690'), ('00:01:02,690', '00:01:05,820'), ('00:01:05,820', '00:01:08,549'), ('00:01:08,549', '00:01:10,490'), ('00:01:10,490', '00:01:13,409'), ('00:01:13,409', '00:01:16,409'), ('00:01:16,409', '00:01:18,149'), ('00:01:18,149', '00:01:20,340'), ('00:01:20,340', '00:01:22,649'), ('00:01:22,649', '00:01:26,159'), ('00:01:26,159', '00:01:28,740'), ('00:01:28,740', '00:01:30,810'), ('00:01:30,810', '00:01:33,719'), ('00:01:33,719', '00:01:36,990'), ('00:01:36,990', '00:01:39,119'), ('00:01:39,119', '00:01:41,759'), ('00:01:41,759', '00:01:43,799'), ('00:01:43,799', '00:01:46,619'), ('00:01:46,619', '00:01:49,140'), ('00:01:49,140', '00:01:51,240'), ('00:01:51,240', '00:01:53,759'), ('00:01:53,759', '00:01:56,460'), ('00:01:56,460', '00:01:58,740'), ('00:01:58,740', '00:02:01,640'), ('00:02:01,640', '00:02:04,409'), ('00:02:04,409', '00:02:07,229'), ('00:02:07,229', '00:02:09,380'), ('00:02:09,380', '00:02:12,060'), ('00:02:12,060', '00:02:14,840')] You could output this in your desired format like so: for start, end in timestamps: print(f"{start} - {end}")
Here's a solution without regular expressions Clean the string, and split on ', ' to create a list Use string slicing to select the odd and even values and zip them together. # give data as your string # convert data into a list by removing end brackets and spaces, and splitting data = data.replace('[, ', '').replace(', ]', '').split(', ') # use list slicing and zip the two components combinations = list(zip(data[::2], data[1::2])) # print the first 5 print(combinations[:5]) [out]: [('00:00:03,950', '00:00:06,840'), ('00:00:06,840', '00:00:09,180'), ('00:00:09,180', '00:00:10,830'), ('00:00:10,830', '00:00:14,070'), ('00:00:14,070', '00:00:16,890')]
parsing dates from strings
I have a list of strings in python like this ['AM_B0_D0.0_2016-04-01T010000.flac.h5', 'AM_B0_D3.7_2016-04-13T215000.flac.h5', 'AM_B0_D10.3_2017-03-17T110000.flac.h5', 'AM_B0_D0.7_2016-10-21T104000.flac.h5', 'AM_B0_D4.4_2016-08-05T151000.flac.h5', 'AM_B0_D0.0_2016-04-01T010000.flac.h5', 'AM_B0_D3.7_2016-04-13T215000.flac.h5', 'AM_B0_D10.3_2017-03-17T110000.flac.h5', 'AM_B0_D0.7_2016-10-21T104000.flac.h5', 'AM_B0_D4.4_2016-08-05T151000.flac.h5'] I want to parse only the date and time (for example, 2016-08-05 15:10:00 )from these strings. So far I used a for loop like the one below but it's very time consuming, is there a better way to do this? for files in glob.glob("AM_B0_*.flac.h5"): if files[11]=='_': year=files[12:16] month=files[17:19] day= files[20:22] hour=files[23:25] minute=files[25:27] second=files[27:29] tindex=pd.date_range(start= '%d-%02d-%02d %02d:%02d:%02d' %(int(year),int(month), int(day), int(hour), int(minute), int(second)), periods=60, freq='10S') else: year=files[11:15] month=files[16:18] day= files[19:21] hour=files[22:24] minute=files[24:26] second=files[26:28] tindex=pd.date_range(start= '%d-%02d-%02d %02d:%02d:%02d' %(int(year), int(month), int(day), int(hour), int(minute), int(second)), periods=60, freq='10S')
Try this (based on the 2nd last '-', no need of if-else case): filesall = ['AM_B0_D0.0_2016-04-01T010000.flac.h5', 'AM_B0_D3.7_2016-04-13T215000.flac.h5', 'AM_B0_D10.3_2017-03-17T110000.flac.h5', 'AM_B0_D0.7_2016-10-21T104000.flac.h5', 'AM_B0_D4.4_2016-08-05T151000.flac.h5', 'AM_B0_D0.0_2016-04-01T010000.flac.h5', 'AM_B0_D3.7_2016-04-13T215000.flac.h5', 'AM_B0_D10.3_2017-03-17T110000.flac.h5', 'AM_B0_D0.7_2016-10-21T104000.flac.h5', 'AM_B0_D4.4_2016-08-05T151000.flac.h5'] def find_second_last(text, pattern): return text.rfind(pattern, 0, text.rfind(pattern)) for files in filesall: start = find_second_last(files,'-') - 4 # from yyyy- part timepart = (files[start:start+17]).replace("T"," ") #insert 2 ':'s timepart = timepart[:13] + ':' + timepart[13:15] + ':' +timepart[15:] # print(timepart) tindex=pd.date_range(start= timepart, periods=60, freq='10S')
In Place of using file[11] as hard coded go for last or 2nd last index of _ then use your code then you don't have to write 2 times same code. Or use regex to parse the string.