How to extract same parts of file name with different patterns? - python-3.x

I have the following file name formats:
2020-01-05-ABC1111_001.jpg
2020_02_06_B444444_MN_004.jpg
2020_03_20_KUKU44223222-STAFF_005.jpg
2020-04-03-LULU4444211-MN_018.jpg
Most (99%) of the files are of the following format:
2020_04_03_LULU4444211_018.jpg
And I just use rsplit("_", 2) and get what I need.
Where the first part is date, second is and ID, MN or STAFF, page number.
How to build a good regex or split function to somehow split it to date, id, and page?
From all the above examples I would like to get:
{
"2020-01-05-ABC1111_001.jpg": {"date": 2020-01-05, "id": ABC1111, "page_num": 1},
"2020_02_06_B444444_MN_004.jpg": {"date": 2020_02_06, "id": B444444, "page_num": 4},
"2020_03_20_KUKU44223222-STAFF_005.jpg": {"date": 2020_03_20, "id": KUKU44223222, "page_num": 5},
"2020-04-03-LULU4444211-MN_018.jpg": {"date": 2020-04-03, "id": LULU4444211, "page_num": 18}
}
I am have tried rsplit, I know there is an annotation option + Spacy NER model but maybe there is another way to do it more simply?

You might use code like
import re
strings = ['2020-01-05-ABC1111_001.jpg','2020_02_06_B444444_MN_004.jpg','2020_03_20_KUKU44223222-STAFF_005.jpg','2020-04-03-LULU4444211-MN_018.jpg']
rx = re.compile(r'(?P<date>\d{4}[-_]\d{2}[-_]\d{2})[-_](?P<id>[^_-]+)(?:[_-](?:MN|STAFF))?[_-](?P<page_num>\d+)')
d = {}
for s in strings:
m = rx.search(s)
if m:
d[s] = m.groupdict()
print(d)
See the Python demo, yielding
{'2020-01-05-ABC1111_001.jpg': {'date': '2020-01-05', 'id': 'ABC1111', 'page_num': '001'}, '2020_02_06_B444444_MN_004.jpg': {'date': '2020_02_06', 'id': 'B444444', 'page_num': '004'}, '2020_03_20_KUKU44223222-STAFF_005.jpg': {'date': '2020_03_20', 'id': 'KUKU44223222', 'page_num': '005'}, '2020-04-03-LULU4444211-MN_018.jpg': {'date': '2020-04-03', 'id': 'LULU4444211', 'page_num': '018'}}
Note the regex used contains named capturing groups so that you could get access to .groupdict() after a match is found, it looks like
(?P<date>\d{4}[-_]\d{2}[-_]\d{2})[-_](?P<id>[^_-]+)(?:[_-](?:MN|STAFF))?[_-](?P<page_num>\d+)
See the regex demo.
Regex details
(?P<date>\d{4}[-_]\d{2}[-_]\d{2}) - Group "date": 4 digits, _ or -, 2 digits, _ or - and then again 2 digits
[-_] - a hyphen or underscore
(?P<id>[^_-]+) - Group "id": 1 or more chars other than - and _
(?:[_-](?:MN|STAFF))? - an optional non-capturing group matching - or _ and then MN or STAFF
[_-] - a - or _
(?P<page_num>\d+) - Group "page_num": 1 or more digits.

Regexp:
(\d{4}[-_]\d{2}[-_]\d{2})[-_](.*)[-_](\d+)\.[a-zA-Z]+
It contains three regexp groups:
date
id
page number
Explanation:
(\d{4}[-_]\d{2}[-_]\d{2}) # date (yyyy-mm-dd or yyyy_mm_dd) - group 1
[-_] # separator (dash or underscore)
(.+) # id (any character) - group 2
[-_] # separator
(\d+) # page number - group 3
\.[a-zA-Z]+ # file extension
Demo: https://regex101.com/r/IPF7QE/1
You can read groups in Python this way:
if match := re.search(regexp, text_line, re.IGNORECASE):
date = match.group(1)
id = match.group(2)
page_number = match.group(3)

Related

Best way to handle element of dict that has multiple key/value pairs inside it

[{'id': 2, 'Registered Address': 'Line 1: 1 Any Street Line 2: Any locale City: Any City Region / State: Any Region Postcode / Zip code: BA2 2SA Country: GB Jurisdiction: Any Jurisdiction'}]
I have the above read into a dataframe and that is the output so far. The issue is I need to break out the individual elements - due to names of places etc the values may or may not have spaces in them - looking at the above my keys are Line 1, Line 2, City, Region / State, Postcode / Zip, Country, Jurisdiction.
Output required for the "Registered Address"-'key'is the keys and values
"Line 1": "1 Any Street"
"Line 2": "Any locale"
"City": "Any City"
"Region / State": "Any Region"
"Postcode / Zip code": "BA2 2SA"
"Country": "GB"
"Jurisdiction": "Any Jurisdiction"
Just struggling to find a way to get to the end result.I have tried to pop out and use urllib.prse but fell short - is anypone able to point me in the best direction please?
Tried to write a code that generalizes your question, but there were some limitations, regarding your data format. Anyway I would do this:
def address_spliter(my_data, my_keys):
address_data = my_data[0]['Registered Address']
key_address = {}
for i,k in enumerate(keys):
print(k)
if k == 'Jurisdiction:':
key_address[k] = address_data.split('Jurisdiction:')[1].removeprefix(' ').removesuffix(' ')
else:
key_address[k] = address_data.split(k)[1].split(keys[i+1])[0].removeprefix(' ').removesuffix(' ')
return key_address
were you can call this function like this:
my_data = [{'id': 2, 'Registered Address': 'Line 1: 1 Any Street Line 2: Any locale City: Any City Region / State: Any Region Postcode / Zip code: BA2 2SA Country: GB Jurisdiction: Any Jurisdiction'}]
and
my_keys = ['Line 1:','Line 2:','City:', 'Region / State:', 'Postcode / Zip code:', 'Country:', 'Jurisdiction']
As you can see It'll work if only the sequence of keys is not changed. But anyway, you can work around this idea and change it base on your problem accordingly if it doesn't go as expected.

Spacy matching priority

I'm looking to create a physics pattern library with spacy:
I want to detect time and speed pattern. My aim is to stay flexible with those pattern.
time_pattern = [
[
{'LIKE_NUM': True, 'OP': '?'},
{'LOWER':{'IN': ['time', 's','h','min']}},
{'LOWER': {"IN": ['maximum','minimum','min','max']}, 'OP':'?'}
]
]
speed_pattern = [
[
{'LIKE_NUM': True, 'OP': '?'},
{'LOWER':{"IN": ['km', 'm']}},
{'IS_PUNCT': True},
{'LOWER' : {"IN": ['h','hour','s','min']}}
]
]
matcher=Matcher(nlp.vocab, validate =True)
matcher.add("SPEED", speed_pattern)
matcher.add("TIME", time_pattern)
doc=nlp("a certain time, more about 23 min, can't get above 25 km/h")
for id_match, start, end in matcher(doc):
match_label=nlp.vocab[id_match].text
print(match_label, '<--', doc[start:end])
So far my code returns this collection of matches:
TIME <-- time
TIME <-- 23 min
TIME <-- min
SPEED <-- 25 km/h
SPEED<-- km/h
TIME <-- h
I want the matcher to match only once, and to match "23 min" rather than "min". Also would like the matcher not to match an element already matched ( for exemple "h" should not be matched because it already matched in "km/h"
You can try add greedy="LONGEST" to matcher.add() to return only the longest (or FIRST) matches:
matcher.add("SPEED", speed_pattern, greedy="LONGEST")
matcher.add("TIME", time_pattern, greedy="LONGEST")
But note that this doesn't handle overlaps across different match IDs:
TIME <-- 23 min
TIME <-- time
TIME <-- h
SPEED <-- 25 km/h
If you want to filter all of the matches, you can use matcher(doc, as_spans=True) to get the matches directly as spans and then use spacy.util.filter_spans to filter the whole list of spans to a list of non-overlapping spans with the longest spans preferred: https://spacy.io/api/top-level#util.filter_spans
[time, 23 min, 25 km/h]
You can use as_spans=True option with spacy.matcher.Matcher (introduced in spaCy v3.0):
matches = matcher(doc, as_spans=True)
for span in spacy.util.filter_spans(matches):
print(span.label_, "->", span.text)
From the documentation:
Instead of tuples, return a list of Span objects of the matches, with the match_id assigned as the span label. Defaults to False.
See the Python demo:
import spacy
from spacy.tokens.doc import Doc
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
time_pattern = [
[
{'LIKE_NUM': True, 'OP': '?'},
{'LOWER':{'IN': ['time', 's','h','min']}},
{'LOWER': {"IN": ['maximum','minimum','min','max']}, 'OP':'?'}
]
]
speed_pattern = [
[
{'LIKE_NUM': True, 'OP': '?'},
{'LOWER':{"IN": ['km', 'm']}},
{'IS_PUNCT': True},
{'LOWER' : {"IN": ['h','hour','s','min']}}
]
]
matcher=Matcher(nlp.vocab, validate =True)
matcher.add("SPEED", speed_pattern)
matcher.add("TIME", time_pattern)
doc=nlp("a certain time, more about 23 min, can't get above 25 km/h")
matches = matcher(doc, as_spans=True)
for span in spacy.util.filter_spans(matches):
print(span.label_, "->", span.text)
Output:
TIME -> time
TIME -> 23 min
SPEED -> 25 km/h

Remove leading dollar sign from data and improve current solution

I have string like so:
"Job 1233:name_uuid (table n_Cars_1234567$20220316) done. Records: 24, with errors: 0."
I'd like to retieve the datte from the table name, so far I use:
"\$[0-9]+"
but this yields $20220316. How do I get only the date, without $?
I'd also like to get the table name: n_Cars_12345678$20220316
So far I have this:
pattern_table_info = "\(([^\)]+)\)"
pattern_table_name = "(?<=table ).*"
table_info = re.search(pattern_table_info, message).group(1)
table = re.search(pattern_table_name, table_info).group(0)
However I'd like to have a more simpler solution, how can I improve this?
EDIT:
Actually the table name should be:
n_Cars_12345678
So everything before the "$" sign and after "table"...how can this part of the string be retrieved?
You can use a regex with two capturing groups:
table\s+([^()]*)\$([0-9]+)
See the regex demo. Details:
table - a word
\s+ - one or more whitespaces
([^()]*) - Group 1: zero or more chars other than ( and )
\$ - a $ char
([0-9]+) - Group 2: one or more digits.
See the Python demo:
import re
text = "Job 1233:name_uuid (table n_Cars_1234567$20220316) done. Records: 24, with errors: 0."
rx = r"table\s+([^()]*)\$([0-9]+)"
m = re.search(rx, text)
if m:
print(m.group(1))
print(m.group(2))
Output:
n_Cars_1234567
20220316
You can write a single pattern with 2 capture groups:
\(table (\w+\$(\d+))\)
The pattern matches:
\(table
( Capture group 1
\w+\$ match 1+ word characters and $
(\d+) Capture group 2, match 1+ digits
) Close group 1
\) Match )
See a Regex demo and a Python demo.
import re
s = "Job 1233:name_uuid (table n_Cars_1234567$20220316) done. Records: 24, with errors: 0."
m = re.search(r"\(table (\w+\$(\d+))\)", s)
if m:
print(m.group(1))
print(m.group(2))
Output
n_Cars_1234567$20220316
20220316

How to apply recursion over this problem and solve this problem

The Problem is:-
Given a digit string, return all possible letter combinations of each digits according to the buttons on a telephone, that the number could represent.
The returned strings must be lexicographically sorted.
Example-1 :-
Input : “23”
Output : ["ad", "ae", "af", "bd", "be", "bf", "cd", "ce", "cf"]
Example-2 :-
Input : “9”
Output: [“w”, “x”, “y”, “z”]
Example-3 :-
Input : “246”
Output : ["agm", "agn", "ago", "ahm", ..., "cho", "cim", "cin" "cio"] {27 elements}
I've squeezed my brain on this, and I've tried a lot but I'm not getting ahead of this part, what I've tried is to use a recursive function that zips the individual letters of each digit with each other letters and use itertools.combinations() over it, but I'm unable to complete this function and I'm unable to get ahead of this.
What I've tried is :-
times, str_res = 0, ""
def getval(lst, times):
if times==len(lst)-1:
for i in lst[times]:
yield i
else:
for i in lst[times]:
yield i + getval(lst, times+1)
dct = {"2":("a","b","c"), "3":("d","e","f"), "4":("g","h","i"),
"5":("j","k","l"), "6":("m","n","o"), "7":("p","q","r","s"),
"8":("t","u","v"), "9":("w","x","y","z"), "1":("")}
str1, res = "23", []
if len(str1)==1:
print(dct[str1[0]])
else:
temp = [dct[i] for i in str1]
str_res = getval(temp, times)
print(str_res)
Please suggest me your ideas over this problem or in completing the function...
It's not itertools.combinations that you need, it's itertools.product.
from itertools import product
def all_letter_comb(s, dct):
for p in product(*map(dct.get, s)):
yield ''.join(p)
dct = {"2":("a","b","c"), "3":("d","e","f"), "4":("g","h","i"),
"5":("j","k","l"), "6":("m","n","o"), "7":("p","q","r","s"),
"8":("t","u","v"), "9":("w","x","y","z"), "1":("")}
for s in ['23', '9', '246']:
print(s)
print(list(all_letter_comb(s, dct)))
print()
Output:
23
['ad', 'ae', 'af', 'bd', 'be', 'bf', 'cd', 'ce', 'cf']
9
['w', 'x', 'y', 'z']
246
['agm', 'agn', 'ago', 'ahm', 'ahn', 'aho', 'aim', 'ain', 'aio', 'bgm', 'bgn', 'bgo', 'bhm', 'bhn', 'bho', 'bim', 'bin', 'bio', 'cgm', 'cgn', 'cgo', 'chm', 'chn', 'cho', 'cim', 'cin', 'cio']
If I am not wrong this is leet code problem. You can find multiple answers there.

Assigning values to imported variables from excel

I need to import an excel document into mathematica which has 2000 compounds in it, with each compound have 6 numerical constants assigned to it. The end goal is to type a compound name into mathematica and have the 6 numerical constants be outputted. So far my code is:
t = Import["Titles.txt.", {"Text", "Lines"}] (imports compound names)
n = Import["NA.txt.", "List"] (imports the 6 values for each compound)
n[[2]] (outputs the second compounds 6 values)
Instead of n[[#]] i would like to know how to type in a compound from the imported compound names and have the 6 values be outputted .
I'm not sure if I understand your question - you have two text files, rather than an Excel file, for example, and it's not clear what the data looks like. But there are probably plenty of ways to do this. Here's a suggestion (it might not be the best way):
Let's assume that you've got all your data into a table (a list of lists):
pt = {
{"Hydrogen", "H", 1, 1.0079, -259, -253, 0.09, 0.14, 1776, 1, 13.5984},
{"Helium", "He", 2, 4.0026, -272, -269, 0, 0, 1895, 18, 24.5874},
{"Lithium" , "Li", 3, 6.941, 180, 1347, 0.53, 0, 1817, 1, 5.3917}
}
To find the information associated with a particular string:
Cases[pt, {"Helium", rest__} -> rest]
{"He", 2, 4.0026, -272, -269, 0, 0, 1895, 18, 24.5874}
where the pattern rest__ holds everything that was found after "Helium".
To look for the second item:
Cases[pt, {_, "Li", rest__} -> rest]
{2, 4.0026, -272, -269, 0, 0, 1895, 18, 24.5874}
If you add more information to the patterns, you have more flexibility in how you choose elements from the table:
Cases[pt, {name_, symbol_, aNumber_, aWeight_, mp_, bp_, density_,
crust_, discovered_, rest__}
/; discovered > 1850 -> {name, symbol, discovered}]
{{"Helium", "He", 1895}}
For something interactive, you could knock up a Manipulate:
elements = pt[[All, 1]];
headings = {"symbol", "aNumber", "aWeight", "mp", "bp", "density", "crust", "discovered", "group", "ion"};
Manipulate[
Column[{
elements[[x]],
TableForm[{
headings, Cases[pt, {elements[[x]], rest__} -> rest]}]}],
{x, 1, Length[elements], 1}]

Resources