Python pattern auto matching within the list - python-3.x

I am trying to write a small script to group strings with similar patterns together. The following is my program snippet, which is working fine, but a little inaccurate.
lst = ["report-2020.10.13", "report-2020.12.12", "analytics-2020.12.14", "sales-cda87", "analytics-2020.11.21", "sales-vu7sa"]
final = []
for pat in lst:
pat = pat[:len(pat) // 2]
ils = []
for pat2 in lst:
if pat2.startswith(pat):
ils.append(pat2)
final.append(tuple(ils))
finalls = list(set(final))
for f in finalls:
print(f)
Also, I want the exact string pattern that groups the string. For example, from string list ["rep-10-01", "rep-10-02", "rep-11-06"] I want "rep-" as a pattern.
Are there any improvements required? Or any libraries/modules that can help me out in first as well as second problem?
Thanks in advance.

Does this work as you expected:
from collections import defaultdict
res = defaultdict(str)
lst = ["report-2020.10.13", "report-2020.12.12", "analytics-2020.12.14",
"sales-cda87", "analytics-2020.11.21", "sales-vu7sa"]
#ll = ['rep-10-01', 'rep-10-02', 'rep-11-06']
for pat in lst:
pattern = pat.split('-')
#print(pattern[0]) # real pattern - eg. report, sales, analytics
res[pattern[0]] += pat+ ', '
print(res)
Output:
defaultdict(<class 'str'>, {'report': 'report-2020.10.13, report-2020.12.12, ', 'analytics': 'analytics-2020.12.14, analytics-2020.11.21, ', 'sales': 'sales-cda87, sales-vu7sa, '})

Related

Elements within a list of lists

I have a below mentioned list:
a= ['1234,5678\n','90123,45678\n']
The expected output I'm working towards is this:
op = [['1234','5678'],['90123','45678']]
Basically a list of lists with individual elements referring to a particular column.
Using the below mentioned code i get the following output:
a = ['1234,5678\n','90123,45678\n']
new_list = []
for element in a:
#remove new lines
new_list.append(element.splitlines())
print(new_list)
output:[['1234,5678'], ['90123,45678']]
Any direction regarding this would be much appreciated.
Check this:
a= ['1234,5678\n','90123,45678\n']
a = ['1234,5678\n','90123,45678\n']
new_list = []
for element in a:
#remove new lines
new_list.append(element.strip("\n").split(","))
print(new_list)
Try this:
a = [i.strip("\n").split(",") for i in a]
Since the strings in your input list appears to follow the CSV format, you can use csv.reader to parse them:
import csv
list(csv.reader(a))
This returns:
[['1234', '5678'], ['90123', '45678']]

match "id" between underscores

I'm newbie in regexp's and have a little task. I have to write function which takes a DataFrame and returns a filtered list of columns' names:
def get_ids(df: pd.DataFrame, other_id_vars: list=None) -> list:
pattern = re.compile('_id_|_id|id_')
list_ids = [col for col in df.columns if pattern.search(col)]
if other_id_vars is non None:
list_ids.extend(other_id_vars)
return list(set(list_ids))
Need filter list of words with id-pattern (\_id|\_id\_|id\_), like so:
#from this
['subs_id', 'play_id_game', 'video', 'fluid', 'id_serv']
#into this
['subs_id', 'play_id_game', 'id_serv']
but I don't like variant mentioned above. Do you have any better ideas?
Try: (?:_|^)id(?:_|$)
Explanation:
(?:...) - non-capturing group
_|^ - alternation, match underscore _ or ^ - beginning of a word
id - match id literally
_|$ - alternation, match underscore _ or $ - end of a word
Demo
To exclude id from possible results try (?:^id_|_id$|_id_)
You can do something like this. Remembering to break each value of the list up and sorting the values back into a separate new list:
lst = ['subs_id', 'play_id_game', 'video', 'fluid',
'id_serv']
new_lst = []
for value in lst:
formatted_val = value.split('_')
for info in formatted_val:
if info == 'id' or info == 'lid' or info == 'idl':
new_lst.append(value)
print(new_lst)

retain only 1 item in a list per unique prefix

I have an example situation where I have a list as follows:
test = ['a-nyc','a-chi','b-sf','c-dal','a-phx','c-la']
the items in this list are naturally ordered in some way, and the objective is to keep the first encountered value for each prefix, e.g. the desired result is a list as follows:
['a-nyc', 'b-sf', 'c-dal']
is there a handy way of doing this?
looks like this can be done this way:
newl = []
prel = []
for i in range(len(test)):
if test[i].split('-')[0] not in prel:
newl.append(test[i])
else:
pass
prel.append(test[i].split('-')[0])
but not sure if there is a more pythonic solution
Yes, you can try like following also:
test = ['a-nyc','a-chi','b-sf','c-dal','a-phx','c-la']
prefix = []
newlist = []
for i in test:
if i.split('-')[0] not in prefix:
prefix.append(i.split('-')[0])
newlist.append(i)
print(newlist)
In this, if any query then let me know.
Thank you.

Making a dictionary of from a list and a dictionary

I am trying to create a dictionary of codes that I can use for queries and selections. Let's say I have a dictionary of state names and corresponding FIPS codes:
statedict ={'Alabama': '01', 'Alaska':'02', 'Arizona': '04',... 'Wyoming': '56'}
And then I have a list of FIPS codes that I have pulled in from a Map Server request:
fipslist = ['02121', '01034', '56139', '04187', '02003', '04023', '02118']
I want to sort of combine the key from the dictionary (based on the first 2 characters of the value of that key) with the list items (also, based on the first 2 characters of the value of that key. Ex. all codes beginning with 01 = 'Alabama', etc...). My end goal is something like this:
fipsdict ={'Alabama': ['01034'], 'Alaska':['02121', '02003','02118'], 'Arizona': ['04187', '04023'],... 'Wyoming': ['56139']}
I would try to set it up similar to this, but it's not working quite correctly. Any suggestions?
fipsdict = {}
tempList = []
for items in fipslist:
for k, v in statedict:
if item[:2] == v in statedict:
fipsdict[k] = statedict[v]
fipsdict[v] = tempList.extend(item)
A one liner with nested comprehensions:
>>> {k:[n for n in fipslist if n[:2]==v] for k,v in statedict.items()}
{'Alabama': ['01034'],
'Alaska': ['02121', '02003', '02118'],
'Arizona': ['04187', '04023'],
'Wyoming': ['56139']}
You will have to create a new list to hold matching fips codes for each state. Below is the code that should work for your case.
for state,two_digit_fips in statedict.items():
matching_fips = []
for fips in fipslist:
if fips[:2] == two_digit_fips:
matching_fips.append(fips)
state_to_matching_fips_map[state] = matching_fips
>>> print(state_to_matching_fips_map)
{'Alabama': ['01034'], 'Arizona': ['04187', '04023'], 'Alaska': ['02121', '02003', '02118'], 'Wyoming': ['56139']}
For both proposed solutions I need a reversed state dictionary (I assume that each state has exactly one 2-digit code):
reverse_state_dict = {v: k for k,v in statedict.items()}
An approach based on defaultdict:
from collections import defaultdict
fipsdict = defaultdict(list)
for f in fipslist:
fipsdict[reverse_state_dict[f[:2]]].append(f)
An approach based on groupby and dictionary comprehension:
from itertools import groupby
{reverse_state_dict[k]: list(v) for k,v
in (groupby(sorted(fipslist), key=lambda x:x[:2]))}

Data filtering code in Pandas taking lot of time to run

I am executing the below code in Python. Its taking some time run. Is there something i am doing wrong.
Is there a better a way to do the same.
y= list(word)
words = y
similar = [[item[0] for item in model.wv.most_similar(word) if item[1] > 0.7] for word in words]
similarity_matrix = pd.DataFrame({'Root_Word': words, 'Similar_Words': similar})
similarity_matrix = similarity_matrix[['Root_Word', 'Similar_Words']]
similarity_matrix['Unlist_Root']=similarity_matrix['Root_Word'].apply(lambda x: ', '.join(x))
similarity_matrix['Unlist_Similar']=similarity_matrix['Similar_Words'].apply(lambda x: ', '.join(x))
similarity_matrix=similarity_matrix.drop(['Root_Word','Similar_Words'],1)
similarity_matrix.columns=['Root_Word','Similar_Words']
It is not possible to determine what is going on in the following line as there is not enough data provided (I do not know what model is):
similar = [[item[0] for item in model.wv.most_similar(word) if item[1] > 0.7] for word in words]
The second line below does not seem necessary as you create a DataFrame similarity_matrix with only two columns:
similarity_matrix = pd.DataFrame({'Root_Word': words, 'Similar_Words': similar})
# This below does not do anything
similarity_matrix = similarity_matrix[['Root_Word', 'Similar_Words']]
The apply method is not very fast. Try using vectorized methods already implemented in pandas as shown below. Here is a useful link about this topic.
similarity_matrix['Unlist_Root'] = similarity_matrix['Root_Word'].apply(lambda x: ', '.join(x))
# will be faster like this:
similarity_matrix['Unlist_Root'] = similarity_matrix['Root_Word'].str.join(', ')
Similarly:
similarity_matrix['Unlist_Similar'] = similarity_matrix['Similar_Words'].apply(lambda x: ', '.join(x))
# will be faster like this:
similarity_matrix['Unlist_Similar'] = similarity_matrix['Similar_Words'].str.join(', ')
The rest of the code could not run much faster.
If you provided more data/info we could help you more than that...

Resources