I'm newbie in regexp's and have a little task. I have to write function which takes a DataFrame and returns a filtered list of columns' names:
def get_ids(df: pd.DataFrame, other_id_vars: list=None) -> list:
pattern = re.compile('_id_|_id|id_')
list_ids = [col for col in df.columns if pattern.search(col)]
if other_id_vars is non None:
list_ids.extend(other_id_vars)
return list(set(list_ids))
Need filter list of words with id-pattern (\_id|\_id\_|id\_), like so:
#from this
['subs_id', 'play_id_game', 'video', 'fluid', 'id_serv']
#into this
['subs_id', 'play_id_game', 'id_serv']
but I don't like variant mentioned above. Do you have any better ideas?
Try: (?:_|^)id(?:_|$)
Explanation:
(?:...) - non-capturing group
_|^ - alternation, match underscore _ or ^ - beginning of a word
id - match id literally
_|$ - alternation, match underscore _ or $ - end of a word
Demo
To exclude id from possible results try (?:^id_|_id$|_id_)
You can do something like this. Remembering to break each value of the list up and sorting the values back into a separate new list:
lst = ['subs_id', 'play_id_game', 'video', 'fluid',
'id_serv']
new_lst = []
for value in lst:
formatted_val = value.split('_')
for info in formatted_val:
if info == 'id' or info == 'lid' or info == 'idl':
new_lst.append(value)
print(new_lst)
Related
I have a list of strings. Each string has the same length/number of characters in the format
xyzw01.ext or xyzv02.ext, etc.
For example
list 1: ['ABCJ01.ext','CDEJ02.ext','ADEJ01.ext','CDEJ01.ext','ABCJ02.ext','CDEJ03.ext']
list 2: ['ABCJ01.ext','ADEJ01.ext','CDEJ01.ext','RPNJ01.ext','PLEJ01.ext']
I would like from these lists to build new lists with only the strings with highest number.
So from list 1 I would like to get
['ADEJ01.ext','ABCJ02.ext','CDEJ03.ext']
while for list 2 I would like to get the same list since all numbers are 01.
Is there a "simple" way of achieving this?
You can use defaultdict and max
from collections import defaultdict
def fun(lst):
res = defaultdict(list)
for x in lst:
res[x[:4]].append(x)
return [max(res[x], key=lambda x: x[4:6]) for x in res]
lst = ['ABCJ01.ext','CDEJ02.ext','ADEJ01.ext','CDEJ01.ext','ABCJ02.ext','CDEJ03.ext']
lst2 = ['ABCJ01.ext','ADEJ01.ext','CDEJ01.ext','RPNJ01.ext','PLEJ01.ext']
print(fun(lst))
print(fun(lst2))
Output:
['ABCJ02.ext', 'CDEJ03.ext', 'ADEJ01.ext']
['ABCJ01.ext', 'ADEJ01.ext', 'CDEJ01.ext', 'RPNJ01.ext', 'PLEJ01.ext']
The easiest way is probably to use an intermediate data structure, like a dict - sort the list items into buckets based on the first part of their names, and then take the maximum number for each bucket. We can just use the built-in max() without a key, since as-given lexicographic sorting works to find the largest. If that's not sufficient, you could use more regex to take the number out of the item and use it as the key instead.
import re
def filter_list(lst):
prefixes = {}
for item in lst:
# use regex to isolate the non-numeric characters at the start of the string
prefix = re.match(r'^([^0-9]*)', item).group(1)
# make a bucket based on each prefix, and put the item in it
prefixes.setdefault(prefix, [])
prefixes[prefix].append(item)
# make a list comprehension taking the maximum item from each bucket
return [max(value) for value in prefixes.values()]
>>> a = ['ABCJ01.ext','CDEJ02.ext','ADEJ01.ext','CDEJ01.ext','ABCJ02.ext','CDEJ03.ext']
>>> b = ['ABCJ01.ext','ADEJ01.ext','CDEJ01.ext','RPNJ01.ext','PLEJ01.ext']
>>> filter_list(a)
['ABCJ02.ext', 'CDEJ03.ext', 'ADEJ01.ext']
>>> filter_list(b)
['ABCJ01.ext', 'ADEJ01.ext', 'CDEJ01.ext', 'RPNJ01.ext', 'PLEJ01.ext']
In python 3.7+, this should preserve the order of list from the first occurrence of each prefix (i.e. CDEJ03.ext will precede ADEJ01.ext in the output because CDEJ02.ext precedes it in the input).
To get the output in the exact same order as the original list, behavior, you'd want to explicitly reassign the key instead of using .setdefault(), perhaps with a pattern like prefixes[prefix] = prefixes[prefix] if prefix in prefixes else [].
I am trying to write a small script to group strings with similar patterns together. The following is my program snippet, which is working fine, but a little inaccurate.
lst = ["report-2020.10.13", "report-2020.12.12", "analytics-2020.12.14", "sales-cda87", "analytics-2020.11.21", "sales-vu7sa"]
final = []
for pat in lst:
pat = pat[:len(pat) // 2]
ils = []
for pat2 in lst:
if pat2.startswith(pat):
ils.append(pat2)
final.append(tuple(ils))
finalls = list(set(final))
for f in finalls:
print(f)
Also, I want the exact string pattern that groups the string. For example, from string list ["rep-10-01", "rep-10-02", "rep-11-06"] I want "rep-" as a pattern.
Are there any improvements required? Or any libraries/modules that can help me out in first as well as second problem?
Thanks in advance.
Does this work as you expected:
from collections import defaultdict
res = defaultdict(str)
lst = ["report-2020.10.13", "report-2020.12.12", "analytics-2020.12.14",
"sales-cda87", "analytics-2020.11.21", "sales-vu7sa"]
#ll = ['rep-10-01', 'rep-10-02', 'rep-11-06']
for pat in lst:
pattern = pat.split('-')
#print(pattern[0]) # real pattern - eg. report, sales, analytics
res[pattern[0]] += pat+ ', '
print(res)
Output:
defaultdict(<class 'str'>, {'report': 'report-2020.10.13, report-2020.12.12, ', 'analytics': 'analytics-2020.12.14, analytics-2020.11.21, ', 'sales': 'sales-cda87, sales-vu7sa, '})
Please help me complete this piece of code. Let me know of any other detail is required.
Thanks in advance!
Given: a column 'PROD_NAME' from pandas dataframe of string type (e.g. Smiths Crinkle Cut Chips Chicken g), a list of certain words (['Chip', 'Chips' etc])
To do: if none of the words from the list is contained in the strings of the dataframe objects, we drop the whole row. Basically we're removing unnecessary products from a dataframe.
This is what data looks like:
Here's my code:
# create a function to Keep only those products which have
# chip, chips, doritos, dorito, pringle, Pringles, Chps, chp, in their name
def onlyChips(df, *cols):
temp = []
chips = ['Chip', 'Chips', 'Doritos', 'Dorito', 'Pringle', 'Pringles', 'Chps', 'Chp']
copy = cp.deepcopy(df)
for col in [*cols]:
for i in range(len(copy[col])):
for item in chips:
if item not in copy[col][i]:
flag = False
else:
flag = True
break;
# drop only those string which doesn't have any match from chips list, if flag never became True
if not flag:
# drop the whole row
return <new created dataframe>
new = onlyChips(df_txn, 'PROD_NAME')
Filter the rows instead of deleting them. Create a boolean mask for each row. Use str.contains on each column you need to search and see if any of the columns match the given criteria row-wise. Filter the rows if not.
search_cols = ['PROD_NAME']
mask = df[search_cols].apply(lambda x: x.str.contains('|'.join(chips))).any(axis=1)
df = df[mask]
I have a pandas df column which has some text. Now I want to compare each word of this text with elements from list and if there is a match then I want to add that word in a new column. Although, I am able to extract these using loop(although not ideal) but when it comes to text where there is no match, I am not able to append none. for ex.
python list:
bodyparts = ['thumb', 'back', 'elbow', 'shoulder', 'ankle', 'hamstring', 'knee']
Also, The following expression doing the job only partially and just appending 0 or 1, if there is a match or no match receptively.
input_file_1['bodyparts'] = input_file_1['Description'].apply(lambda x: sum(i in bodyparts for i in x.split())).................
can I use any other expression, which can actually append the matched word?
Although, I am able to extract these using loop(although not ideal) but when it comes to text where there is no match, I am not able to append none.
Also, The following expression doing the job only partially and just appending 0 or 1, if there is a match or no match receptively.
input_file_1['bodyparts'] = input_file_1['Description'].apply(lambda x: sum(i in bodyparts for i in x.split()))
expected output
bodyparts
thumb
back
elbow
none
actual output
1
1
1
0
I think that this will do the job.
bodyparts = ['thumb', 'back', 'elbow', 'shoulder', 'ankle', 'hamstring', 'knee']
def search_bodyparts(s, bodyparts):
found_bodyparts = [bodypart for bodypart in bodyparts if bodypart in s]
if len(found_bodyparts)>0:
return ', '.join(found_bodyparts)
else:
return None
df['bodyparts'] = df['Description'].apply(lambda x : search_bodyparts(x, bodyparts))
I am trying to invert an italian-english dictionary using the code that follows.
Some terms have one translation, while others have multiple possibilities. If an entry has multiple translations I iterate through each word, adding it to english-italian dict (if not already present).
If there is a single translation it should not iterate, but as I have written the code, it does. Also only the last translation in the term with multiple translations is added to the dictionary. I cannot figure out how to rewrite the code to resolve what should be a really simple task
from collections import defaultdict
def invertdict():
source_dict ={'paramezzale (s.m.)': ['hog', 'keelson', 'inner keel'], 'vento (s.m.)': 'wind'}
english_dict = defaultdict(list)
for parola, words in source_dict.items():
if len(words) > 1: # more than one translation ?
for word in words: # if true, iterate through each word
word = str(word).strip(' ')
print(word)
else: # only one translation, don't iterate!!
word = str(words).strip(' ')
print(word)
if word in english_dict.keys(): # check to see if the term already exists
if english_dict[word] != parola: # check that the italian is not present
#english_dict[word] = [english_dict[word], parola]
english_dict[word].append(parola).strip('')
else:
english_dict[word] = parola.strip(' ')
print(len(english_dict))
for key,value in english_dict.items():
print(key, value)
When this code is run, I get :
hog
keelson
inner keel
w
i
n
d
2
inner keel paramezzale (s.m.)
d vento (s.m.)
instead of
hog: paramezzale, keelson: paramezzale, inner keel: paramezzale, wind: vento
It would be easier to use lists everywhere in the dictionary, like:
source_dict = {'many translations': ['a', 'b'], 'one translation': ['c']}
Then you need 2 nested loops. Right now you're not always running the inner loop.
for italian_word, english_words in source_dict.items():
for english_word in english_words:
# print, add to english dict, etc.
If you can't change the source_dict format, you need to check the type explicitly. I would transform the single item in a list.
for italian_word, item in source_dict.items():
if not isinstance(item, list):
item = [item]
Full code:
source_dict ={'paramezzale (s.m.)': ['hog', 'keelson', 'inner keel'], 'vento (s.m.)': ['wind']}
english_dict = defaultdict(list)
for parola, words in source_dict.items():
for word in words:
word = str(word).strip(' ')
# add to the list if not already present
# english_dict is a defaultdict(list) so we can use .append directly
if parola not in english_dict[word]:
english_dict[word].append(parola)