Python: pandas count row-wise string matches across specific variables - python-3.x

I have a dataset that I am manipulating with Python's pandas library. The data frame has one string variable that is the ID of interest. I have a set of other variables with a shared prefix (i.e. name_). For each row, I want to count the number of variables in this set that have the ID as the value.
Question: Is there a pandas 1 liner that I can use?
Here is an example dataset
import pandas as pd
df = pd.DataFrame({
'ID_var': ['ab?c', 'xyzyy', 'ab?c', 'ghi55'],
'name_01': ['def55', 'abc', 'ab?c', 'def'],
'name_02': ['ab?c', 'jkl123', 'ab?c', 'ghi55'],
'name_03': ['ghi55', 'mn_o', 'ab?c', 'ghi55'],
'not_name': [0, 1, 2, 3],
'other_str': ['str1', 'str2', 'str3', 'str'],
})
and, for each row, I want to count the number of times variables with the prefix name_ equal ID_var. So the desired output is:
import pandas as pd
df_final = pd.DataFrame({
'ID_var': ['ab?c', 'xyzyy', 'ab?c', 'ghi55'],
'name_01': ['def55', 'abc', 'ab?c', 'def'],
'name_02': ['ab?c', 'jkl123', 'ab?c', 'ghi55'],
'name_03': ['ghi55', 'mn_o', 'ab?c', 'ghi55'],
'not_name': [0, 1, 2, 3],
'other_x': ['str1', 'str2', 'str3', 'str'],
'total_name':[1,0,3,2]
})
I haven't been able to find this elsewhere on SO. I suspect that there is a way I can use pd.str.contains but I am not sure how. Thank you in advance

Here's a pandas one-liner, a bit faster on my computer than #Timeless. The filter subsets to just columns that start wtih name_, then the .eq compares the values to the first column by-column (axis=0) then sums by row (axis=1)
df['total_name'] = df.filter(regex='^name_').eq(df['ID_var'],axis=0).sum(axis=1)
Timing:
196 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%%timeit
df['total_name'] = df.filter(regex='^name_').eq(df['ID_var'],axis=0).sum(axis=1)
Compared to: 353 µs ± 6.61 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
df["total_name"] = [(df.iloc[i, 0]==df.iloc[i, 1:]).sum() for i in range(len(df))]

Question: Is there a pandas 1 liner that I can use?
One option, is to use a listcomp with iloc :
df["total_name"] = [(df.iloc[i, 0]==df.iloc[i, 1:]).sum() for i in range(len(df))]
Output :
print(df)
ID_var name_01 name_02 name_03 not_name other_str total_name
0 ab?c def55 ab?c ghi55 0 str1 1
1 xyzyy abc jkl123 mn_o 1 str2 0
2 ab?c ab?c ab?c ab?c 2 str3 3
3 ghi55 def ghi55 ghi55 3 str 2
#%%timeit : 547 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Related

Pythonic way for calculating complex terms in Pandas (values bigger or equal to a number divided by the length of a list)

I have the following dataframe:
simple_list=[[3.0, [1.1, 2.2, 3.3, 4.4, 5.5]]]
simple_list.append([0.25, [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]])
df4=pd.DataFrame(simple_list,columns=['col1','col2'])
I want create a new column called new_col, in which there's the following calculation:
The number of occurrences of elements in col2 that are bigger or equal than the given number in col1, divided by the length of the list in col2.
i.e.,
first value in new_col will be: 0.6 (there are 3 numbers bigger than 3.0, and 5 is the length of this list)
second value in new_col will be: 0.6667 (there are 4 numbers bigger than 0.25, and 6 is the length of this list).
Use DataFrame.squeeze with DataFrame.eval for compare columns and then mean per index:
df4['new'] = df4.explode('col2').eval('col1 < col2').mean(level=0)
Or convert lists to DataFrame and before mean create missing values by df1:
df1 = pd.DataFrame(df4['col2'].tolist(), index=df4.index)
df4['new'] = df1.gt(df4['col1'], axis=0).mask(df1.isna()).mean(axis=1)
Slowier solutions:
Or is possible use list comprehension with convert list to numpy array:
df4['new'] = [(np.array(b) > a).mean() for a, b in df4[['col1','col2']].to_numpy()]
Another idea with DataFrame.apply:
df4['new'] = df4.apply(lambda x: (np.array(x['col2']) > x['col1']).mean(), axis=1)
print (df4)
col1 col2 new
0 3.00 [1.1, 2.2, 3.3, 4.4, 5.5] 0.600000
1 0.25 [0.1, 0.2, 0.3, 0.4, 0.5, 0.6] 0.666667
Perfromance:
df4=pd.DataFrame(simple_list,columns=['col1','col2'])
df4 = pd.concat([df4] * 10000, ignore_index=True)
In [262]: %%timeit
...: df1 = pd.DataFrame(df4['col2'].tolist(), index=df4.index)
...: df4['new'] = df1.gt(df4['col1'], axis=0).mask(df1.isna()).mean(axis=1)
...:
40.9 ms ± 3.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [263]: %timeit df4.explode('col2').eval('col1 < col2').mean(level=0)
97.2 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [264]: %timeit [(np.array(b) > a).mean() for a, b in df4[['col1','col2']].to_numpy()]
305 ms ± 12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [265]: %timeit df4.apply(lambda x: (np.array(x['col2']) > x['col1']).mean(), axis=1)
1.23 s ± 32.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

replace all values in range of columns based on condition

How can I replace values in multiple columns based on a condition?
Suppose I have a df looking something like this:
df = pd.DataFrame({'A': [1,2,3,4], 'C': [1,2,3,4], 'B': [3,4,6,6]})
With numpy I can change the value of a column based on a condition like this:
df['A'] = np.where((df['B'] < 5), '-', df['A'])
But how can I change the value of many columns based on a condition? Thought I could do something like below but that's not working.
df[['A','C']] = np.where((df['B'] < 5), '-', df[['A', 'C']])
I could do a loop but that does not feel very pythonic/pands
cols = ['A', 'C']
for col in cols:
df[col] = np.where((df['B'] < 5), '-', df[col])
One idea is use DataFrame.mask:
df[['A','C']] = df[['A', 'C']].mask(df['B'] < 5, '-')
print (df)
A C B
0 - - 3
1 - - 4
2 3 3 6
3 4 4 6
Alternative solution with DataFrame.loc:
df.loc[df['B'] < 5, ['A','C']] = '-'
print (df)
A C B
0 - - 3
1 - - 4
2 3 3 6
3 4 4 6
Solution with numpy.where and broadcasting mask:
df[['A','C']] = np.where((df['B'] < 5)[:, None], '-', df[['A', 'C']])
Performance if mixed values - numeric with string -:
df = pd.DataFrame({'A': [1,2,3,4], 'C': [1,2,3,4], 'B': [3,4,6,6]})
#400k rows
df = pd.concat([df] * 100000, ignore_index=True)
In [217]: %timeit df[['A','C']] = df[['A', 'C']].mask(df['B'] < 5, '-')
171 ms ± 13.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [219]: %timeit df[['A','C']] = np.where((df['B'] < 5)[:, None], '-', df[['A', 'C']])
72.5 ms ± 11.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [221]: %timeit df.loc[df['B'] < 5, ['A','C']] = '-'
27.8 ms ± 533 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Performace if replace by numeric:
df = pd.DataFrame({'A': [1,2,3,4], 'C': [1,2,3,4], 'B': [3,4,6,6]})
df = pd.concat([df] * 100000, ignore_index=True)
In [229]: %timeit df[['A','C']] = df[['A', 'C']].mask(df['B'] < 5, 0)
187 ms ± 4.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [231]: %timeit df[['A','C']] = np.where((df['B'] < 5)[:, None], 0, df[['A', 'C']])
20.8 ms ± 455 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [233]: %timeit df.loc[df['B'] < 5, ['A','C']] = 0
61.3 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

parsing a panda dataframe column from a dictionary data form into new columns for each dictionary key

In python 3, pandas. Imagine there is a dataframe df with a column x
df=pd.DataFrame(
[
{'x':'{"a":"1","b":"2","c":"3"}'},
{'x':'{"a":"2","b":"3","c":"4"}'}
]
)
The column x has data which looks like a dictionary. Wonder how can I parse them into a new dataframe, so each key here becomes a new column?
The desired output dataframe is like
x,a,b,c
'{"a":"1","b":"2","c":"3"}',1,2,3
'{"a":"2","b":"3","c":"4"}',2,3,4
None of the solution in this post seems to work in this case
parsing a dictionary in a pandas dataframe cell into new row cells (new columns)
df1=pd.DataFrame(df.loc[:,'x'].values.tolist())
print(df1)
result the same dataframe. didn't separate the column into each key per column
Any 2 cents?
Thanks!
You can also map json.loads and convert to a dataframe like;
import json
df1 = pd.DataFrame(df['x'].map(json.loads).tolist(),index=df.index)
print(df1)
a b c
0 1 2 3
1 2 3 4
this tests to be faster than evaluating via ast , below is the benchmark for 40K rows:
m = pd.concat([df]*20000,ignore_index=True)
%%timeit
import json
df1 = pd.DataFrame(m['x'].map(json.loads).tolist(),index=m.index)
#256 ms ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
import ast
df1 = pd.DataFrame(m['x'].map(ast.literal_eval).tolist(),index=m.index)
#1.32 s ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
import ast
df1 = pd.DataFrame(m['x'].apply(ast.literal_eval).tolist(),index=m.index)
#1.34 s ± 71.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Because string repr of dictionaries is necessary convert values to dictionaries:
import ast, json
#performance for repeated sample data, in real data should be different
m = pd.concat([df]*20000,ignore_index=True)
In [98]: %timeit pd.DataFrame([json.loads(x) for x in m['x']], index=m.index)
206 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#anky_91 solution
In [99]: %timeit pd.DataFrame(m['x'].map(json.loads).tolist(),index=m.index)
210 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [100]: %timeit pd.DataFrame(m['x'].map(ast.literal_eval).tolist(),index=m.index)
903 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [101]: %timeit pd.DataFrame(m['x'].apply(ast.literal_eval).tolist(),index=m.index)
893 ms ± 2.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
print(df1)
a b c
0 1 2 3
1 2 3 4
Last for append to original:
df = df.join(df1)
print(df)
x a b c
0 {"a":"1","b":"2","c":"3"} 1 2 3
1 {"a":"2","b":"3","c":"4"} 2 3 4

Optimizing pandas operation for column intersection

I have a DataFrame with 2 columns (event and events) . Event column contains a particular eventid and events column contain list of event Ids.
Example :-
df
event events
'a' ['x','y','abc','a']
'b' ['x','y','c','a']
'c' ['a','c']
'd' ['b']
I want to create another column(eventoccured) indicating whether event isin events.
eventoccured
1
0
1
0
I am currently using
df['eventoccured']= df.apply(lambda x: x['event'] in x['events'], axis=1)
which gives the desired result but is slow, I want a faster solution for this.
Thanks
One idea is use list comprehension:
#40k rows
df = pd.concat([df] * 10000, ignore_index=True)
In [217]: %timeit df['eventoccured']= df.apply(lambda x: x['event'] in x['events'], axis=1)
1.15 s ± 36.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [218]: %timeit df['eventoccured1'] = [x in y for x, y in zip(df['event'], df['events'])]
15.2 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

converting pandas dataframe into dictionary??

I have a pandas dataframe as news_datasetwhere column id is an article ID and column Content is Article content (large text). Given as,
ID Content
17283 WASHINGTON — Congressional Republicans have...
17284 After the bullet shells get counted, the blood...
17285 When Walt Disney’s “Bambi” opened in 1942, cri...
17286 Death may be the great equalizer, but it isn’t...
17287 SEOUL, South Korea — North Korea’s leader, ...
Now, all I want to convert pandas dataframe into dictionary such as ID would be a key and Content will the value. Basically, what I have done at first something like,
dd={}
for i in news_dataset['ID']:
for j in news_dataset['Content']:
dd[j]=i
This piece of code is pathetic and taking so much time(> 4 minutes) to get processed. So, after checking for some better approaches(stackoverflow). What I have finally did is,
id_array=[]
content_array=[]
for id_num in news_dataset['ID']:
id_array.append(id_num)
for content in news_dataset['Content']:
content_array.append(content)
news_dict=dict(zip(id_array,content_array))
This code takes nearly 15 seconds to get executed.
What I want to ask is,
i) what's wrong in first code and why it take so much time to get processed?
ii) Does using for loop inside another for loop is wrong way to do iterations when it comes to large text data?
iii) what would be right way to create dictionary using for loop within single piece of query?
I think generally loops in pandas should be avoid if exist some non loop, obviously vectorized alternatives.
You can create index by column ID and call Series.to_dict:
news_dict=news_dataset.set_index('ID')['Content'].to_dict()
Or zip:
news_dict=dict(zip(news_dataset['ID'],news_dataset['Content']))
#alternative
#news_dict=dict(zip(news_dataset['ID'].values, news_dataset['Content'].values))
Performance:
np.random.seed(1425)
#1000rows sample
news_dataset = pd.DataFrame({'ID':np.arange(1000),
'Content':np.random.choice(list('abcdef'), size=1000)})
#print (news_dataset)
In [98]: %%timeit
...: dd={}
...: for i in news_dataset['ID']:
...: for j in news_dataset['Content']:
...: dd[j]=i
...:
61.7 ms ± 2.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [99]: %%timeit
...: id_array=[]
...: content_array=[]
...: for id_num in news_dataset['ID']:
...: id_array.append(id_num)
...: for content in news_dataset['Content']:
...: content_array.append(content)
...: news_dict=dict(zip(id_array,content_array))
...:
251 µs ± 3.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [100]: %%timeit
...: news_dict=news_dataset.set_index('ID')['Content'].to_dict()
...:
584 µs ± 9.69 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [101]: %%timeit
...: news_dict=dict(zip(news_dataset['ID'],news_dataset['Content']))
...:
106 µs ± 3.94 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [102]: %%timeit
...: news_dict=dict(zip(news_dataset['ID'].values, news_dataset['Content'].values))
...:
122 µs ± 891 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Resources