Pythonic way for calculating complex terms in Pandas (values bigger or equal to a number divided by the length of a list) - python-3.x

I have the following dataframe:
simple_list=[[3.0, [1.1, 2.2, 3.3, 4.4, 5.5]]]
simple_list.append([0.25, [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]])
df4=pd.DataFrame(simple_list,columns=['col1','col2'])
I want create a new column called new_col, in which there's the following calculation:
The number of occurrences of elements in col2 that are bigger or equal than the given number in col1, divided by the length of the list in col2.
i.e.,
first value in new_col will be: 0.6 (there are 3 numbers bigger than 3.0, and 5 is the length of this list)
second value in new_col will be: 0.6667 (there are 4 numbers bigger than 0.25, and 6 is the length of this list).

Use DataFrame.squeeze with DataFrame.eval for compare columns and then mean per index:
df4['new'] = df4.explode('col2').eval('col1 < col2').mean(level=0)
Or convert lists to DataFrame and before mean create missing values by df1:
df1 = pd.DataFrame(df4['col2'].tolist(), index=df4.index)
df4['new'] = df1.gt(df4['col1'], axis=0).mask(df1.isna()).mean(axis=1)
Slowier solutions:
Or is possible use list comprehension with convert list to numpy array:
df4['new'] = [(np.array(b) > a).mean() for a, b in df4[['col1','col2']].to_numpy()]
Another idea with DataFrame.apply:
df4['new'] = df4.apply(lambda x: (np.array(x['col2']) > x['col1']).mean(), axis=1)
print (df4)
col1 col2 new
0 3.00 [1.1, 2.2, 3.3, 4.4, 5.5] 0.600000
1 0.25 [0.1, 0.2, 0.3, 0.4, 0.5, 0.6] 0.666667
Perfromance:
df4=pd.DataFrame(simple_list,columns=['col1','col2'])
df4 = pd.concat([df4] * 10000, ignore_index=True)
In [262]: %%timeit
...: df1 = pd.DataFrame(df4['col2'].tolist(), index=df4.index)
...: df4['new'] = df1.gt(df4['col1'], axis=0).mask(df1.isna()).mean(axis=1)
...:
40.9 ms ± 3.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [263]: %timeit df4.explode('col2').eval('col1 < col2').mean(level=0)
97.2 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [264]: %timeit [(np.array(b) > a).mean() for a, b in df4[['col1','col2']].to_numpy()]
305 ms ± 12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [265]: %timeit df4.apply(lambda x: (np.array(x['col2']) > x['col1']).mean(), axis=1)
1.23 s ± 32.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Related

How to make a calculation in a pandas daframe depending on a value of a certain column

I have this dataframe and I want to make a calculation depending on a condition, like below:
count prep result
0 10 100
10 100 100
I wanto to create a new column evaluated that is:
if df['count']==0:
df['evaluated'] = df['result'] / df['prep']
else:
df['evaluated'] = df['result'] / df['count']
expected result is:
count prep result evaluated
0 10 100 10
100 10 100 1
What's the best way to do it? My real dataframe has 30k rows.
You can use where or mask:
df['evaluated'] = df['result'].div(df['prep'].where(df['count'].eq(0), df['count']))
Or:
df['evaluated'] = df['result'].div(df['count'].mask(df['count'].eq(0), df['prep']))
Output (assuming there was an error in the provided input):
count prep result evaluated
0 0 10 100 10.0
1 100 10 100 1.0
You can also use np.where from numpy to do that:
df['evaluated'] = np.where(df['count'] == 0,
df['result'] / df['prep'], # == 0
df['result'] / df['count']) # != 0
Performance (not really significant) over 30k rows:
>>> %timeit df['result'].div(df['prep'].where(df['count'].eq(0), df['count']))
652 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit df['result'].div(df['count'].mask(df['count'].eq(0), df['prep']))
638 µs ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit np.where(df['count'] == 0, df['result'] / df['prep'], df['result'] / df['count'])
462 µs ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Pandas using map or apply to make a new column from adjustments using a dictionary

I have data from a sporting event and the knowledge is that there is a bias at each home arena that I want to make adjustments for. I have already created a dictionary where the arena is the key and the value is the adjustment I want to make.
So for each row, I want to take the home team, get the adjustment, and then subtract that from the distance column. I have the following code but I cannot seem to get it working.
#Making the dictionary, this is working properly
teams = df.home_team.unique().tolist()
adj_shot_dict = {}
for team in teams:
df_temp = df[df.home_team == team]
average = round(df_temp.event_distance.mean(),2)
adj_shot_dict[team] = average
def make_adjustment(df):
team = df.home_team
distance = df.event_distance
adj_dist = distance - adj_shot_dict[team]
return adj_dist
df['adj_dist'] = df['event_distance'].apply(make_adjustment)
IIUC, you already have the dict and you want simply subtract adj_shot_dict to event_distance column:
df['adj_dist'] = df['event_distance'] - df['home_team'].map(adj_shot_dict)
Old answer
Group by home_team, compute the average of event_distance then subtract the result to event_distance:
df['adj_dist'] = df['event_distance'] \
- df.groupby('home_team')['event_distance'] \
.transform('mean').round(2)
# OR
df['adj_dist'] = df.groupby('home_team')['event_distance'] \
.apply(lambda x: x - x.mean().round(2))
Performance
>>> len(df)
60000
>>> df.sample(5)
home_team event_distance
5 team3 60
4 team2 50
1 team2 20
1 team2 20
0 team1 10
def loop():
teams = df.home_team.unique().tolist()
adj_shot_dict = {}
for team in teams:
df_temp = df[df.home_team == team]
average = round(df_temp.event_distance.mean(),2)
adj_shot_dict[team] = average
def loop2():
df.groupby('home_team')['event_distance'].transform('mean').round(2)
>>> %timeit loop()
13.5 ms ± 194 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit loop2()
3.62 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Total process
>>> %timeit df['event_distance'] - df.groupby('home_team')['event_distance'].transform('mean').round(2)
3.7 ms ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

replace all values in range of columns based on condition

How can I replace values in multiple columns based on a condition?
Suppose I have a df looking something like this:
df = pd.DataFrame({'A': [1,2,3,4], 'C': [1,2,3,4], 'B': [3,4,6,6]})
With numpy I can change the value of a column based on a condition like this:
df['A'] = np.where((df['B'] < 5), '-', df['A'])
But how can I change the value of many columns based on a condition? Thought I could do something like below but that's not working.
df[['A','C']] = np.where((df['B'] < 5), '-', df[['A', 'C']])
I could do a loop but that does not feel very pythonic/pands
cols = ['A', 'C']
for col in cols:
df[col] = np.where((df['B'] < 5), '-', df[col])
One idea is use DataFrame.mask:
df[['A','C']] = df[['A', 'C']].mask(df['B'] < 5, '-')
print (df)
A C B
0 - - 3
1 - - 4
2 3 3 6
3 4 4 6
Alternative solution with DataFrame.loc:
df.loc[df['B'] < 5, ['A','C']] = '-'
print (df)
A C B
0 - - 3
1 - - 4
2 3 3 6
3 4 4 6
Solution with numpy.where and broadcasting mask:
df[['A','C']] = np.where((df['B'] < 5)[:, None], '-', df[['A', 'C']])
Performance if mixed values - numeric with string -:
df = pd.DataFrame({'A': [1,2,3,4], 'C': [1,2,3,4], 'B': [3,4,6,6]})
#400k rows
df = pd.concat([df] * 100000, ignore_index=True)
In [217]: %timeit df[['A','C']] = df[['A', 'C']].mask(df['B'] < 5, '-')
171 ms ± 13.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [219]: %timeit df[['A','C']] = np.where((df['B'] < 5)[:, None], '-', df[['A', 'C']])
72.5 ms ± 11.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [221]: %timeit df.loc[df['B'] < 5, ['A','C']] = '-'
27.8 ms ± 533 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Performace if replace by numeric:
df = pd.DataFrame({'A': [1,2,3,4], 'C': [1,2,3,4], 'B': [3,4,6,6]})
df = pd.concat([df] * 100000, ignore_index=True)
In [229]: %timeit df[['A','C']] = df[['A', 'C']].mask(df['B'] < 5, 0)
187 ms ± 4.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [231]: %timeit df[['A','C']] = np.where((df['B'] < 5)[:, None], 0, df[['A', 'C']])
20.8 ms ± 455 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [233]: %timeit df.loc[df['B'] < 5, ['A','C']] = 0
61.3 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

parsing a panda dataframe column from a dictionary data form into new columns for each dictionary key

In python 3, pandas. Imagine there is a dataframe df with a column x
df=pd.DataFrame(
[
{'x':'{"a":"1","b":"2","c":"3"}'},
{'x':'{"a":"2","b":"3","c":"4"}'}
]
)
The column x has data which looks like a dictionary. Wonder how can I parse them into a new dataframe, so each key here becomes a new column?
The desired output dataframe is like
x,a,b,c
'{"a":"1","b":"2","c":"3"}',1,2,3
'{"a":"2","b":"3","c":"4"}',2,3,4
None of the solution in this post seems to work in this case
parsing a dictionary in a pandas dataframe cell into new row cells (new columns)
df1=pd.DataFrame(df.loc[:,'x'].values.tolist())
print(df1)
result the same dataframe. didn't separate the column into each key per column
Any 2 cents?
Thanks!
You can also map json.loads and convert to a dataframe like;
import json
df1 = pd.DataFrame(df['x'].map(json.loads).tolist(),index=df.index)
print(df1)
a b c
0 1 2 3
1 2 3 4
this tests to be faster than evaluating via ast , below is the benchmark for 40K rows:
m = pd.concat([df]*20000,ignore_index=True)
%%timeit
import json
df1 = pd.DataFrame(m['x'].map(json.loads).tolist(),index=m.index)
#256 ms ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
import ast
df1 = pd.DataFrame(m['x'].map(ast.literal_eval).tolist(),index=m.index)
#1.32 s ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
import ast
df1 = pd.DataFrame(m['x'].apply(ast.literal_eval).tolist(),index=m.index)
#1.34 s ± 71.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Because string repr of dictionaries is necessary convert values to dictionaries:
import ast, json
#performance for repeated sample data, in real data should be different
m = pd.concat([df]*20000,ignore_index=True)
In [98]: %timeit pd.DataFrame([json.loads(x) for x in m['x']], index=m.index)
206 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#anky_91 solution
In [99]: %timeit pd.DataFrame(m['x'].map(json.loads).tolist(),index=m.index)
210 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [100]: %timeit pd.DataFrame(m['x'].map(ast.literal_eval).tolist(),index=m.index)
903 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [101]: %timeit pd.DataFrame(m['x'].apply(ast.literal_eval).tolist(),index=m.index)
893 ms ± 2.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
print(df1)
a b c
0 1 2 3
1 2 3 4
Last for append to original:
df = df.join(df1)
print(df)
x a b c
0 {"a":"1","b":"2","c":"3"} 1 2 3
1 {"a":"2","b":"3","c":"4"} 2 3 4

Optimizing pandas operation for column intersection

I have a DataFrame with 2 columns (event and events) . Event column contains a particular eventid and events column contain list of event Ids.
Example :-
df
event events
'a' ['x','y','abc','a']
'b' ['x','y','c','a']
'c' ['a','c']
'd' ['b']
I want to create another column(eventoccured) indicating whether event isin events.
eventoccured
1
0
1
0
I am currently using
df['eventoccured']= df.apply(lambda x: x['event'] in x['events'], axis=1)
which gives the desired result but is slow, I want a faster solution for this.
Thanks
One idea is use list comprehension:
#40k rows
df = pd.concat([df] * 10000, ignore_index=True)
In [217]: %timeit df['eventoccured']= df.apply(lambda x: x['event'] in x['events'], axis=1)
1.15 s ± 36.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [218]: %timeit df['eventoccured1'] = [x in y for x, y in zip(df['event'], df['events'])]
15.2 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Resources