Optimizing pandas operation for column intersection - python-3.x

I have a DataFrame with 2 columns (event and events) . Event column contains a particular eventid and events column contain list of event Ids.
Example :-
df
event events
'a' ['x','y','abc','a']
'b' ['x','y','c','a']
'c' ['a','c']
'd' ['b']
I want to create another column(eventoccured) indicating whether event isin events.
eventoccured
1
0
1
0
I am currently using
df['eventoccured']= df.apply(lambda x: x['event'] in x['events'], axis=1)
which gives the desired result but is slow, I want a faster solution for this.
Thanks

One idea is use list comprehension:
#40k rows
df = pd.concat([df] * 10000, ignore_index=True)
In [217]: %timeit df['eventoccured']= df.apply(lambda x: x['event'] in x['events'], axis=1)
1.15 s ± 36.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [218]: %timeit df['eventoccured1'] = [x in y for x, y in zip(df['event'], df['events'])]
15.2 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Related

How to make a calculation in a pandas daframe depending on a value of a certain column

I have this dataframe and I want to make a calculation depending on a condition, like below:
count prep result
0 10 100
10 100 100
I wanto to create a new column evaluated that is:
if df['count']==0:
df['evaluated'] = df['result'] / df['prep']
else:
df['evaluated'] = df['result'] / df['count']
expected result is:
count prep result evaluated
0 10 100 10
100 10 100 1
What's the best way to do it? My real dataframe has 30k rows.
You can use where or mask:
df['evaluated'] = df['result'].div(df['prep'].where(df['count'].eq(0), df['count']))
Or:
df['evaluated'] = df['result'].div(df['count'].mask(df['count'].eq(0), df['prep']))
Output (assuming there was an error in the provided input):
count prep result evaluated
0 0 10 100 10.0
1 100 10 100 1.0
You can also use np.where from numpy to do that:
df['evaluated'] = np.where(df['count'] == 0,
df['result'] / df['prep'], # == 0
df['result'] / df['count']) # != 0
Performance (not really significant) over 30k rows:
>>> %timeit df['result'].div(df['prep'].where(df['count'].eq(0), df['count']))
652 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit df['result'].div(df['count'].mask(df['count'].eq(0), df['prep']))
638 µs ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit np.where(df['count'] == 0, df['result'] / df['prep'], df['result'] / df['count'])
462 µs ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Pandas using map or apply to make a new column from adjustments using a dictionary

I have data from a sporting event and the knowledge is that there is a bias at each home arena that I want to make adjustments for. I have already created a dictionary where the arena is the key and the value is the adjustment I want to make.
So for each row, I want to take the home team, get the adjustment, and then subtract that from the distance column. I have the following code but I cannot seem to get it working.
#Making the dictionary, this is working properly
teams = df.home_team.unique().tolist()
adj_shot_dict = {}
for team in teams:
df_temp = df[df.home_team == team]
average = round(df_temp.event_distance.mean(),2)
adj_shot_dict[team] = average
def make_adjustment(df):
team = df.home_team
distance = df.event_distance
adj_dist = distance - adj_shot_dict[team]
return adj_dist
df['adj_dist'] = df['event_distance'].apply(make_adjustment)
IIUC, you already have the dict and you want simply subtract adj_shot_dict to event_distance column:
df['adj_dist'] = df['event_distance'] - df['home_team'].map(adj_shot_dict)
Old answer
Group by home_team, compute the average of event_distance then subtract the result to event_distance:
df['adj_dist'] = df['event_distance'] \
- df.groupby('home_team')['event_distance'] \
.transform('mean').round(2)
# OR
df['adj_dist'] = df.groupby('home_team')['event_distance'] \
.apply(lambda x: x - x.mean().round(2))
Performance
>>> len(df)
60000
>>> df.sample(5)
home_team event_distance
5 team3 60
4 team2 50
1 team2 20
1 team2 20
0 team1 10
def loop():
teams = df.home_team.unique().tolist()
adj_shot_dict = {}
for team in teams:
df_temp = df[df.home_team == team]
average = round(df_temp.event_distance.mean(),2)
adj_shot_dict[team] = average
def loop2():
df.groupby('home_team')['event_distance'].transform('mean').round(2)
>>> %timeit loop()
13.5 ms ± 194 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit loop2()
3.62 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Total process
>>> %timeit df['event_distance'] - df.groupby('home_team')['event_distance'].transform('mean').round(2)
3.7 ms ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pythonic way for calculating complex terms in Pandas (values bigger or equal to a number divided by the length of a list)

I have the following dataframe:
simple_list=[[3.0, [1.1, 2.2, 3.3, 4.4, 5.5]]]
simple_list.append([0.25, [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]])
df4=pd.DataFrame(simple_list,columns=['col1','col2'])
I want create a new column called new_col, in which there's the following calculation:
The number of occurrences of elements in col2 that are bigger or equal than the given number in col1, divided by the length of the list in col2.
i.e.,
first value in new_col will be: 0.6 (there are 3 numbers bigger than 3.0, and 5 is the length of this list)
second value in new_col will be: 0.6667 (there are 4 numbers bigger than 0.25, and 6 is the length of this list).
Use DataFrame.squeeze with DataFrame.eval for compare columns and then mean per index:
df4['new'] = df4.explode('col2').eval('col1 < col2').mean(level=0)
Or convert lists to DataFrame and before mean create missing values by df1:
df1 = pd.DataFrame(df4['col2'].tolist(), index=df4.index)
df4['new'] = df1.gt(df4['col1'], axis=0).mask(df1.isna()).mean(axis=1)
Slowier solutions:
Or is possible use list comprehension with convert list to numpy array:
df4['new'] = [(np.array(b) > a).mean() for a, b in df4[['col1','col2']].to_numpy()]
Another idea with DataFrame.apply:
df4['new'] = df4.apply(lambda x: (np.array(x['col2']) > x['col1']).mean(), axis=1)
print (df4)
col1 col2 new
0 3.00 [1.1, 2.2, 3.3, 4.4, 5.5] 0.600000
1 0.25 [0.1, 0.2, 0.3, 0.4, 0.5, 0.6] 0.666667
Perfromance:
df4=pd.DataFrame(simple_list,columns=['col1','col2'])
df4 = pd.concat([df4] * 10000, ignore_index=True)
In [262]: %%timeit
...: df1 = pd.DataFrame(df4['col2'].tolist(), index=df4.index)
...: df4['new'] = df1.gt(df4['col1'], axis=0).mask(df1.isna()).mean(axis=1)
...:
40.9 ms ± 3.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [263]: %timeit df4.explode('col2').eval('col1 < col2').mean(level=0)
97.2 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [264]: %timeit [(np.array(b) > a).mean() for a, b in df4[['col1','col2']].to_numpy()]
305 ms ± 12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [265]: %timeit df4.apply(lambda x: (np.array(x['col2']) > x['col1']).mean(), axis=1)
1.23 s ± 32.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

parsing a panda dataframe column from a dictionary data form into new columns for each dictionary key

In python 3, pandas. Imagine there is a dataframe df with a column x
df=pd.DataFrame(
[
{'x':'{"a":"1","b":"2","c":"3"}'},
{'x':'{"a":"2","b":"3","c":"4"}'}
]
)
The column x has data which looks like a dictionary. Wonder how can I parse them into a new dataframe, so each key here becomes a new column?
The desired output dataframe is like
x,a,b,c
'{"a":"1","b":"2","c":"3"}',1,2,3
'{"a":"2","b":"3","c":"4"}',2,3,4
None of the solution in this post seems to work in this case
parsing a dictionary in a pandas dataframe cell into new row cells (new columns)
df1=pd.DataFrame(df.loc[:,'x'].values.tolist())
print(df1)
result the same dataframe. didn't separate the column into each key per column
Any 2 cents?
Thanks!
You can also map json.loads and convert to a dataframe like;
import json
df1 = pd.DataFrame(df['x'].map(json.loads).tolist(),index=df.index)
print(df1)
a b c
0 1 2 3
1 2 3 4
this tests to be faster than evaluating via ast , below is the benchmark for 40K rows:
m = pd.concat([df]*20000,ignore_index=True)
%%timeit
import json
df1 = pd.DataFrame(m['x'].map(json.loads).tolist(),index=m.index)
#256 ms ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
import ast
df1 = pd.DataFrame(m['x'].map(ast.literal_eval).tolist(),index=m.index)
#1.32 s ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
import ast
df1 = pd.DataFrame(m['x'].apply(ast.literal_eval).tolist(),index=m.index)
#1.34 s ± 71.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Because string repr of dictionaries is necessary convert values to dictionaries:
import ast, json
#performance for repeated sample data, in real data should be different
m = pd.concat([df]*20000,ignore_index=True)
In [98]: %timeit pd.DataFrame([json.loads(x) for x in m['x']], index=m.index)
206 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#anky_91 solution
In [99]: %timeit pd.DataFrame(m['x'].map(json.loads).tolist(),index=m.index)
210 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [100]: %timeit pd.DataFrame(m['x'].map(ast.literal_eval).tolist(),index=m.index)
903 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [101]: %timeit pd.DataFrame(m['x'].apply(ast.literal_eval).tolist(),index=m.index)
893 ms ± 2.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
print(df1)
a b c
0 1 2 3
1 2 3 4
Last for append to original:
df = df.join(df1)
print(df)
x a b c
0 {"a":"1","b":"2","c":"3"} 1 2 3
1 {"a":"2","b":"3","c":"4"} 2 3 4

converting pandas dataframe into dictionary??

I have a pandas dataframe as news_datasetwhere column id is an article ID and column Content is Article content (large text). Given as,
ID Content
17283 WASHINGTON — Congressional Republicans have...
17284 After the bullet shells get counted, the blood...
17285 When Walt Disney’s “Bambi” opened in 1942, cri...
17286 Death may be the great equalizer, but it isn’t...
17287 SEOUL, South Korea — North Korea’s leader, ...
Now, all I want to convert pandas dataframe into dictionary such as ID would be a key and Content will the value. Basically, what I have done at first something like,
dd={}
for i in news_dataset['ID']:
for j in news_dataset['Content']:
dd[j]=i
This piece of code is pathetic and taking so much time(> 4 minutes) to get processed. So, after checking for some better approaches(stackoverflow). What I have finally did is,
id_array=[]
content_array=[]
for id_num in news_dataset['ID']:
id_array.append(id_num)
for content in news_dataset['Content']:
content_array.append(content)
news_dict=dict(zip(id_array,content_array))
This code takes nearly 15 seconds to get executed.
What I want to ask is,
i) what's wrong in first code and why it take so much time to get processed?
ii) Does using for loop inside another for loop is wrong way to do iterations when it comes to large text data?
iii) what would be right way to create dictionary using for loop within single piece of query?
I think generally loops in pandas should be avoid if exist some non loop, obviously vectorized alternatives.
You can create index by column ID and call Series.to_dict:
news_dict=news_dataset.set_index('ID')['Content'].to_dict()
Or zip:
news_dict=dict(zip(news_dataset['ID'],news_dataset['Content']))
#alternative
#news_dict=dict(zip(news_dataset['ID'].values, news_dataset['Content'].values))
Performance:
np.random.seed(1425)
#1000rows sample
news_dataset = pd.DataFrame({'ID':np.arange(1000),
'Content':np.random.choice(list('abcdef'), size=1000)})
#print (news_dataset)
In [98]: %%timeit
...: dd={}
...: for i in news_dataset['ID']:
...: for j in news_dataset['Content']:
...: dd[j]=i
...:
61.7 ms ± 2.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [99]: %%timeit
...: id_array=[]
...: content_array=[]
...: for id_num in news_dataset['ID']:
...: id_array.append(id_num)
...: for content in news_dataset['Content']:
...: content_array.append(content)
...: news_dict=dict(zip(id_array,content_array))
...:
251 µs ± 3.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [100]: %%timeit
...: news_dict=news_dataset.set_index('ID')['Content'].to_dict()
...:
584 µs ± 9.69 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [101]: %%timeit
...: news_dict=dict(zip(news_dataset['ID'],news_dataset['Content']))
...:
106 µs ± 3.94 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [102]: %%timeit
...: news_dict=dict(zip(news_dataset['ID'].values, news_dataset['Content'].values))
...:
122 µs ± 891 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Resources