How to make a calculation in a pandas daframe depending on a value of a certain column - python-3.x

I have this dataframe and I want to make a calculation depending on a condition, like below:
count prep result
0 10 100
10 100 100
I wanto to create a new column evaluated that is:
if df['count']==0:
df['evaluated'] = df['result'] / df['prep']
else:
df['evaluated'] = df['result'] / df['count']
expected result is:
count prep result evaluated
0 10 100 10
100 10 100 1
What's the best way to do it? My real dataframe has 30k rows.

You can use where or mask:
df['evaluated'] = df['result'].div(df['prep'].where(df['count'].eq(0), df['count']))
Or:
df['evaluated'] = df['result'].div(df['count'].mask(df['count'].eq(0), df['prep']))
Output (assuming there was an error in the provided input):
count prep result evaluated
0 0 10 100 10.0
1 100 10 100 1.0

You can also use np.where from numpy to do that:
df['evaluated'] = np.where(df['count'] == 0,
df['result'] / df['prep'], # == 0
df['result'] / df['count']) # != 0
Performance (not really significant) over 30k rows:
>>> %timeit df['result'].div(df['prep'].where(df['count'].eq(0), df['count']))
652 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit df['result'].div(df['count'].mask(df['count'].eq(0), df['prep']))
638 µs ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit np.where(df['count'] == 0, df['result'] / df['prep'], df['result'] / df['count'])
462 µs ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Related

Pandas using map or apply to make a new column from adjustments using a dictionary

I have data from a sporting event and the knowledge is that there is a bias at each home arena that I want to make adjustments for. I have already created a dictionary where the arena is the key and the value is the adjustment I want to make.
So for each row, I want to take the home team, get the adjustment, and then subtract that from the distance column. I have the following code but I cannot seem to get it working.
#Making the dictionary, this is working properly
teams = df.home_team.unique().tolist()
adj_shot_dict = {}
for team in teams:
df_temp = df[df.home_team == team]
average = round(df_temp.event_distance.mean(),2)
adj_shot_dict[team] = average
def make_adjustment(df):
team = df.home_team
distance = df.event_distance
adj_dist = distance - adj_shot_dict[team]
return adj_dist
df['adj_dist'] = df['event_distance'].apply(make_adjustment)
IIUC, you already have the dict and you want simply subtract adj_shot_dict to event_distance column:
df['adj_dist'] = df['event_distance'] - df['home_team'].map(adj_shot_dict)
Old answer
Group by home_team, compute the average of event_distance then subtract the result to event_distance:
df['adj_dist'] = df['event_distance'] \
- df.groupby('home_team')['event_distance'] \
.transform('mean').round(2)
# OR
df['adj_dist'] = df.groupby('home_team')['event_distance'] \
.apply(lambda x: x - x.mean().round(2))
Performance
>>> len(df)
60000
>>> df.sample(5)
home_team event_distance
5 team3 60
4 team2 50
1 team2 20
1 team2 20
0 team1 10
def loop():
teams = df.home_team.unique().tolist()
adj_shot_dict = {}
for team in teams:
df_temp = df[df.home_team == team]
average = round(df_temp.event_distance.mean(),2)
adj_shot_dict[team] = average
def loop2():
df.groupby('home_team')['event_distance'].transform('mean').round(2)
>>> %timeit loop()
13.5 ms ± 194 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit loop2()
3.62 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Total process
>>> %timeit df['event_distance'] - df.groupby('home_team')['event_distance'].transform('mean').round(2)
3.7 ms ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

How can I reduce Execution time of Python code

In this this code I'm calculating difference between squares of n numbers and the square of the sum of n numbers.
Example : n=3, (1+2+3)^2 -(1^2+2^2+3^2) =22
def sum_square_diff(num):
sum1=0
sum2=0
for i in range(1,num+1):
sum1 +=i**2
sum2 +=i
sum2=sum2**2
diff=sum2-sum1
return diff
if __name__=="__main__":
n=int(input())
for i in range(n):
num=int(input())
result=sum_square_diff(num)
print(result)
This code is correct but it takes too much time to complete execution.
In the first place, the formula that you want to compute has a closed-form representation. There is no need for any loops:
n*n*(n+1)*(n+1)/4 - n*(n+1)*(2*n+1)/6
But if you insist, you can get >3x speedup by using numpy instead of raw Python:
def sum_square_diff1(num):
x = np.arange(1,num+1)
return x.sum()**2-(x**2).sum()
In [7]: %timeit sum_square_diff(100)
19.6 µs ± 435 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [8]: %timeit sum_square_diff1(100)
5.61 µs ± 26.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

parsing a panda dataframe column from a dictionary data form into new columns for each dictionary key

In python 3, pandas. Imagine there is a dataframe df with a column x
df=pd.DataFrame(
[
{'x':'{"a":"1","b":"2","c":"3"}'},
{'x':'{"a":"2","b":"3","c":"4"}'}
]
)
The column x has data which looks like a dictionary. Wonder how can I parse them into a new dataframe, so each key here becomes a new column?
The desired output dataframe is like
x,a,b,c
'{"a":"1","b":"2","c":"3"}',1,2,3
'{"a":"2","b":"3","c":"4"}',2,3,4
None of the solution in this post seems to work in this case
parsing a dictionary in a pandas dataframe cell into new row cells (new columns)
df1=pd.DataFrame(df.loc[:,'x'].values.tolist())
print(df1)
result the same dataframe. didn't separate the column into each key per column
Any 2 cents?
Thanks!
You can also map json.loads and convert to a dataframe like;
import json
df1 = pd.DataFrame(df['x'].map(json.loads).tolist(),index=df.index)
print(df1)
a b c
0 1 2 3
1 2 3 4
this tests to be faster than evaluating via ast , below is the benchmark for 40K rows:
m = pd.concat([df]*20000,ignore_index=True)
%%timeit
import json
df1 = pd.DataFrame(m['x'].map(json.loads).tolist(),index=m.index)
#256 ms ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
import ast
df1 = pd.DataFrame(m['x'].map(ast.literal_eval).tolist(),index=m.index)
#1.32 s ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
import ast
df1 = pd.DataFrame(m['x'].apply(ast.literal_eval).tolist(),index=m.index)
#1.34 s ± 71.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Because string repr of dictionaries is necessary convert values to dictionaries:
import ast, json
#performance for repeated sample data, in real data should be different
m = pd.concat([df]*20000,ignore_index=True)
In [98]: %timeit pd.DataFrame([json.loads(x) for x in m['x']], index=m.index)
206 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#anky_91 solution
In [99]: %timeit pd.DataFrame(m['x'].map(json.loads).tolist(),index=m.index)
210 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [100]: %timeit pd.DataFrame(m['x'].map(ast.literal_eval).tolist(),index=m.index)
903 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [101]: %timeit pd.DataFrame(m['x'].apply(ast.literal_eval).tolist(),index=m.index)
893 ms ± 2.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
print(df1)
a b c
0 1 2 3
1 2 3 4
Last for append to original:
df = df.join(df1)
print(df)
x a b c
0 {"a":"1","b":"2","c":"3"} 1 2 3
1 {"a":"2","b":"3","c":"4"} 2 3 4

Optimizing pandas operation for column intersection

I have a DataFrame with 2 columns (event and events) . Event column contains a particular eventid and events column contain list of event Ids.
Example :-
df
event events
'a' ['x','y','abc','a']
'b' ['x','y','c','a']
'c' ['a','c']
'd' ['b']
I want to create another column(eventoccured) indicating whether event isin events.
eventoccured
1
0
1
0
I am currently using
df['eventoccured']= df.apply(lambda x: x['event'] in x['events'], axis=1)
which gives the desired result but is slow, I want a faster solution for this.
Thanks
One idea is use list comprehension:
#40k rows
df = pd.concat([df] * 10000, ignore_index=True)
In [217]: %timeit df['eventoccured']= df.apply(lambda x: x['event'] in x['events'], axis=1)
1.15 s ± 36.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [218]: %timeit df['eventoccured1'] = [x in y for x, y in zip(df['event'], df['events'])]
15.2 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

What's the most concise way to iterate over a list by pairs in Python?

I've got the following bruteforce option that allows me to iterate over points:
# [x1, y1, x2, y2, ..., xn, yn]
coords = [1, 1, 2, 2, 3, 3]
# The goal is to operate with (x, y) within for loop
for (x, y) in zip(coords[::2], coords[1::2]):
# do something with (x, y) as a point
Is there a more concise / efficient way to do it?
(coords -> items)
Short Answer
If you want your items grouped with a specific length of 2, then
zip(items[::2], items[1::2])
is one of the best compromise in terms of speed and clarity.
If you can afford an extra line, you can get a bit (lot -- for larger inputs) more efficient by using iterators:
it = iter(items)
zip(it, it)
Long Answer
(EDIT: added a method that avoids zip())
You could achieve this in a number of ways.
For convenience, I write those as functions that can be benchmarked.
Also I will leave the size of the group as a parameter n (which, in your case, is 2)
def grouping1(items, n=2):
return zip(*tuple(items[i::n] for i in range(n)))
def grouping2(items, n=2):
return zip(*tuple(itertools.islice(items, i, None, n) for i in range(n)))
def grouping3(items, n=2):
for j in range(len(items) // n):
yield items[j:j + n]
def grouping4(items, n=2):
return zip(*([iter(items)] * n))
def grouping5(items, n=2):
it = iter(items)
while True:
result = []
for _ in range(n):
try:
tmp = next(it)
except StopIteration:
break
else:
result.append(tmp)
if len(result) == n:
yield result
else:
break
Benchmarking these with a relatively short list gives:
short = list(range(10))
%timeit [x for x in grouping1(short)]
# 1.33 µs ± 9.82 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit [x for x in grouping2(short)]
# 1.51 µs ± 16.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit [x for x in grouping3(short)]
# 1.14 µs ± 28.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit [x for x in grouping4(short)]
# 639 ns ± 7.56 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit [x for x in grouping5(short)]
# 3.37 µs ± 16.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
For medium sized inputs:
medium = list(range(1000))
%timeit [x for x in grouping1(medium)]
# 21.9 µs ± 466 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit [x for x in grouping2(medium)]
# 25.2 µs ± 257 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit [x for x in grouping3(medium)]
# 65.6 µs ± 233 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit [x for x in grouping4(medium)]
# 18.3 µs ± 114 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit [x for x in grouping5(medium)]
# 257 µs ± 2.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
For larger inputs:
large = list(range(1000000))
%timeit [x for x in grouping1(large)]
# 49.7 ms ± 840 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit [x for x in grouping2(large)]
# 37.5 ms ± 42.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit [x for x in grouping3(large)]
# 84.4 ms ± 736 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit [x for x in grouping4(large)]
# 31.6 ms ± 85.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit [x for x in grouping5(large)]
# 274 ms ± 2.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
As far as efficiency, grouping4() seems to be the fastest, closely followed by grouping1() or grouping3() (depending on the size of the input).
In your case, grouping1() seems a good compromise between speed and clearness, unless you are willing to wrap it up in a function.
Note that grouping4() requires you to use the same iterator multiple times and:
zip(iter(items), iter(items))
would NOT work.
If you want more control over uneven grouping i.e. when the len(items) is not divisible by n, you could replace zip with itertools.zip_longest() from the standard library.
Note also that grouping4() is substantially the grouper() recipe from the itertools official documentation.
You can use iter(object) and next(iterator, default) with a known default to leave your loop:
coords = [1, 1, 2, 2, 3, 3]
it = iter(coords)
while it:
x = next(it, None)
y = next(it, None)
if x is None or y is None:
break
# do something with your pairs
print(x,y)
Output:
1 1
2 2
3 3

Resources