Pandas optimize look up of values - python-3.x

I have two dataframes simresult and skipper.
Simresult is in the format:
player_id 0 1 2 3 ... 5 6 7 8 9
0 41843 16.38 10.97 0.37 15.38 ... 23.78 38.08 47.21 26.14 11.94
1 17648 30.14 41.08 1.92 0.18 ... 0.00 39.01 41.08 8.54 22.95
2 16500 19.72 14.38 12.99 3.74 ... 17.72 32.70 3.34 0.50 28.66
3 36727 8.96 30.07 28.45 0.00 ... 0.00 20.48 0.00 27.64 2.15
4 10512 46.59 16.94 1.54 0.00 ... 5.89 0.00 18.33 11.24 0.71
.. ... ... ... ... ... ... ... ... ... ... ...
skipper is:
0 1 2 3 ... 6 7 8 9
0 449732 15862 1200230 10261 ... 814904 14622 16129 2500366
1 14759 10325 17769 16163 ... 10934 15862 17851 21540
2 14542 10934 2038972 2038976 ... 2039175 15181 35356 14622
3 11446 35356 10261 17851 ... 538921 16282 14542 34672
4 15806 21540 894408 37481 ... 15862 480013 15860 814065
.. ... ... ... ... ... ... ... ... ...
where each of those cells is a player_id.
I have some code that takes the player_ids from skipper and looks up the value, by column, from simresult then sums the row and gives me a list of lists.
i.e. there are 10 ids per row in skipper, the final output gives me the sum of those 10 ids as looked up in the first column, then the second column, etc.
This code works exactly as intended.
size = skipper.shape[0]
runs = simresult.shape[1]-1
def getScores(r):
slist = []
for s in range(0,size): #size
z = 0
for l in range(0,10):
y = skipper.iloc[s,l]
x = simresult.loc[simresult['player_id'] == y, r].iloc[0]
z += x
z = round(z,2)
slist.append(z)
return(slist)
scores = np.empty((runs,0)).tolist()
for r in range(0,runs):
scores[r] = getScores(r)
Output after I take the list of lists and put it into another df:
0 1 2 3 ... 6 7 8 9
0 73.90 128.52 106.79 151.37 ... 126.81 115.88 139.87 115.43
1 93.51 135.78 130.62 130.72 ... 136.51 133.90 179.96 207.32
2 112.65 174.84 131.14 103.40 ... 159.58 85.81 116.59 158.88
3 83.37 129.34 117.12 64.50 ... 175.28 69.81 142.74 110.61
4 80.90 142.68 90.23 128.25 ... 51.04 141.63 134.30 95.76
.. ... ... ... ... ... ... ... ... ...
My question: Is there a more efficient way to code this lookup nested loop?
Simplifed simresult:
simplified skipper:
0 1 2 3 ... 6 7 8 9
0 449732 15862 1200230 10261 ... 814904 14622 16129 2500366
simplified output:
0 1 2 3 ... 6 7 8 9
0 73.90 128.52 106.79 151.37 ... 126.81 115.88 139.87 115.43

Use the magic of replace
rep_df = df.set_index('player_id')
out = skip.replace(rep_df)

def getScores(c):
rep_df = simresult.set_index('player_id')
out = round(skipper.replace(rep_df[c]).sum(axis=1),2).to_list()
return(out)
Thanks to #BENY for the suggestion. Doing the replace/lookup on one column at a time from the simresult df and then summing that in place gets me exactly what I want -- and at a significantly faster processing time.

Related

Adding rows in dataframe based on changing column condition

I have a dataframe that looks like this.
name Datetime col_3 col_4
8 'Name 1' 2017-01-02T00:00:00 160 1600
9 'Name 1' 2017-01-02T00:00:00 160 1600
10 'Name 1' 2017-01-03T00:00:00 160 1800
.. ... ... ... ...
150 'Name 2' 2004-10-13T00:00:00 160 1600
151 'Name 2' 2004-10-14T00:00:00 160 1600
152 'Name 2' 2004-10-15T00:00:00 160 1800
.. ... ... ... ...
435 'Name 3' 2009-01-02T00:00:00 160 1600
436 'Name 3' 2009-01-02T00:00:00 170 1500
437 'Name 3' 2009-01-03T00:00:00 160 1800
.. ... ... ... ...
Essentially, I want to delete the 'name' column and I want to add a row each time the 'Name-#' field changes, containing only that 'Name-#':
Datetime col_2 col_3
7 'Name 1'
8 2017-01-02T00:00:00 160 1600
9 2017-01-02T00:00:00 160 1600
.. ... ... ... ...
149 'Name 2'
150 2004-10-13T00:00:00 160 1600
151 2004-10-14T00:00:00 160 1600
.. ... ... ... ...
435 'Name 3'
436 2009-01-02T00:00:00 170 1500
437 2009-01-03T00:00:00 160 1800
.. ... ... ... ...
I know how to add rows once the name column changes, but I need to automate the process of adding in the 'name-#' field in the Datetime column such that different data of the same style can be put though the code. Any help would be much appreciated.
Thanks!
I think what you are after is groupby
df.groupby('name')
so you could do
for name, dfsub in df.groupby('name'):
...
This would allow you to work on each group individually
An example
import pandas as pd
df = pd.DataFrame( {
'Name': ['a','a','a','b','b','b','b','c','c','d','d','d'],
'B': [5,5,6,7,5,6,6,7,7,6,7,7],
'C': [1,1,1,1,1,1,1,1,1,1,1,1]
} )
giving a dataframe
Name B C
0 a 5 1
1 a 5 1
2 a 6 1
3 b 7 1
4 b 5 1
5 b 6 1
6 b 6 1
7 c 7 1
8 c 7 1
9 d 6 1
10 d 7 1
11 d 7 1
Now we can just look at the output of a groupby. groupby in a loop returns two things, the first is the group name, and the second is the subset of the dataframe with the data grouped by it.
for name, dfsub in df.groupby('Name'):
print("Name is :"+name)
dfsub1 = dfsub.drop(‘Name’, axis=1)
print(dfsub1)
print() # new line for clarity
and this gives
Name is :a
B C
0 5 1
1 5 1
2 6 1
Name is :b
B C
3 7 1
4 5 1
5 6 1
6 6 1
Name is :c
B C
7 7 1
8 7 1
Name is :d
B C
9 6 1
10 7 1
11 7 1
where you get the name you are dealing with, then the dataframe dfsub that contains just the data that you are looking at.

Year On Year Growth Using Pandas - Traverse N rows Back

I have a lot of parameters on which I have to calculate the year on year growth.
Type 2006-Q1 2006-Q2 2006-Q3 2006-Q4 2007-Q1 2007-Q2 2007-Q3 2007-Q4 2008-Q1 2008-Q2 2008-Q3 2008-Q4
MonMkt_IntRt 3.44 3.60 3.99 4.40 4.61 4.73 5.11 4.97 4.92 4.89 5.29 4.51
RtlVol 97.08 97.94 98.25 99.15 99.63 100.29 100.71 101.18 102.04 101.56 101.05 99.49
IntRt 4.44 5.60 6.99 7.40 8.61 9.73 9.11 9.97 9.92 9.89 7.29 9.51
GMR 9.08 9.94 9.25 9.15 9.63 10.29 10.71 10.18 10.04 10.56 10.05 9.49
I need to calculate the growth, i.e in column 2007-Q1 i need to find the growth from 2006-Q1. The formula is (2007-Q1/2006-Q1) - 1
I have gone through the link below and tried to code
Calculating year over year growth by group in Pandas
df = pd.read_csv('c:/Econometric/EconoModel.csv')
df.set_index('Type',inplace=True)
df.sort_index(axis=1, inplace=True)
df_t = df.T
df_output=(df_cd_americas_t/df_cd_americas_t.shift(4)) -1
The output is as below
Type 2006-Q1 2006-Q2 2006-Q3 2006-Q4 2007-Q1 2007-Q2 2007-Q3 2007-Q4 2008-Q1 2008-Q2 2008-Q3 2008-Q4
MonMkt_IntRt 0.3398 0.3159 0.2806 0.1285 0.0661 0.0340 0.0363 -0.0912
RtlVol 0.0261 0.0240 0.0249 0.0204 0.0242 0.0126 0.0033 -0.0166
IntRt 0.6666 0.5375 0.3919 0.2310 0.1579 0.0195 0.0856 -0.2688
GMR 0.0077 -0.031 0.1124 0.1704 0.0571 -0.024 -0.014 -0.0127
Use iloc to shift data slices. See an example on test df.
df= pd.DataFrame({i:[0+i,1+i,2+i] for i in range(0,12)})
print(df)
0 1 2 3 4 5 6 7 8 9 10 11
0 0 1 2 3 4 5 6 7 8 9 10 11
1 1 2 3 4 5 6 7 8 9 10 11 12
2 2 3 4 5 6 7 8 9 10 11 12 13
df.iloc[:,list(range(3,12))] = df.iloc[:,list(range(3,12))].values/ df.iloc[:,list(range(0,9))].values - 1
print(df)
0 1 2 3 4 5 6 7 8 9 10
0 0 1 2 inf 3.0 1.50 1.00 0.75 0.600000 0.500000 0.428571
1 1 2 3 3.0 1.5 1.00 0.75 0.60 0.500000 0.428571 0.375000
2 2 3 4 1.5 1.0 0.75 0.60 0.50 0.428571 0.375000 0.333333
11
0 0.375000
1 0.333333
2 0.300000
I could not find any issue with your code.
Simply added axis=1 to the dataframe.shift() method as you are trying to do the column comparison
I have executed the following code it is giving the result you expected.
def getSampleDataframe():
df_economy_model = pd.DataFrame(
{
'Type':['MonMkt_IntRt', 'RtlVol', 'IntRt', 'GMR'],
'2006-Q1':[3.44, 97.08, 4.44, 9.08],
'2006-Q2':[3.6, 97.94, 5.6, 9.94],
'2006-Q3':[3.99, 98.25, 6.99, 9.25],
'2006-Q4':[4.4, 99.15, 7.4, 9.15],
'2007-Q1':[4.61, 99.63, 8.61, 9.63],
'2007-Q2':[4.73, 100.29, 9.73, 10.29],
'2007-Q3':[5.11, 100.71, 9.11, 10.71],
'2007-Q4':[4.97, 101.18, 9.97, 10.18],
'2008-Q1':[4.92, 102.04, 9.92, 10.04],
'2008-Q2':[4.89, 101.56, 9.89, 10.56],
'2008-Q3':[5.29, 101.05, 7.29, 10.05],
'2008-Q4':[4.51, 99.49, 9.51, 9.49]
}) # Your data
return df_economy_model>
df_cd_americas = getSampleDataframe()
df_cd_americas.set_index('Type', inplace=True)
df_yearly_growth = (df/df.shift(4, axis=1))-1
print (df_cd_americas)
print (df_yearly_growth)

Iterate over rows in a data frame create a new column then adding more columns based on the new column

I have a data frame as below:
Date Quantity
2019-04-25 100
2019-04-26 148
2019-04-27 124
The output that I need is to take the quantity difference between two next dates and average over 24 hours and create 23 columns with hourly quantity difference added to the column before such as below:
Date Quantity Hour-1 Hour-2 ....Hour-23
2019-04-25 100 102 104 .... 146
2019-04-26 148 147 146 .... 123
2019-04-27 124
I'm trying to iterate over a loop but it's not working ,my code is as below:
for i in df.index:
diff=(df.get_value(i+1,'Quantity')-df.get_value(i,'Quantity'))/24
for j in range(24):
df[i,[1+j]]=df.[i,[j]]*(1+diff)
I did some research but I have not found how to create columns like above iteratively. I hope you could help me. Thank you in advance.
IIUC using resample and interpolate, then we pivot the output
s=df.set_index('Date').resample('1 H').interpolate()
s=pd.pivot_table(s,index=s.index.date,columns=s.groupby(s.index.date).cumcount(),values=s,aggfunc='mean')
s.columns=s.columns.droplevel(0)
s
Out[93]:
0 1 2 3 ... 20 21 22 23
2019-04-25 100.0 102.0 104.0 106.0 ... 140.0 142.0 144.0 146.0
2019-04-26 148.0 147.0 146.0 145.0 ... 128.0 127.0 126.0 125.0
2019-04-27 124.0 NaN NaN NaN ... NaN NaN NaN NaN
[3 rows x 24 columns]
If I have understood the question correctly.
for loop approach:
list_of_values = []
for i,row in df.iterrows():
if i < len(df) - 2:
qty = row['Quantity']
qty_2 = df.at[i+1,'Quantity']
diff = (qty_2 - qty)/24
list_of_values.append(diff)
else:
list_of_values.append(0)
df['diff'] = list_of_values
Output:
Date Quantity diff
2019-04-25 100 2
2019-04-26 148 -1
2019-04-27 124 0
Now create the columns required.
i.e.
df['Hour-1'] = df['Quantity'] + df['diff']
df['Hour-2'] = df['Quantity'] + 2*df['diff']
.
.
.
.
There are other approaches which will work way better.

How to expand Python Pandas Dataframe in linearly spaced increments

Beginner question:
I have a pandas dataframe that looks like this:
x1 y1 x2 y2
0 0 2 2
10 10 12 12
and I want to expand that dataframe by half units along the x and y coordinates to look like this:
x1 y1 x2 y2 Interpolated_X Interpolated_Y
0 0 2 2 0 0
0 0 2 2 0.5 0.5
0 0 2 2 1 1
0 0 2 2 1.5 1.5
0 0 2 2 2 2
10 10 12 12 10 10
10 10 12 12 10.5 10.5
10 10 12 12 11 11
10 10 12 12 11.5 11.5
10 10 12 12 12 12
Any help would be much appreciated.
The cleanest way I know how to expand rows like this is through groupby.apply. May be faster to use something like itertuples in pandas but it will be a little more complicated code (keep that in mind if your data-set is larger).
groupby the index which will send each row to my apply function (your index has to be unique for each row, if its not just run reset_index). I can return a DataFrame from my apply therefore we can expand from one row to multiple rows.
caveat, your x2-x1 and y2-y1 distance must be the same or this won't work.
import pandas as pd
import numpy as np
def expand(row):
row = row.iloc[0] # passes a dateframe so this gets reference to first and only row
xdistance = (row.x2 - row.x1)
ydistance = (row.y2 - row.y1)
xsteps = np.arange(row.x1, row.x2 + .5, .5) # create steps arrays
ysteps = np.arange(row.y1, row.y2 + .5, .5)
return (pd.DataFrame([row] * len(xsteps)) # you can expand lists in python by multiplying like this [val] * 3 = [val, val, val]
.assign(int_x = xsteps, int_y = ysteps))
(df.groupby(df.index) # "group" on each row
.apply(expand) # send row to expand function
.reset_index(level=1, drop=True)) # groupby gives us an extra index we don't want
starting df
x1 y1 x2 y2
0 0 2 2
10 10 12 12
ending df
x1 y1 x2 y2 int_x int_y
0 0 0 2 2 0.0 0.0
0 0 0 2 2 0.5 0.5
0 0 0 2 2 1.0 1.0
0 0 0 2 2 1.5 1.5
0 0 0 2 2 2.0 2.0
1 10 10 12 12 10.0 10.0
1 10 10 12 12 10.5 10.5
1 10 10 12 12 11.0 11.0
1 10 10 12 12 11.5 11.5
1 10 10 12 12 12.0 12.0

Cannot assign row in pandas.Dataframe

I am trying to calculate the mean of the rows of a DataFrame which have the same value on a specified column col. However I'm stuck at assigning a row of the pandas DataFrame.
Here's my code:
def code(data, col):
""" Finds average value of all rows that have identical col values from column col .
Returns new Pandas.DataFrame with the data
"""
values = pd.unique(data[col])
rows = len(values)
res = pd.DataFrame(np.zeros(shape = (rows, len(data.columns))), columns = data.columns)
for i, v in enumerate(values):
e = data[data[col] == v].mean().to_frame().transpose()
res[i:i+1] = e
return res
The problem is that the code only works for the first row, and puts NaN values on the next rows. I have checked the value of e and confirmed it to be good, so there is a problem with the assignment res[i:i+1] = e. I have also tried to do res.iloc[i] = e but i get "ValueError: Incompatible indexer with Series" Is there an alternate way to do this? It seems very straight forward and I'm baffled why it doesn't work...
E.g:
wdata
Out[78]:
Die Subsite Algorithm Vt1 It1 Ignd
0 1 0 0 0.0 -2.320000e-07 -4.862400e-08
1 1 0 0 0.1 -1.000000e-04 1.000000e-04
2 1 0 0 0.2 -1.000000e-03 1.000000e-03
3 1 0 0 0.3 -1.000000e-02 1.000000e-02
4 1 1 1 0.0 3.554000e-07 -2.012000e-07
5 1 2 2 0.0 5.353000e-08 -1.684000e-07
6 1 3 3 0.0 9.369400e-08 -2.121400e-08
7 1 4 4 0.0 3.286200e-08 -2.093600e-08
8 1 5 5 0.0 8.978600e-08 -3.262000e-07
9 1 6 6 0.0 3.624800e-08 -2.507600e-08
10 1 7 7 0.0 2.957000e-08 -1.993200e-08
11 1 8 8 0.0 7.732600e-08 -3.773200e-08
12 1 9 9 0.0 9.300000e-08 -3.521200e-08
13 1 10 10 0.0 8.468000e-09 -6.990000e-09
14 1 11 11 0.0 1.434200e-11 -1.200000e-11
15 2 0 0 0.0 8.118000e-11 -5.254000e-11
16 2 1 1 0.0 9.322000e-11 -1.359200e-10
17 2 2 2 0.0 1.944000e-10 -2.409400e-10
18 2 3 3 0.0 7.756000e-11 -8.556000e-11
19 2 4 4 0.0 1.260000e-11 -8.618000e-12
20 2 5 5 0.0 7.122000e-12 -1.402000e-13
21 2 6 6 0.0 6.224000e-11 -2.760000e-11
22 2 7 7 0.0 1.133400e-08 -6.566000e-09
23 2 8 8 0.0 6.600000e-13 -1.808000e-11
24 2 9 9 0.0 6.861000e-08 -4.063400e-08
25 2 10 10 0.0 2.743800e-10 -1.336000e-10
Expected output:
Die Subsite Algorithm Vt1 It1 Ignd
0 1 4.4 4.4 0.04 -0.00074 0.00074
0 2 5.5 5.5 0 6.792247e-09 -4.023330e-09
Instead, what i get is:
Die Subsite Algorithm Vt1 It1 Ignd
0 1 4.4 4.4 0.04 -0.00074 0.00074
0 NaN NaN NaN NaN NaN NaN
For example, this code results in:
In[81]: wdata[wdata['Die'] == 2].mean().to_frame().transpose()
Out[81]:
Die Subsite Algorithm Vt1 It1 Ignd
0 2 5.5 5.5 0 6.792247e-09 -4.023330e-09
For me works:
def code(data, col):
""" Finds average value of all rows that have identical col values from column col .
Returns new Pandas.DataFrame with the data
"""
values = pd.unique(data[col])
rows = len(values)
res = pd.DataFrame(columns = data.columns)
for i, v in enumerate(values):
e = data[data[col] == v].mean()
res.loc[i,:] = e
return res
col = 'Die'
print (code(data, col))
Die Subsite Algorithm Vt1 It1 Ignd
0 1 4.4 4.4 0.04 -0.000739957 0.000739939
1 2 5 5 0 7.34067e-09 -4.35482e-09
but same output has groupby with aggregate mean:
print (data.groupby(col, as_index=False).mean())
Die Subsite Algorithm Vt1 It1 Ignd
0 1 4.4 4.4 0.04 -7.399575e-04 7.399392e-04
1 2 5.0 5.0 0.00 7.340669e-09 -4.354818e-09
A few minutes after I posted the question I solved it by adding a .values to e.
e = data[data[col] == v].mean().to_frame().transpose().values
However it turns out that what I wanted to do is already done by Pandas. Thanks MaxU!
df.groupBy(col).mean()

Resources