Sum of last values of a dataframe column - python-3.x

I would like to create a new column where the values are the sum of the last 14 values of column atr1, How can I do it?
I tried
col = a.columns.get_loc('atr1')
a['atrsum'] = a.iloc[-14:,col].sum()
But I get only a fixed value in the new column. Dataframe below as reference.
time open high low close volume atr1
0 1620518400000 1.6206 1.8330 1.5726 1.7663 8.830913e+08 NaN
1 1620604800000 1.7662 1.8243 1.5170 1.6423 7.123049e+08 0.3073
2 1620691200000 1.6418 1.7791 1.5954 1.7632 5.243267e+08 0.1837
3 1620777600000 1.7633 1.8210 1.5462 1.5694 5.997101e+08 0.2748
4 1620864000000 1.5669 1.9719 1.5000 1.9296 1.567655e+09 0.4719
... ... ... ... ... ... ... ...
360 1651622400000 0.7712 0.8992 0.7677 0.8985 2.566498e+08 0.1315
361 1651708800000 0.8986 0.9058 0.7716 0.7884 3.649706e+08 0.1342
362 1651795200000 0.7884 0.7997 0.7625 0.7832 2.440587e+08 0.0372
363 1651881600000 0.7832 0.7858 0.7467 0.7604 1.268089e+08 0.0391
364 1651968000000 0.7605 0.7663 0.7254 0.7403 1.751395e+08 0.0409

use expanding function of pandas
14 is the number of days, followed by the column you like to sum
a.expanding(14)['atr1'].sum()
I must be missing something in the question, ,my apologies.
I just used the data you shared and used the 2 previous days and this is the result
df['atrsum'] = df['atr1'].expanding(2).sum()
id time open high low close volume atr1 atrsum
0 0 1620518400000 1.6206 1.8330 1.5726 1.7663 8.830913e+08 NaN NaN
1 1 1620604800000 1.7662 1.8243 1.5170 1.6423 7.123049e+08 0.3073 NaN
2 2 1620691200000 1.6418 1.7791 1.5954 1.7632 5.243267e+08 0.1837 0.4910
3 3 1620777600000 1.7633 1.8210 1.5462 1.5694 5.997101e+08 0.2748 0.7658
4 4 1620864000000 1.5669 1.9719 1.5000 1.9296 1.567655e+09 0.4719 1.2377
5 360 1651622400000 0.7712 0.8992 0.7677 0.8985 2.566498e+08 0.1315 1.3692
6 361 1651708800000 0.8986 0.9058 0.7716 0.7884 3.649706e+08 0.1342 1.5034
7 362 1651795200000 0.7884 0.7997 0.7625 0.7832 2.440587e+08 0.0372 1.5406
8 363 1651881600000 0.7832 0.7858 0.7467 0.7604 1.268089e+08 0.0391 1.5797
9 364 1651968000000 0.7605 0.7663 0.7254 0.7403 1.751395e+08 0.0409 1.6206
Result with Rolling sum
df['atrsum'] = df['atr1'].rolling(2).sum()
id time open high low close volume atr1 atrsum
0 0 1620518400000 1.6206 1.8330 1.5726 1.7663 8.830913e+08 NaN NaN
1 1 1620604800000 1.7662 1.8243 1.5170 1.6423 7.123049e+08 0.3073 NaN
2 2 1620691200000 1.6418 1.7791 1.5954 1.7632 5.243267e+08 0.1837 0.4910
3 3 1620777600000 1.7633 1.8210 1.5462 1.5694 5.997101e+08 0.2748 0.4585
4 4 1620864000000 1.5669 1.9719 1.5000 1.9296 1.567655e+09 0.4719 0.7467
5 360 1651622400000 0.7712 0.8992 0.7677 0.8985 2.566498e+08 0.1315 0.6034
6 361 1651708800000 0.8986 0.9058 0.7716 0.7884 3.649706e+08 0.1342 0.2657
7 362 1651795200000 0.7884 0.7997 0.7625 0.7832 2.440587e+08 0.0372 0.1714
8 363 1651881600000 0.7832 0.7858 0.7467 0.7604 1.268089e+08 0.0391 0.0763
9 364 1651968000000 0.7605 0.7663 0.7254 0.7403 1.751395e+08 0.0409 0.0800

At least for me the answer from Naveed did not work, but I found a different way:
a['atrsum']= a['atr1'].rolling(window=14).apply(sum).dropna()
This gives me the result.

Related

winsorize does not affect the outlier

I have this set of data in a DataFrame :
data
winsor_data
0
1660
1660
1
600
600
2
50
50
3
3173.55
3173.55
4
30
30
5
120
120
6
7.84
7.84
7
1660
1660
8
33.3
33.3
9
2069.49
2069.49
10
42
42
11
384.29
384.29
12
1660
1660
13
1338.57
1338.57
14
200000
200000
15
1760
1760
The 14th value is clearly an outlier.
from scipy.stats.mstats import winsorize
dfdailyIncome['winsor_data'] = winsorize(df['data'], limits=(0,0.95))
I do not understand why the outlier is not clipped. May be it has something to do with the way the quantiles are calculated.
I think you are misinterpreting the 'limits' parameter.
If you want to cut 10 percent of your largest values, you need:
dfdailyIncome['winsor_data'] = winsorize(df['data'], limits=[0,0.1])
You cut 95 percent of your largest data in your example.
Hint: Even if you would use winsorize(df['data'], limits=[0,0.05]), your data would stay the same because 5 percent of your largest data is the original data because you have less than 20 values.
See the example from here for further explanation: scipy.stats.mstats.winsorize

for loop incorrectly plotting boxplot on same figure

I have the code below where I’m trying to create three separate figures. I’m trying to create a figure with a boxplot for each column from the list. When I run this code it plots all three boxplots in the same figure on top of each other. If I instead changed it to a histogram it works perfectly, creating a separate figure for each histogram plot. Can someone please let me know how to fix this? I’ve also included some sample data below.
Code:
for i in ['Fresh', 'Milk', 'Grocery']:
data_df.boxplot(column=i)
Data:
print(data_df[:10])
Channel Region Fresh Milk Grocery Frozen Detergents_Paper \
0 2 3 12669 9656 7561 214 2674
1 2 3 7057 9810 9568 1762 3293
2 2 3 6353 8808 7684 2405 3516
3 1 3 13265 1196 4221 6404 507
4 2 3 22615 5410 7198 3915 1777
5 2 3 9413 8259 5126 666 1795
6 2 3 12126 3199 6975 480 3140
7 2 3 7579 4956 9426 1669 3321
8 1 3 5963 3648 6192 425 1716
9 2 3 6006 11093 18881 1159 7425
Delicatessen
0 1338
1 1776
2 7844
3 1788
4 5185
5 1451
6 545
7 2566
8 750
9 2098
You can try this:
import matplotlib.pyplot as plt
df[['Fresh','Milk','Grocery']].plot.box(subplots=True)
plt.tight_layout()
Output:

Year On Year Growth Using Pandas - Traverse N rows Back

I have a lot of parameters on which I have to calculate the year on year growth.
Type 2006-Q1 2006-Q2 2006-Q3 2006-Q4 2007-Q1 2007-Q2 2007-Q3 2007-Q4 2008-Q1 2008-Q2 2008-Q3 2008-Q4
MonMkt_IntRt 3.44 3.60 3.99 4.40 4.61 4.73 5.11 4.97 4.92 4.89 5.29 4.51
RtlVol 97.08 97.94 98.25 99.15 99.63 100.29 100.71 101.18 102.04 101.56 101.05 99.49
IntRt 4.44 5.60 6.99 7.40 8.61 9.73 9.11 9.97 9.92 9.89 7.29 9.51
GMR 9.08 9.94 9.25 9.15 9.63 10.29 10.71 10.18 10.04 10.56 10.05 9.49
I need to calculate the growth, i.e in column 2007-Q1 i need to find the growth from 2006-Q1. The formula is (2007-Q1/2006-Q1) - 1
I have gone through the link below and tried to code
Calculating year over year growth by group in Pandas
df = pd.read_csv('c:/Econometric/EconoModel.csv')
df.set_index('Type',inplace=True)
df.sort_index(axis=1, inplace=True)
df_t = df.T
df_output=(df_cd_americas_t/df_cd_americas_t.shift(4)) -1
The output is as below
Type 2006-Q1 2006-Q2 2006-Q3 2006-Q4 2007-Q1 2007-Q2 2007-Q3 2007-Q4 2008-Q1 2008-Q2 2008-Q3 2008-Q4
MonMkt_IntRt 0.3398 0.3159 0.2806 0.1285 0.0661 0.0340 0.0363 -0.0912
RtlVol 0.0261 0.0240 0.0249 0.0204 0.0242 0.0126 0.0033 -0.0166
IntRt 0.6666 0.5375 0.3919 0.2310 0.1579 0.0195 0.0856 -0.2688
GMR 0.0077 -0.031 0.1124 0.1704 0.0571 -0.024 -0.014 -0.0127
Use iloc to shift data slices. See an example on test df.
df= pd.DataFrame({i:[0+i,1+i,2+i] for i in range(0,12)})
print(df)
0 1 2 3 4 5 6 7 8 9 10 11
0 0 1 2 3 4 5 6 7 8 9 10 11
1 1 2 3 4 5 6 7 8 9 10 11 12
2 2 3 4 5 6 7 8 9 10 11 12 13
df.iloc[:,list(range(3,12))] = df.iloc[:,list(range(3,12))].values/ df.iloc[:,list(range(0,9))].values - 1
print(df)
0 1 2 3 4 5 6 7 8 9 10
0 0 1 2 inf 3.0 1.50 1.00 0.75 0.600000 0.500000 0.428571
1 1 2 3 3.0 1.5 1.00 0.75 0.60 0.500000 0.428571 0.375000
2 2 3 4 1.5 1.0 0.75 0.60 0.50 0.428571 0.375000 0.333333
11
0 0.375000
1 0.333333
2 0.300000
I could not find any issue with your code.
Simply added axis=1 to the dataframe.shift() method as you are trying to do the column comparison
I have executed the following code it is giving the result you expected.
def getSampleDataframe():
df_economy_model = pd.DataFrame(
{
'Type':['MonMkt_IntRt', 'RtlVol', 'IntRt', 'GMR'],
'2006-Q1':[3.44, 97.08, 4.44, 9.08],
'2006-Q2':[3.6, 97.94, 5.6, 9.94],
'2006-Q3':[3.99, 98.25, 6.99, 9.25],
'2006-Q4':[4.4, 99.15, 7.4, 9.15],
'2007-Q1':[4.61, 99.63, 8.61, 9.63],
'2007-Q2':[4.73, 100.29, 9.73, 10.29],
'2007-Q3':[5.11, 100.71, 9.11, 10.71],
'2007-Q4':[4.97, 101.18, 9.97, 10.18],
'2008-Q1':[4.92, 102.04, 9.92, 10.04],
'2008-Q2':[4.89, 101.56, 9.89, 10.56],
'2008-Q3':[5.29, 101.05, 7.29, 10.05],
'2008-Q4':[4.51, 99.49, 9.51, 9.49]
}) # Your data
return df_economy_model>
df_cd_americas = getSampleDataframe()
df_cd_americas.set_index('Type', inplace=True)
df_yearly_growth = (df/df.shift(4, axis=1))-1
print (df_cd_americas)
print (df_yearly_growth)

Iterate over rows in a data frame create a new column then adding more columns based on the new column

I have a data frame as below:
Date Quantity
2019-04-25 100
2019-04-26 148
2019-04-27 124
The output that I need is to take the quantity difference between two next dates and average over 24 hours and create 23 columns with hourly quantity difference added to the column before such as below:
Date Quantity Hour-1 Hour-2 ....Hour-23
2019-04-25 100 102 104 .... 146
2019-04-26 148 147 146 .... 123
2019-04-27 124
I'm trying to iterate over a loop but it's not working ,my code is as below:
for i in df.index:
diff=(df.get_value(i+1,'Quantity')-df.get_value(i,'Quantity'))/24
for j in range(24):
df[i,[1+j]]=df.[i,[j]]*(1+diff)
I did some research but I have not found how to create columns like above iteratively. I hope you could help me. Thank you in advance.
IIUC using resample and interpolate, then we pivot the output
s=df.set_index('Date').resample('1 H').interpolate()
s=pd.pivot_table(s,index=s.index.date,columns=s.groupby(s.index.date).cumcount(),values=s,aggfunc='mean')
s.columns=s.columns.droplevel(0)
s
Out[93]:
0 1 2 3 ... 20 21 22 23
2019-04-25 100.0 102.0 104.0 106.0 ... 140.0 142.0 144.0 146.0
2019-04-26 148.0 147.0 146.0 145.0 ... 128.0 127.0 126.0 125.0
2019-04-27 124.0 NaN NaN NaN ... NaN NaN NaN NaN
[3 rows x 24 columns]
If I have understood the question correctly.
for loop approach:
list_of_values = []
for i,row in df.iterrows():
if i < len(df) - 2:
qty = row['Quantity']
qty_2 = df.at[i+1,'Quantity']
diff = (qty_2 - qty)/24
list_of_values.append(diff)
else:
list_of_values.append(0)
df['diff'] = list_of_values
Output:
Date Quantity diff
2019-04-25 100 2
2019-04-26 148 -1
2019-04-27 124 0
Now create the columns required.
i.e.
df['Hour-1'] = df['Quantity'] + df['diff']
df['Hour-2'] = df['Quantity'] + 2*df['diff']
.
.
.
.
There are other approaches which will work way better.

Apply Python lambda : if condition giving syntax error

This is my data set
fake_abalone2
Sex Length Diameter Height Whole Shucked Viscera Shell Rings
Weight Weight Weight Weight
0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.1500 15
1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.0700 7
2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.2100 9
3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.1550 10
4 K 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.0550 7
5 K 0.425 0.300 0.095 0.3515 0.1410 0.0775 0.1200 8
Getting syntax error while using the following method. Please help me out.
I want the value in "sex" table to change depending on "Rings" table.If "Rings" value is less than 10 the corresponding "sex" value should be changed to 'K'.Otherwise, no change should be made in "Sex" table.
fake_abalone2["sex"]=fake_abalone2["Rings"].apply(lambda x:"K" if x<10)
File "", line 1
fake_abalone2["sex"]=fake_abalone2["Rings"].apply(lambda x:"K" if x<10)
SyntaxError: invalid syntax
The Following method works perfectly.
df1["Sex"]=df1.apply(lambda x: "K"if x.Rings<10 else x["Sex"],axis=1)
df1 is the dataframe
Sex Length Diameter Height Whole Shucked Viscera Shell Rings
weight weight weight weight
0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.1500 15
1 K 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.0700 7
2 K 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.2100 9
3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.1550 10
4 K 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.0550 7
5 K 0.425 0.300 0.095 0.3515 0.1410 0.0775 0.1200 8
6 F 0.530 0.415 0.150 0.7775 0.2370 0.1415 0.3300 20
You can use Python numpy instead of lambda function.
Import python numpy using import numpy as np
then you can use the following method to replace the string.
fake_abalone2['Sex'] = np.where(fake_abalone2['Rings']<10, 'K', fake_abalone2['Sex'])
The main problem is the output of the lambda function:
.apply(lambda x:"K" if x<10)
The output is not certain for other conditions, so you can use else something ...
.apply(lambda x:"K" if x<10 else None)

Resources