Year On Year Growth Using Pandas - Traverse N rows Back - python-3.x

I have a lot of parameters on which I have to calculate the year on year growth.
Type 2006-Q1 2006-Q2 2006-Q3 2006-Q4 2007-Q1 2007-Q2 2007-Q3 2007-Q4 2008-Q1 2008-Q2 2008-Q3 2008-Q4
MonMkt_IntRt 3.44 3.60 3.99 4.40 4.61 4.73 5.11 4.97 4.92 4.89 5.29 4.51
RtlVol 97.08 97.94 98.25 99.15 99.63 100.29 100.71 101.18 102.04 101.56 101.05 99.49
IntRt 4.44 5.60 6.99 7.40 8.61 9.73 9.11 9.97 9.92 9.89 7.29 9.51
GMR 9.08 9.94 9.25 9.15 9.63 10.29 10.71 10.18 10.04 10.56 10.05 9.49
I need to calculate the growth, i.e in column 2007-Q1 i need to find the growth from 2006-Q1. The formula is (2007-Q1/2006-Q1) - 1
I have gone through the link below and tried to code
Calculating year over year growth by group in Pandas
df = pd.read_csv('c:/Econometric/EconoModel.csv')
df.set_index('Type',inplace=True)
df.sort_index(axis=1, inplace=True)
df_t = df.T
df_output=(df_cd_americas_t/df_cd_americas_t.shift(4)) -1
The output is as below
Type 2006-Q1 2006-Q2 2006-Q3 2006-Q4 2007-Q1 2007-Q2 2007-Q3 2007-Q4 2008-Q1 2008-Q2 2008-Q3 2008-Q4
MonMkt_IntRt 0.3398 0.3159 0.2806 0.1285 0.0661 0.0340 0.0363 -0.0912
RtlVol 0.0261 0.0240 0.0249 0.0204 0.0242 0.0126 0.0033 -0.0166
IntRt 0.6666 0.5375 0.3919 0.2310 0.1579 0.0195 0.0856 -0.2688
GMR 0.0077 -0.031 0.1124 0.1704 0.0571 -0.024 -0.014 -0.0127

Use iloc to shift data slices. See an example on test df.
df= pd.DataFrame({i:[0+i,1+i,2+i] for i in range(0,12)})
print(df)
0 1 2 3 4 5 6 7 8 9 10 11
0 0 1 2 3 4 5 6 7 8 9 10 11
1 1 2 3 4 5 6 7 8 9 10 11 12
2 2 3 4 5 6 7 8 9 10 11 12 13
df.iloc[:,list(range(3,12))] = df.iloc[:,list(range(3,12))].values/ df.iloc[:,list(range(0,9))].values - 1
print(df)
0 1 2 3 4 5 6 7 8 9 10
0 0 1 2 inf 3.0 1.50 1.00 0.75 0.600000 0.500000 0.428571
1 1 2 3 3.0 1.5 1.00 0.75 0.60 0.500000 0.428571 0.375000
2 2 3 4 1.5 1.0 0.75 0.60 0.50 0.428571 0.375000 0.333333
11
0 0.375000
1 0.333333
2 0.300000

I could not find any issue with your code.
Simply added axis=1 to the dataframe.shift() method as you are trying to do the column comparison
I have executed the following code it is giving the result you expected.
def getSampleDataframe():
df_economy_model = pd.DataFrame(
{
'Type':['MonMkt_IntRt', 'RtlVol', 'IntRt', 'GMR'],
'2006-Q1':[3.44, 97.08, 4.44, 9.08],
'2006-Q2':[3.6, 97.94, 5.6, 9.94],
'2006-Q3':[3.99, 98.25, 6.99, 9.25],
'2006-Q4':[4.4, 99.15, 7.4, 9.15],
'2007-Q1':[4.61, 99.63, 8.61, 9.63],
'2007-Q2':[4.73, 100.29, 9.73, 10.29],
'2007-Q3':[5.11, 100.71, 9.11, 10.71],
'2007-Q4':[4.97, 101.18, 9.97, 10.18],
'2008-Q1':[4.92, 102.04, 9.92, 10.04],
'2008-Q2':[4.89, 101.56, 9.89, 10.56],
'2008-Q3':[5.29, 101.05, 7.29, 10.05],
'2008-Q4':[4.51, 99.49, 9.51, 9.49]
}) # Your data
return df_economy_model>
df_cd_americas = getSampleDataframe()
df_cd_americas.set_index('Type', inplace=True)
df_yearly_growth = (df/df.shift(4, axis=1))-1
print (df_cd_americas)
print (df_yearly_growth)

Related

Nodal/Label assignment according to the load at a terminal in pandas data frame python

Input data frame with requirement for boarding in a transport allocation setting.
import pandas as pd
data1 = {
'N_Id': [0,1,2,2,2,2,4,5,5,5,10,10,11,11,13,13,13,13],
'N_Name': ['Destiny','N_Area','Pstation','Pstation','Pstation','Pstation','A_Area','T_depot','T_depot','T_depot','B_colony','B_colony','c_colony','c_colony','Z_colony','Z_colony','Z_colony','Z_colony'],
'Geocode': ['9.994798,0.206728','2.994798,0.206728','2.989385,0.201941','2.989385,0.201941','2.989385,0.201941','2.989385,0.201941','3.006515,0.247101','3.000568,0.256468','3.000568,0.256468','3.000568,0.256468','3.049031,0.167278','3.049031,0.167278','4.049031,0.167278','4.049031,0.167278','5.2,0.7','5.2,0.7','5.2,0.7','5.2,0.7'],
'pcode': ['None','M023','L123','M0232','L1234','K0324','M0137','M01368','M01369','K07249','M01375','F04509','F04609','F04610','F1','F2','F3','F4']
}
df1 = pd.DataFrame.from_dict(data1)
output
df1
Out[45]:
N_Id N_Name Geocode pcode
0 0 Destiny 9.994798,0.206728 None
1 1 N_Area 2.994798,0.206728 M023
2 2 Pstation 2.989385,0.201941 L123
3 2 Pstation 2.989385,0.201941 M0232
4 2 Pstation 2.989385,0.201941 L1234
5 2 Pstation 2.989385,0.201941 K0324
6 4 A_Area 3.006515,0.247101 M0137
7 5 T_depot 3.000568,0.256468 M01368
8 5 T_depot 3.000568,0.256468 M01369
9 5 T_depot 3.000568,0.256468 K07249
10 10 B_colony 3.049031,0.167278 M01375
11 10 B_colony 3.049031,0.167278 F04509
12 11 c_colony 4.049031,0.167278 F04609
13 11 c_colony 4.049031,0.167278 F04610
14 13 Z_colony 5.2,0.7 F1
15 13 Z_colony 5.2,0.7 F2
16 13 Z_colony 5.2,0.7 F3
17 13 Z_colony 5.2,0.7 F4
second data frame with a route plan/to load according to the Terminal/Nodal load.
data2= {
'BusID': ['V1','V1','V1','V4','V4','V4','V5','V5','V5','V0','v100'],
'Tcap': [4,4,4,8,8,8,12,12,12,12,8],
'Terminal_Load':[1,2,1,2,1,1,2,1,1,0,4],
'N_Id': [1,2,5,2,5,10,11,4,10,0,13],
'N_Name': ['N_Area','Pstation','T_depot','Pstation','T_depot','B_colony','c_colony','A_Area','B_colony','Destiny','Z_colony'],
}
df2 = pd.DataFrame.from_dict(data2)
output of the second data frame.
df2
Out[46]:
BusID Tcap Terminal_Load N_Id N_Name
0 V1 4 1 1 N_Area
1 V1 4 2 2 Pstation
2 V1 4 1 5 T_depot
3 V4 8 2 2 Pstation
4 V4 8 1 5 T_depot
5 V4 8 1 10 B_colony
6 V5 12 2 11 c_colony
7 V5 12 1 4 A_Area
8 V5 12 1 10 B_colony
9 V0 12 0 0 Destiny
10 v100 8 4 13 Z_colony
required Dataframe format..
data3 = {
'BusID': ['V1','V1','V1','V4','V4','V4','V5','V5','V5','V0','v100'],
'Tcap': [4,4,4,8,8,8,12,12,12,12,8],
'Terminal_Load':[1,2,1,2,1,1,2,1,1,0,4],
'N_Id': [1,2,5,2,5,10,11,4,10,0,13],
'N_Name': ['N_Area','Pstation','T_depot','Pstation','T_depot','B_colony','c_colony','A_Area','B_colony','Destiny','Z_colony'],
'Pcode': ['M023','L123,M0232','M01368','L1234,K0324','M01369','M01375','F04609,F04610','M0137','F04509','','F1,F2,F3,F4'],
}
df3 = pd.DataFrame.from_dict(data3)
required output...
df3
Out[47]:
BusID Tcap Terminal_Load N_Id N_Name Pcode
0 V1 4 1 1 N_Area M023
1 V1 4 2 2 Pstation L123,M0232
2 V1 4 1 5 T_depot M01368
3 V4 8 2 2 Pstation L1234,K0324
4 V4 8 1 5 T_depot M01369
5 V4 8 1 10 B_colony M01375
6 V5 12 2 11 c_colony F04609,F04610
7 V5 12 1 4 A_Area M0137
8 V5 12 1 10 B_colony F04509
9 V0 12 0 0 Destiny
10 v100 8 4 13 Z_colony F1,F2,F3,F4
The required Dataframe is the combination of both the first to dataframe, I need to assign the PCode based on the requirement Terminal Load.
The load at a terminal is on the Terminal_Load value, for example at nodal ID 1 the load is 1 so 1 person is listed in the pcode, 2nd row terminal load is 2 and the nodal ID is 2 hence two person are loaded at Pcode. Terminal load 0 at the 0 node.
requesting a pandas dataframe solution or any Non Pandas implementation just guessing any labelled pop up methods..or any algorithm to capacity and allocate.. and after allocating remove the label from the queue.
First aggregate list like previous solution and add new column to df2 with aggregate lists:
s = (df1.groupby(['N_Id','N_Name'])['pcode']
.agg(list)
.rename('Pcode'))
df = df2.join(s, on=['N_Id','N_Name'])
Then need split lists per groups (same per groups, so selected first list by iat[0]) by Terminal_Load values like linked solution - only is necessary cumulative sum and starting by 0:
def f(x):
indices = [0] + x['Terminal_Load'].cumsum().tolist()
s = x['Pcode'].iat[0]
#https://stackoverflow.com/a/10851479/2901002
x['Pcode'] = [','.join(s[indices[i]:indices[i+1]]) for i in range(len(indices)-1)]
return x
df = df.groupby(['N_Id','N_Name']).apply(f)
print (df)
BusID Tcap Terminal_Load N_Id N_Name Pcode
0 V1 4 1 1 N_Area M023
1 V1 4 2 2 Pstation L123,M0232
2 V1 4 1 5 T_depot M01368
3 V4 8 2 2 Pstation L1234,K0324
4 V4 8 1 5 T_depot M01369
5 V4 8 1 10 B_colony M01375
6 V5 12 2 11 c_colony F04609,F04610
7 V5 12 1 4 A_Area M0137
8 V5 12 1 10 B_colony F04509
9 V0 12 0 0 Destiny
10 v100 8 4 13 Z_colony F1,F2,F3,F4

Groupby count of non NaN of another column and a specific calculation of the same columns in pandas

I have a data frame as shown below
ID Class Score1 Score2 Name
1 A 9 7 Xavi
2 B 7 8 Alba
3 A 10 8 Messi
4 A 8 10 Neymar
5 A 7 8 Mbappe
6 C 4 6 Silva
7 C 3 2 Pique
8 B 5 7 Ramos
9 B 6 7 Serge
10 C 8 5 Ayala
11 A NaN 4 Casilas
12 A NaN 4 De_Gea
13 B NaN 2 Seaman
14 C NaN 7 Chilavert
15 B NaN 3 Courtous
From the above, I would like to calculate the number of players with scoer1 less than or equal to 6 in each Class along with count of non NaN rows (Class wise)
Expected output:
Class Total_Number Count_Non_NaN Score1_less_than_6_# Avg_score1
A 6 4 0 8.5
B 5 3 2 6
C 4 3 2 5
tried below code
df2 = df.groupby('Class').agg(Total_Number = ('Score1','size'),
Score1_less_than_6 = ('Score1',lambda x: x.between(0,6).sum()),
Avg_score1 = ('Score1','mean'))
df2 = df2.reset_index()
df2
Groupby and aggregate using a dictionary
df['s'] = df['Score1'].le(6)
df.groupby('Class').agg(**{'total_number': ('Score1', 'size'),
'count_non_nan': ('Score1', 'count'),
'score1_less_than_six': ('s', 'sum'),
'avg_score1': ('Score1', 'mean')})
total_number count_non_nan score1_less_than_six avg_score1
Class
A 6 4 0 8.5
B 5 3 2 6.0
C 4 3 2 5.0
Try:
x = df.groupby("Class", as_index=False).agg(
Total_Number=("Class", "count"),
Count_Non_NaN=("Score1", lambda x: x.notna().sum()),
Score1_less_than_6=("Score1", lambda x: (x <= 6).sum()),
Avg_score1=("Score1", "mean"),
)
print(x)
Prints:
Class Total_Number Count_Non_NaN Score1_less_than_6 Avg_score1
0 A 6 4.0 0.0 8.5
1 B 5 3.0 2.0 6.0
2 C 4 3.0 2.0 5.0

identify common column values across two different sized dataframes in pandas

I have two dataframes of different row and column sizes. I want to compare the two and create new columns in df2 based on whether values exist in df1. First for an example (I think you can copy/paste this text into a .csv to import), df1 looks like this:
subject block target dist1 dist2 dist3
7 1 doorlock candleholder01 jar03 stroller
7 2 glassescase clownfish kangaroo ram
7 3 badger chocolatefonduedish hosenozzle toycar04
7 4 hyena crocodile pig toad
7 1 scooter cormorant lizard rockbass
df2 like this:
subject image
7 acorn
7 chainsaw
7 doorlock
7 stroller
7 bathtub
7 clownfish
7 bagtie
7 birdie
7 witchhat
7 crocodile
7 honeybee
7 electricitymeter
7 flowerwreath
7 jar03
7 camera02a
and what I'd like to achieve is this:
subject image present type block
7 acorn 0 NA NA
7 chainsaw 0 NA NA
7 doorlock 1 target 1
7 stroller 1 dist3 1
7 bathtub 0 NA NA
7 clownfish 1 dist1 2
7 bagtie 0 NA NA
7 birdie 0 NA NA
7 witchhat 0 NA NA
7 crocodile 1 dist1 4
7 honeybee 0 NA NA
7 electricitymeter 0 NA NA
7 flowerwreath 0 NA NA
7 jar03 1 dist2 1
7 camera02a 0 NA NA
Specifically, I would like to identify, from the 4 columns in df1 ('target', 'dist1', 'dist2', 'dist3'), which values exist in the 'image' column of df2, and then (1) generate a column (boolean or 0/1) in df2 indicating whether that value exists in df1, (2) generate a second column in df2 with the name of the column in which that item exists in df1 (i.e. 'target', 'dist1', ...), and finally (3) generate a column in df2 with the df1 'block' value from which that item came from, if any.
I hope this is clear. I'd also like some ideas on how to handle the cases that don't match - should I code these as NAN or just empty strings? The thing is I will probably be groupby()'ing later, and I had some problems with groupby() when the df contained missing values..
You can do this by using melt on df1 and merge.
df1 = df1.melt(id_vars=['subject', 'block'], var_name='type', value_name='image')
df2['present'] = df2['image'].isin(df1['image']).astype(int)
pd.merge(df2, df1[['image', 'type', 'block']], on='image', how='left')
subject image present type block
0 7 acorn 0 NaN NaN
1 7 chainsaw 0 NaN NaN
2 7 doorlock 1 target 1.0
3 7 stroller 1 dist3 1.0
4 7 bathtub 0 NaN NaN
5 7 clownfish 1 dist1 2.0
6 7 bagtie 0 NaN NaN
7 7 birdie 0 NaN NaN
8 7 witchhat 0 NaN NaN
9 7 crocodile 1 dist1 4.0
10 7 honeybee 0 NaN NaN
11 7 electricitymeter 0 NaN NaN
12 7 flowerwreath 0 NaN NaN
13 7 jar03 1 dist2 1.0
14 7 camera02a 0 NaN NaN
As for the missing values, I would keep them as NaN. pandas is pretty powerful in terms of working with missing data, so may as well take advantage of this.

How do I calculate the probability of every value in a dataframe column quickly in Python?

I want to calculate the probability of all the data in a column dataframe according to its own distribution.For example,my data like this:
data
0 1
1 1
2 2
3 3
4 2
5 2
6 7
7 8
8 3
9 4
10 1
And the output I expect like this:
data pro
0 1 0.155015
1 1 0.155015
2 2 0.181213
3 3 0.157379
4 2 0.181213
5 2 0.181213
6 7 0.048717
7 8 0.044892
8 3 0.157379
9 4 0.106164
10 1 0.155015
I also refer to another question(How to compute the probability ...) and get an example of the above.My code is as follows:
import scipy.stats
samples = [1,1,2,3,2,2,7,8,3,4,1]
samples = pd.DataFrame(samples,columns=['data'])
print(samples)
kde = scipy.stats.gaussian_kde(samples['data'].tolist())
samples['pro'] = kde.pdf(samples['data'].tolist())
print(samples)
But what I can't stand is that if my column is too long, it makes the operation slow.Is there a better way to do it in pandas?Thanks in advance.
Its own distribution does not mean kde. You can use value_counts with normalize=True
df.assign(pro=df.data.map(df.data.value_counts(normalize=True)))
data pro
0 1 0.272727
1 1 0.272727
2 2 0.272727
3 3 0.181818
4 2 0.272727
5 2 0.272727
6 7 0.090909
7 8 0.090909
8 3 0.181818
9 4 0.090909
10 1 0.272727

Cannot assign row in pandas.Dataframe

I am trying to calculate the mean of the rows of a DataFrame which have the same value on a specified column col. However I'm stuck at assigning a row of the pandas DataFrame.
Here's my code:
def code(data, col):
""" Finds average value of all rows that have identical col values from column col .
Returns new Pandas.DataFrame with the data
"""
values = pd.unique(data[col])
rows = len(values)
res = pd.DataFrame(np.zeros(shape = (rows, len(data.columns))), columns = data.columns)
for i, v in enumerate(values):
e = data[data[col] == v].mean().to_frame().transpose()
res[i:i+1] = e
return res
The problem is that the code only works for the first row, and puts NaN values on the next rows. I have checked the value of e and confirmed it to be good, so there is a problem with the assignment res[i:i+1] = e. I have also tried to do res.iloc[i] = e but i get "ValueError: Incompatible indexer with Series" Is there an alternate way to do this? It seems very straight forward and I'm baffled why it doesn't work...
E.g:
wdata
Out[78]:
Die Subsite Algorithm Vt1 It1 Ignd
0 1 0 0 0.0 -2.320000e-07 -4.862400e-08
1 1 0 0 0.1 -1.000000e-04 1.000000e-04
2 1 0 0 0.2 -1.000000e-03 1.000000e-03
3 1 0 0 0.3 -1.000000e-02 1.000000e-02
4 1 1 1 0.0 3.554000e-07 -2.012000e-07
5 1 2 2 0.0 5.353000e-08 -1.684000e-07
6 1 3 3 0.0 9.369400e-08 -2.121400e-08
7 1 4 4 0.0 3.286200e-08 -2.093600e-08
8 1 5 5 0.0 8.978600e-08 -3.262000e-07
9 1 6 6 0.0 3.624800e-08 -2.507600e-08
10 1 7 7 0.0 2.957000e-08 -1.993200e-08
11 1 8 8 0.0 7.732600e-08 -3.773200e-08
12 1 9 9 0.0 9.300000e-08 -3.521200e-08
13 1 10 10 0.0 8.468000e-09 -6.990000e-09
14 1 11 11 0.0 1.434200e-11 -1.200000e-11
15 2 0 0 0.0 8.118000e-11 -5.254000e-11
16 2 1 1 0.0 9.322000e-11 -1.359200e-10
17 2 2 2 0.0 1.944000e-10 -2.409400e-10
18 2 3 3 0.0 7.756000e-11 -8.556000e-11
19 2 4 4 0.0 1.260000e-11 -8.618000e-12
20 2 5 5 0.0 7.122000e-12 -1.402000e-13
21 2 6 6 0.0 6.224000e-11 -2.760000e-11
22 2 7 7 0.0 1.133400e-08 -6.566000e-09
23 2 8 8 0.0 6.600000e-13 -1.808000e-11
24 2 9 9 0.0 6.861000e-08 -4.063400e-08
25 2 10 10 0.0 2.743800e-10 -1.336000e-10
Expected output:
Die Subsite Algorithm Vt1 It1 Ignd
0 1 4.4 4.4 0.04 -0.00074 0.00074
0 2 5.5 5.5 0 6.792247e-09 -4.023330e-09
Instead, what i get is:
Die Subsite Algorithm Vt1 It1 Ignd
0 1 4.4 4.4 0.04 -0.00074 0.00074
0 NaN NaN NaN NaN NaN NaN
For example, this code results in:
In[81]: wdata[wdata['Die'] == 2].mean().to_frame().transpose()
Out[81]:
Die Subsite Algorithm Vt1 It1 Ignd
0 2 5.5 5.5 0 6.792247e-09 -4.023330e-09
For me works:
def code(data, col):
""" Finds average value of all rows that have identical col values from column col .
Returns new Pandas.DataFrame with the data
"""
values = pd.unique(data[col])
rows = len(values)
res = pd.DataFrame(columns = data.columns)
for i, v in enumerate(values):
e = data[data[col] == v].mean()
res.loc[i,:] = e
return res
col = 'Die'
print (code(data, col))
Die Subsite Algorithm Vt1 It1 Ignd
0 1 4.4 4.4 0.04 -0.000739957 0.000739939
1 2 5 5 0 7.34067e-09 -4.35482e-09
but same output has groupby with aggregate mean:
print (data.groupby(col, as_index=False).mean())
Die Subsite Algorithm Vt1 It1 Ignd
0 1 4.4 4.4 0.04 -7.399575e-04 7.399392e-04
1 2 5.0 5.0 0.00 7.340669e-09 -4.354818e-09
A few minutes after I posted the question I solved it by adding a .values to e.
e = data[data[col] == v].mean().to_frame().transpose().values
However it turns out that what I wanted to do is already done by Pandas. Thanks MaxU!
df.groupBy(col).mean()

Resources