Adding rows in dataframe based on changing column condition - python-3.x

I have a dataframe that looks like this.
name Datetime col_3 col_4
8 'Name 1' 2017-01-02T00:00:00 160 1600
9 'Name 1' 2017-01-02T00:00:00 160 1600
10 'Name 1' 2017-01-03T00:00:00 160 1800
.. ... ... ... ...
150 'Name 2' 2004-10-13T00:00:00 160 1600
151 'Name 2' 2004-10-14T00:00:00 160 1600
152 'Name 2' 2004-10-15T00:00:00 160 1800
.. ... ... ... ...
435 'Name 3' 2009-01-02T00:00:00 160 1600
436 'Name 3' 2009-01-02T00:00:00 170 1500
437 'Name 3' 2009-01-03T00:00:00 160 1800
.. ... ... ... ...
Essentially, I want to delete the 'name' column and I want to add a row each time the 'Name-#' field changes, containing only that 'Name-#':
Datetime col_2 col_3
7 'Name 1'
8 2017-01-02T00:00:00 160 1600
9 2017-01-02T00:00:00 160 1600
.. ... ... ... ...
149 'Name 2'
150 2004-10-13T00:00:00 160 1600
151 2004-10-14T00:00:00 160 1600
.. ... ... ... ...
435 'Name 3'
436 2009-01-02T00:00:00 170 1500
437 2009-01-03T00:00:00 160 1800
.. ... ... ... ...
I know how to add rows once the name column changes, but I need to automate the process of adding in the 'name-#' field in the Datetime column such that different data of the same style can be put though the code. Any help would be much appreciated.
Thanks!

I think what you are after is groupby
df.groupby('name')
so you could do
for name, dfsub in df.groupby('name'):
...
This would allow you to work on each group individually
An example
import pandas as pd
df = pd.DataFrame( {
'Name': ['a','a','a','b','b','b','b','c','c','d','d','d'],
'B': [5,5,6,7,5,6,6,7,7,6,7,7],
'C': [1,1,1,1,1,1,1,1,1,1,1,1]
} )
giving a dataframe
Name B C
0 a 5 1
1 a 5 1
2 a 6 1
3 b 7 1
4 b 5 1
5 b 6 1
6 b 6 1
7 c 7 1
8 c 7 1
9 d 6 1
10 d 7 1
11 d 7 1
Now we can just look at the output of a groupby. groupby in a loop returns two things, the first is the group name, and the second is the subset of the dataframe with the data grouped by it.
for name, dfsub in df.groupby('Name'):
print("Name is :"+name)
dfsub1 = dfsub.drop(‘Name’, axis=1)
print(dfsub1)
print() # new line for clarity
and this gives
Name is :a
B C
0 5 1
1 5 1
2 6 1
Name is :b
B C
3 7 1
4 5 1
5 6 1
6 6 1
Name is :c
B C
7 7 1
8 7 1
Name is :d
B C
9 6 1
10 7 1
11 7 1
where you get the name you are dealing with, then the dataframe dfsub that contains just the data that you are looking at.

Related

Merge pandas data frame based on specific conditions

I have a df as shown below
df1:
ID Job Salary
1 A 100
2 B 200
3 B 20
4 C 150
5 A 500
6 A 600
7 A 200
8 B 150
df2:
ID Type Status Age
1 2 P 23
2 1 P 28
8 1 F 33
4 3 P 48
14 1 F 23
11 2 P 28
16 2 F 23
41 3 P 38
df3:
ID T_Type Amount
1 K 20
2 L -50
1 K 30
3 K 5
1 K 100
2 L -50
1 L -30
25 K 500
1 K 20
4 L -80
19 K 30
2 K -5
Explanation About the data
ID is the primary key of df1.
ID is the primary key of df2.
df3 does not have any primary key.
From the above, I would like to prepare below dfs.
1. IDs which are in df1 and df2.
Expected output1:
ID Job Salary
1 A 100
2 B 200
4 C 150
8 B 150
IDs which are there in df1 and not in df2
output2:
ID Job Salary
3 B 20
5 A 500
6 A 600
7 A 200
IDs which are there in df1 and df3
output3:
ID Job Salary
1 A 100
2 B 200
3 B 20
4 C 150
4. IDs which are there in df1 and not in df3.
output4:
ID Job Salary
5 A 500
6 A 600
7 A 200
8 B 150
>>> # 1. IDs which are in df1 and df2.
>>> df1[df1['ID'].isin(df2['ID'])]
ID Job Salary
0 1 A 100
1 2 B 200
3 4 C 150
7 8 B 150
>>> # 2. IDs which are there in df1 and not in df2
>>> df1[~df1['ID'].isin(df2['ID'])]
ID Job Salary
2 3 B 20
4 5 A 500
5 6 A 600
6 7 A 200
>>> # 3. IDs which are there in df1 and df3
>>> df1[df1['ID'].isin(df3['ID'])]
ID Job Salary
0 1 A 100
1 2 B 200
2 3 B 20
3 4 C 150
>>> # 4. IDs which are there in df1 and not in df3.
>>> df1[~df1['ID'].isin(df3['ID'])]
ID Job Salary
4 5 A 500
5 6 A 600
6 7 A 200
7 8 B 150
Actually, your expected results aren't any merges, but rather
selections, based on whether df1.ID is (or is not) in ID column
of the second DataFrame.
To get your expected results, run the following commands:
result_1 = df1[df1.ID.isin(df2.ID)]
result_2 = df1[~df1.ID.isin(df2.ID)]
result_3 = df1[df1.ID.isin(df3.ID)]
result_4 = df1[~df1.ID.isin(df3.ID)]

loops application in dataframe to find output

I have the following data:
dict={'A':[1,2,3,4,5],'B':[10,20,233,29,2],'C':[10,20,3040,230,238]...................}
and
df= pd.Dataframe(dict)
In this manner I have 20 columns with 5 numerical entry in each column
I want to have a new column where the value should come as the following logic:
0 A[0]*B[0]+A[0]*C[0] + A[0]*D[0].......
1 A[1]*B[1]+A[1]*C[1] + A[1]*D[1].......
2 A[2]*B[2]+A[2]*B[2] + A[2]*D[2].......
I tried in the following manner but manually I can not put 20 columns, so I wanted to know the way to apply a loop to get the desired output
:
lst=[]
for i in range(0,5):
j=df.A[i]*df.B[i]+ df.A[i]*df.C[i]+.......
lst.append(j)
i=i+1
A potential solution is the following. I am only taking the example you posted but is works fine for more. Your data is df
A B C
0 1 10 10
1 2 20 20
2 3 233 3040
3 4 29 230
4 5 2 238
You can create a new column, D by first subsetting your dataframe
add = df.loc[:, df.columns != 'A']
and then take the sum over all multiplications of the columns in D with column A in the following way:
df['D'] = df['A']*add.sum(axis=1)
which returns
A B C D
0 1 10 10 20
1 2 20 20 80
2 3 233 3040 9819
3 4 29 230 1036
4 5 2 238 1200

How to do cumulative mean and count in a easy way

I have following dataframe in pandas
data = {'call_put':['C', 'C', 'P','C', 'P'],'price':[10,20,30,40,50], 'qty':[11,12,11,14,9]}
df['amt']=df.price*df.qty
df=pd.DataFrame(data)
call_put price qty amt
0 C 10 11 110
1 C 20 12 240
2 P 30 11 330
3 C 40 14 560
4 P 50 9 450
I want output something like following based on call_put value is 'C' or 'P' count, median and calculation as follows
call_put price qty amt cummcount cummmedian cummsum
C 10 11 110 1 110 110
C 20 12 240 2 175 ((110+240)/2 ) 350
P 30 11 330 1 330 680
C 40 14 560 3 303.33 (110+240+560)/3 1240
P 50 9 450 2 390 ((330+450)/2) 1690
Can it be done in some easy way without creating additional dataframes and functions?
create a grouped element named g and use df.assign to assign values:
g=df.groupby('call_put')
final=df.assign(cum_count=g.cumcount().add(1),
cummedian=g['amt'].expanding().mean().reset_index(drop=True), cum_sum=df.amt.cumsum())
call_put price qty amt cum_count cummedian cum_sum
0 C 10 11 110 1 110.000000 110
1 C 20 12 240 2 175.000000 350
2 P 30 11 330 1 303.333333 680
3 C 40 14 560 3 330.000000 1240
4 P 50 9 450 2 390.000000 1690
Note: for P , the cummedian should be 390 since (330+450)/2 = 390
For cum_count look at df.groupby.cumcount()
for cummedian check how expanding() works ,
for cumsum check df.cumsum()
IIUC, this should work
df['cumcount']=df.groupby('call_put').cumcount()
df['cummidean']=df.groupby('call_put')['amt'].cumsum()
df['cumsum']=df.groupby('call_put').cumsum()
Thanks following solution is fine
g=df.groupby('call_put')
final=df.assign(cum_count=g.cumcount().add(1),
cummedian=g['amt'].expanding().mean().reset_index(drop=True), cum_sum=df.amt.cumsum())
if I run following without drop=True
g['amt'].expanding().mean().reset_index()
why output is showing level_1
call_put level_1 amt
0 C 0 110.000000
1 C 1 175.000000
2 C 3 303.333333
3 P 2 330.000000
4 P 4 390.000000
g['amt'].expanding().mean().reset_index(drop=True)
0 110.000000
1 175.000000
2 303.333333
3 330.000000
4 390.000000
Name: amt, dtype: float64
Can you pl explain in more detail ?
How do you add one more condition in groupby clause
g=df.groupby('call_put', 'price' < 50)
TypeError: '<' not supported between instances of 'str' and 'int'

Sum of next n rows in python

I have a dataframe which is grouped at product store day_id level Say it looks like the below and I need to create a column with rolling sum
prod store day_id visits
111 123 1 2
111 123 2 3
111 123 3 1
111 123 4 0
111 123 5 1
111 123 6 0
111 123 7 1
111 123 8 1
111 123 9 2
need to create a dataframe as below
prod store day_id visits rolling_4_sum cond
111 123 1 2 6 1
111 123 2 3 5 1
111 123 3 1 2 1
111 123 4 0 2 1
111 123 5 1 4 0
111 123 6 0 4 0
111 123 7 1 NA 0
111 123 8 1 NA 0
111 123 9 2 NA 0
i am looking for create a
cond column: that recursively checks a condition , say if rolling_4_sum is greater than 5 then make the next 4 rows as 1 else do nothing ,i.e. even if the condition is not met retain what was already filled before , do this check for each row until 7 th row.
How can i achieve this using python ? i am trying
d1['rolling_4_sum'] = d1.groupby(['prod', 'store']).visits.rolling(4).sum()
but getting an error.
The formation of rolling sums can be done with rolling method, using boxcar window:
df['rolling_4_sum'] = df.visits.rolling(4, win_type='boxcar', center=True).sum().shift(-2)
The shift by -2 is because you apparently want the sums to be placed at the left edge of the window.
Next, the condition about rolling sums being less than 4:
df['cond'] = 0
for k in range(1, 4):
df.loc[df.rolling_4_sum.shift(k) < 7, 'cond'] = 1
A new column is inserted and filled with 0; then for each k=1,2,3,4, look k steps back; if the sum then less than 7, then set the condition to 1.

how to add a new column in dataframe which divides multiple columns and finds the maximum value

This maybe real simple solution but I am new to python 3 and I have a dataframe with multiple columns. I would like to add a new column to the existing dataframe - which does the following calculation i.e.
New Column = Max((Column A/Column B), (Column C/Column D), (Column E/Column F))
I can do a max based on the following code but wanted to check how can I do div alongwith it.
df['Max'] = df[['Column A','Column B','Column C', 'Column D', 'Column E', 'Column F']].max(axis=1)
Column A Column B Column C Column D Column E Column F Max
3600 36000 22 11 3200 3200 36000
2300 2300 13 26 1100 1200 2300
1300 13000 15 33 1000 1000 13000
Thanks
You can div the df by itself by slicing the columns in steps and then take the max:
In [105]:
df['Max'] = df.ix[:,df.columns[::2]].div(df.ix[:,df.columns[1::2]].values, axis=1).max(axis=1)
df
Out[105]:
Column A Column B Column C Column D Column E Column F Max
0 3600 36000 22 11 3200 3200 2
1 2300 2300 13 26 1100 1200 1
2 1300 13000 15 33 1000 1000 1
Here are the intermediate values:
In [108]:
df.ix[:,df.columns[::2]].div(df.ix[:,df.columns[1::2]].values, axis=1)
Out[108]:
Column A Column C Column E
0 0.1 2.000000 1.000000
1 1.0 0.500000 0.916667
2 0.1 0.454545 1.000000
You can try something like as follows
df['Max'] = df.apply(lambda v: max(v['A'] / v['B'].astype(float), v['C'] / V['D'].astype(float), v['E'] / v['F'].astype(float)), axis=1)
Example
In [14]: df
Out[14]:
A B C D E F
0 1 11 1 11 12 98
1 2 22 2 22 67 1
2 3 33 3 33 23 4
3 4 44 4 44 11 10
In [15]: df['Max'] = df.apply(lambda v: max(v['A'] / v['B'].astype(float), v['C'] /
v['D'].astype(float), v['E'] / v['F'].astype(float)), axis=1)
In [16]: df
Out[16]:
A B C D E F Max
0 1 11 1 11 12 98 0.122449
1 2 22 2 22 67 1 67.000000
2 3 33 3 33 23 4 5.750000
3 4 44 4 44 11 10 1.100000

Resources