Create a column in pandas dataframes based on conditionals - python-3.x

I have a pandas dataframe as below:
import pandas as pd
import numpy as np
import datetime
# intialise data of lists.
data = {'month' :[2,3,4,5,6,7,2,3,6,5],
'flag': ["A","A","A","A","A","A","B","B","B","B"],
'month1' :[4,4,7,15,11,13,6,5,6,5],
'value' :[100,20,50,10,65,86,24,12,1000,200]
}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
df
month flag month1 value
0 2 A 4 100
1 3 A 4 20
2 4 A 7 50
3 5 A 15 10
4 6 A 11 65
5 7 A 13 86
6 2 B 6 24
7 3 B 5 12
8 6 B 6 1000
9 5 B 5 200
Now for each month in unique flag, I want to perform below logic
1) Create a variable "final" and set it to 0
2) for each month, If month1 <= max(month), set "final" for where month == month1 to "final" from month1 + value from original month. For example,
index 0 to 5 are one group(flag = 'A')
MAX of month column for group A is 7
for row 1(month 2), month1 is 4 which is less than 7, go to month 4(row 3) update the value of "final" column to 100(0(current "final" value)+100(value from original month)
perform above step to each row in a group.
Expected output:
month flag month1 value Final
0 2 A 4 100 0
1 3 A 4 20 0
2 4 A 7 50 120
3 5 A 15 10 0
4 6 A 11 65 0
5 7 A 13 86 50
6 2 B 6 24 0
7 3 B 5 12 0
8 6 B 6 1000 1024
9 5 B 5 200 212

Define the following functions:
A function to be applied to each row (in the current group):
def fn(row, tbl, maxMonth):
return tbl[tbl.month1 == row.month].value.sum()
A function to be applied to each group:
def fnGrp(grp):
return grp.apply(fn, axis=1, tbl=grp, maxMonth=grp.month.max())
Then, to compute final column, group df by flag and apply
fnGrp to each group and save the result in final column:
df['final'] = df.groupby('flag').apply(fnGrp).reset_index(level=0, drop=True)
The result (df with added column) is:
month flag month1 value final
0 2 A 4 100 0
1 3 A 4 20 0
2 4 A 7 50 120
3 5 A 15 10 0
4 6 A 11 65 0
5 7 A 13 86 50
6 2 B 6 24 0
7 3 B 5 12 0
8 6 B 6 1000 1024
9 5 B 5 200 212

you can groupby 'flag' and 'month1' and get the sum of 'value', then merge this with df plus fillna with 0 such as:
new_df = df.merge(df.groupby(['flag', 'month1'])[['value']].sum(),
left_on=['flag','month'], right_index=True,
how='left', suffixes=('','_final'))\
.fillna({'value_final':0})
print (new_df)
month flag month1 value value_final
0 2 A 4 100 0.0
1 3 A 4 20 0.0
2 4 A 7 50 120.0
3 5 A 15 10 0.0
4 6 A 11 65 0.0
5 7 A 13 86 50.0
6 2 B 6 24 0.0
7 3 B 5 12 0.0
8 6 B 6 1000 1024.0
9 5 B 5 200 212.0

Related

Adding timepoints to a multirow dataframe based on ID and date

As the title says it, my dataframe looks as follows:
ID
Follow up month
Value-x
value -y
1
0
12
12
1
0
11
14
2
0
10
11
2
3
11
0
2
0
12
1
1
3
13
12
2
3
11
5
I want to add another column called timepoint which would make the table look like as follows:
ID
Follow up month
Value-x
value -y
Timepoint
1
0
12
12
1
1
0
11
14
1
2
0
10
11
1
2
3
11
0
2
2
0
12
1
1
1
3
13
12
2
2
3
11
5
2
2
3
11
0
2
2
0
12
1
1
1
3
13
12
2
2
3
11
5
2
So far I tried to group the rows by their ID and follow up month and then apply a timepoint using cumcount. This didn't give me any results any help on how to handle this would be appreciated.
From your table I can only infer that you want to create the Timepoint column based on the corresponding values in Follow up month, which will look like:
from io import StringIO
import pandas as pd
wt = StringIO("""ID Follow up month Value-x value -y
1 0 12 12
1 0 11 14
2 0 10 11
2 3 11 0
2 0 12 1
1 3 13 12
2 3 11 5""")
df = pd.read_csv(wt, sep='\s\s+')
df['Timepoint'] = df['Follow up month'].apply(lambda x: 1 if x==0 else 2)
df
Output:
ID Follow up month Value-x value -y Timepoint
0 1 0 12 12 1
1 1 0 11 14 1
2 2 0 10 11 1
3 2 3 11 0 2
4 2 0 12 1 1
5 1 3 13 12 2
6 2 3 11 5 2
Edit
Based on your comment, this should be what you want:
def timepoint(s):
if not s.isin([0]).any() and s.iloc[0] == 3:
return 1
else:
return s.apply(lambda x: 1 if x==0 else 2)
df['Timepoint'] = df.groupby('ID')['Follow up month'].transform(timepoint)

How to get rows when specific column value is continous for certain number of rows

I want to extract rows when the column x value remains the same for more than five consecutive rows.
x x2
0 5 5
1 4 5
2 10 6
3 10 5
4 10 6
5 10 78
6 10 89
7 10 78
8 10 98
9 10 8
10 10 56
11 60 45
12 10 65
Desired_output:
x x2
0 10 6
1 10 5
2 10 6
3 10 78
4 10 89
5 10 78
6 10 98
7 10 8
8 10 56
You can use .shift + .cumsum to identify the blocks of consecutive rows where column x value remains same, then group the dataframe on these blocks and transform using count to identify the groups which have greater than 5 consecutive same values in x:
b = df['x'].ne(df['x'].shift()).cumsum()
df_out = df[df['x'].groupby(b).transform('count').gt(5)]
Details:
>>> b
0 1
1 2
2 3
3 3
4 3
5 3
6 3
7 3
8 3
9 3
10 3
11 4
12 5
Name: x, dtype: int64
>>> df_out
x x2
2 10 6
3 10 5
4 10 6
5 10 78
6 10 89
7 10 78
8 10 98
9 10 8
10 10 56
you can use shift to compare the next row and then take cumulative sum to compare if the repeat is greater than 5, then group on x and transform any then mask with the condition to unselect rows where condition does not match.
c = df['x'].eq(df['x'].shift())
out = df[c.cumsum().gt(5).groupby(df['x']).transform('any') & (c|c.shift(-1))]
print(out)
x x2
2 10 6
3 10 5
4 10 6
5 10 78
6 10 89
7 10 78
8 10 98
9 10 8
10 10 56

How to save rows when value change in column python

I have DataFrame with two columns ID and Value1, I want to select rows when the value of column value1 column changes. I want to save rows 3 before change and 3 after the change and also change point row.
df=pd.DataFrame({'ID':[1,3,4,6,7,8,90,23,56,78,90,34,56,78,89,34,56],'Value1':[0,0,0,0,0,2,2,2,2,0,0,0,1,1,1,1,1]})
ID Value1
0 1 0
1 3 0
2 4 0
3 6 0
4 7 0
5 8 2
6 90 2
7 23 2
8 56 2
9 78 0
10 90 0
11 34 0
12 56 1
13 78 1
14 89 1
15 34 1
16 56 1
output:
ID Value1
0 4 0
1 6 0
2 7 0
3 8 2
4 90 2
5 23 2
6 90 2
7 23 2
8 56 2
9 78 0
10 90 0
11 34 0
IIUC,
import numpy as np
df=pd.DataFrame({'ID':[1,3,4,6,7,8,90,23,56,78,90,34,56,78,89,34,56],'Value1':[0,0,0,0,0,2,2,2,2,0,0,0,1,1,1,1,1]})
df.reset_index(drop=True) #index needs to start from zero for solution
ind = list(set([val for i in df[df['Value1'].diff()!=0].index for val in range(i-3, i+4) if i>0 and val>=0]))
# diff gives column wise differencing. combined it with nested list and
# finally, list(set()) to drop any duplicates in index values
df[df.index.isin(ind)]
ID Value1
2 4 0
3 6 0
4 7 0
5 8 2
6 90 2
7 23 2
8 56 2
9 78 0
10 90 0
11 34 0
12 56 1
13 78 1
14 89 1
15 34 1
If you want to retain occurrences of duplicates, drop the list(set()) function over the list

How to write Python code that does cumprod for forward 2 periods with groupby

I want to calculate Return, RET, which is Cumulative of 2 periods (now & next period) with groupby(id).
df['RET'] = df.groupby('id')['trt1m1'].rolling(2,min_periods=2).apply(lambda x:x.prod()).reset_index(0,drop=True)
Expected Result:
id datadate trt1m1 RET
1 20051231 1 2
1 20060131 2 6
1 20060228 3 12
1 20060331 4 16
1 20060430 4 20
1 20060531 5 Nan
2 20061031 10 110
2 20061130 11 165
2 20061231 15 300
2 20070131 20 420
2 20070228 21 Nan
Actual Result:
id datadate trt1m1 RET
1 20051231 1 Nan
1 20060131 2 2
1 20060228 3 6
1 20060331 4 12
1 20060430 4 16
1 20060531 5 20
2 20061031 10 Nan
2 20061130 11 110
2 20061231 15 165
2 20070131 20 300
2 20070228 21 420
The code i used calculate cumprod for trailing 2 periods instead of forward.

find running total on every 7th day in pandas

I have a data like this. first column is the number of days from one starting point. second column is value generated after each number of days as given.
example after 1 day i get 5$, after 2nd day i get 3$ and so on. there may be some time where there is no revenue like 4th day. the numbers are not consecutive.
data =pd.DataFrame({'day':[1,2,3,5,6,7,8,9,10,11,14,15,17,18,19],
'value':[5,3,7,8,9,4,6,5,2,8,6,7,9,5,2]})
I want to find total value after every 7 day window.
output should be like
day value
7 36
14 27
21 23
I am using loop to achieve this. is there a better pythonic way of doing this.
df =pd.DataFrame({})
sum_value=0
for index, row in data.iterrows():
sum_value+= row['value']
if row['day'] %7==0:
df = df.append(pd.DataFrame({'day':row['day'],'sum_value':[sum_value]}))
sum_value=0
pritn(df)
Also, how to find sum of previous 7 day values at each day (each row)
expected output
day value
1 5
2 8
3 15
5 23
6 32
7 36
8 37
9 39
10 34
and so on...
I hope i did the calculation right. it is basically running total of previous 7 days of values. it would be easier if the numbers are not missing in days column.
Use groupby with helper Series with subtract 1 and integer division with aggregate sum and last:
df = data.groupby((data['day'] - 1) // 7 , as_index=False).agg({'day':'last', 'value':'sum'})
print (df)
day value
0 7 36
1 14 27
2 19 23
Details:
print ((data['day'] - 1) // 7)
0 0
1 0
2 0
3 0
4 0
5 0
6 1
7 1
8 1
9 1
10 1
11 2
12 2
13 2
14 2
Name: day, dtype: int64
Similar solution if need divide day column by 7:
df = data.groupby((data['day'] - 1) // 7)['value'].sum().reset_index()
df['day'] = (df['day'] + 1) * 7
print (df)
day value
0 7 36
1 14 27
2 21 23
EDIT: Need rolling with sum, but first is necessary add missing dates by reindex - necessary unique values of day column.
idx = np.arange(data['day'].min(), data['day'].max() + 1)
df = data.set_index('day').reindex(idx).rolling(7, min_periods=1).sum()
df = df[df.index.isin(data['day'])]
print (df)
value
day
1 5.0
2 8.0
3 15.0
5 23.0
6 32.0
7 36.0
8 37.0
9 39.0
10 34.0
11 42.0
14 27.0
15 28.0
17 30.0
18 27.0
19 29.0
If get:
ValueError: cannot reindex from a duplicate axis
it means duplicates day values and solution is aggregate sum first:
#duplicated day 1
data =pd.DataFrame({'day':[1,1,3,5,6,7,8,9,10,11,14,15,17,18,19],
'value':[5,3,7,8,9,4,6,5,2,8,6,7,9,5,2]})
idx = np.arange(data['day'].min(), data['day'].max() + 1)
df = data.groupby('day')['value'].sum().reindex(idx).rolling(7, min_periods=1).sum()
df = df[df.index.isin(data['day'])]
print (df)
day
1 8.0
3 15.0
5 23.0
6 32.0
7 36.0
8 34.0
9 39.0
10 34.0
11 42.0
14 27.0
15 28.0
17 30.0
18 27.0
19 29.0
Name: value, dtype: float64

Resources