Create a column in pandas dataframes based on conditionals

Create a column in pandas dataframes based on conditionals - python-3.x

I have a pandas dataframe as below:
import pandas as pd
import numpy as np
import datetime
# intialise data of lists.
data = {'month' :[2,3,4,5,6,7,2,3,6,5],
'flag': ["A","A","A","A","A","A","B","B","B","B"],
'month1' :[4,4,7,15,11,13,6,5,6,5],
'value' :[100,20,50,10,65,86,24,12,1000,200]
}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
df
month flag month1 value
0 2 A 4 100
1 3 A 4 20
2 4 A 7 50
3 5 A 15 10
4 6 A 11 65
5 7 A 13 86
6 2 B 6 24
7 3 B 5 12
8 6 B 6 1000
9 5 B 5 200
Now for each month in unique flag, I want to perform below logic
1) Create a variable "final" and set it to 0
2) for each month, If month1 <= max(month), set "final" for where month == month1 to "final" from month1 + value from original month. For example,
index 0 to 5 are one group(flag = 'A')
MAX of month column for group A is 7
for row 1(month 2), month1 is 4 which is less than 7, go to month 4(row 3) update the value of "final" column to 100(0(current "final" value)+100(value from original month)
perform above step to each row in a group.
Expected output:
month flag month1 value Final
0 2 A 4 100 0
1 3 A 4 20 0
2 4 A 7 50 120
3 5 A 15 10 0
4 6 A 11 65 0
5 7 A 13 86 50
6 2 B 6 24 0
7 3 B 5 12 0
8 6 B 6 1000 1024
9 5 B 5 200 212

Define the following functions:
A function to be applied to each row (in the current group):
def fn(row, tbl, maxMonth):
return tbl[tbl.month1 == row.month].value.sum()
A function to be applied to each group:
def fnGrp(grp):
return grp.apply(fn, axis=1, tbl=grp, maxMonth=grp.month.max())
Then, to compute final column, group df by flag and apply
fnGrp to each group and save the result in final column:
df['final'] = df.groupby('flag').apply(fnGrp).reset_index(level=0, drop=True)
The result (df with added column) is:
month flag month1 value final
0 2 A 4 100 0
1 3 A 4 20 0
2 4 A 7 50 120
3 5 A 15 10 0
4 6 A 11 65 0
5 7 A 13 86 50
6 2 B 6 24 0
7 3 B 5 12 0
8 6 B 6 1000 1024
9 5 B 5 200 212

you can groupby 'flag' and 'month1' and get the sum of 'value', then merge this with df plus fillna with 0 such as:
new_df = df.merge(df.groupby(['flag', 'month1'])[['value']].sum(),
left_on=['flag','month'], right_index=True,
how='left', suffixes=('','_final'))\
.fillna({'value_final':0})
print (new_df)
month flag month1 value value_final
0 2 A 4 100 0.0
1 3 A 4 20 0.0
2 4 A 7 50 120.0
3 5 A 15 10 0.0
4 6 A 11 65 0.0
5 7 A 13 86 50.0
6 2 B 6 24 0.0
7 3 B 5 12 0.0
8 6 B 6 1000 1024.0
9 5 B 5 200 212.0

Related

Adding timepoints to a multirow dataframe based on ID and date

As the title says it, my dataframe looks as follows:
ID
Follow up month
Value-x
value -y
1
0
12
12
1
0
11
14
2
0
10
11
2
3
11
0
2
0
12
1
1
3
13
12
2
3
11
5
I want to add another column called timepoint which would make the table look like as follows:
ID
Follow up month
Value-x
value -y
Timepoint
1
0
12
12
1
1
0
11
14
1
2
0
10
11
1
2
3
11
0
2
2
0
12
1
1
1
3
13
12
2
2
3
11
5
2
2
3
11
0
2
2
0
12
1
1
1
3
13
12
2
2
3
11
5
2
So far I tried to group the rows by their ID and follow up month and then apply a timepoint using cumcount. This didn't give me any results any help on how to handle this would be appreciated.

From your table I can only infer that you want to create the Timepoint column based on the corresponding values in Follow up month, which will look like:
from io import StringIO
import pandas as pd
wt = StringIO("""ID Follow up month Value-x value -y
1 0 12 12
1 0 11 14
2 0 10 11
2 3 11 0
2 0 12 1
1 3 13 12
2 3 11 5""")
df = pd.read_csv(wt, sep='\s\s+')
df['Timepoint'] = df['Follow up month'].apply(lambda x: 1 if x==0 else 2)
df
Output:
ID Follow up month Value-x value -y Timepoint
0 1 0 12 12 1
1 1 0 11 14 1
2 2 0 10 11 1
3 2 3 11 0 2
4 2 0 12 1 1
5 1 3 13 12 2
6 2 3 11 5 2
Edit
Based on your comment, this should be what you want:
def timepoint(s):
if not s.isin([0]).any() and s.iloc[0] == 3:
return 1
else:
return s.apply(lambda x: 1 if x==0 else 2)
df['Timepoint'] = df.groupby('ID')['Follow up month'].transform(timepoint)

How to get rows when specific column value is continous for certain number of rows

I want to extract rows when the column x value remains the same for more than five consecutive rows.
x x2
0 5 5
1 4 5
2 10 6
3 10 5
4 10 6
5 10 78
6 10 89
7 10 78
8 10 98
9 10 8
10 10 56
11 60 45
12 10 65
Desired_output:
x x2
0 10 6
1 10 5
2 10 6
3 10 78
4 10 89
5 10 78
6 10 98
7 10 8
8 10 56

You can use .shift + .cumsum to identify the blocks of consecutive rows where column x value remains same, then group the dataframe on these blocks and transform using count to identify the groups which have greater than 5 consecutive same values in x:
b = df['x'].ne(df['x'].shift()).cumsum()
df_out = df[df['x'].groupby(b).transform('count').gt(5)]
Details:
>>> b
0 1
1 2
2 3
3 3
4 3
5 3
6 3
7 3
8 3
9 3
10 3
11 4
12 5
Name: x, dtype: int64
>>> df_out
x x2
2 10 6
3 10 5
4 10 6
5 10 78
6 10 89
7 10 78
8 10 98
9 10 8
10 10 56

you can use shift to compare the next row and then take cumulative sum to compare if the repeat is greater than 5, then group on x and transform any then mask with the condition to unselect rows where condition does not match.
c = df['x'].eq(df['x'].shift())
out = df[c.cumsum().gt(5).groupby(df['x']).transform('any') & (c|c.shift(-1))]
print(out)
x x2
2 10 6
3 10 5
4 10 6
5 10 78
6 10 89
7 10 78
8 10 98
9 10 8
10 10 56

How to save rows when value change in column python

I have DataFrame with two columns ID and Value1, I want to select rows when the value of column value1 column changes. I want to save rows 3 before change and 3 after the change and also change point row.
df=pd.DataFrame({'ID':[1,3,4,6,7,8,90,23,56,78,90,34,56,78,89,34,56],'Value1':[0,0,0,0,0,2,2,2,2,0,0,0,1,1,1,1,1]})
ID Value1
0 1 0
1 3 0
2 4 0
3 6 0
4 7 0
5 8 2
6 90 2
7 23 2
8 56 2
9 78 0
10 90 0
11 34 0
12 56 1
13 78 1
14 89 1
15 34 1
16 56 1
output:
ID Value1
0 4 0
1 6 0
2 7 0
3 8 2
4 90 2
5 23 2
6 90 2
7 23 2
8 56 2
9 78 0
10 90 0
11 34 0

IIUC,
import numpy as np
df=pd.DataFrame({'ID':[1,3,4,6,7,8,90,23,56,78,90,34,56,78,89,34,56],'Value1':[0,0,0,0,0,2,2,2,2,0,0,0,1,1,1,1,1]})
df.reset_index(drop=True) #index needs to start from zero for solution
ind = list(set([val for i in df[df['Value1'].diff()!=0].index for val in range(i-3, i+4) if i>0 and val>=0]))
# diff gives column wise differencing. combined it with nested list and
# finally, list(set()) to drop any duplicates in index values
df[df.index.isin(ind)]
ID Value1
2 4 0
3 6 0
4 7 0
5 8 2
6 90 2
7 23 2
8 56 2
9 78 0
10 90 0
11 34 0
12 56 1
13 78 1
14 89 1
15 34 1
If you want to retain occurrences of duplicates, drop the list(set()) function over the list

How to write Python code that does cumprod for forward 2 periods with groupby

I want to calculate Return, RET, which is Cumulative of 2 periods (now & next period) with groupby(id).
df['RET'] = df.groupby('id')['trt1m1'].rolling(2,min_periods=2).apply(lambda x:x.prod()).reset_index(0,drop=True)
Expected Result:
id datadate trt1m1 RET
1 20051231 1 2
1 20060131 2 6
1 20060228 3 12
1 20060331 4 16
1 20060430 4 20
1 20060531 5 Nan
2 20061031 10 110
2 20061130 11 165
2 20061231 15 300
2 20070131 20 420
2 20070228 21 Nan
Actual Result:
id datadate trt1m1 RET
1 20051231 1 Nan
1 20060131 2 2
1 20060228 3 6
1 20060331 4 12
1 20060430 4 16
1 20060531 5 20
2 20061031 10 Nan
2 20061130 11 110
2 20061231 15 165
2 20070131 20 300
2 20070228 21 420
The code i used calculate cumprod for trailing 2 periods instead of forward.

find running total on every 7th day in pandas

I have a data like this. first column is the number of days from one starting point. second column is value generated after each number of days as given.
example after 1 day i get 5$, after 2nd day i get 3$ and so on. there may be some time where there is no revenue like 4th day. the numbers are not consecutive.
data =pd.DataFrame({'day':[1,2,3,5,6,7,8,9,10,11,14,15,17,18,19],
'value':[5,3,7,8,9,4,6,5,2,8,6,7,9,5,2]})
I want to find total value after every 7 day window.
output should be like
day value
7 36
14 27
21 23
I am using loop to achieve this. is there a better pythonic way of doing this.
df =pd.DataFrame({})
sum_value=0
for index, row in data.iterrows():
sum_value+= row['value']
if row['day'] %7==0:
df = df.append(pd.DataFrame({'day':row['day'],'sum_value':[sum_value]}))
sum_value=0
pritn(df)
Also, how to find sum of previous 7 day values at each day (each row)
expected output
day value
1 5
2 8
3 15
5 23
6 32
7 36
8 37
9 39
10 34
and so on...
I hope i did the calculation right. it is basically running total of previous 7 days of values. it would be easier if the numbers are not missing in days column.

Use groupby with helper Series with subtract 1 and integer division with aggregate sum and last:
df = data.groupby((data['day'] - 1) // 7 , as_index=False).agg({'day':'last', 'value':'sum'})
print (df)
day value
0 7 36
1 14 27
2 19 23
Details:
print ((data['day'] - 1) // 7)
0 0
1 0
2 0
3 0
4 0
5 0
6 1
7 1
8 1
9 1
10 1
11 2
12 2
13 2
14 2
Name: day, dtype: int64
Similar solution if need divide day column by 7:
df = data.groupby((data['day'] - 1) // 7)['value'].sum().reset_index()
df['day'] = (df['day'] + 1) * 7
print (df)
day value
0 7 36
1 14 27
2 21 23
EDIT: Need rolling with sum, but first is necessary add missing dates by reindex - necessary unique values of day column.
idx = np.arange(data['day'].min(), data['day'].max() + 1)
df = data.set_index('day').reindex(idx).rolling(7, min_periods=1).sum()
df = df[df.index.isin(data['day'])]
print (df)
value
day
1 5.0
2 8.0
3 15.0
5 23.0
6 32.0
7 36.0
8 37.0
9 39.0
10 34.0
11 42.0
14 27.0
15 28.0
17 30.0
18 27.0
19 29.0
If get:
ValueError: cannot reindex from a duplicate axis
it means duplicates day values and solution is aggregate sum first:
#duplicated day 1
data =pd.DataFrame({'day':[1,1,3,5,6,7,8,9,10,11,14,15,17,18,19],
'value':[5,3,7,8,9,4,6,5,2,8,6,7,9,5,2]})
idx = np.arange(data['day'].min(), data['day'].max() + 1)
df = data.groupby('day')['value'].sum().reindex(idx).rolling(7, min_periods=1).sum()
df = df[df.index.isin(data['day'])]
print (df)
day
1 8.0
3 15.0
5 23.0
6 32.0
7 36.0
8 34.0
9 39.0
10 34.0
11 42.0
14 27.0
15 28.0
17 30.0
18 27.0
19 29.0
Name: value, dtype: float64

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Create a column in pandas dataframes based on conditionals - python-3.x

Related

Adding timepoints to a multirow dataframe based on ID and date

How to get rows when specific column value is continous for certain number of rows

How to save rows when value change in column python

How to write Python code that does cumprod for forward 2 periods with groupby

find running total on every 7th day in pandas

Categories

Resources