Find no of days gap for a specific ID when it has a flag X in other column - python-3.x

I want to calculate the no. of days gap for when the 'flag' column is equal to 'X' for same IDs.
The dataframe that I have:
ID Date flag
1 1-1-2020 X
1 10-1-2020 null
1 15-1-2020 X
2 1-2-2020 X
2 10-2-2020 X
2 15-2-2020 X
3 15-2-2020 null
The dataframe I want:
ID Date flag no_of_days
1 1-1-2020 X 14
1 10-1-2020 null null
1 15-1-2020 X null
2 1-2-2020 X 9
2 10-2-2020 X 8
2 18-2-2020 X null
3 15-2-2020 null null
Thanks in advance.

First filter rows by X in boolean indexing and then subtract shifted column per groups by DataFrameGroupBy.shift and last convert timedeltas to days by Series.dt.days:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df['new'] = df[df['flag'].eq('X')].groupby('ID')['Date'].shift(-1).sub(df['Date']).dt.days
print (df)
ID Date flag new
0 1 2020-01-01 X 14.0
1 1 2020-01-10 NaN NaN
2 1 2020-01-15 X NaN
3 2 2020-02-01 X 9.0
4 2 2020-02-10 X 8.0
5 2 2020-02-18 X NaN
6 3 2020-02-15 NaN NaN

Related

Divide values of rows based on condition which are of running count

Sample of the table for 1 id, exists multiple id in the original df.
id
legend
date
running_count
101
X
24-07-2021
3
101
Y
24-07-2021
5
101
X
25-07-2021
4
101
Y
25-07-2021
6
I want to create a new column where I have to perform division of the running_count on the basis of the id, legend and date - (X/Y) for the date 24-07-2021 for a particular id and so on.
How shall I perform the calculation?
If there is same order X, Y for each id is possible use:
df['new'] = df['running_count'].div(df.groupby(['id','date'])['running_count'].shift(-1))
print (df)
id legend date running_count new
0 101 X 24-07-2021 3 0.600000
1 101 Y 24-07-2021 5 NaN
2 101 X 25-07-2021 4 0.666667
3 101 Y 25-07-2021 6 NaN
If possible change ouput:
df1 = df.pivot(index=['id','date'], columns='legend', values='running_count')
df1['new'] = df1['X'].div(df1['Y'])
df1 = df1.reset_index()
print (df1)
legend id date X Y new
0 101 24-07-2021 3 5 0.600000
1 101 25-07-2021 4 6 0.666667

Join with column having the max sequence number

I have a margin table
item margin
0 a 3
1 b 4
2 c 5
and an item table
item sequence
0 a 1
1 a 2
2 a 3
3 b 1
4 b 2
5 c 1
6 c 2
7 c 3
I want to join the two table so that the margin will only be joined to the product with maximum sequence number, the desired outcome is
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
How to achieve this?
Below is the code for margin and item table
import pandas as pd
df_margin=pd.DataFrame({"item":["a","b","c"],"margin":[3,4,5]})
df_item=pd.DataFrame({"item":["a","a","a","b","b","c","c","c"],"sequence":[1,2,3,1,2,1,2,3]})
One option would be to merge then replace extra values with NaN via Series.where:
new_df = df_item.merge(df_margin)
new_df['margin'] = new_df['margin'].where(
new_df.groupby('item')['sequence'].transform('max').eq(new_df['sequence'])
)
Or with loc:
new_df = df_item.merge(df_margin)
new_df.loc[new_df.groupby('item')['sequence']
.transform('max').ne(new_df['sequence']), 'margin'] = np.NAN
Another option would be to assign a temp column to both frames df_item with True where the value is maximal, and df_margin is True everywhere then merge outer and drop the temp column:
new_df = (
df_item.assign(
t=df_item
.groupby('item')['sequence']
.transform('max')
.eq(df_item['sequence'])
).merge(df_margin.assign(t=True), how='outer').drop('t', 1)
)
Both produce new_df:
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
You could do:
df_item.merge(df_item.groupby('item')['sequence'].max().\
reset_index().merge(df_margin), 'left')
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
Breakdown:
df_new = df_item.groupby('item')['sequence'].max().reset_index().merge(df_margin)
df_item.merge(df_new, 'left')

pandas groupby and widen dataframe with ordered columns

I have a long form dataframe that contains multiple samples and time points for each subject. The number of samples and timepoint can vary, and the days between time points can also vary:
test_df = pd.DataFrame({"subject_id":[1,1,1,2,2,3],
"sample":["A", "B", "C", "D", "E", "F"],
"timepoint":[19,11,8,6,2,12],
"time_order":[3,2,1,2,1,1]
})
subject_id sample timepoint time_order
0 1 A 19 3
1 1 B 11 2
2 1 C 8 1
3 2 D 6 2
4 2 E 2 1
5 3 F 12 1
I need to figure out a way to generalize grouping this dataframe by subject_id and putting all samples and time points on the same row, in time order.
DESIRED OUTPUT:
subject_id sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
0 1 C 8 B 11 A 19
1 2 E 2 D 6 null null
5 3 F 12 null null null null
Pivot gets me close, but I'm stuck on how to proceed from there:
test_df = test_df.pivot(index=['subject_id', 'sample'],
columns='time_order', values='timepoint')
Use DataFrame.set_index with DataFrame.unstack for pivoting, sorting MultiIndex in columns, flatten it and last convert subject_id to column:
df = (test_df.set_index(['subject_id', 'time_order'])
.unstack()
.sort_index(level=[1,0], axis=1))
df.columns = df.columns.map(lambda x: f'{x[0]}{x[1]}')
df = df.reset_index()
print (df)
subject_id sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
0 1 C 8.0 B 11.0 A 19.0
1 2 E 2.0 D 6.0 NaN NaN
2 3 F 12.0 NaN NaN NaN NaN
a=test_df.iloc[:,:3].groupby('subject_id').last().add_suffix('1')
b=test_df.iloc[:,:3].groupby('subject_id').nth(-2).add_suffix('2')
c=test_df.iloc[:,:3].groupby('subject_id').nth(-3).add_suffix('3')
pd.concat([a, b,c], axis=1)
sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
subject_id
1 C 8 B 11.0 A 19.0
2 E 2 D 6.0 NaN NaN
3 F 12 NaN NaN NaN NaN

Create Python function to look for ANY NULL value in a group

I am trying to write a function that will check a specified column for nulls, within a group in a dataframe. The example dataframe has two columns, ID and VALUE. Multiple rows exist per ID. I want to know if ANY of the rows for a particular ID have a NULL value in VALUE.
I have tried building the function with iterrows().
df = pd.DataFrame({'ID':[1,2,2,3,3,3],
'VALUE':[50,None,30,20,10,None]})
def nullValue(col):
for i, row in col.iterrows():
if ['VALUE'] is None:
return 1
else:
return 0
df2 = df.groupby('ID').apply(nullVALUE)
df2.columns = ['ID','VALUE','isNULL']
df2
I am expecting to retrieve a dataframe with three columns, ID, VALUE, and isNULL. If any row in a grouped ID has a null, all of the rows for that ID should have a 1 under isNull.
Example:
ID VALUE isNULL
1 50.0 0
2 NaN 1
2 30.0 1
3 20.0 1
3 10.0 1
3 NaN 1
A quick solution, borrowed partially from this answer is to use groupby with transform:
df = pd.DataFrame({'ID':[1,2,2,3,3,3,3],
'VALUE':[50,None,None,30,20,10,None]})
df['isNULL'] = (df.VALUE.isnull().groupby([df['ID']]).transform('sum') > 0).astype(int)
Out[51]:
ID VALUE isNULL
0 1 50.0 0
1 2 NaN 1
2 2 NaN 1
3 3 30.0 1
4 3 20.0 1
5 3 10.0 1
6 3 NaN 1

Pandas dataframe: Number of dates in group prior to row date

I would like to add a column in a dataframe that contains for each group G the number of distinct observations in variable x that happened before time t.
Note: t is in datetime format and missing values in the data are possible but can be ignored. The same x can appear multiple times in a group but then it is assigned the same date. The time assigned to x is not the same across groups.
I hope this example helps:
Input:
Group x t
1 a 2013-11-01
1 b 2015-04-03
1 b 2015-04-03
1 c NaT
2 a 2017-03-01
2 c 2013-11-06
2 d 2015-04-26
2 d 2015-04-26
2 d 2015-04-26
2 b NaT
Output:
Group x t Number of unique x before time t
1 a 2013-11-01 0
1 b 2015-04-03 1
1 b 2015-04-03 1
1 c NaT NaN
2 a 2017-03-01 2
2 c 2013-11-06 0
2 d 2015-04-26 1
2 d 2015-04-26 1
2 d 2015-04-26 1
2 b NaT NaN
The dataset is quite large so I wonder if there is any vectorized way do this (e.g. using groupby).
Many Thanks
Here's another method.
The initial sort makes it so fillna will work later on.
Create df2, which calculates the unique number of days within each group before that date.
Merge the number of days back to the original df. fillna then takes care of the days which were duplicated (the sort ensures this happens properly)
Dates with NaT were placed at the end for the cumsum so just reset them to NaN
If you want to reorder at the end to the original order, just sort the index df.sort_index(inplace=True)
import pandas as pd
import numpy as np
df = df.sort_values(by=['Group', 't'])
df['t'] = pd.to_datetime(df.t)
df2 = df
df2 = df2[df2.t.notnull()]
df2 = df2.drop_duplicates()
df2['temp'] = 1
df2['num_b4'] = df2.groupby('Group').temp.cumsum()-1
df = df.merge(df2[['num_b4']], left_index=True, right_index=True, how='left')
df['num_b4'] = df['num_b4'].fillna(method='ffill')
df.loc[df.t.isnull(), 'num_b4'] = np.NaN
# Group x t num_b4
#0 1 a 2013-11-01 0.0
#1 1 b 2015-04-03 1.0
#2 1 b 2015-04-03 1.0
#3 1 c NaT NaN
#5 2 c 2013-11-06 0.0
#6 2 d 2015-04-26 1.0
#7 2 d 2015-04-26 1.0
#8 2 d 2015-04-26 1.0
#4 2 a 2017-03-01 2.0
#9 2 b NaT NaN
IIUUC for the new cases, you want to change a single line in the above code.
# df2 = df2.drop_duplicates()
df2 = df2.drop_duplicates(['Group', 't'])
With that, the same day that has multiple x values assigned to it does not cause the number of observations to increment. See the output for Group 3 below, in which I added 4 rows to your initial data.
Group x t
3 a 2015-04-03
3 b 2015-04-03
3 c 2015-04-03
3 c 2015-04-04
## Apply the Code changing the drop_duplicates() line
Group x t num_b4
0 1 a 2013-11-01 0.0
1 1 b 2015-04-03 1.0
2 1 b 2015-04-03 1.0
3 1 c NaT NaN
5 2 c 2013-11-06 0.0
6 2 d 2015-04-26 1.0
7 2 d 2015-04-26 1.0
8 2 d 2015-04-26 1.0
4 2 a 2017-03-01 2.0
9 2 b NaT NaN
10 3 a 2015-04-03 0.0
11 3 b 2015-04-03 0.0
12 3 c 2015-04-03 0.0
13 3 c 2015-04-04 1.0
Can you can do it like this using a custom designed function using merge to do a self-join, groupby and nunique to count unique values:
def countunique(x):
df_out = x.merge(x, on='Group')\
.query('x_x != x_y and t_y < t_x')\
.groupby(['x_x','t_x'])['x_y'].nunique()\
.reset_index()
result = x.merge(df_out, left_on=['x','t'],
right_on=['x_x','t_x'],
how='left')
result = result[['Group','x','t','x_y']]
result.loc[result.t.notnull(),'x_y'] = result.loc[result.t.notnull(),'x_y'].fillna(0)
return result.rename(columns={'x_y':'No of unique x before t'})
df.groupby('Group', group_keys=False).apply(countunique)
Output:
Group x t No of unique x before t
0 1 a 2013-11-01 0.0
1 1 b 2015-04-03 1.0
2 1 b 2015-04-03 1.0
3 1 c NaT NaN
0 2 a 2017-03-01 2.0
1 2 c 2013-11-06 0.0
2 2 d 2015-04-26 1.0
3 2 d 2015-04-26 1.0
4 2 d 2015-04-26 1.0
5 2 b NaT NaN
Explanation:
For each group,
Perform a self-join using merge on 'Group'
Filter result of self join only getting those time before the
current record.
Use groupby with nunique to count only unique values of x from
self-join.
Merge count of x back to the original dataframe keep all rows using
how='left'
Fill NaN values with zero where there is time on a recourd
Rename column headings

Resources