calculate different between consecutive date records at an ID level

calculate different between consecutive date records at an ID level - python-3.x

I have a dataframe as
col 1 col 2
A 2020-07-13
A 2020-07-15
A 2020-07-18
A 2020-07-19
B 2020-07-13
B 2020-07-19
C 2020-07-13
C 2020-07-18
I want it to become the following in a new dataframe
col_3 diff_btw_1st_2nd_date diff_btw_2nd_3rd_date diff_btw_3rd_4th_date
A 2 3 1
B 6 NaN NaN
C 5 NaN NaN
I tried getting the groupby at Col 1 level , but not getting the intended result. Can anyone help?

Use GroupBy.cumcount for counter pre column col 1 and reshape by DataFrame.set_index with Series.unstack, then use DataFrame.diff, remove first only NaNs columns by DataFrame.iloc, convert timedeltas to days by Series.dt.days per all columns and change columns names by DataFrame.add_prefix:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.set_index(['col 1',df.groupby('col 1').cumcount()])['col 2']
.unstack()
.diff(axis=1)
.iloc[:, 1:]
.apply(lambda x: x.dt.days)
.add_prefix('diff_')
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2 3.0 1.0
1 B 6 NaN NaN
2 C 5 NaN NaN
Or use DataFrameGroupBy.diff with counter for new columns by DataFrame.assign, reshape by DataFrame.pivot and remove NaNs by c2 with DataFrame.dropna:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.assign(g = df.groupby('col 1').cumcount(),
c1 = df.groupby('col 1')['col 2'].diff().dt.days)
.dropna(subset=['c1'])
.pivot('col 1','g','c1')
.add_prefix('diff_')
.rename_axis(None, axis=1)
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2.0 3.0 1.0
1 B 6.0 NaN NaN
2 C 5.0 NaN NaN

You can assign a cumcount number grouped by col 1, and pivot the table using that cumcount number.
Solution
df["col 2"] = pd.to_datetime(df["col 2"])
# 1. compute date difference in days using diff() and dt accessor
df["diff"] = df.groupby(["col 1"])["col 2"].diff().dt.days
# 2. assign cumcount for pivoting
df["cumcount"] = df.groupby("col 1").cumcount()
# 3. partial transpose, discarding the first difference in nan
df2 = df[["col 1", "diff", "cumcount"]]\
.pivot(index="col 1", columns="cumcount")\
.drop(columns=[("diff", 0)])
Result
# replace column names for readability
df2.columns = [f"d{i+2}-d{i+1}" for i in range(len(df2.columns))]
print(df2)
d2-d1 d3-d2 d4-d3
col 1
A 2.0 3.0 1.0
B 6.0 NaN NaN
C 5.0 NaN NaN
df after assing cumcount is like this
print(df)
col 1 col 2 diff cumcount
0 A 2020-07-13 NaN 0
1 A 2020-07-15 2.0 1
2 A 2020-07-18 3.0 2
3 A 2020-07-19 1.0 3
4 B 2020-07-13 NaN 0
5 B 2020-07-19 6.0 1
6 C 2020-07-13 NaN 0
7 C 2020-07-18 5.0 1

Related

Create a new dataframe from specific columns

I have a dataframe and I want to use columns to create new rows in a new dataframe.
>>> df_1
mix_id ngs phr d mp1 mp2 mp1_wt mp2_wt mp1_phr mp2_phr
2 M01 SBR2353 100.0 NaN MES/HPD SBR2353 0.253731 0.746269 25.373134 74.626866
3 M02 SBR2054 80.0 NaN TDAE SBR2054 0.264706 0.735294 21.176471 58.823529
I would like to have a dataframe like this.
>>> df_2
mix_id ngs phr d
1 M01 MES/HPD 25.373134 NaN
2 M01 SBR2353 74.626866 NaN
3 M02 TDAE 21.176471 NaN
4 M02 SBR2054 58.823529 NaN

IIUC
you can use pd.wide_to_long, it does however needs the repeating columns to have numbers as suffix. So, the first part of solution, just renames the columns to bring the number as suffix
df.columns=[col for col in df.columns[:6]] + [re.sub(r'\d','',col) + str(re.search(r'(\d)',col).group(0)) for col in df.columns[6:] ]
# this makes mp1_wt as mp_wt1, to support pd.wide_to_long
df2=pd.wide_to_long(df, stubnames=['mp','mp_wt','mp_phr'], i=['mix_id','ngs','d'], j='val').reset_index().drop(columns='val')
df2.drop(columns=['ngs','phr','mp_wt'], inplace=True)
df2.rename(columns={'mp':'ngs','mp_phr':'phr'}, inplace=True)
df2
mix_id d ngs phr
0 M01 NaN MES/HPD 25.373134
1 M01 NaN SBR2353 74.626866
2 M02 NaN TDAE 21.176471
3 M02 NaN SBR2054 58.823529

Join with column having the max sequence number

I have a margin table
item margin
0 a 3
1 b 4
2 c 5
and an item table
item sequence
0 a 1
1 a 2
2 a 3
3 b 1
4 b 2
5 c 1
6 c 2
7 c 3
I want to join the two table so that the margin will only be joined to the product with maximum sequence number, the desired outcome is
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
How to achieve this?
Below is the code for margin and item table
import pandas as pd
df_margin=pd.DataFrame({"item":["a","b","c"],"margin":[3,4,5]})
df_item=pd.DataFrame({"item":["a","a","a","b","b","c","c","c"],"sequence":[1,2,3,1,2,1,2,3]})

One option would be to merge then replace extra values with NaN via Series.where:
new_df = df_item.merge(df_margin)
new_df['margin'] = new_df['margin'].where(
new_df.groupby('item')['sequence'].transform('max').eq(new_df['sequence'])
)
Or with loc:
new_df = df_item.merge(df_margin)
new_df.loc[new_df.groupby('item')['sequence']
.transform('max').ne(new_df['sequence']), 'margin'] = np.NAN
Another option would be to assign a temp column to both frames df_item with True where the value is maximal, and df_margin is True everywhere then merge outer and drop the temp column:
new_df = (
df_item.assign(
t=df_item
.groupby('item')['sequence']
.transform('max')
.eq(df_item['sequence'])
).merge(df_margin.assign(t=True), how='outer').drop('t', 1)
)
Both produce new_df:
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0

You could do:
df_item.merge(df_item.groupby('item')['sequence'].max().\
reset_index().merge(df_margin), 'left')
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
Breakdown:
df_new = df_item.groupby('item')['sequence'].max().reset_index().merge(df_margin)
df_item.merge(df_new, 'left')

pandas groupby and widen dataframe with ordered columns

I have a long form dataframe that contains multiple samples and time points for each subject. The number of samples and timepoint can vary, and the days between time points can also vary:
test_df = pd.DataFrame({"subject_id":[1,1,1,2,2,3],
"sample":["A", "B", "C", "D", "E", "F"],
"timepoint":[19,11,8,6,2,12],
"time_order":[3,2,1,2,1,1]
})
subject_id sample timepoint time_order
0 1 A 19 3
1 1 B 11 2
2 1 C 8 1
3 2 D 6 2
4 2 E 2 1
5 3 F 12 1
I need to figure out a way to generalize grouping this dataframe by subject_id and putting all samples and time points on the same row, in time order.
DESIRED OUTPUT:
subject_id sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
0 1 C 8 B 11 A 19
1 2 E 2 D 6 null null
5 3 F 12 null null null null
Pivot gets me close, but I'm stuck on how to proceed from there:
test_df = test_df.pivot(index=['subject_id', 'sample'],
columns='time_order', values='timepoint')

Use DataFrame.set_index with DataFrame.unstack for pivoting, sorting MultiIndex in columns, flatten it and last convert subject_id to column:
df = (test_df.set_index(['subject_id', 'time_order'])
.unstack()
.sort_index(level=[1,0], axis=1))
df.columns = df.columns.map(lambda x: f'{x[0]}{x[1]}')
df = df.reset_index()
print (df)
subject_id sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
0 1 C 8.0 B 11.0 A 19.0
1 2 E 2.0 D 6.0 NaN NaN
2 3 F 12.0 NaN NaN NaN NaN

a=test_df.iloc[:,:3].groupby('subject_id').last().add_suffix('1')
b=test_df.iloc[:,:3].groupby('subject_id').nth(-2).add_suffix('2')
c=test_df.iloc[:,:3].groupby('subject_id').nth(-3).add_suffix('3')
pd.concat([a, b,c], axis=1)
sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
subject_id
1 C 8 B 11.0 A 19.0
2 E 2 D 6.0 NaN NaN
3 F 12 NaN NaN NaN NaN

Copying column that have NaN value in it and adding prefix

I have x number of columns that contain NaN value
With the following code i can check that
for index,value in df.iteritems():
if value.isnull().values.any() == True:
this will show me with Boolean values which volumns have NaN.
If true I need to create new column that will have prefix 'Interpolation' + name of that column in its name.
So to make it clear if Column with the name 'XXX' has NaN I need to create new column with the name 'Interpolation XXX'.
Any ides how to do this ?

Something like this:
In [80]: df = pd.DataFrame({'XXX':[1,2,np.nan,4], 'YYY':[1,2,3,4], 'ZZZ':[1,np.nan, np.nan, 4]})
In [81]: df
Out[81]:
XXX YYY ZZZ
0 1.0 1 1.0
1 2.0 2 NaN
2 NaN 3 NaN
3 4.0 4 4.0
In [92]: nan_cols = df.columns[df.isna().any()].tolist()
In [94]: for col in df.columns:
...: if col in nan_cols:
...: df['Interpolation ' + col ] = df[col]
...:
In [95]: df
Out[95]:
XXX YYY ZZZ Interpolation XXX Interpolation ZZZ
0 1.0 1 1.0 1.0 1.0
1 2.0 2 NaN 2.0 NaN
2 NaN 3 NaN NaN NaN
3 4.0 4 4.0 4.0 4.0

DataFrame difference between rows based on multiple columns

I am trying to calculate the difference between rows based on multiple columns. The data set is very large and I am pasting dummy data below that describes the problem:
if I want to calculate the daily difference in weight at a pet+name level. So far I have only come up with the solution of concatenating these columns and creating multiindex based on the new column and the date column. But I think there should be a better way. In the real dataset I have more than 3 columns I am using calculate row difference.
df['pet_name']=df.pet + df.name
df.set_index(['pet_name','date'],inplace = True)
df.sort_index(inplace=True)
df['diffs']=np.nan
for idx in t.index.levels[0]:
df.diffs[idx] = df.weight[idx].diff()

Base on your description , you can try groupby
df['pet_name']=df.pet + df.name
df.groupby('pet_name')['weight'].diff()

Use groupby by 2 columns:
df.groupby(['pet', 'name'])['weight'].diff()
All together:
#convert dates to datetimes
df['date'] = pd.to_datetime(df['date'])
#sorting
df = df.sort_values(['pet', 'name','date'])
#get differences per groups
df['diffs'] = df.groupby(['pet', 'name', 'date'])['weight'].diff()
Sample:
np.random.seed(123)
N = 100
L = list('abc')
df = pd.DataFrame({'pet': np.random.choice(L, N),
'name': np.random.choice(L, N),
'date': pd.Series(pd.date_range('2015-01-01', periods=int(N/10)))
.sample(N, replace=True),
'weight':np.random.rand(N)})
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['pet', 'name','date'])
df['diffs'] = df.groupby(['pet', 'name', 'date'])['weight'].diff()
df['pet_name'] = df.pet + df.name
df = df.sort_values(['pet_name','date'])
df['diffs1'] = df.groupby(['pet_name', 'date'])['weight'].diff()
print (df.head(20))
date name pet weight diffs pet_name diffs1
1 2015-01-02 a a 0.105446 NaN aa NaN
2 2015-01-03 a a 0.845533 NaN aa NaN
2 2015-01-03 a a 0.980582 0.135049 aa 0.135049
2 2015-01-03 a a 0.443368 -0.537214 aa -0.537214
3 2015-01-04 a a 0.375186 NaN aa NaN
6 2015-01-07 a a 0.715601 NaN aa NaN
7 2015-01-08 a a 0.047340 NaN aa NaN
9 2015-01-10 a a 0.236600 NaN aa NaN
0 2015-01-01 b a 0.777162 NaN ab NaN
2 2015-01-03 b a 0.871683 NaN ab NaN
3 2015-01-04 b a 0.988329 NaN ab NaN
4 2015-01-05 b a 0.918397 NaN ab NaN
4 2015-01-05 b a 0.016119 -0.902279 ab -0.902279
5 2015-01-06 b a 0.095530 NaN ab NaN
5 2015-01-06 b a 0.894978 0.799449 ab 0.799449
5 2015-01-06 b a 0.365719 -0.529259 ab -0.529259
5 2015-01-06 b a 0.887593 0.521874 ab 0.521874
7 2015-01-08 b a 0.792299 NaN ab NaN
7 2015-01-08 b a 0.313669 -0.478630 ab -0.478630
7 2015-01-08 b a 0.281235 -0.032434 ab -0.032434

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

calculate different between consecutive date records at an ID level - python-3.x

Related

Create a new dataframe from specific columns

Join with column having the max sequence number

pandas groupby and widen dataframe with ordered columns

Copying column that have NaN value in it and adding prefix

DataFrame difference between rows based on multiple columns

Categories

Resources