creating new column values depending on other column values in a dataframe - python-3.x

I have a data frame and a snippet of it is given below.
data = {'ID':['A', 'A', 'A,'A', 'B', 'B', 'B', 'B', 'C', 'C'],
'Date':['03/25/2021', '03/25/2021',03/27/2021', '03/29/2021', '03/10/2021','03/11/2021','03/15/2021','03/16/2021', '03/21/2021','03/25/2021']}
df = pd.DataFrame(data)
I am looking for a final result which should look like this.
The explanation: For each ID, the study_date starts from the starting date and ends on the last date. The missing dates in the middle have to be filled. If the date was missing from the original dataframe, then 'missing_date' column will have value 1 or else 0. And the study day column is the number of days from the starting to ending days incrementing in order.
I tried some stuff but I have been stuck on this for a while now. Any help is greatly appreciated.
Thanks.

Try:
def fn(x):
dr = pd.date_range(x["Date"].min(), x["Date"].max())
out = pd.DataFrame({"Date": dr}, index=range(1, len(dr) + 1))
out["Missing_Date"] = (~out["Date"].isin(x["Date"])).astype(int)
return out
# if the "Date" column is not converted:
df["Date"] = pd.to_datetime(df["Date"])
x = (
df.groupby("ID")
.apply(fn)
.reset_index()
.rename(columns={"level_1": "StudyDay"})
)
print(x)
Prints:
ID StudyDay Date Missing_Date
0 A 1 2021-03-25 0
1 A 2 2021-03-26 1
2 A 3 2021-03-27 0
3 A 4 2021-03-28 1
4 A 5 2021-03-29 0
5 B 1 2021-03-10 0
6 B 2 2021-03-11 0
7 B 3 2021-03-12 1
8 B 4 2021-03-13 1
9 B 5 2021-03-14 1
10 B 6 2021-03-15 0
11 B 7 2021-03-16 0
12 C 1 2021-03-21 0
13 C 2 2021-03-22 1
14 C 3 2021-03-23 1
15 C 4 2021-03-24 1
16 C 5 2021-03-25 0

Related

How to replenish a data frame based on another one?

Given two data frames. One contains a column of repeated values (a, in this case). The other contains what this value corresponds to (in this example, it corresponds to some "d" values). How do I efficiently replenish the first data frame with a new column, values in which correspond to some existent column, according to a rule recorded in the other data frame. Here is an example code that works really slow:
import pandas as pd
import numpy as np
d1 = pd.DataFrame(np.asarray([[1,2,3], [2,4,5], [3,4,5], [2,1,4], [3,4,5]]), columns = ['a', 'b', 'c'])
d2 = pd.DataFrame(np.asarray([[1,7], [2,8], [3,11]]), columns = ['a', 'd'])
d = np.empty((d1.shape[0],))
for i in range(d1.shape[0]):
temp = d2.loc[d2['a'] == d1.at[i,'a']]
d[i] = temp['d'].array[0]
d1['d'] = d
This is d1 original:
a b c
0 1 2 3
1 2 4 5
2 3 4 5
3 2 1 4
4 3 4 5
This is d2:
a d
0 1 7
1 2 8
2 3 11
This is a resultant d1:
a b c d
0 1 2 3 7
1 2 4 5 8
2 3 4 5 11
3 2 1 4 8
4 3 4 5 11
You're probably looking for pd.merge.
In your case, d1 = d1.merge(d2, on=['a'], how='left') should do the trick.
Another way is to use map and make only the values you need.
d1['d'] = d1['a'].map(d2.set_index('a')['d'])
d1
Output:
a b c d
0 1 2 3 7
1 2 4 5 8
2 3 4 5 11
3 2 1 4 8
4 3 4 5 11

How to check value change in column

in my dataframe have three columns columns value ,ID and distance . i want to check in ID column when its changes from 2 to any other value count rows and record first value and last value when 2 changes to other value and save and also save corresponding value of column distance when change from 2 to other in ID column.
df=pd.DataFrame({'value':[3,4,7,8,11,20,15,20,15,16],'ID':[2,2,8,8,8,2,2,2,5,5],'distance':[0,0,1,0,0,0,0,0,0,0]})
print(df)
value ID distance
0 3 2 0
1 4 2 0
2 7 8 1
3 8 8 0
4 11 8 0
5 20 2 0
6 15 2 0
7 20 2 0
8 15 5 0
9 16 5 0
required results:
df_out=pd.DataFrame({'rows_Count':[3,2],'value_first':[7,15],'value_last':[11,16],'distance_first':[1,0]})
print(df_out)
rows_Count value_first value_last distance_first
0 3 7 11 1
1 2 15 16 0
Use:
#compare by 2
m = df['ID'].eq(2)
#filter out data before first 2 (in sample data not, in real data possible)
df = df[m.cumsum().ne(0)]
#create unique groups for non 2 groups, add misisng values by reindex
s = m.ne(m.shift()).cumsum()[~m].reindex(df.index)
#aggregate with helper s Series
df1 = df.groupby(s).agg({'ID':'size', 'value':['first','last'], 'distance':'first'})
#flatten MultiIndex
df1.columns = df1.columns.map('_'.join)
df1 = df1.reset_index(drop=True)
print (df1)
ID_size value_first value_last distance_first
0 3 7 11 1
1 2 15 16 0
Verify in changed data (not only 2 first group):
df=pd.DataFrame({'value':[3,4,7,8,11,20,15,20,15,16],
'ID':[1,7,8,8,8,2,2,2,5,5],
'distance':[0,0,1,0,0,0,0,0,0,0]})
print(df)
value ID distance
0 3 1 0 <- changed ID
1 4 7 0 <- changed ID
2 7 8 1
3 8 8 0
4 11 8 0
5 20 2 0
6 15 2 0
7 20 2 0
8 15 5 0
9 16 5 0
#compare by 2
m = df['ID'].eq(2)
#filter out data before first 2 (in sample data not, in real data possible)
df = df[m.cumsum().ne(0)]
#create unique groups for non 2 groups, add misisng values by reindex
s = m.ne(m.shift()).cumsum()[~m].reindex(df.index)
#aggregate with helper s Series
df1 = df.groupby(s).agg({'ID':'size', 'value':['first','last'], 'distance':'first'})
#flatten MultiIndex
df1.columns = df1.columns.map('_'.join)
df1 = df1.reset_index(drop=True)
print (df1)
ID_size value_first value_last distance_first
0 2 15 16 0

Get value from another dataframe column based on condition

I have a dataframe like below:
>>> df1
a b
0 [1, 2, 3] 10
1 [4, 5, 6] 20
2 [7, 8] 30
and another like:
>>> df2
a
0 1
1 2
2 3
3 4
4 5
I need to create column 'c' in df2 from column 'b' of df1 if column 'a' value of df2 is in coulmn 'a' df1. In df1 each tuple of column 'a' is a list.
I have tried to implement from following url, but got nothing so far:
https://medium.com/#Imaadmkhan1/using-pandas-to-create-a-conditional-column-by-selecting-multiple-columns-in-two-different-b50886fabb7d
expect result is
>>> df2
a c
0 1 10
1 2 10
2 3 10
3 4 20
4 5 20
Use Series.map by flattening values from df1 to dictionary:
d = {c: b for a, b in zip(df1['a'], df1['b']) for c in a}
print (d)
{1: 10, 2: 10, 3: 10, 4: 20, 5: 20, 6: 20, 7: 30, 8: 30}
df2['new'] = df2['a'].map(d)
print (df2)
a new
0 1 10
1 2 10
2 3 10
3 4 20
4 5 20
EDIT: I think problem is mixed integers in list in column a, solution is use if/else for test it for new dictionary:
d = {}
for a, b in zip(df1['a'], df1['b']):
if isinstance(a, list):
for c in a:
d[c] = b
else:
d[a] = b
df2['new'] = df2['a'].map(d)
Use :
m=pd.DataFrame({'a':np.concatenate(df.a.values),'b':df.b.repeat(df.a.str.len())})
df2.merge(m,on='a')
a b
0 1 10
1 2 10
2 3 10
3 4 20
4 5 20
First we unnest the list df1 to rows, then we merge them on column a:
df1 = df1.set_index('b').a.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'a'})
print(df1, '\n')
df_final = df2.merge(df1, on='a')
print(df_final)
b a
0 10 1.0
1 10 2.0
2 10 3.0
0 20 4.0
1 20 5.0
2 20 6.0
0 30 7.0
1 30 8.0
a b
0 1 10
1 2 10
2 3 10
3 4 20
4 5 20

In Python Pandas using cumsum with groupby and reset of cumsum when value is 0

I'm rather new at python.
I try to have a cumulative sum for each client to see the consequential months of inactivity (flag: 1 or 0). The cumulative sum of the 1's need therefore to be reset when we have a 0. The reset need to happen as well when we have a new client. See below with example where a is the column of clients and b are the dates.
After some research, I found the question 'Cumsum reset at NaN' and 'In Python Pandas using cumsum with groupby'. I assume that I kind of need to put them together.
Adapting the code of 'Cumsum reset at NaN' to the reset towards 0, is successful:
cumsum = v.cumsum().fillna(method='pad')
reset = -cumsum[v.isnull() !=0].diff().fillna(cumsum)
result = v.where(v.notnull(), reset).cumsum()
However, I don't succeed at adding a groupby. My count just goes on...
So, a dataset would be like this:
import pandas as pd
df = pd.DataFrame({'a' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'b' : [1/15,2/15,3/15,4/15,5/15,6/15,1/15,2/15,3/15,4/15,5/15,6/15],
'c' : [1,0,1,0,1,1,0,1,1,0,1,1,1,1]})
this should result in a dataframe with the columns a, b, c and d with
'd' : [1,0,1,0,1,2,0,1,2,0,1,2,3,4]
Please note that I have a very large dataset, so calculation time is really important.
Thank you for helping me
Use groupby.apply and cumsum after finding contiguous values in the groups. Then groupby.cumcount to get the integer counting upto each contiguous value and add 1 later.
Multiply with the original row to create the AND logic cancelling all zeros and only considering positive values.
df['d'] = df.groupby('a')['c'] \
.apply(lambda x: x * (x.groupby((x != x.shift()).cumsum()).cumcount() + 1))
print(df['d'])
0 1
1 0
2 1
3 0
4 1
5 2
6 0
7 1
8 2
9 0
10 1
11 2
12 3
13 4
Name: d, dtype: int64
Another way of doing would be to apply a function after series.expanding on the groupby object which basically computes values on the series starting from the first index upto that current index.
Use reduce later to apply function of two args cumulatively to the items of iterable so as to reduce it to a single value.
from functools import reduce
df.groupby('a')['c'].expanding() \
.apply(lambda i: reduce(lambda x, y: x+1 if y==1 else 0, i, 0))
a
1 0 1.0
1 0.0
2 1.0
3 0.0
4 1.0
5 2.0
6 0.0
2 7 1.0
8 2.0
9 0.0
10 1.0
11 2.0
12 3.0
13 4.0
Name: c, dtype: float64
Timings:
%%timeit
df.groupby('a')['c'].apply(lambda x: x * (x.groupby((x != x.shift()).cumsum()).cumcount() + 1))
100 loops, best of 3: 3.35 ms per loop
%%timeit
df.groupby('a')['c'].expanding().apply(lambda s: reduce(lambda x, y: x+1 if y==1 else 0, s, 0))
1000 loops, best of 3: 1.63 ms per loop
I think you need custom function with groupby:
#change row with index 6 to 1 for better testing
df = pd.DataFrame({'a' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'b' : [1/15,2/15,3/15,4/15,5/15,6/15,1/15,2/15,3/15,4/15,5/15,6/15,7/15,8/15],
'c' : [1,0,1,0,1,1,1,1,1,0,1,1,1,1],
'd' : [1,0,1,0,1,2,3,1,2,0,1,2,3,4]})
print (df)
a b c d
0 1 0.066667 1 1
1 1 0.133333 0 0
2 1 0.200000 1 1
3 1 0.266667 0 0
4 1 0.333333 1 1
5 1 0.400000 1 2
6 1 0.066667 1 3
7 2 0.133333 1 1
8 2 0.200000 1 2
9 2 0.266667 0 0
10 2 0.333333 1 1
11 2 0.400000 1 2
12 2 0.466667 1 3
13 2 0.533333 1 4
def f(x):
x.ix[x.c == 1, 'e'] = 1
a = x.e.notnull()
x.e = a.cumsum()-a.cumsum().where(~a).ffill().fillna(0).astype(int)
return (x)
print (df.groupby('a').apply(f))
a b c d e
0 1 0.066667 1 1 1
1 1 0.133333 0 0 0
2 1 0.200000 1 1 1
3 1 0.266667 0 0 0
4 1 0.333333 1 1 1
5 1 0.400000 1 2 2
6 1 0.066667 1 3 3
7 2 0.133333 1 1 1
8 2 0.200000 1 2 2
9 2 0.266667 0 0 0
10 2 0.333333 1 1 1
11 2 0.400000 1 2 2
12 2 0.466667 1 3 3
13 2 0.533333 1 4 4

How to sum columns in pandas and add the result into a new row?

In this code I want to sum each column and add it as a new row.
It does the sum but it does not show the new row.
df = pd.DataFrame(g, columns=('AWA', 'REM', 'S1', 'S2'))
df['xSujeto'] = df.sum(axis=1)
xEstado = df.sum(axis=0)
df.append(xEstado, ignore_index=True)
df
I think you can use loc:
df = pd.DataFrame({'AWA':[1,2,3],
'REM':[4,5,6],
'S1':[7,8,9],
'S2':[1,3,5]})
#add 1 to last index value
print (df.index[-1] + 1)
3
df.loc[df.index[-1] + 1] = df.sum()
print (df)
AWA REM S1 S2
0 1 4 7 1
1 2 5 8 3
2 3 6 9 5
3 6 15 24 9
Or append from comment of Nickil Maveli:
xEstado = df.sum()
df = df.append(xEstado, ignore_index=True)
print (df)
AWA REM S1 S2
0 1 4 7 1
1 2 5 8 3
2 3 6 9 5
3 6 15 24 9

Resources