How to identify a sequence and index number before a particular sequence occurs for the first time - python-3.x

I have a dataframe in pandas, an example of which is provided below:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6
A 1 0 0 1 0 1
B 1 1 0 0 1 0
C 1 0 1 1 0 0
D 1 1 0 1 0 0
As you can see 1 and 0 occurs randomly in different columns. It would be helpful, if anyone can suggest me a code in python such that I am able to count the number of times '1' occurs before the first occurrence of a 1, 0 and 0 in order. For example, for member A, the first double zero event occurs at appear_2 and appear_3, so the duration will be 1. Similarly for the member B, the first double zero event occurs at appear_3 and appear_4 so there are a total of two 1s that occur before this. So, the 1 included in 1,0,0 sequence is also considered during the count of total number of 1. it is because the 1 indicates that a person started the process, and 0,0 indicates his/her absence for two consecutive appearances after initiating the process. The resulting table should have a new column 'duration' something like this:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6 duration
A 1 0 0 1 0 1 1
B 1 1 0 0 1 0 2
C 1 0 1 1 0 0 3
D 1 1 1 1 0 0 4
Thank you in advance.

A little logic here , first we use rolling sum find the value equal to 0 , then we just need to do cumprod, once it hit the 0, the prod will return 0, then we just need to sum all value not 0 for each row get the result
s=df.iloc[:,1:]
s1=s.rolling(2,axis=1,min_periods=1).sum().cumprod(axis=1)
s.mask(s1==0).sum(1)
Out[37]:
0 1.0
1 2.0
2 3.0
3 4.0
dtype: float64

My logic is checking the current position to next position. If they are both 0, the mask turns to True at that location. After that doing cumsum on axis=1. Locations are in front the first True will turn to 0 by cumsum. Finally, comparing mask to 0 to keep only positions appear before the double 0 and sum. To use this logic, I need to handle the case where double 0 are the first elements in row as in 'D', 0, 0, 1, 1, 0, 0. Your sample doesn't have this case. However, I expect the real data would have it.
cols = ['appear_1', 'appear_2', 'appear_3', 'appear_4', 'appear_5', 'appear_6']
df1 = df[cols]
m = df1[df1.eq(1)].ffill(1).notna()
df2 = df1[m].bfill(1).eq(0)
m2 = df2 & df2.shift(-1, axis=1, fill_value=True)
df['duration'] = df1[m2.cumsum(1) == 0].sum(1)
Out[100]:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6 duration
0 A 1 0 0 1 0 1 1.0
1 B 1 1 0 0 1 0 2.0
2 C 1 0 1 1 0 0 3.0
3 D 1 1 1 1 0 0 4.0
Change your sample to have the special case where the first elements are 0
Update: add case E where all appear_x are 1.
Sample (df_n):
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6
0 A 1 0 0 1 0 1
1 B 1 1 0 0 1 0
2 C 1 0 1 1 0 0
3 D 0 0 1 1 0 0
4 E 1 1 1 1 1 1
cols = ['appear_1', 'appear_2', 'appear_3', 'appear_4', 'appear_5', 'appear_6']
df1 = df_n[cols]
m = df1[df1.eq(1)].ffill(1).notna()
df2 = df1[m].bfill(1).eq(0)
m2 = df2 & df2.shift(-1, axis=1, fill_value=True)
df_n['duration'] = df1[m2.cumsum(1) == 0].sum(1)
Out[503]:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6 duration
0 A 1 0 0 1 0 1 1.0
1 B 1 1 0 0 1 0 2.0
2 C 1 0 1 1 0 0 3.0
3 D 0 0 1 1 0 0 2.0
4 E 1 1 1 1 1 1 6.0

Related

How can I change the values of columns based on the values from other columns?

Here are the tables before cleaned:
name
date
time_lag1
time_lag2
time_lag3
lags
a
2000/5/3
1
0
1
time_lag1
a
2000/5/10
1
1
0
time_lag2
a
2000/5/17
1
1
1
time_lag3
b
2000/5/3
0
1
0
time_lag1
c
2000/5/3
0
0
0
time_lag1
Logics are simple, each name have several date and that date correspond to a "lags". What I tried to do is to match the column names like "time_lag1","time_lag2",...,"time_lagn" to the values in column "lags". For example, the first value of "time_lag1" is because column name "time_lag1" equals the corresponding value of "lags" which is also "time_lag1". However, I don't know why the values of other columns and rows are becoming incorrect.
My thought is:
# time_lag columns are not following a trend, so it can be lag_time4 as well.
time_list = ['time_lag1','time_lag2','lag_time4'...]
for col in time_list:
if col == df['lags'].values:
df.col == 1
else:
df.col == 0
I don't know why the codes I tried is not working very well.
Here are the tables I tried to get:
name
date
time_lag1
time_lag2
time_lag3
lags
a
2000/5/3
1
0
0
time_lag1
a
2000/5/10
0
1
0
time_lag2
a
2000/5/17
0
0
1
time_lag3
b
2000/5/3
1
0
0
time_lag1
c
2000/5/3
1
0
0
time_lag1
The simplest is to recalculate them from scratch with pandas.get_dummies and to update the dataframe:
df.update(pd.get_dummies(df['lags']))
Output:
name date time_lag1 time_lag2 time_lag3 lags
0 a 2000/5/3 1 0 0 time_lag1
1 a 2000/5/10 0 1 0 time_lag2
2 a 2000/5/17 0 0 1 time_lag3
3 b 2000/5/3 1 0 0 time_lag1
4 c 2000/5/3 1 0 0 time_lag1

How to return all rows that have equal number of values of 0 and 1?

I have dataframe that has 50 columns each column have either 0 or 1. How do I return all rows that have an equal (tie) in the number of 0 and 1 (25 "0" and 25 "1").
An example on a 4 columns:
A B C D
1 1 0 0
1 1 1 0
1 0 1 0
0 0 0 0
based on the above example it should return the first and the third row.
A B C D
1 1 0 0
1 0 1 0
Because you have four columns, we assume you must have atleast two sets of 1 in a row. So, please try
df[df.mean(1).eq(0.5)]

How to identify where a particular sequence in a row occurs for the first time

I have a dataframe in pandas, an example of which is provided below:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6
A 1 0 0 1 0 0
B 1 1 0 0 1 0
C 1 0 1 1 0 0
D 0 0 1 0 0 1
E 1 1 1 1 1 1
As you can see 1 and 0 occurs randomly in different columns. It would be helpful, if anyone can suggest me a code in python such that I am able to find the column number where the 1 0 0 pattern occurs for the first time. For example, for member A, the first 1 0 0 pattern occurs at appear_1. so the first occurrence will be 1. Similarly for the member B, the first 1 0 0 pattern occurs at appear_2, so the first occurrence will be at column 2. The resulting table should have a new column named 'first_occurrence'. If there is no such 1 0 0 pattern occurs (like in row E) then the value in first occurrence column will the sum of number of 1 in that row. The resulting table should look something like this:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6 first_occurrence
A 1 0 0 1 0 0 1
B 1 1 0 0 1 0 2
C 1 0 1 1 0 0 4
D 0 0 1 0 0 1 3
E 1 1 1 1 1 1 6
Thank you in advance.
I try not to reinvent the wheel, so I develop on my answer to previous question. From that answer, you need to use additional idxmax, np.where, and get_indexer
cols = ['appear_1', 'appear_2', 'appear_3', 'appear_4', 'appear_5', 'appear_6']
df1 = df[cols]
m = df1[df1.eq(1)].ffill(1).notna()
df2 = df1[m].bfill(1).eq(0)
m2 = df2 & df2.shift(-1, axis=1, fill_value=True)
df['first_occurrence'] = np.where(m2.any(1), df1.columns.get_indexer(m2.idxmax(1)),
df1.shape[1])
Out[540]:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6 first_occurrence
0 A 1 0 0 1 0 0 1
1 B 1 1 0 0 1 0 2
2 C 1 0 1 1 0 0 4
3 D 0 0 1 0 0 1 3
4 E 1 1 1 1 1 1 6

Pattern identification and sequence detection

I have a dataset 'df' that looks something like this:
MEMBER seen_1 seen_2 seen_3 seen_4 seen_5 seen_6
A 1 0 0 1 0 1
B 1 1 0 0 1 0
C 1 1 1 0 0 1
D 0 0 1 0 0 1
As you can see there are several rows of ones and zeros. Can anyone suggest me a code in python such that I am able to count the number of times '1' occurs continuously before the first occurrence of a 1, 0 and 0 in order. For example, for member A, the first double zero event occurs at seen_2 and seen_3, so the event will be 1. Similarly for the member B, the first double zero event occurs at seen_3 and seen_4 so there are two 1s that occur before this. The resultant table should have a new column 'event' something like this:
MEMBER seen_1 seen_2 seen_3 seen_4 seen_5 seen_6 event
A 1 0 0 1 0 1 1
B 1 1 0 0 1 0 2
C 1 1 1 0 0 1 3
D 0 0 1 0 0 1 1
My approach:
df = df.set_index('MEMBER')
# count 1 on each rows since the last 0
s = (df.stack()
.groupby(['MEMBER', df.eq(0).cumsum(1).stack()])
.cumsum().unstack()
)
# mask of the zeros:
u = s.eq(0)
# look for the first 1 0 0
idx = (~u &
u.shift(-1, axis=1, fill_value=False) &
u.shift(-2, axis=1, fill_value=False) ).idxmax(1)
# look up
df['event'] = s.lookup(idx.index, idx)
Test data:
MEMBER seen_1 seen_2 seen_3 seen_4 seen_5 seen_6
0 A 1 0 1 0 0 1
1 B 1 1 0 0 1 0
2 C 1 1 1 0 0 1
3 D 0 0 1 0 0 1
4 E 1 0 1 1 0 0
Output:
MEMBER seen_1 seen_2 seen_3 seen_4 seen_5 seen_6 event
0 A 1 0 1 0 0 1 1
1 B 1 1 0 0 1 0 2
2 C 1 1 1 0 0 1 3
3 D 0 0 1 0 0 1 1
4 E 1 0 1 1 0 0 2

List column name having value greater than zero

I have following dataframe
A | B | C | D
1 0 2 1
0 1 1 0
0 0 0 1
I want to add the new column have any value of row in the column greater than zero along with column name
A | B | C | D | New
1 0 2 1 A-1, C-2, D-1
0 1 1 0 B-1, C-1
0 0 0 1 D-1
We can use mask and stack
s=df.mask(df==0).stack().\
astype(int).astype(str).\
reset_index(level=1).apply('-'.join,1).add(',').sum(level=0).str[:-1]
df['New']=s
df
Out[170]:
A B C D New
0 1 0 2 1 A-1,C-2,D-1
1 0 1 1 0 B-1,C-1
2 0 0 0 1 D-1
Combine the column names with the df values that are not zero and then filter out the None values.
df = pd.read_clipboard()
arrays = np.where(df!=0, df.columns.values + '-' + df.values.astype('str'), None)
new = []
for array in arrays:
new.append(list(filter(None, array)))
df['New'] = new
df
Out[1]:
A B C D New
0 1 0 2 1 [A-1, C-2, D-1]
1 0 1 1 0 [B-1, C-1]
2 0 0 0 1 [D-1]

Resources