Pattern identification and sequence detection - python-3.x

I have a dataset 'df' that looks something like this:
MEMBER seen_1 seen_2 seen_3 seen_4 seen_5 seen_6
A 1 0 0 1 0 1
B 1 1 0 0 1 0
C 1 1 1 0 0 1
D 0 0 1 0 0 1
As you can see there are several rows of ones and zeros. Can anyone suggest me a code in python such that I am able to count the number of times '1' occurs continuously before the first occurrence of a 1, 0 and 0 in order. For example, for member A, the first double zero event occurs at seen_2 and seen_3, so the event will be 1. Similarly for the member B, the first double zero event occurs at seen_3 and seen_4 so there are two 1s that occur before this. The resultant table should have a new column 'event' something like this:
MEMBER seen_1 seen_2 seen_3 seen_4 seen_5 seen_6 event
A 1 0 0 1 0 1 1
B 1 1 0 0 1 0 2
C 1 1 1 0 0 1 3
D 0 0 1 0 0 1 1

My approach:
df = df.set_index('MEMBER')
# count 1 on each rows since the last 0
s = (df.stack()
.groupby(['MEMBER', df.eq(0).cumsum(1).stack()])
.cumsum().unstack()
)
# mask of the zeros:
u = s.eq(0)
# look for the first 1 0 0
idx = (~u &
u.shift(-1, axis=1, fill_value=False) &
u.shift(-2, axis=1, fill_value=False) ).idxmax(1)
# look up
df['event'] = s.lookup(idx.index, idx)
Test data:
MEMBER seen_1 seen_2 seen_3 seen_4 seen_5 seen_6
0 A 1 0 1 0 0 1
1 B 1 1 0 0 1 0
2 C 1 1 1 0 0 1
3 D 0 0 1 0 0 1
4 E 1 0 1 1 0 0
Output:
MEMBER seen_1 seen_2 seen_3 seen_4 seen_5 seen_6 event
0 A 1 0 1 0 0 1 1
1 B 1 1 0 0 1 0 2
2 C 1 1 1 0 0 1 3
3 D 0 0 1 0 0 1 1
4 E 1 0 1 1 0 0 2

Related

pandas groupby; counting overlapping colums

I have a DataFrame that looks like this:
ID A B C D
6234 1 0 1 0
3417 1 0 0 0
9954 0 1 0 0
4369 0 0 0 1
6281 1 0 1 0
And I want to group it so as to make it look like this:
ID
A B C D
1 0 0 0 3
1 0 1 0 2
0 1 0 0 1
0 0 1 0 2
0 0 0 1 1
I have been using the following code, which has not gotten me very far.
import pandas as pd
data = [[6234,1,0,1,0],
[3417,1,0,0,0],
[9954,0,1,0,0],
[4369,0,0,0,1],
[6281,1,0,1,0]]
DF1 = pd.DataFrame(data, columns = ['ID','A','B','C','D'])
DF2 = DF1.groupby(['A','B','C','D']).count()
I would appreciate any insight that anyone might have to offer.

How to identify where a particular sequence in a row occurs for the first time

I have a dataframe in pandas, an example of which is provided below:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6
A 1 0 0 1 0 0
B 1 1 0 0 1 0
C 1 0 1 1 0 0
D 0 0 1 0 0 1
E 1 1 1 1 1 1
As you can see 1 and 0 occurs randomly in different columns. It would be helpful, if anyone can suggest me a code in python such that I am able to find the column number where the 1 0 0 pattern occurs for the first time. For example, for member A, the first 1 0 0 pattern occurs at appear_1. so the first occurrence will be 1. Similarly for the member B, the first 1 0 0 pattern occurs at appear_2, so the first occurrence will be at column 2. The resulting table should have a new column named 'first_occurrence'. If there is no such 1 0 0 pattern occurs (like in row E) then the value in first occurrence column will the sum of number of 1 in that row. The resulting table should look something like this:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6 first_occurrence
A 1 0 0 1 0 0 1
B 1 1 0 0 1 0 2
C 1 0 1 1 0 0 4
D 0 0 1 0 0 1 3
E 1 1 1 1 1 1 6
Thank you in advance.
I try not to reinvent the wheel, so I develop on my answer to previous question. From that answer, you need to use additional idxmax, np.where, and get_indexer
cols = ['appear_1', 'appear_2', 'appear_3', 'appear_4', 'appear_5', 'appear_6']
df1 = df[cols]
m = df1[df1.eq(1)].ffill(1).notna()
df2 = df1[m].bfill(1).eq(0)
m2 = df2 & df2.shift(-1, axis=1, fill_value=True)
df['first_occurrence'] = np.where(m2.any(1), df1.columns.get_indexer(m2.idxmax(1)),
df1.shape[1])
Out[540]:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6 first_occurrence
0 A 1 0 0 1 0 0 1
1 B 1 1 0 0 1 0 2
2 C 1 0 1 1 0 0 4
3 D 0 0 1 0 0 1 3
4 E 1 1 1 1 1 1 6

How to identify a sequence and index number before a particular sequence occurs for the first time

I have a dataframe in pandas, an example of which is provided below:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6
A 1 0 0 1 0 1
B 1 1 0 0 1 0
C 1 0 1 1 0 0
D 1 1 0 1 0 0
As you can see 1 and 0 occurs randomly in different columns. It would be helpful, if anyone can suggest me a code in python such that I am able to count the number of times '1' occurs before the first occurrence of a 1, 0 and 0 in order. For example, for member A, the first double zero event occurs at appear_2 and appear_3, so the duration will be 1. Similarly for the member B, the first double zero event occurs at appear_3 and appear_4 so there are a total of two 1s that occur before this. So, the 1 included in 1,0,0 sequence is also considered during the count of total number of 1. it is because the 1 indicates that a person started the process, and 0,0 indicates his/her absence for two consecutive appearances after initiating the process. The resulting table should have a new column 'duration' something like this:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6 duration
A 1 0 0 1 0 1 1
B 1 1 0 0 1 0 2
C 1 0 1 1 0 0 3
D 1 1 1 1 0 0 4
Thank you in advance.
A little logic here , first we use rolling sum find the value equal to 0 , then we just need to do cumprod, once it hit the 0, the prod will return 0, then we just need to sum all value not 0 for each row get the result
s=df.iloc[:,1:]
s1=s.rolling(2,axis=1,min_periods=1).sum().cumprod(axis=1)
s.mask(s1==0).sum(1)
Out[37]:
0 1.0
1 2.0
2 3.0
3 4.0
dtype: float64
My logic is checking the current position to next position. If they are both 0, the mask turns to True at that location. After that doing cumsum on axis=1. Locations are in front the first True will turn to 0 by cumsum. Finally, comparing mask to 0 to keep only positions appear before the double 0 and sum. To use this logic, I need to handle the case where double 0 are the first elements in row as in 'D', 0, 0, 1, 1, 0, 0. Your sample doesn't have this case. However, I expect the real data would have it.
cols = ['appear_1', 'appear_2', 'appear_3', 'appear_4', 'appear_5', 'appear_6']
df1 = df[cols]
m = df1[df1.eq(1)].ffill(1).notna()
df2 = df1[m].bfill(1).eq(0)
m2 = df2 & df2.shift(-1, axis=1, fill_value=True)
df['duration'] = df1[m2.cumsum(1) == 0].sum(1)
Out[100]:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6 duration
0 A 1 0 0 1 0 1 1.0
1 B 1 1 0 0 1 0 2.0
2 C 1 0 1 1 0 0 3.0
3 D 1 1 1 1 0 0 4.0
Change your sample to have the special case where the first elements are 0
Update: add case E where all appear_x are 1.
Sample (df_n):
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6
0 A 1 0 0 1 0 1
1 B 1 1 0 0 1 0
2 C 1 0 1 1 0 0
3 D 0 0 1 1 0 0
4 E 1 1 1 1 1 1
cols = ['appear_1', 'appear_2', 'appear_3', 'appear_4', 'appear_5', 'appear_6']
df1 = df_n[cols]
m = df1[df1.eq(1)].ffill(1).notna()
df2 = df1[m].bfill(1).eq(0)
m2 = df2 & df2.shift(-1, axis=1, fill_value=True)
df_n['duration'] = df1[m2.cumsum(1) == 0].sum(1)
Out[503]:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6 duration
0 A 1 0 0 1 0 1 1.0
1 B 1 1 0 0 1 0 2.0
2 C 1 0 1 1 0 0 3.0
3 D 0 0 1 1 0 0 2.0
4 E 1 1 1 1 1 1 6.0

how to remove outermost logic?

how to remove outermost logic?
such as
input column D result
And(OR(A,B),C)
output column E binary number
OR(A,B)
A B C result(D)after extract(E)
0 0 0 0 0
0 0 1 0 0
0 1 0 0 1
0 1 1 1 1
1 0 0 0 1
1 0 1 1 1
1 1 0 0 1
1 1 1 1 1
i tried in excel
=IF(NOT(AND(D2,C2))=TRUE,1,0)
but can not remove outermost logic
result after extract
0 0 0 =IF(AND(OR(A2,B2),C2)=TRUE,1,0) =IF(OR(A2,B2)=TRUE,1,0) =IF(NOT(AND(D2,C2))=TRUE,1,0)
0 0 1 =IF(AND(OR(A3,B3),C3)=TRUE,1,0) =IF(OR(A3,B3)=TRUE,1,0) =IF(NOT(AND(D3,C3))=TRUE,1,0)
0 1 0 =IF(AND(OR(A4,B4),C4)=TRUE,1,0) =IF(OR(A4,B4)=TRUE,1,0) =IF(NOT(AND(D4,C4))=TRUE,1,0)
0 1 1 =IF(AND(OR(A5,B5),C5)=TRUE,1,0) =IF(OR(A5,B5)=TRUE,1,0) =IF(NOT(AND(D5,C5))=TRUE,1,0)
1 0 0 =IF(AND(OR(A6,B6),C6)=TRUE,1,0) =IF(OR(A6,B6)=TRUE,1,0) =IF(NOT(AND(D6,C6))=TRUE,1,0)
1 0 1 =IF(AND(OR(A7,B7),C7)=TRUE,1,0) =IF(OR(A7,B7)=TRUE,1,0) =IF(NOT(AND(D7,C7))=TRUE,1,0)
1 1 0 =IF(AND(OR(A8,B8),C8)=TRUE,1,0) =IF(OR(A8,B8)=TRUE,1,0) =IF(NOT(AND(D8,C8))=TRUE,1,0)
1 1 1 =IF(AND(OR(A9,B9),C9)=TRUE,1,0) =IF(OR(A9,B9)=TRUE,1,0) =IF(NOT(AND(D9,C9))=TRUE,1,0)
By "remove the outermost logic", I assume you want to remove the IF function.
One thing to note is that in a formula like =IF(AND(OR(A2,B2),C2)=TRUE,1,0) you never need the =TRUE test. =IF(AND(OR(A2,B2),C2),1,0) will work exactly the same.
There are a couple of ways to convert a boolean (i.e. true/false value) into an integer without the explicit IF. One is --AND(OR(A2,B2),C2). Another is int(AND(OR(A2,B2),C2)).

Matlab string operation

I have converted a string to binary as follows
message='hello my name is kamran';
messagebin=dec2bin(message);
Is there any method for storing it in array?
I am not really sure of what you want to do here, but if you need to concatenate the rows of the binary representation (which is a matrix of numchars times bits_per_char), this is the code:
message = 'hello my name is kamran';
messagebin = dec2bin(double(message));
linearmessagebin = reshape(messagebin',1,numel(messagebin));
Please note that the double conversion returns your ASCII code. I do not have access to a Matlab installation here, but for example octave complains about the code you provided in the original question.
NOTE
As it was kindly pointed out to me, you have to transpose the messagebin before "serializing" it, in order to have the correct result.
If you want the result as numeric matrix, try:
>> str = 'hello world';
>> b = dec2bin(double(str),8) - '0'
b =
0 1 1 0 1 0 0 0
0 1 1 0 0 1 0 1
0 1 1 0 1 1 0 0
0 1 1 0 1 1 0 0
0 1 1 0 1 1 1 1
0 0 1 0 0 0 0 0
0 1 1 1 0 1 1 1
0 1 1 0 1 1 1 1
0 1 1 1 0 0 1 0
0 1 1 0 1 1 0 0
0 1 1 0 0 1 0 0
Each row corresponds to a character. You can easily reshape it into to sequence of 0,1

Resources