Pattern identification in a dataset using python - python-3.x

I have a dataframe that looks something like this:
empl_ID day_1 day_2 day_3 day_4 day_5 day_6 day_7 day_8 day_9 day_10
1 1 1 1 1 1 1 0 1 1 1
2 0 0 1 1 1 1 1 1 1 0
3 0 1 0 0 1 1 1 1 1 1
4 1 0 1 0 1 1 1 0 1 0
5 1 0 0 1 1 1 1 1 1 1
6 0 0 0 0 1 1 1 1 1 1
As we can see we have 6 employees and index 1 indicates their presence for that day. I want to write a code using Python such that I can trace 2 continuous absences i.e. pattern 0 ,0 for day i, day i+1 in a time-frame of 6 days right from the person begins his employment.
For example, employee 1 begins his work at day_1 column, which is his first appearance of 1. So, from columns day_1 to day_6 if we do not observe any continuous 0, 0 that record should be labeled as '0'. Same would be the case for employee 2 (cols: day_3 to day_8), employee 4 (cols: day_1 to day_6) and employee 6 (cols: day_5 to day_10) and they will be labeled as '0'.
However, for employee 3 (cols: day_2 to day_7), employee 6 (cols: day_5 to day_10) they contain a 0, 0 pattern right from their first presence of 1 within the respective time-frame and thus will be labeled as '1'.
It would be really helpful if someone could help me in formulating a code to achieve the above objective. Thanks in advance!
The result should look something like this:
empl_ID day_1 day_2 day_3 day_4 day_5 day_6 day_7 day_8 day_9 day_10 label
1 1 1 1 1 1 1 0 1 1 1 0
2 0 0 1 1 1 1 1 1 1 0 0
3 0 1 0 0 1 1 1 1 1 1 1
4 1 0 1 0 1 1 1 0 1 0 0
5 1 0 0 1 1 1 1 1 1 1 1
6 0 0 0 0 1 1 1 1 1 1 0

Check with idxmcx and for loop with shift
s=df.set_index('empl_ID')
idx=s.columns.get_indexer(s.idxmax(1))
l=[(s.iloc[t, x :y].eq(s.iloc[t, x :y].shift())&s.iloc[t, x :y].eq(0)).any() for t , x ,y in zip(df.index,idx,idx+5)]
df['Label']=l
df
empl_ID day_1 day_2 day_3 day_4 ... day_7 day_8 day_9 day_10 Label
0 1 1 1 1 1 ... 0 1 1 1 False
1 2 0 0 1 1 ... 1 1 1 0 False
2 3 0 1 0 0 ... 1 1 1 1 True
3 4 1 0 1 0 ... 1 0 1 0 False
4 5 1 0 0 1 ... 1 1 1 1 True
5 6 0 0 0 0 ... 1 1 1 1 False
[6 rows x 12 columns]

Related

ValueError in clustering evaluation: Expected 2D array, got 1D array instead

df_2D = df[['sepal-length', 'petal-length']]
df_2D = np.array(df_2D)
k_means_2D_model = KMeans(n_clusters=3, max_iter=1000).fit(df_2D)
Error:
ValueError: Expected 2D array, got 1D array instead:
array=[0 1 0 0 2 2 1 2 1 1 0 1 0 2 0 2 1 2 0 2 2 1 0 1 2 0 0 2 1 0 2 0 1 1 0 0 2
1 2 1 0 0 1 1 1 1 2 1 2 2 0 2 2 2 0 1 1 1 1 0 1 2 1 2 1 2 2 1 2 1 0 0 2 1
1 0 2 0 0 1 1 0 1 2 1 1 0 1 0 1 0 0 0 2 1 1 1 2 0 2 0 0 2 2 0 0 1 2 2 1 0
2 2 1 1 2 0 0 2 2 2 0 0 1 0 1 0 2 2 2 0 0 0 0 2 1 2 2 2 2 1 1 1 2 2 0 0 0
1 0].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

make lower half of a n*n list zero without using any functions in python

I tried to solve it by using 2 for loops and an if statement . But i was unable to get the desired output.
INPUT-
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
thislist=[1]*10
thislist=[thislist]*10
print(thislist)
for i in range(10):
for j in range(10):
print(thislist[i][j], end = " ")
print()
print()
for i in range(10):
for j in range(10):
if i>j:
thislist[i][j]=0
for i in range(10):
for j in range(10):
print(thislist[i][j], end = " ")
print()
This was the output i got:
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 1
but when i made a list using the below method i got the desired output.
thislist=[[1,1,1,1,1,1,1,1,1,1],
[1,1,1,1,1,1,1,1,1,1],
[1,1,1,1,1,1,1,1,1,1],
[1,1,1,1,1,1,1,1,1,1],
[1,1,1,1,1,1,1,1,1,1],
[1,1,1,1,1,1,1,1,1,1],
[1,1,1,1,1,1,1,1,1,1],
[1,1,1,1,1,1,1,1,1,1],
[1,1,1,1,1,1,1,1,1,1],
[1,1,1,1,1,1,1,1,1,1]]
print(thislist)
for i in range(10):
for j in range(10):
if i>j:
thislist[i][j]=0
for i in range(10):
for j in range(10):
print(thislist[i][j], end = " ")
print()
note-This is the desired OUTPUT-
1 1 1 1 1 1 1 1 1 1
0 1 1 1 1 1 1 1 1 1
0 0 1 1 1 1 1 1 1 1
0 0 0 1 1 1 1 1 1 1
0 0 0 0 1 1 1 1 1 1
0 0 0 0 0 1 1 1 1 1
0 0 0 0 0 0 1 1 1 1
0 0 0 0 0 0 0 1 1 1
0 0 0 0 0 0 0 0 1 1
0 0 0 0 0 0 0 0 0 1
Can someone explain whats the difference between the above 2 codes?
As you pointed out, the problem comes from the manner you created your list of list. In your first example, you do something like this:
list1 = [1]*10
list_of_list1=[list1]*10
list_of_list1 is actually a list of shallow copies of the original list1. Then if you modify a value in list_of_list1, the modification will occurs in all the rows of list_of_list1.
The opposit of a shallow copy is a deep copy. You might want to search more info on the Internet about this topic
In the mean time, you can simply try this.
thislist = []
for row in range(10):
list1 = [1]*10
thislist.append(list1)
But I usually go with numpy when it is available.

Vectorized way of using the previous row value based on the condition

I have a pandas dataframe as below. I want to perform the below condition:
if Column 'A' is 1 then update the value of column 'F' with the previous value of 'F'. This can be done row by row iteration but it is not efficient way of doing that. I want a vectorized method of doing that.
df = pd.DataFrame({'A':[1,1,1, 0, 0, 0, 1, 0, 0], 'C':[1,1,1, 0, 0, 0, 1, 1, 1], 'D':[1,1,1, 0, 0, 0, 1, 1, 1],
'F':[2,0,0, 0, 0, 1, 1, 1, 1]})
df
A C D F
0 1 1 1 2
1 1 1 1 0
2 1 1 1 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 1
6 1 1 1 1
7 0 1 1 1
8 0 1 1 1
My desired output:
A C D F
0 1 1 1 2
1 1 1 1 2
2 1 1 1 2
3 0 0 0 0
4 0 0 0 0
5 0 0 0 1
6 1 1 1 1
7 0 1 1 1
8 0 1 1 1
I tried the below code, but it doesnot work because when I use shift, it doesnot take the updated previous row.
df['F'] = df.groupby(['A'])['F'].shift(1)
df
A C D F
0 1 1 1 NaN
1 1 1 1 2.0
2 1 1 1 0.0
3 0 0 0 NaN
4 0 0 0 0.0
5 0 0 0 0.0
6 1 1 1 0.0
7 0 1 1 1.0
8 0 1 1 1.0
transform('first')
df.F.groupby(df.A.rsub(1).cumsum()).transform('first')
0 2
1 2
2 2
3 0
4 0
5 1
6 1
7 1
8 1
Name: F, dtype: int64
Assign to column 'F'
df.assign(F=df.F.groupby(df.A.rsub(1).cumsum()).transform('first'))
A C D F
0 1 1 1 2
1 1 1 1 2
2 1 1 1 2
3 0 0 0 0
4 0 0 0 0
5 0 0 0 1
6 1 1 1 1
7 0 1 1 1
8 0 1 1 1
we also know how to do it without groupby:
where=df['A'].eq(1)&df['A'].ne(df['A'].shift())
df['F']=df['F'].where(where).ffill().mask(df['A'].ne(1),df['F'])
print(df)
A C D F
0 1 1 1 2.0
1 1 1 1 2.0
2 1 1 1 2.0
3 0 0 0 0.0
4 0 0 0 0.0
5 0 0 0 1.0
6 1 1 1 1.0
7 0 1 1 1.0
8 0 1 1 1.0

Sorting dataframe and creating new columns based on the rank of element

I have the following dataframe:
import pandas as pd
df = pd.DataFrame(
{
'id': [1, 1, 1, 1, 2, 2,2, 2, 3, 3, 3, 3],
'name': ['A', 'B', 'C', 'D','A', 'B','C', 'D', 'A', 'B','C', 'D'],
'Value': [1, 2, 3, 4, 5, 6, 0, 2, 4, 6, 3, 5]
},
columns=['name','id','Value'])`
I can sort the data using id and value as shown below:
df.sort_values(['id','Value'],ascending = [True,False])
The table that I print will be appearing as follow:
name id Value
D 1 4
C 1 3
B 1 2
A 1 1
B 2 6
A 2 5
D 2 2
C 2 0
B 3 6
D 3 5
A 3 4
C 3 3
I would like to create 4 new columns (Rank1, Rank2, Rank3, Rank4) if element in the column name is highest value, the column Rank1 will be assign as 1 else 0. if element in the column name is second highest value, he column Rank2 will be assign as 1 else 0.
Same for Rank3 and Rank4.
How could I do that?
Thanks.
Zep
Use:
df = df.join(pd.get_dummies(df.groupby('id').cumcount().add(1)).add_prefix('Rank'))
print (df)
name id Value Rank1 Rank2 Rank3 Rank4
3 D 1 4 1 0 0 0
2 C 1 3 0 1 0 0
1 B 1 2 0 0 1 0
0 A 1 1 0 0 0 1
5 B 2 6 1 0 0 0
4 A 2 5 0 1 0 0
7 D 2 2 0 0 1 0
6 C 2 0 0 0 0 1
9 B 3 6 1 0 0 0
11 D 3 5 0 1 0 0
8 A 3 4 0 0 1 0
10 C 3 3 0 0 0 1
Details:
For count per groups use GroupBy.cumcount, then add 1:
print (df.groupby('id').cumcount().add(1))
3 1
2 2
1 3
0 4
5 1
4 2
7 3
6 4
9 1
11 2
8 3
10 4
dtype: int64
For indicator columns use get_dumes with add_prefix:
print (pd.get_dummies(df.groupby('id').cumcount().add(1)).add_prefix('Rank'))
Rank1 Rank2 Rank3 Rank4
3 1 0 0 0
2 0 1 0 0
1 0 0 1 0
0 0 0 0 1
5 1 0 0 0
4 0 1 0 0
7 0 0 1 0
6 0 0 0 1
9 1 0 0 0
11 0 1 0 0
8 0 0 1 0
10 0 0 0 1
This does not require a prior sort
df.join(
pd.get_dummies(
df.groupby('id').Value.apply(np.argsort).rsub(4)
).add_prefix('Rank')
)
name id Value Rank1 Rank2 Rank3 Rank4
0 D 1 4 1 0 0 0
1 C 1 3 0 1 0 0
2 B 1 2 0 0 1 0
3 A 1 1 0 0 0 1
4 B 2 6 1 0 0 0
5 A 2 5 0 1 0 0
6 D 2 2 0 0 1 0
7 C 2 0 0 0 0 1
8 B 3 6 1 0 0 0
9 D 3 5 0 1 0 0
10 A 3 4 0 0 1 0
11 C 3 3 0 0 0 1
More dynamic
df.join(
pd.get_dummies(
df.groupby('id').Value.apply(lambda x: len(x) - np.argsort(x))
).add_prefix('Rank')
)
name id Value Rank1 Rank2 Rank3 Rank4
0 D 1 4 1 0 0 0
1 C 1 3 0 1 0 0
2 B 1 2 0 0 1 0
3 A 1 1 0 0 0 1
4 B 2 6 1 0 0 0
5 A 2 5 0 1 0 0
6 D 2 2 0 0 1 0
7 C 2 0 0 0 0 1
8 B 3 6 1 0 0 0
9 D 3 5 0 1 0 0
10 A 3 4 0 0 1 0
11 C 3 3 0 0 0 1

Can I use join command to join two files that have similarities on different column?

I have two files. one with three column (file 1):
AX-76297970 24 1000227
AX-76297974 24 1000999
AX-76297977 24 1001279
AX-76297978 24 1001552
AX-76297979 24 1001892
AX-76297985 24 1002443
AX-76297989 24 1002815
AX-76297993 24 1003894
AX-76297994 24 1004444
and another with several columns (file 2):
24 991 3 2 51.39 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -5 1 1 1 1 1 1 1 1 1
24 1000227 4 1 35496.64 0 0 0.077 0 0 0.077 0 0 0 0 0.308 0 0 0 0 -5 0 0 0 0 0 0 0 0
24 1068 3 4 257.06 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -5 1 1 1 1 1 1 1 1 1
24 1002443 4 2 66.67 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -5 1 1 1 1 1 0.95 1 1 1
24 1094 3 4 98.21 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1
what I want to do is to join these two files on column 3 of file 1 and column 2 of file 2.to get an output of all the columns of file 2 like this:
24 1000227 4 1 35496.64 0 0 0.077 0 0 0.077 0 0 0 0 0.308 0 0 0 0 -5 0 0 0 0 0 0 0 0
24 1002443 4 2 66.67 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -5 1 1 1 1 1 0.95 1 1 1
If you have the solution can you please explain it in detail so I can use it for different columns.
thanks in advance
Something like
join -1 3 -2 2 file1 file2
-1 3 tells join to use column three (3) of the first file (-1)
-2 2 tells join to use column two (2) of the second file (-2)
should do it. Maybe you will need to specify the separator:
join -t '\t' -1 3 -2 2 file1 file2
Have a look at the man page for the join command.

Resources