Pattern identification in a dataset using python

Pattern identification in a dataset using python - python-3.x

I have a dataframe that looks something like this:
empl_ID day_1 day_2 day_3 day_4 day_5 day_6 day_7 day_8 day_9 day_10
1 1 1 1 1 1 1 0 1 1 1
2 0 0 1 1 1 1 1 1 1 0
3 0 1 0 0 1 1 1 1 1 1
4 1 0 1 0 1 1 1 0 1 0
5 1 0 0 1 1 1 1 1 1 1
6 0 0 0 0 1 1 1 1 1 1
As we can see we have 6 employees and index 1 indicates their presence for that day. I want to write a code using Python such that I can trace 2 continuous absences i.e. pattern 0 ,0 for day i, day i+1 in a time-frame of 6 days right from the person begins his employment.
For example, employee 1 begins his work at day_1 column, which is his first appearance of 1. So, from columns day_1 to day_6 if we do not observe any continuous 0, 0 that record should be labeled as '0'. Same would be the case for employee 2 (cols: day_3 to day_8), employee 4 (cols: day_1 to day_6) and employee 6 (cols: day_5 to day_10) and they will be labeled as '0'.
However, for employee 3 (cols: day_2 to day_7), employee 6 (cols: day_5 to day_10) they contain a 0, 0 pattern right from their first presence of 1 within the respective time-frame and thus will be labeled as '1'.
It would be really helpful if someone could help me in formulating a code to achieve the above objective. Thanks in advance!
The result should look something like this:
empl_ID day_1 day_2 day_3 day_4 day_5 day_6 day_7 day_8 day_9 day_10 label
1 1 1 1 1 1 1 0 1 1 1 0
2 0 0 1 1 1 1 1 1 1 0 0
3 0 1 0 0 1 1 1 1 1 1 1
4 1 0 1 0 1 1 1 0 1 0 0
5 1 0 0 1 1 1 1 1 1 1 1
6 0 0 0 0 1 1 1 1 1 1 0

Check with idxmcx and for loop with shift
s=df.set_index('empl_ID')
idx=s.columns.get_indexer(s.idxmax(1))
l=[(s.iloc[t, x :y].eq(s.iloc[t, x :y].shift())&s.iloc[t, x :y].eq(0)).any() for t , x ,y in zip(df.index,idx,idx+5)]
df['Label']=l
df
empl_ID day_1 day_2 day_3 day_4 ... day_7 day_8 day_9 day_10 Label
0 1 1 1 1 1 ... 0 1 1 1 False
1 2 0 0 1 1 ... 1 1 1 0 False
2 3 0 1 0 0 ... 1 1 1 1 True
3 4 1 0 1 0 ... 1 0 1 0 False
4 5 1 0 0 1 ... 1 1 1 1 True
5 6 0 0 0 0 ... 1 1 1 1 False
[6 rows x 12 columns]

Related

ValueError in clustering evaluation: Expected 2D array, got 1D array instead

df_2D = df[['sepal-length', 'petal-length']]
df_2D = np.array(df_2D)
k_means_2D_model = KMeans(n_clusters=3, max_iter=1000).fit(df_2D)
Error:
ValueError: Expected 2D array, got 1D array instead:
array=[0 1 0 0 2 2 1 2 1 1 0 1 0 2 0 2 1 2 0 2 2 1 0 1 2 0 0 2 1 0 2 0 1 1 0 0 2
1 2 1 0 0 1 1 1 1 2 1 2 2 0 2 2 2 0 1 1 1 1 0 1 2 1 2 1 2 2 1 2 1 0 0 2 1
1 0 2 0 0 1 1 0 1 2 1 1 0 1 0 1 0 0 0 2 1 1 1 2 0 2 0 0 2 2 0 0 1 2 2 1 0
2 2 1 1 2 0 0 2 2 2 0 0 1 0 1 0 2 2 2 0 0 0 0 2 1 2 2 2 2 1 1 1 2 2 0 0 0
1 0].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

make lower half of a n*n list zero without using any functions in python

I tried to solve it by using 2 for loops and an if statement . But i was unable to get the desired output.
INPUT-
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
thislist=[1]*10
thislist=[thislist]*10
print(thislist)
for i in range(10):
for j in range(10):
print(thislist[i][j], end = " ")
print()
print()
for i in range(10):
for j in range(10):
if i>j:
thislist[i][j]=0
for i in range(10):
for j in range(10):
print(thislist[i][j], end = " ")
print()
This was the output i got:
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 1
but when i made a list using the below method i got the desired output.
thislist=[[1,1,1,1,1,1,1,1,1,1],
[1,1,1,1,1,1,1,1,1,1],
[1,1,1,1,1,1,1,1,1,1],
[1,1,1,1,1,1,1,1,1,1],
[1,1,1,1,1,1,1,1,1,1],
[1,1,1,1,1,1,1,1,1,1],
[1,1,1,1,1,1,1,1,1,1],
[1,1,1,1,1,1,1,1,1,1],
[1,1,1,1,1,1,1,1,1,1],
[1,1,1,1,1,1,1,1,1,1]]
print(thislist)
for i in range(10):
for j in range(10):
if i>j:
thislist[i][j]=0
for i in range(10):
for j in range(10):
print(thislist[i][j], end = " ")
print()
note-This is the desired OUTPUT-
1 1 1 1 1 1 1 1 1 1
0 1 1 1 1 1 1 1 1 1
0 0 1 1 1 1 1 1 1 1
0 0 0 1 1 1 1 1 1 1
0 0 0 0 1 1 1 1 1 1
0 0 0 0 0 1 1 1 1 1
0 0 0 0 0 0 1 1 1 1
0 0 0 0 0 0 0 1 1 1
0 0 0 0 0 0 0 0 1 1
0 0 0 0 0 0 0 0 0 1
Can someone explain whats the difference between the above 2 codes?

As you pointed out, the problem comes from the manner you created your list of list. In your first example, you do something like this:
list1 = [1]*10
list_of_list1=[list1]*10
list_of_list1 is actually a list of shallow copies of the original list1. Then if you modify a value in list_of_list1, the modification will occurs in all the rows of list_of_list1.
The opposit of a shallow copy is a deep copy. You might want to search more info on the Internet about this topic
In the mean time, you can simply try this.
thislist = []
for row in range(10):
list1 = [1]*10
thislist.append(list1)
But I usually go with numpy when it is available.

Vectorized way of using the previous row value based on the condition

I have a pandas dataframe as below. I want to perform the below condition:
if Column 'A' is 1 then update the value of column 'F' with the previous value of 'F'. This can be done row by row iteration but it is not efficient way of doing that. I want a vectorized method of doing that.
df = pd.DataFrame({'A':[1,1,1, 0, 0, 0, 1, 0, 0], 'C':[1,1,1, 0, 0, 0, 1, 1, 1], 'D':[1,1,1, 0, 0, 0, 1, 1, 1],
'F':[2,0,0, 0, 0, 1, 1, 1, 1]})
df
A C D F
0 1 1 1 2
1 1 1 1 0
2 1 1 1 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 1
6 1 1 1 1
7 0 1 1 1
8 0 1 1 1
My desired output:
A C D F
0 1 1 1 2
1 1 1 1 2
2 1 1 1 2
3 0 0 0 0
4 0 0 0 0
5 0 0 0 1
6 1 1 1 1
7 0 1 1 1
8 0 1 1 1
I tried the below code, but it doesnot work because when I use shift, it doesnot take the updated previous row.
df['F'] = df.groupby(['A'])['F'].shift(1)
df
A C D F
0 1 1 1 NaN
1 1 1 1 2.0
2 1 1 1 0.0
3 0 0 0 NaN
4 0 0 0 0.0
5 0 0 0 0.0
6 1 1 1 0.0
7 0 1 1 1.0
8 0 1 1 1.0

transform('first')
df.F.groupby(df.A.rsub(1).cumsum()).transform('first')
0 2
1 2
2 2
3 0
4 0
5 1
6 1
7 1
8 1
Name: F, dtype: int64
Assign to column 'F'
df.assign(F=df.F.groupby(df.A.rsub(1).cumsum()).transform('first'))
A C D F
0 1 1 1 2
1 1 1 1 2
2 1 1 1 2
3 0 0 0 0
4 0 0 0 0
5 0 0 0 1
6 1 1 1 1
7 0 1 1 1
8 0 1 1 1

we also know how to do it without groupby:
where=df['A'].eq(1)&df['A'].ne(df['A'].shift())
df['F']=df['F'].where(where).ffill().mask(df['A'].ne(1),df['F'])
print(df)
A C D F
0 1 1 1 2.0
1 1 1 1 2.0
2 1 1 1 2.0
3 0 0 0 0.0
4 0 0 0 0.0
5 0 0 0 1.0
6 1 1 1 1.0
7 0 1 1 1.0
8 0 1 1 1.0

Sorting dataframe and creating new columns based on the rank of element

I have the following dataframe:
import pandas as pd
df = pd.DataFrame(
{
'id': [1, 1, 1, 1, 2, 2,2, 2, 3, 3, 3, 3],
'name': ['A', 'B', 'C', 'D','A', 'B','C', 'D', 'A', 'B','C', 'D'],
'Value': [1, 2, 3, 4, 5, 6, 0, 2, 4, 6, 3, 5]
},
columns=['name','id','Value'])`
I can sort the data using id and value as shown below:
df.sort_values(['id','Value'],ascending = [True,False])
The table that I print will be appearing as follow:
name id Value
D 1 4
C 1 3
B 1 2
A 1 1
B 2 6
A 2 5
D 2 2
C 2 0
B 3 6
D 3 5
A 3 4
C 3 3
I would like to create 4 new columns (Rank1, Rank2, Rank3, Rank4) if element in the column name is highest value, the column Rank1 will be assign as 1 else 0. if element in the column name is second highest value, he column Rank2 will be assign as 1 else 0.
Same for Rank3 and Rank4.
How could I do that?
Thanks.
Zep

Use:
df = df.join(pd.get_dummies(df.groupby('id').cumcount().add(1)).add_prefix('Rank'))
print (df)
name id Value Rank1 Rank2 Rank3 Rank4
3 D 1 4 1 0 0 0
2 C 1 3 0 1 0 0
1 B 1 2 0 0 1 0
0 A 1 1 0 0 0 1
5 B 2 6 1 0 0 0
4 A 2 5 0 1 0 0
7 D 2 2 0 0 1 0
6 C 2 0 0 0 0 1
9 B 3 6 1 0 0 0
11 D 3 5 0 1 0 0
8 A 3 4 0 0 1 0
10 C 3 3 0 0 0 1
Details:
For count per groups use GroupBy.cumcount, then add 1:
print (df.groupby('id').cumcount().add(1))
3 1
2 2
1 3
0 4
5 1
4 2
7 3
6 4
9 1
11 2
8 3
10 4
dtype: int64
For indicator columns use get_dumes with add_prefix:
print (pd.get_dummies(df.groupby('id').cumcount().add(1)).add_prefix('Rank'))
Rank1 Rank2 Rank3 Rank4
3 1 0 0 0
2 0 1 0 0
1 0 0 1 0
0 0 0 0 1
5 1 0 0 0
4 0 1 0 0
7 0 0 1 0
6 0 0 0 1
9 1 0 0 0
11 0 1 0 0
8 0 0 1 0
10 0 0 0 1

This does not require a prior sort
df.join(
pd.get_dummies(
df.groupby('id').Value.apply(np.argsort).rsub(4)
).add_prefix('Rank')
)
name id Value Rank1 Rank2 Rank3 Rank4
0 D 1 4 1 0 0 0
1 C 1 3 0 1 0 0
2 B 1 2 0 0 1 0
3 A 1 1 0 0 0 1
4 B 2 6 1 0 0 0
5 A 2 5 0 1 0 0
6 D 2 2 0 0 1 0
7 C 2 0 0 0 0 1
8 B 3 6 1 0 0 0
9 D 3 5 0 1 0 0
10 A 3 4 0 0 1 0
11 C 3 3 0 0 0 1
More dynamic
df.join(
pd.get_dummies(
df.groupby('id').Value.apply(lambda x: len(x) - np.argsort(x))
).add_prefix('Rank')
)
name id Value Rank1 Rank2 Rank3 Rank4
0 D 1 4 1 0 0 0
1 C 1 3 0 1 0 0
2 B 1 2 0 0 1 0
3 A 1 1 0 0 0 1
4 B 2 6 1 0 0 0
5 A 2 5 0 1 0 0
6 D 2 2 0 0 1 0
7 C 2 0 0 0 0 1
8 B 3 6 1 0 0 0
9 D 3 5 0 1 0 0
10 A 3 4 0 0 1 0
11 C 3 3 0 0 0 1

Can I use join command to join two files that have similarities on different column?

I have two files. one with three column (file 1):
AX-76297970 24 1000227
AX-76297974 24 1000999
AX-76297977 24 1001279
AX-76297978 24 1001552
AX-76297979 24 1001892
AX-76297985 24 1002443
AX-76297989 24 1002815
AX-76297993 24 1003894
AX-76297994 24 1004444
and another with several columns (file 2):
24 991 3 2 51.39 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -5 1 1 1 1 1 1 1 1 1
24 1000227 4 1 35496.64 0 0 0.077 0 0 0.077 0 0 0 0 0.308 0 0 0 0 -5 0 0 0 0 0 0 0 0
24 1068 3 4 257.06 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -5 1 1 1 1 1 1 1 1 1
24 1002443 4 2 66.67 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -5 1 1 1 1 1 0.95 1 1 1
24 1094 3 4 98.21 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1
what I want to do is to join these two files on column 3 of file 1 and column 2 of file 2.to get an output of all the columns of file 2 like this:
24 1000227 4 1 35496.64 0 0 0.077 0 0 0.077 0 0 0 0 0.308 0 0 0 0 -5 0 0 0 0 0 0 0 0
24 1002443 4 2 66.67 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -5 1 1 1 1 1 0.95 1 1 1
If you have the solution can you please explain it in detail so I can use it for different columns.
thanks in advance

Something like
join -1 3 -2 2 file1 file2
-1 3 tells join to use column three (3) of the first file (-1)
-2 2 tells join to use column two (2) of the second file (-2)
should do it. Maybe you will need to specify the separator:
join -t '\t' -1 3 -2 2 file1 file2
Have a look at the man page for the join command.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pattern identification in a dataset using python - python-3.x

Related

ValueError in clustering evaluation: Expected 2D array, got 1D array instead

make lower half of a n*n list zero without using any functions in python

Vectorized way of using the previous row value based on the condition

Sorting dataframe and creating new columns based on the rank of element

Can I use join command to join two files that have similarities on different column?

Categories

Resources