List column name having value greater than zero - python-3.x

I have following dataframe
A | B | C | D
1 0 2 1
0 1 1 0
0 0 0 1
I want to add the new column have any value of row in the column greater than zero along with column name
A | B | C | D | New
1 0 2 1 A-1, C-2, D-1
0 1 1 0 B-1, C-1
0 0 0 1 D-1

We can use mask and stack
s=df.mask(df==0).stack().\
astype(int).astype(str).\
reset_index(level=1).apply('-'.join,1).add(',').sum(level=0).str[:-1]
df['New']=s
df
Out[170]:
A B C D New
0 1 0 2 1 A-1,C-2,D-1
1 0 1 1 0 B-1,C-1
2 0 0 0 1 D-1

Combine the column names with the df values that are not zero and then filter out the None values.
df = pd.read_clipboard()
arrays = np.where(df!=0, df.columns.values + '-' + df.values.astype('str'), None)
new = []
for array in arrays:
new.append(list(filter(None, array)))
df['New'] = new
df
Out[1]:
A B C D New
0 1 0 2 1 [A-1, C-2, D-1]
1 0 1 1 0 [B-1, C-1]
2 0 0 0 1 [D-1]

Related

Dataframe conversion

I am trying to convert a data frame into a 1,0 matrix format
data = pd.DataFrame({'Val1':['A','B','B'],
'Val2':['C','A','D'],
'Val3':['E','F','C'],
'Comb':['Comb1','Comb2','Comb3']})
data:
Val1 Val2 Val3 Comb
0 A C E Comb1
1 B A F Comb2
2 B D C Comb3
What I need is to convert to below data frame
Comb A C D E B F
0 Comb1 1 1 0 1 0 0
1 Comb2 1 0 0 0 1 1
2 Comb3 0 1 1 0 1 0
I was able to do it with a FOR loop but as my dataframe increases, the processing time increases. Is there a better way to do it?
header = set(data[['Val1','Val2','Val3']].values.ravel())
matrix = pd.DataFrame(columns=header)
for i in range(data.shape[0]):
temp_dict = {data["Val1"].iloc[i]:1, data["Val2"].iloc[i]:1, data["Val3"].iloc[i]:1}
matrix = matrix.append(temp_dict, ignore_index=True)
matrix = matrix.loc[:, matrix.columns.notnull()]
matrix = matrix.fillna(0)
matrix = pd.merge(data[["Comb"]],matrix, left_index=True, right_index=True, how= 'outer')
Thanks!
There may be a better solution, but this is what came to my mind: convert each raw to a dictionary of "present" letters, build a Series from the dictionary, and combine the Series into a dataframe.
data.loc[:, 'Val1':'Val3'].apply(lambda row:
pd.Series({letter: 1 for letter in row}), axis=1)\
.fillna(0).astype(int).join(data.Comb)
# A B C D E F Comb
#0 1 0 1 0 1 0 Comb1
#1 1 1 0 0 0 1 Comb2
#2 0 1 1 1 0 0 Comb3
There are propably multiple ways to solve this, I used pd.crosstab for it:
import pandas as pd
data = pd.DataFrame({'Val1':['A','B','B'],
'Val2':['C','A','D'],
'Val3':['E','F','C'],
'Comb':['Comb1','Comb2','Comb3']})
data["lst"] = data[['Val1', 'Val2', 'Val3']].values.tolist()
data = data.explode("lst")
print(pd.crosstab(data["Comb"], data["lst"]))
Out[20]:
lst A B C D E F
Comb
Comb1 1 0 1 0 1 0
Comb2 1 1 0 0 0 1
Comb3 0 1 1 1 0 0
I guess this will work. Please let me know if it works
pd.get_dummies(data, columns=['Val1','Val2','Val3'],prefix="",prefix_sep="").groupby(axis=1,level=0).sum()
Here's another way:
data.melt('Comb').set_index('Comb')['value'].str.get_dummies().sum(level=0).reset_index()
Output:
Comb A B C D E F
0 Comb1 1 0 1 0 1 0
1 Comb2 1 1 0 0 0 1
2 Comb3 0 1 1 1 0 0

Merge multiple binary encoded rows into one in pandas dataframe

I have a pandas.DataFrame that looks like this:
A B C D E F
0 0 1 0 0 0
1 1 0 0 0 0
2 0 1 0 0 0
3 0 0 0 1 0
4 0 0 1 0 0
There are several rows that share a 1 in their columns and in each row there is only one 1 present. I want to merge the rows with each other so the resulting dataFrame would onyl consist of one row, that combines all the 1s of the dataframe, like this:
A B C D E F
0 1 1 1 1 0
Is there a smart, easy way to do this with pandas?
Use DataFrame.sum, then compare for greater or equal by Series.ge and last convert to 0,1 by Series.view:
s = df.sum().ge(1).view('i1')
Another idea if 0,1 values only is use DataFrame.any with convert mask to 0,1:
s = df.any().view('i1')
print (s)
A 1
B 1
C 1
D 1
E 1
F 0
dtype: int8
We can do
df.sum().ge(1).astype(int)
Out[316]:
A 1
B 1
C 1
D 1
E 1
F 0
dtype: int32

How to identify where a particular sequence in a row occurs for the first time

I have a dataframe in pandas, an example of which is provided below:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6
A 1 0 0 1 0 0
B 1 1 0 0 1 0
C 1 0 1 1 0 0
D 0 0 1 0 0 1
E 1 1 1 1 1 1
As you can see 1 and 0 occurs randomly in different columns. It would be helpful, if anyone can suggest me a code in python such that I am able to find the column number where the 1 0 0 pattern occurs for the first time. For example, for member A, the first 1 0 0 pattern occurs at appear_1. so the first occurrence will be 1. Similarly for the member B, the first 1 0 0 pattern occurs at appear_2, so the first occurrence will be at column 2. The resulting table should have a new column named 'first_occurrence'. If there is no such 1 0 0 pattern occurs (like in row E) then the value in first occurrence column will the sum of number of 1 in that row. The resulting table should look something like this:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6 first_occurrence
A 1 0 0 1 0 0 1
B 1 1 0 0 1 0 2
C 1 0 1 1 0 0 4
D 0 0 1 0 0 1 3
E 1 1 1 1 1 1 6
Thank you in advance.
I try not to reinvent the wheel, so I develop on my answer to previous question. From that answer, you need to use additional idxmax, np.where, and get_indexer
cols = ['appear_1', 'appear_2', 'appear_3', 'appear_4', 'appear_5', 'appear_6']
df1 = df[cols]
m = df1[df1.eq(1)].ffill(1).notna()
df2 = df1[m].bfill(1).eq(0)
m2 = df2 & df2.shift(-1, axis=1, fill_value=True)
df['first_occurrence'] = np.where(m2.any(1), df1.columns.get_indexer(m2.idxmax(1)),
df1.shape[1])
Out[540]:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6 first_occurrence
0 A 1 0 0 1 0 0 1
1 B 1 1 0 0 1 0 2
2 C 1 0 1 1 0 0 4
3 D 0 0 1 0 0 1 3
4 E 1 1 1 1 1 1 6

Pattern identification and sequence detection

I have a dataset 'df' that looks something like this:
MEMBER seen_1 seen_2 seen_3 seen_4 seen_5 seen_6
A 1 0 0 1 0 1
B 1 1 0 0 1 0
C 1 1 1 0 0 1
D 0 0 1 0 0 1
As you can see there are several rows of ones and zeros. Can anyone suggest me a code in python such that I am able to count the number of times '1' occurs continuously before the first occurrence of a 1, 0 and 0 in order. For example, for member A, the first double zero event occurs at seen_2 and seen_3, so the event will be 1. Similarly for the member B, the first double zero event occurs at seen_3 and seen_4 so there are two 1s that occur before this. The resultant table should have a new column 'event' something like this:
MEMBER seen_1 seen_2 seen_3 seen_4 seen_5 seen_6 event
A 1 0 0 1 0 1 1
B 1 1 0 0 1 0 2
C 1 1 1 0 0 1 3
D 0 0 1 0 0 1 1
My approach:
df = df.set_index('MEMBER')
# count 1 on each rows since the last 0
s = (df.stack()
.groupby(['MEMBER', df.eq(0).cumsum(1).stack()])
.cumsum().unstack()
)
# mask of the zeros:
u = s.eq(0)
# look for the first 1 0 0
idx = (~u &
u.shift(-1, axis=1, fill_value=False) &
u.shift(-2, axis=1, fill_value=False) ).idxmax(1)
# look up
df['event'] = s.lookup(idx.index, idx)
Test data:
MEMBER seen_1 seen_2 seen_3 seen_4 seen_5 seen_6
0 A 1 0 1 0 0 1
1 B 1 1 0 0 1 0
2 C 1 1 1 0 0 1
3 D 0 0 1 0 0 1
4 E 1 0 1 1 0 0
Output:
MEMBER seen_1 seen_2 seen_3 seen_4 seen_5 seen_6 event
0 A 1 0 1 0 0 1 1
1 B 1 1 0 0 1 0 2
2 C 1 1 1 0 0 1 3
3 D 0 0 1 0 0 1 1
4 E 1 0 1 1 0 0 2

Comparing two different sized pandas Dataframes and to find the row index with equal values

I need some help with comparing two pandas dataframe
I have two dataframes
The first dataframe is
df1 =
a b c d
0 1 1 1 1
1 0 1 0 1
2 0 0 0 1
3 1 1 1 1
4 1 0 1 0
5 1 1 1 0
6 0 0 1 0
7 0 1 0 1
and the second dataframe is
df2 =
a b c d
0 1 1 1 1
1 1 0 1 0
2 0 0 1 0
I want to find the row index of dataframe 1 (df1) which the entire row is the same as the rows in dataframe 2 (df2). My expect result would be
0
3
4
6
The order of the above index does not need to be in order, all I want is the index of dataframe 1 (df1)
Is there a way without using for loop?
Thanks
Tommy
You can using merge
df1.merge(df2,indicator=True,how='left').loc[lambda x : x['_merge']=='both'].index
Out[459]: Int64Index([0, 3, 4, 6], dtype='int64')

Resources