Combine columns in pandas to create a new column - python-3.x

Hello I am working on pandas dataframe and I want to create a column combining multiple columns and applying condition on them and I am looking for a smart way to do it.
Suppose the data frame looks as
A B C D
1 0 0 0
0 1 0 0
0 0 1 0
1 0 1 0
1 1 1 0
0 0 1 1
My output column should be as below
A B C D Output_col
1 0 0 0 A
0 1 0 0 B
0 0 1 0 C
1 0 1 0 A_C
1 1 1 0 A_B_C
0 0 1 1 C_D
I can certainly achieve this using below code but then I have to do it for every column.
test['Output_col'] = test.A.apply(lambda x: A if x > 0 else 0)
I was wondering if there is a way where I could achieve this without applying to every column if I have very large number of columns.
Thanks in advance !!

Use DataFrame.apply + join.
Select column names using x.index(
note that axis = 1 is used) + boolean indexing with Series.eq to filter the selected columns :
test['Output_col']=test.apply(lambda x: '_'.join(x.index[x.eq(1)]),axis=1)
print(test)
A B C D Output_col
0 1 0 0 0 A
1 0 1 0 0 B
2 0 0 1 0 C
3 1 0 1 0 A_C
4 1 1 1 0 A_B_C
5 0 0 1 1 C_D
To apply only a list of columns:
my_list_columns=['enter element of your list']
test['Output_col']=test[my_list_columns].apply(lambda x: '_'.join(x.index[x.eq(1)]),axis=1)
print(test)
case to all columns is 0
my_list_columns=['A','B','C','D']
df['Output_col']=df[my_list_columns].apply(lambda x: '_'.join(x.index[x.eq(1)]) if x.eq(1).any() else 'no_value',axis=1)
print(df)
A B C D Output_col
0 1 0 0 0 A
1 0 0 0 0 no_value
2 0 0 1 0 C
3 1 0 1 0 A_C
4 1 0 1 0 A_C
5 0 0 1 1 C_D

Edit: for a subset of columns (I use method 2)
cols = ['A', 'B']
df1 = df[cols]
s = df1.columns + '-'
df['Output_col'] = df1.dot(s).str[:-1]
Out[54]:
A B C D Output_col
0 1 0 0 0 A
1 0 1 0 0 B
2 0 0 1 0
3 1 0 1 0 A
4 1 1 1 0 A-B
5 0 0 1 1
Try this combination of str.replace and dot
df['Output_col'] = df.dot(df.columns).str.replace(r'(?<!^)(?!$)','-')
Out[32]:
A B C D Output_col
0 1 0 0 0 A
1 0 1 0 0 B
2 0 0 1 0 C
3 1 0 1 0 A-C
4 1 1 1 0 A-B-C
5 0 0 1 1 C-D
If you feel uneasy with regex pattern. You may try this way without using str.replace
s = df.columns + '-'
df['Output_col'] = df.dot(s).str[:-1]
Out[50]:
A B C D Output_col
0 1 0 0 0 A
1 0 1 0 0 B
2 0 0 1 0 C
3 1 0 1 0 A-C
4 1 1 1 0 A-B-C
5 0 0 1 1 C-D

This builds off a solution provided by #Jezrael : link
df['Output_col'] = df.dot(df.columns.str.cat(['_']*len(df.columns),sep='')).str.strip('_')
A B C D Output_col
0 1 0 0 0 A
1 0 1 0 0 B
2 0 0 1 0 C
3 1 0 1 0 A_C
4 1 1 1 0 A_B_C
5 0 0 1 1 C_D

Related

Returning column header corresponding to matched value in separate sheet

Sheet 1
Name
Gender
w
0
e
1
r
2
t
4
y
6
u
2
i
NoMatch
q
1
w
1
e
1
r
2
Sheet 2 - Note sheet 2 has 2 "w" under Name column
Name
Male 1
Female 2
other 3
other 4
other 5
Donotknow 6
w
0
0
0
0
0
0
w
1
0
0
0
0
0
a
0
0
0
0
0
1
q
1
0
0
0
0
0
r
0
1
0
0
0
0
e
1
0
0
0
0
0
t
0
0
0
1
1
0
y
0
0
0
0
0
1
u
0
1
0
0
0
0
I am using this formula in Sheet 1 under Gender:
=IFERROR(FILTER({1,2,3,4,5,6},INDEX(Sheet2!$B$2:$G$10,MATCH(A2,Sheet2!$A$2:$A$10,0),0)=1),"NoMatch")
If you can live with the fact that a zero stands for 'No Match', then try:
Formula in B1:
=BYROW(A2:A12,LAMBDA(a,MIN(IF((D2:D10=a)*E2:J10,SEQUENCE(,6),""))))
If not, then change too:
=LET(X,BYROW(A2:A12,LAMBDA(a,MIN(IF((D2:D10=a)*E2:J10,SEQUENCE(,6),"")))),IF(X,X,"No Match"))

Returning column header corresponding to matched value - follow up

Sheet 1
Name
Gender
w
1
e
1
r
2
t
4
y
6
u
2
i
NoMatch
q
1
w
1
e
1
r
2
Sheet 2
Name
Male 1
Female 2
other 3
other 4
other 5
Donotknow 6
w
1
0
0
0
0
0
a
0
0
0
0
0
1
q
1
0
0
0
0
0
r
0
1
0
0
0
0
e
1
0
0
0
0
0
t
0
0
0
1
1
0
y
0
0
0
0
0
1
u
0
1
0
0
0
0
I am using this formula in Sheet 1 under Gender:
=IFERROR(INDEX({1,2,3,4,5,6},MATCH(1,INDEX(Sheet2!$B$2:$G$9,MATCH(A2,Sheet2!$A$2:$A$9,0),0),0)),"No Match")
if I have 2 matches... For example t: columns other 4 and other 5 have 1. I would like the output to be 4;5 in the same cell.
How can I modify the formula to reflect that?
I am using an older excel - Filter command does not work.

make lower half of a n*n list zero without using any functions in python

I tried to solve it by using 2 for loops and an if statement . But i was unable to get the desired output.
INPUT-
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
thislist=[1]*10
thislist=[thislist]*10
print(thislist)
for i in range(10):
for j in range(10):
print(thislist[i][j], end = " ")
print()
print()
for i in range(10):
for j in range(10):
if i>j:
thislist[i][j]=0
for i in range(10):
for j in range(10):
print(thislist[i][j], end = " ")
print()
This was the output i got:
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 1
but when i made a list using the below method i got the desired output.
thislist=[[1,1,1,1,1,1,1,1,1,1],
[1,1,1,1,1,1,1,1,1,1],
[1,1,1,1,1,1,1,1,1,1],
[1,1,1,1,1,1,1,1,1,1],
[1,1,1,1,1,1,1,1,1,1],
[1,1,1,1,1,1,1,1,1,1],
[1,1,1,1,1,1,1,1,1,1],
[1,1,1,1,1,1,1,1,1,1],
[1,1,1,1,1,1,1,1,1,1],
[1,1,1,1,1,1,1,1,1,1]]
print(thislist)
for i in range(10):
for j in range(10):
if i>j:
thislist[i][j]=0
for i in range(10):
for j in range(10):
print(thislist[i][j], end = " ")
print()
note-This is the desired OUTPUT-
1 1 1 1 1 1 1 1 1 1
0 1 1 1 1 1 1 1 1 1
0 0 1 1 1 1 1 1 1 1
0 0 0 1 1 1 1 1 1 1
0 0 0 0 1 1 1 1 1 1
0 0 0 0 0 1 1 1 1 1
0 0 0 0 0 0 1 1 1 1
0 0 0 0 0 0 0 1 1 1
0 0 0 0 0 0 0 0 1 1
0 0 0 0 0 0 0 0 0 1
Can someone explain whats the difference between the above 2 codes?
As you pointed out, the problem comes from the manner you created your list of list. In your first example, you do something like this:
list1 = [1]*10
list_of_list1=[list1]*10
list_of_list1 is actually a list of shallow copies of the original list1. Then if you modify a value in list_of_list1, the modification will occurs in all the rows of list_of_list1.
The opposit of a shallow copy is a deep copy. You might want to search more info on the Internet about this topic
In the mean time, you can simply try this.
thislist = []
for row in range(10):
list1 = [1]*10
thislist.append(list1)
But I usually go with numpy when it is available.

Pattern identification in a dataset using python

I have a dataframe that looks something like this:
empl_ID day_1 day_2 day_3 day_4 day_5 day_6 day_7 day_8 day_9 day_10
1 1 1 1 1 1 1 0 1 1 1
2 0 0 1 1 1 1 1 1 1 0
3 0 1 0 0 1 1 1 1 1 1
4 1 0 1 0 1 1 1 0 1 0
5 1 0 0 1 1 1 1 1 1 1
6 0 0 0 0 1 1 1 1 1 1
As we can see we have 6 employees and index 1 indicates their presence for that day. I want to write a code using Python such that I can trace 2 continuous absences i.e. pattern 0 ,0 for day i, day i+1 in a time-frame of 6 days right from the person begins his employment.
For example, employee 1 begins his work at day_1 column, which is his first appearance of 1. So, from columns day_1 to day_6 if we do not observe any continuous 0, 0 that record should be labeled as '0'. Same would be the case for employee 2 (cols: day_3 to day_8), employee 4 (cols: day_1 to day_6) and employee 6 (cols: day_5 to day_10) and they will be labeled as '0'.
However, for employee 3 (cols: day_2 to day_7), employee 6 (cols: day_5 to day_10) they contain a 0, 0 pattern right from their first presence of 1 within the respective time-frame and thus will be labeled as '1'.
It would be really helpful if someone could help me in formulating a code to achieve the above objective. Thanks in advance!
The result should look something like this:
empl_ID day_1 day_2 day_3 day_4 day_5 day_6 day_7 day_8 day_9 day_10 label
1 1 1 1 1 1 1 0 1 1 1 0
2 0 0 1 1 1 1 1 1 1 0 0
3 0 1 0 0 1 1 1 1 1 1 1
4 1 0 1 0 1 1 1 0 1 0 0
5 1 0 0 1 1 1 1 1 1 1 1
6 0 0 0 0 1 1 1 1 1 1 0
Check with idxmcx and for loop with shift
s=df.set_index('empl_ID')
idx=s.columns.get_indexer(s.idxmax(1))
l=[(s.iloc[t, x :y].eq(s.iloc[t, x :y].shift())&s.iloc[t, x :y].eq(0)).any() for t , x ,y in zip(df.index,idx,idx+5)]
df['Label']=l
df
empl_ID day_1 day_2 day_3 day_4 ... day_7 day_8 day_9 day_10 Label
0 1 1 1 1 1 ... 0 1 1 1 False
1 2 0 0 1 1 ... 1 1 1 0 False
2 3 0 1 0 0 ... 1 1 1 1 True
3 4 1 0 1 0 ... 1 0 1 0 False
4 5 1 0 0 1 ... 1 1 1 1 True
5 6 0 0 0 0 ... 1 1 1 1 False
[6 rows x 12 columns]

Sorting dataframe and creating new columns based on the rank of element

I have the following dataframe:
import pandas as pd
df = pd.DataFrame(
{
'id': [1, 1, 1, 1, 2, 2,2, 2, 3, 3, 3, 3],
'name': ['A', 'B', 'C', 'D','A', 'B','C', 'D', 'A', 'B','C', 'D'],
'Value': [1, 2, 3, 4, 5, 6, 0, 2, 4, 6, 3, 5]
},
columns=['name','id','Value'])`
I can sort the data using id and value as shown below:
df.sort_values(['id','Value'],ascending = [True,False])
The table that I print will be appearing as follow:
name id Value
D 1 4
C 1 3
B 1 2
A 1 1
B 2 6
A 2 5
D 2 2
C 2 0
B 3 6
D 3 5
A 3 4
C 3 3
I would like to create 4 new columns (Rank1, Rank2, Rank3, Rank4) if element in the column name is highest value, the column Rank1 will be assign as 1 else 0. if element in the column name is second highest value, he column Rank2 will be assign as 1 else 0.
Same for Rank3 and Rank4.
How could I do that?
Thanks.
Zep
Use:
df = df.join(pd.get_dummies(df.groupby('id').cumcount().add(1)).add_prefix('Rank'))
print (df)
name id Value Rank1 Rank2 Rank3 Rank4
3 D 1 4 1 0 0 0
2 C 1 3 0 1 0 0
1 B 1 2 0 0 1 0
0 A 1 1 0 0 0 1
5 B 2 6 1 0 0 0
4 A 2 5 0 1 0 0
7 D 2 2 0 0 1 0
6 C 2 0 0 0 0 1
9 B 3 6 1 0 0 0
11 D 3 5 0 1 0 0
8 A 3 4 0 0 1 0
10 C 3 3 0 0 0 1
Details:
For count per groups use GroupBy.cumcount, then add 1:
print (df.groupby('id').cumcount().add(1))
3 1
2 2
1 3
0 4
5 1
4 2
7 3
6 4
9 1
11 2
8 3
10 4
dtype: int64
For indicator columns use get_dumes with add_prefix:
print (pd.get_dummies(df.groupby('id').cumcount().add(1)).add_prefix('Rank'))
Rank1 Rank2 Rank3 Rank4
3 1 0 0 0
2 0 1 0 0
1 0 0 1 0
0 0 0 0 1
5 1 0 0 0
4 0 1 0 0
7 0 0 1 0
6 0 0 0 1
9 1 0 0 0
11 0 1 0 0
8 0 0 1 0
10 0 0 0 1
This does not require a prior sort
df.join(
pd.get_dummies(
df.groupby('id').Value.apply(np.argsort).rsub(4)
).add_prefix('Rank')
)
name id Value Rank1 Rank2 Rank3 Rank4
0 D 1 4 1 0 0 0
1 C 1 3 0 1 0 0
2 B 1 2 0 0 1 0
3 A 1 1 0 0 0 1
4 B 2 6 1 0 0 0
5 A 2 5 0 1 0 0
6 D 2 2 0 0 1 0
7 C 2 0 0 0 0 1
8 B 3 6 1 0 0 0
9 D 3 5 0 1 0 0
10 A 3 4 0 0 1 0
11 C 3 3 0 0 0 1
More dynamic
df.join(
pd.get_dummies(
df.groupby('id').Value.apply(lambda x: len(x) - np.argsort(x))
).add_prefix('Rank')
)
name id Value Rank1 Rank2 Rank3 Rank4
0 D 1 4 1 0 0 0
1 C 1 3 0 1 0 0
2 B 1 2 0 0 1 0
3 A 1 1 0 0 0 1
4 B 2 6 1 0 0 0
5 A 2 5 0 1 0 0
6 D 2 2 0 0 1 0
7 C 2 0 0 0 0 1
8 B 3 6 1 0 0 0
9 D 3 5 0 1 0 0
10 A 3 4 0 0 1 0
11 C 3 3 0 0 0 1

Resources