How to subset a DataFrame by only a column having multiple entries? - python-3.x

I have a pandas DataFrame df that looks like this:
0 1
C1 V1
C2 V1
C3 V1
C4 V2
C5 V3
C6 V3
C7 V4
I wish to subset df by only those rows that have multiple values in column 1, the desired output being:
0 1
C1 V1
C2 V1
C3 V1
C5 V3
C6 V3
How do I do this?

I think you need boolean indexing with mask created by DataFrame.duplicated with keep=False for mark all duplicates as True:
print (df.columns)
Index(['0', '1'], dtype='object')
mask = df.duplicated('1', keep=False)
#another solution with Series.duplicated
#mask = df['1'].duplicated(keep=False)
print (mask)
0 True
1 True
2 True
3 False
4 True
5 True
6 False
dtype: bool
print (df[mask])
0 1
0 C1 V1
1 C2 V1
2 C3 V1
4 C5 V3
5 C6 V3
print (df.columns)
Int64Index([0, 1], dtype='int64')
mask = df.duplicated(1, keep=False)
#another solution with Series.duplicated
#mask = df[1].duplicated(keep=False)
print (mask)
0 True
1 True
2 True
3 False
4 True
5 True
6 False
dtype: bool
print (df[mask])
0 1
0 C1 V1
1 C2 V1
2 C3 V1
4 C5 V3
5 C6 V3

Related

How to reorganize/restructure values in a dataframe with no column header by refering to a master dataframe in python?

Master Dataframe:
B
D
E
b1
d1
e1
b2
d2
e2
b3
d3
d4
d5
Dataframe with no column name:
b1
d3
e1
d2
b2
e2
e1
d5
e1
How do i convert the dataframe above to something like in the table below (with column names) by refering to master dataframe?
B
D
E
b1
d3
e1
d2
b2
e2
e1
d5
e1
Thank you in advance for your help!
One way would be to make a mapping dict, then reindex each row:
# Mapping dict
d = {}
for k, v in df.to_dict("list").items():
d.update(**dict.fromkeys(set(v) - {np.nan}, k))
# or pandas approach
d = df.melt().dropna().set_index("value")["variable"].to_dict()
def reorganize(ser):
data = [i for i in ser if pd.notna(i)]
ind = [d.get(i, i) for i in data]
return pd.Series(data, index=ind)
df2.apply(reorganize, axis=1)
Output:
B D E
0 b1 NaN NaN
1 NaN d3 e1
2 NaN d2 NaN
3 b2 NaN e2
4 NaN NaN e1
5 NaN d5 e1
It's not a beautiful answer, but I think I was able to do it by using .loc. I don't think you need to use Master Dataframe.
import pandas as pd
df = pd.DataFrame({'col1': ['b1', 'd3', 'd2', 'b2', 'e1', 'd5'],
'col2': ['', 'e1', '', 'e2', '', 'e1']},
columns=['col1', 'col2'])
df
# col1 col2
# 0 b1
# 1 d3 e1
# 2 d2
# 3 b2 e2
# 4 e1
# 5 d5 e1
df_reshaped = pd.DataFrame()
for index, row in df.iterrows():
for col in df.columns:
i = row[col]
j = i[0] if i != '' else ''
if j != '':
df_reshaped.loc[index, j] = i
df_reshaped.columns = df_reshaped.columns.str.upper()
df_reshaped
# B D E
# 0 b1 NaN NaN
# 1 NaN d3 e1
# 2 NaN d2 NaN
# 3 b2 NaN e2
# 4 NaN NaN e1
# 5 NaN d5 e1

Ungroup pandas dataframe column values separated by comma

Hello I Have pandas dataframe which is grouped wanted to ungroup the dataframe the column values are separated comma the dataframe which is looking as below
col1 col2 name exams
0,0,0 0,0,0, A1 exm1,exm2, exm3
0,1,0,20 0,0,2,20 A2 exm1,exm2, exm4, exm5
0,0,0,30 0,0,20,20 A3 exm1,exm2, exm3, exm5
output how I wanted
col1 col2 name exam
0 0 A1 exm1
0 0 A1 exm2
0 0 A1 exm3
0 0 A2 exm1
1 0 A2 exm2
0 2 A2 exm4
20 20 A2 exm5
..............
30 20 A3 exm5
I am tried with Split (explode) pandas dataframe string entry to separate rows
but not able get proper approach any one please give me suggestion how can I get my output
Try with explode, notice , explode is the new function after pandas 0.25.0
df[['col1','col2','exams']]=df[['col1','col2','exams']].apply(lambda x : x.str.split(','))
df = df.join(pd.concat([df.pop(x).explode() for x in ['col1','col2','exams']],axis=1))
Out[62]:
name col1 col2 exams
0 A1 0 0 exm1
0 A1 0 0 exm2
0 A1 0 0 exm3
1 A2 0 0 exm1
1 A2 1 0 exm2
1 A2 0 2 exm4
1 A2 20 20 exm5
2 A3 0 0 exm1
2 A3 0 0 exm2
2 A3 0 20 exm3
2 A3 30 20 exm5

Adding new Column in Dataframe and updating row values as other columns name based on condition

I have a dataframe with columns as a, c1, c2, c3 c4.
df =
a. c1. c2. c3. c4.
P1 1 0 0 0
P2 0 0 0 1
P3 1 0 0 0
P4 0 1 0 0
On above df, I want to do following operations:
Add a new column main, whose value will be the name of column which contain value 1 for a particular row.
For eg: 1st row will have value 'c1' in its main column, similarly second row will have c4.
The resulting df will look like below:
df =
a. c1. c2. c3. c4. main
P1 1 0 0 0 c1
P2 0 0 0 1 c4
P3 1 0 0 0 c1
P4 0 1 0 0 c2
I am new to python and dataframes. Please help.
Use DataFrame.dot for matrix multiplication:
If a is first colum omit it by indexing:
df['main'] = df.iloc[:, 1:].dot(df.columns[1:])
#if possible multiple 1 per row
#df['main'] = df.iloc[:, 1:].dot(df.columns[1:] + ',').str.rstrip(',')
print (df)
a c1 c2 c3 c4 main
0 P1 1 0 0 0 c1
1 P2 0 0 0 1 c4
2 P3 1 0 0 0 c1
3 P4 0 1 0 0 c2
If a is index:
df['main'] = df.dot(df.columns)
#if possible multiple 1 per row
#df['main'] = df.dot(df.columns + ',').str.rstrip(',')
print (df)
c1 c2 c3 c4 main
a
P1 1 0 0 0 c1
P2 0 0 0 1 c4
P3 1 0 0 0 c1
P4 0 1 0 0 c2

How can I create Frequency Matrix using all columns

Let's say that I have a dataset that contains 4 binary columns for 2 rows.
It looks like this:
c1 c2 c3 c4 c5
r1 0 1 0 1 0
r2 1 1 1 1 0
I want to create a matrix that gives the number of occurrences of a column, given that it also occurred in another column. Kinda like a confusion matrix
My desired output is:
c1 c2 c3 c4 c5
c1 - 1 1 1 0
c2 1 - 1 2 0
c3 1 1 - 1 0
c4 1 2 1 - 0
I have used pandas crosstab but it only gives the desired output when using 2 columns. I want to use all of the columns
dot
df.T.dot(df)
# same as
# df.T # df
c1 c2 c3 c4 c5
c1 1 1 1 1 0
c2 1 2 1 2 0
c3 1 1 1 1 0
c4 1 2 1 2 0
c5 0 0 0 0 0
You can use np.fill_diagonal to make the diagonal zero
d = df.T.dot(df)
np.fill_diagonal(d.to_numpy(), 0)
d
c1 c2 c3 c4 c5
c1 0 1 1 1 0
c2 1 0 1 2 0
c3 1 1 0 1 0
c4 1 2 1 0 0
c5 0 0 0 0 0
And as long as we're using Numpy, you could go all the way...
a = df.to_numpy()
b = a.T # a
np.fill_diagonal(b, 0)
pd.DataFrame(b, df.columns, df.columns)
c1 c2 c3 c4 c5
c1 0 1 1 1 0
c2 1 0 1 2 0
c3 1 1 0 1 0
c4 1 2 1 0 0
c5 0 0 0 0 0
A way of using melt and merge with groupby
s=df.reset_index().melt('index').loc[lambda x : x.value==1]
s.merge(s,on='index').query('variable_x!=variable_y').groupby(['variable_x','variable_y'])['value_x'].sum().unstack(fill_value=0)
Out[32]:
variable_y c1 c2 c3 c4
variable_x
c1 0 1 1 1
c2 1 0 1 2
c3 1 1 0 1
c4 1 2 1 0

Change index of crosstab in pandas dataframe

I have following data-frame df. I retrieved subset of df without NAN.
#df is:
DT RE FE SE C_Step
0 D1 E1 F1 S1 poor
1 D2 E3 F2 NaN UNKNOWN
2 D1 E3 NaN S2 UNKNOWN
3 D1 NaN F1 S1 poor
4 D2 NaN F1 S2 poor
5 D2 E3 NaN S1 fair
6 D1 E3 F1 S2 fair
7 D2 E2 F1 S1 UNKNOWN
8 D2 E2 F1 S1 fair
9 D2 E2 NaN NaN good
10 D2 E2 F1 S1 UNKNOWN
11 D1 E3 F2 S1 UNKNOWN
12 D2 E1 F2 S2 UNKNOWN
13 D2 E1 F1 S2 poor
14 D2 E3 F1 S1 fair
15 D1 E3 F1 S2 UNKNOWN
df_subset = df[~(df.iloc[:, 0:4].isnull().any(1))]
print(df_subset)
#df_subset is:
DT RE FE SE C_Step
0 D1 E1 F1 S1 poor
6 D1 E3 F1 S2 fair
7 D2 E2 F1 S1 UNKNOWN
8 D2 E2 F1 S1 fair
10 D2 E2 F1 S1 UNKNOWN
11 D1 E3 F2 S1 UNKNOWN
12 D2 E1 F2 S2 UNKNOWN
13 D2 E1 F1 S2 poor
14 D2 E3 F1 S1 fair
15 D1 E3 F1 S2 UNKNOWN
After this I try to make cross-tab from both df and df_subset data-frames, 'C_Step' for index and 'RE' for column
Cross-tab from df:
c1 = pd.crosstab([df.C_Step],[df.RE],dropna=True)
print(c1)
RE E1 E2 E3
C_Step
UNKNOWN 1 2 4
fair 0 1 3
good 0 1 0
poor 2 0 0
Cross tab from df_subset:
c1 = pd.crosstab([df_subset.C_Step],[df_subset.RE],dropna=False)
print(c1)
RE E1 E2 E3
C_Step
UNKNOWN 1 2 2
fair 0 1 2
poor 2 0 0
Question: Index of both crosstab is different. How Can I have index of cross-tab generated from 'df_subset' same as 'df'? Category 'good' is missing in cross-tab of df_subset
The desired cross-tab of df_subset is:
RE E1 E2 E3
C_Step
UNKNOWN 1 2 2
fair 0 1 2
good 0 0 0
poor 2 0 0
Use reindex with parameter fill_value=0:
c2 = pd.crosstab([df_subset.C_Step], [df_subset.RE], dropna=False)
c2 = c2.reindex(c1.index, fill_value=0)
print(c2)
RE E1 E2 E3
C_Step
UNKNOWN 1 2 2
fair 0 1 2
good 0 0 0
poor 2 0 0

Resources