I have a dataframe df, I want to fill values in a column, based on the condition applied to other column
Structure of DF, After ID there are some columns:
ID ...... col1 col2 col3 col4
1 A1 A1 A1 A1
2 G3 D5
3 R6
4 Q3
5 M5 N8
I want to create two new column called 'final_col' and 'status', where 'final_col' has value from col1 or col2 or col3 or col4 depending on which col had first non-blank (not null/NaN) value.
The column 'status' is just the name of the column
Expected Output:
ID ...... col1 col2 col3 col4 final_col status
1 A1 A1 A1 A1 A1 col1
2 G3 D5 G3 col2
3 R6 L4 R6 col1
4 Not_found Not_found
5 M5 N8 M5 col2
I know how to do this in excel, with nested ifs as so, assuming ID is cell 'A1'
In the first row of 'final_col':
=IF(A2<>"",A2,IF(B2<>"",B2,IF(C2<>"",C2,IF(D2<>"",D2,"Not_found"))))
For column 'status'
=IF(A2<>"","col1",IF(B2<>"","col2",IF(C2<>"","col3",IF(D2<>"","col4","Not_found"))))
P.S: Please use column names in your solution, and not index because the structure of the data frame may vary (order of the columns).
You can use first_valid_index. If you can all NaN values in some row in columns col1 to col4 use:
print df
ID col1 col2 col3 col4
0 1 A1 A1 A1 A1
1 2 NaN G3 NaN D5
2 3 R6 NaN NaN NaN
3 4 NaN NaN NaN NaN
4 5 NaN M5 N8 NaN
def f1(x):
if x.first_valid_index() is None:
return 'Not_found'
else:
return str(x.first_valid_index())
def f2(x):
if x.first_valid_index() is None:
return 'Not_found'
else:
return x[x.first_valid_index()]
df['status'] = df.ix[:, df.columns.tolist().index("col1") :].apply(f1, axis=1)
df['final_col'] = df.ix[:, df.columns.tolist().index("col1") :].apply(f2, axis=1)
print df
ID col1 col2 col3 col4 status final_col
0 1 A1 A1 A1 A1 col1 A1
1 2 NaN G3 NaN D5 col2 G3
2 3 R6 NaN NaN NaN col1 R6
3 4 NaN NaN NaN NaN Not_found Not_found
4 5 NaN M5 N8 NaN col2 M5
You could use first_valid_index:
In [105]: df
Out[105]:
ID col1 col2 col3 col4
0 1 A1 A1 A1 A1
1 2 NaN G3 NaN D5
2 3 R6 NaN NaN NaN
3 4 NaN NaN NaN NaN
4 5 NaN M5 N8 NaN
df['status'] = df.iloc[:,1:].apply(lambda x: x.first_valid_index(), axis=1)
df['final_col'] = df.iloc[:, 1:].apply(lambda x: x[x['status']] if x['status'] != None else 'Not found', axis=1)
df['status'].fillna('Not found', inplace=True)
In [129]: df
Out[129]:
ID col1 col2 col3 col4 status final_col
0 1 A1 A1 A1 A1 col1 A1
1 2 NaN G3 NaN D5 col2 G3
2 3 R6 NaN NaN NaN col1 R6
3 4 NaN NaN NaN NaN Not found Not found
4 5 NaN M5 N8 NaN col2 M5
Related
I like to reshape a dataframe thats first column should be used to group the other columns by an additional header row.
Initial dataframe
df = pd.DataFrame(
{
'col1':['A','A','A','B','B','B'],
'col2':[1,2,3,4,5,6],
'col3':[1,2,3,4,5,6],
'col4':[1,2,3,4,5,6],
'colx':[1,2,3,4,5,6]
}
)
Trial:
Using pd.pivot() I can create an example, but this do not fit my expected one, it seems to be flipped in grouping:
df.pivot(columns='col1', values=['col2','col3','col4','colx'])
col2 col3 col4 colx
col1 A B A B A B A B
0 1.0 NaN 1.0 NaN 1.0 NaN 1.0 NaN
1 2.0 NaN 2.0 NaN 2.0 NaN 2.0 NaN
2 3.0 NaN 3.0 NaN 3.0 NaN 3.0 NaN
3 NaN 4.0 NaN 4.0 NaN 4.0 NaN 4.0
4 NaN 5.0 NaN 5.0 NaN 5.0 NaN 5.0
5 NaN 6.0 NaN 6.0 NaN 6.0 NaN 6.0
Expected output:
A B
col1 col2 col3 col4 colx col2 col3 col4 colx
0 1 1 1 1 4 4 4 4
1 2 2 2 2 5 5 5 5
2 3 3 3 3 6 6 6 6
Create counter column by GroupBy.cumcount, then use DataFrame.pivot with swapping level of MultiIndex in columns by DataFrame.swaplevel, sorting it and last remove index and columns names by DataFrame.rename_axis:
df = (df.assign(g = df.groupby('col1').cumcount())
.pivot(index='g', columns='col1')
.swaplevel(0,1,axis=1)
.sort_index(axis=1)
.rename_axis(index=None, columns=[None, None]))
print(df)
A B
col2 col3 col4 colx col2 col3 col4 colx
0 1 1 1 1 4 4 4 4
1 2 2 2 2 5 5 5 5
2 3 3 3 3 6 6 6 6
As an alternative to the classical pivot, you can concat the output of groupby with a dictionary comprehension, ensuring alignment with reset_index:
out = pd.concat({k: d.drop(columns='col1').reset_index(drop=True)
for k,d in df.groupby('col1')}, axis=1)
output:
A B
col2 col3 col4 colx col2 col3 col4 colx
0 1 1 1 1 4 4 4 4
1 2 2 2 2 5 5 5 5
2 3 3 3 3 6 6 6 6
I have a pandas dataframe as below:
data = {'A' : [1,2,3],
'B' : [2,17,17],
'C1' : ["C1", np.nan,np.nan],
'C2' : ["C2", "C2",np.nan]}
# Create DataFrame
df = pd.DataFrame(data)
Dataframe:
A B C1 C2
0 1 2 C1 C2
1 2 17 NaN C2
2 3 17 NaN NaN
I am creating a variable "C" based on the below logic and code
If any of C's(C1, C2, C3..) has the value "C"= value from C's(C1, C2, C3....).
df['C'] = df.filter(regex='C\d+').stack().groupby(level=0).agg(','.join)
Result:
A B C1 C2 C
0 1 2 C1 C2 C1,C2
1 2 17 NaN C2 C2
2 3 17 NaN NaN NaN
Now, I want to perform below logic
If "C" has more than 1 values(say C1, C2) for any row, create a new row and append 2nd value. So I want my output to look like below:
A B C1 C2 C
0 1 2 C1 C2 C1
0 1 2 C1 C2 C2
1 2 17 NaN C2 C2
2 3 17 NaN NaN NaN
We can do it by use explode then concat
s=df.filter(regex='C\d+').stack().groupby(level=0).agg(list).explode().to_frame('C').join(df)
s=pd.concat([s,df[~df.index.isin(s.index)]],axis=0,join='outer',ignore_index=True,sort=False)
s
Out[62]:
C A B C1 C2
0 C1 1 2 C1 C2
1 C2 1 2 C1 C2
2 C2 2 17 NaN C2
3 NaN 3 17 NaN NaN
you could do:
df.merge(df.melt(['A','B'],value_name= 'C').dropna().drop('variable',axis = 1),how = "left")
A B C1 C2 C
0 1 2 C1 C2 C1
1 1 2 C1 C2 C2
2 2 17 NaN C2 C2
3 3 17 NaN NaN NaN
You can just df.explode(...), try:
#please note I aggregate it into list, not string
df['C'] = df.filter(regex='C\d+').stack().groupby(level=0).agg(list)
df=df.explode("C")
Outputs:
A B C1 C2 C
0 1 2 C1 C2 C1
0 1 2 C1 C2 C2
1 2 17 NaN C2 C2
2 3 17 NaN NaN NaN
I have a pandas dataframe and I would like to add a row at the end of dataframe to show the average of each column; however, due to NaN values in Col2, Col3, and Col4, the mean function cannot return the correct average of the columns. How can I fix this issue?
Col1 Col2 Col3 Col4
1 A 11 10 NaN
2 B 14 NaN 15
3 C 45 16 0
4 D NaN 16 NaN
5 E 12 23 5
P.S. This is the dataframe after getting average (df.loc["mean"] = df.mean()):
Col1 Col2 Col3 Col4
1 A 11 10 NaN
2 B 14 NaN 15
3 C 45 16 0
4 D NaN 16 NaN
5 E 12 23 5
Mean NaN Nan NaN NaN
Problem is columns are not filled by numbers, but string repr, so first convert them to numeric by DataFrame.astype:
cols = ['Col2','Col3','Col4']
df[cols] = df[cols].astype(float)
df.loc["mean"] = df.mean()
print (df)
Col1 Col2 Col3 Col4
1 A 11.0 10.00 NaN
2 B 14.0 NaN 15.000000
3 C 45.0 16.00 0.000000
4 D NaN 16.00 NaN
5 E 12.0 23.00 5.000000
mean NaN 20.5 16.25 6.666667
Or if some non numeric values use to_numeric with errors='coerce':
cols = ['Col2','Col3','Col4']
df[cols] = df[cols].apply(lambda x: pd.to_numeric(x, errors='coerce'))
df.loc["mean"] = df.mean()
You can set skipna=True when calculating the mean:
df = df.mean(axis=0, skipna=True).rename('Mean').pipe(df.append)
print(df)
Col1 Col2 Col3 Col4
0 A 11.0 10.00 NaN
1 B 14.0 NaN 15.000000
2 C 45.0 16.00 0.000000
3 D NaN 16.00 NaN
4 E 12.0 23.00 5.000000
Mean NaN 20.5 16.25 6.666667
I have a pandas dataframe as below:
data = {'A' :[1,2,3],
'B':[2,17,17],
'C1' :["C1",np.nan,np.nan],
'C2' :[np.nan,"C2",np.nan]}
# Create DataFrame
df = pd.DataFrame(data)
df
A B C1 C2
0 1 2 C1 NaN
1 2 17 NaN C2
2 3 17 NaN NaN
I want to create a variable "C" based on "C1" and"C2"(there could be "C4", "C5". If any of C's has the value "C"= value from C's(C1, C2, C3....). My output in this case should look like below:
A B C1 C2 C
0 1 2 C1 NaN C1
1 2 17 NaN C2 C2
2 3 17 NaN NaN NaN
Try this
df1 = df.filter(regex='^C\d+')
df['C'] = df1[df1.isin(df1.columns)].bfill(1).iloc[:,0]
Out[117]:
A B C1 C2 C
0 1 2 C1 NaN C1
1 2 17 NaN C2 C2
2 3 17 NaN NaN NaN
If you want to strictly compare values matching to its own column name, Use eq instead of isin as follows
df['C'] = df1[df1.eq(df1.columns, axis=1)].bfill(1).iloc[:,0]
IIUC
df['C']=df.filter(like='C').bfill(axis=1).iloc[:,0]
df
A B C1 C2 C
0 1 2 C1 NaN C1
1 2 17 NaN C2 C2
2 3 17 NaN NaN NaN
IIUC,
we can filter your columns by the word C then aggregate the values with an agg call:
df['C'] = df.filter(regex='C\d+').stack().groupby(level=0).agg(','.join)
print(df)
A B C1 C2 C
0 1 2 C1 NaN C1
1 2 17 NaN C2 C2
2 3 17 NaN NaN NaN
I have a pandas table with rows and columns and data in only some intersections between the rows and columns. See below:
col1 col2 col3 col4 col5
row1 1
row2 1 1
row3 1
row4 1 1
row5 1
I want to sort columns, so that the columns that have 1 in intersection with row 1 would go first, columns with intersection with row 2 second and so on. Like below:
col1 col3 col4 col5 col2
row1 1
row2 1 1
row3 1
row4 1 1
row5 1
Thank you for any suggestions.
If those empty cell are Nan, you can just use idxmax() on notnull():
orders = df.notnull().agg(['any', 'idxmax']).T
col_orders = orders.sort_values(['any', 'idxmax'],
ascending=[False, True]).index
df[col_orders]
Output:
col1 col3 col4 col5 col2
row1 1.0 NaN NaN NaN NaN
row2 NaN 1.0 1.0 NaN NaN
row3 1.0 NaN NaN NaN NaN
row4 NaN NaN 1.0 1.0 NaN
row5 NaN NaN NaN NaN 1.0