I try to get new columns a and b based on the following dataframe:
a_x b_x a_y b_y
0 13.67 0.0 13.67 0.0
1 13.42 0.0 13.42 0.0
2 13.52 1.0 13.17 1.0
3 13.61 1.0 13.11 1.0
4 12.68 1.0 13.06 1.0
5 12.70 1.0 12.93 1.0
6 13.60 1.0 NaN NaN
7 12.89 1.0 NaN NaN
8 11.68 1.0 NaN NaN
9 NaN NaN 8.87 0.0
10 NaN NaN 8.77 0.0
11 NaN NaN 7.97 0.0
If b_x or b_y are 0.0 (at this case they have same values if they both exist), then a_x and b_y share same values, so I take either of them as new columns a and b; if b_x or b_y are 1.0, they are different values, so I calculate means of a_x and a_y as the values of a, take either b_x and b_y as b;
If a_x, b_x or a_y, b_y is not null, so I'll take existing values as a and b.
My expected results will like this:
a_x b_x a_y b_y a b
0 13.67 0.0 13.67 0.0 13.670 0
1 13.42 0.0 13.42 0.0 13.420 0
2 13.52 1.0 13.17 1.0 13.345 1
3 13.61 1.0 13.11 1.0 13.360 1
4 12.68 1.0 13.06 1.0 12.870 1
5 12.70 1.0 12.93 1.0 12.815 1
6 13.60 1.0 NaN NaN 13.600 1
7 12.89 1.0 NaN NaN 12.890 1
8 11.68 1.0 NaN NaN 11.680 1
9 NaN NaN 8.87 0.0 8.870 0
10 NaN NaN 8.77 0.0 8.770 0
11 NaN NaN 7.97 0.0 7.970 0
How can I get an result above? Thank you.
Use:
#filter all a and b columns
b = df.filter(like='b')
a = df.filter(like='a')
#test if at least one 0 or 1 value
m1 = b.eq(0).any(axis=1)
m2 = b.eq(1).any(axis=1)
#get means of a columns
a1 = a.mean(axis=1)
#forward filling mising values and select last column
b1 = b.ffill(axis=1).iloc[:, -1]
a2 = a.ffill(axis=1).iloc[:, -1]
#new Dataframe with 2 conditions
df1 = pd.DataFrame(np.select([m1, m2], [[a2, b1], [a1, b1]]), index=['a','b']).T
#join to original
df = df.join(df1)
print (df)
a_x b_x a_y b_y a b
0 13.67 0.0 13.67 0.0 13.670 0.0
1 13.42 0.0 13.42 0.0 13.420 0.0
2 13.52 1.0 13.17 1.0 13.345 1.0
3 13.61 1.0 13.11 1.0 13.360 1.0
4 12.68 1.0 13.06 1.0 12.870 1.0
5 12.70 1.0 12.93 1.0 12.815 1.0
6 13.60 1.0 NaN NaN 13.600 1.0
7 12.89 1.0 NaN NaN 12.890 1.0
8 11.68 1.0 NaN NaN 11.680 1.0
9 NaN NaN 8.87 0.0 8.870 0.0
10 NaN NaN 8.77 0.0 8.770 0.0
11 NaN NaN 7.97 0.0 7.970 0.0
But I think solution should be simplify, because mean should be used for both conditions (because mean of same values is same like first value):
b = df.filter(like='b')
a = df.filter(like='a')
m1 = b.eq(0).any(axis=1)
m2 = b.eq(1).any(axis=1)
a1 = a.mean(axis=1)
b1 = b.ffill(axis=1).iloc[:, -1]
df['a'] = a1
df['b'] = b1
print (df)
a_x b_x a_y b_y a b
0 13.67 0.0 13.67 0.0 13.670 0.0
1 13.42 0.0 13.42 0.0 13.420 0.0
2 13.52 1.0 13.17 1.0 13.345 1.0
3 13.61 1.0 13.11 1.0 13.360 1.0
4 12.68 1.0 13.06 1.0 12.870 1.0
5 12.70 1.0 12.93 1.0 12.815 1.0
6 13.60 1.0 NaN NaN 13.600 1.0
7 12.89 1.0 NaN NaN 12.890 1.0
8 11.68 1.0 NaN NaN 11.680 1.0
9 NaN NaN 8.87 0.0 8.870 0.0
10 NaN NaN 8.77 0.0 8.770 0.0
11 NaN NaN 7.97 0.0 7.970 0.0
Related
I have a dataframe with Columns A,B,D and C. I would like to drop all NaN containing rows in the dataframe only where D and C columns contain value 0.
Eg:
Would anyone be able to help me in this issue.
Thanks & Best Regards
Michael
Use boolean indexing with inverted mask by ~:
np.random.seed(2021)
df = pd.DataFrame(np.random.choice([1,0,np.nan], size=(10, 4)), columns=list('ABCD'))
print (df)
A B C D
0 1.0 0.0 0.0 1.0
1 0.0 NaN NaN 1.0
2 NaN 0.0 0.0 0.0
3 1.0 1.0 NaN NaN
4 NaN NaN 0.0 0.0
5 0.0 NaN 0.0 1.0
6 0.0 NaN NaN 1.0
7 0.0 1.0 NaN NaN
8 1.0 0.0 1.0 0.0
9 0.0 NaN NaN NaN
If need remove columns if both D and C has 0 and another columns has NaNs use DataFrame.all for test if both values are 0 and chain by & for bitwise AND with
DataFrame.any for test if at least one value is NaN tested by DataFrame.isna:
m = df[['D','C']].eq(0).all(axis=1) & df.isna().any(axis=1)
df1 = df[~m]
print (df1)
A B C D
0 1.0 0.0 0.0 1.0
1 0.0 NaN NaN 1.0
3 1.0 1.0 NaN NaN
5 0.0 NaN 0.0 1.0
6 0.0 NaN NaN 1.0
7 0.0 1.0 NaN NaN
8 1.0 0.0 1.0 0.0
9 0.0 NaN NaN NaN
Another alternative without ~ for invert, but all conditions and also & is changed to | for bitwise OR:
m = df[['D','C']].ne(0).any(axis=1) | df.notna().all(axis=1)
df1 = df[m]
print (df1)
A B C D
0 1.0 0.0 0.0 1.0
1 0.0 NaN NaN 1.0
3 1.0 1.0 NaN NaN
5 0.0 NaN 0.0 1.0
6 0.0 NaN NaN 1.0
7 0.0 1.0 NaN NaN
8 1.0 0.0 1.0 0.0
9 0.0 NaN NaN NaN
I'm reading this data from an excel file:
a b
0 x y x y
1 0 1 2 3
2 0 1 2 3
3 0 1 2 3
4 0 1 2 3
5 0 1 2 3
For each a and b categories (a.k.a samples), there two colums of x and y values. I want to convert this excel data into a dataframe that looks like this (concatenating vertically data from samples a and b):
sample x y
0 a 0.0 1.0
1 a 0.0 1.0
2 a 0.0 1.0
3 a 0.0 1.0
4 a 0.0 1.0
5 b 2.0 3.0
6 b 2.0 3.0
7 b 2.0 3.0
8 b 2.0 3.0
9 b 2.0 3.0
I've written the following code:
x=np.arange(0,4,2) # create a variable that allows to select even columns
sample_df=pd.DataFrame() # create an empty dataFrame
for i in x: # looping through the excel data
sample = pd.read_excel(xls2, usecols=[i,i], nrows=0, header=0)
values_df= pd.read_excel(xls2, usecols=[i,i+1], nrows=5, header=1)
values_df.insert(loc=0, column='sample', value=sample.columns[0])
sample_df=pd.concat([sample_df, values_df], ignore_index=True)
display(sample_df)
But, this is the Output I obtain:
sample x y x.1 y.1
0 a 0.0 1.0 NaN NaN
1 a 0.0 1.0 NaN NaN
2 a 0.0 1.0 NaN NaN
3 a 0.0 1.0 NaN NaN
4 a 0.0 1.0 NaN NaN
5 b NaN NaN 2.0 3.0
6 b NaN NaN 2.0 3.0
7 b NaN NaN 2.0 3.0
8 b NaN NaN 2.0 3.0
9 b NaN NaN 2.0 3.0
I want to add a median row to the top. Based on this stack answer I do the following:
pd.concat([df.median(),df],axis=0, ignore_index=True)
Shape of DF: 50000 x 226
Shape expected: 50001 x 226
Shape of modified DF: 500213 x 227 ???
What am I doing wrong? I am unable to understand what is going on?
Maybe what you wanted is like this:
dfn = pd.concat([df.median().to_frame().T, df], ignore_index=True)
create some sample data:
df = pd.DataFrame(np.arange(20).reshape(4,5), columns= list('ABCDE'))
dfn = pd.concat([df.median().to_frame().T, df])
df
A B C D E
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
df.median().to_frame().T
A B C D E
0 7.5 8.5 9.5 10.5 11.5
dfn
A B C D E
0 7.5 8.5 9.5 10.5 11.5
0 0.0 1.0 2.0 3.0 4.0
1 5.0 6.0 7.0 8.0 9.0
2 10.0 11.0 12.0 13.0 14.0
3 15.0 16.0 17.0 18.0 19.0
df.median() is an Series, with row index of A, B, C, D, E, so when you concat df.median() with df, the result is that:
pd.concat([df.median(),df], axis=0)
0 A B C D E
A 7.5 NaN NaN NaN NaN NaN
B 8.5 NaN NaN NaN NaN NaN
C 9.5 NaN NaN NaN NaN NaN
D 10.5 NaN NaN NaN NaN NaN
E 11.5 NaN NaN NaN NaN NaN
0 NaN 0.0 1.0 2.0 3.0 4.0
1 NaN 5.0 6.0 7.0 8.0 9.0
2 NaN 10.0 11.0 12.0 13.0 14.0
3 NaN 15.0 16.0 17.0 18.0 19.0
pd.concat([df.median(),df],axis=0, ignore_index=True)
this code creates a row for you but that is not a DataFrame it is a Series. So you want to convert the series to DataFrame
so you can use
.to_frame().T
to your code then your code become
pd.concat([df.median().to_frame().T,df],axis=0, ignore_index=True)
my credit credit_scoring.csv is like this how can i make it in an organised way 14 column and each column has it's corresponding value
Seniority;Home;Time;Age;Marital;Records;Job;Expenses;Income;Assets;Debt;Amount;Price;Status
0 9.0;1.0;60.0;30.0;0.0;1.0;1.0;73.0;129.0;0.0;0...
1 17.0;1.0;60.0;58.0;1.0;1.0;0.0;48.0;131.0;0.0;...
2 10.0;0.0;36.0;46.0;0.0;2.0;1.0;90.0;200.0;3000...
3 0.0;1.0;60.0;24.0;1.0;1.0;0.0;63.0;182.0;2500....
4 0.0;1.0;36.0;26.0;1.0;1.0;0.0;46.0;107.0;0.0;0...
. .................................................
. .................................................
. .................................................
. .................................................
You can simply use read_csv() with sep=';'
Your example data isn't great, but I tried to do the most of it.
I saved it as a.csv and here is the code:
In [1]: import pandas as pd
In [2]: pd.read_csv('a.csv', sep=';')
Out[2]:
Seniority Home Time Age Marital Records Job Expenses Income Assets Debt Amount Price Status
0 9.0 1.0 60.0 30.0 0.0 1.0 1.0 73.0 129.0 0.0 0.0 NaN NaN NaN
1 17.0 1.0 60.0 58.0 1.0 1.0 0.0 48.0 131.0 0.0 NaN NaN NaN NaN
2 10.0 0.0 36.0 46.0 0.0 2.0 1.0 90.0 200.0 3000.0 NaN NaN NaN NaN
3 0.0 1.0 60.0 24.0 1.0 1.0 0.0 63.0 182.0 2500.0 NaN NaN NaN NaN
4 0.0 1.0 36.0 26.0 1.0 1.0 0.0 46.0 107.0 0.0 0.0 NaN NaN NaN
Simply we can calculate mean by axis:
import pandas as pd
df=pd.DataFrame({'A':[1,1,0,1,0,1,1,0,1,1,1],
'b':[1,1,0,1,0,1,1,0,1,1,1],
'c':[1,1,0,1,0,1,1,0,1,1,1]})
# max_of_three columns
mean= np.max(df.mean(axis=1))
How to do this same this with rolling mean ?
I tried 1:
# max_of_three columns
mean=df.rolling(2).mean(axis=1)
got this error:
UnsupportedFunctionCall: numpy operations are not valid with window objects. Use .rolling(...).mean() instead
I tried 2:
def tt(x):
x=pd.DataFrame(x)
b1=np.max(x.mean(axis=1))
return b1
# max_of_three columns
mean=df.rolling(2).apply(tt,raw=True)
But from here I get three columns in result, in real should be 1 value for each moving window.
Where I am doing mistake? or any other efficient way to doing this.
You use the axis argument in rolling as:
df.rolling(2, axis=0).mean()
>>> A b c
0 NaN NaN NaN
1 1.0 1.0 1.0
2 0.5 0.5 0.5
3 0.5 0.5 0.5
4 0.5 0.5 0.5
5 0.5 0.5 0.5
6 1.0 1.0 1.0
7 0.5 0.5 0.5
8 0.5 0.5 0.5
9 1.0 1.0 1.0
10 1.0 1.0 1.0
r = df.rolling(2, axis=1).mean()
r
>>> A b c
0 NaN 1.0 1.0
2 NaN 0.0 0.0
3 NaN 1.0 1.0
4 NaN 0.0 0.0
5 NaN 1.0 1.0
6 NaN 1.0 1.0
7 NaN 0.0 0.0
8 NaN 1.0 1.0
9 NaN 1.0 1.0
10 NaN 1.0 1.0
r.max()
>>> A NaN
b 1.0
c 1.0
dtype: float64