I'm reading this data from an excel file:
a b
0 x y x y
1 0 1 2 3
2 0 1 2 3
3 0 1 2 3
4 0 1 2 3
5 0 1 2 3
For each a and b categories (a.k.a samples), there two colums of x and y values. I want to convert this excel data into a dataframe that looks like this (concatenating vertically data from samples a and b):
sample x y
0 a 0.0 1.0
1 a 0.0 1.0
2 a 0.0 1.0
3 a 0.0 1.0
4 a 0.0 1.0
5 b 2.0 3.0
6 b 2.0 3.0
7 b 2.0 3.0
8 b 2.0 3.0
9 b 2.0 3.0
I've written the following code:
x=np.arange(0,4,2) # create a variable that allows to select even columns
sample_df=pd.DataFrame() # create an empty dataFrame
for i in x: # looping through the excel data
sample = pd.read_excel(xls2, usecols=[i,i], nrows=0, header=0)
values_df= pd.read_excel(xls2, usecols=[i,i+1], nrows=5, header=1)
values_df.insert(loc=0, column='sample', value=sample.columns[0])
sample_df=pd.concat([sample_df, values_df], ignore_index=True)
display(sample_df)
But, this is the Output I obtain:
sample x y x.1 y.1
0 a 0.0 1.0 NaN NaN
1 a 0.0 1.0 NaN NaN
2 a 0.0 1.0 NaN NaN
3 a 0.0 1.0 NaN NaN
4 a 0.0 1.0 NaN NaN
5 b NaN NaN 2.0 3.0
6 b NaN NaN 2.0 3.0
7 b NaN NaN 2.0 3.0
8 b NaN NaN 2.0 3.0
9 b NaN NaN 2.0 3.0
Related
I have a dataframe of time series data, in which data reporting starts at different times (columns) for different observation units (rows). Prior to first reported datapoint for each unit, the dataframe contains NaN values, e.g.
0 1 2 3 4 ...
A NaN NaN 4 5 6 ...
B NaN 7 8 NaN 10...
C NaN 2 11 24 17...
I want to replace the leading (left-side) NaN values with 0, but only the leading ones (i.e. leaving the internal missing ones as NaN. So the result on the example above would be:
0 1 2 3 4 ...
A 0 0 4 5 6 ...
B 0 7 8 NaN 10...
C 0 2 11 24 17...
(Note the retained NaN for row B col 3)
I could iterate through the dataframe row-by-row, identify the first index of a non-NaN value in each row, and replace everything left of that with 0. But is there a way to do this as a whole-array operation?
notna + cumsum by rows, cells with zeros are leading NaN:
df[df.notna().cumsum(1) == 0] = 0
df
0 1 2 3 4
A 0.0 0.0 4 5.0 6
B 0.0 7.0 8 NaN 10
C 0.0 2.0 11 24.0 17
Here is another way using cumprod() and apply()
s = df.isna().cumprod(axis=1).sum(axis=1)
df.apply(lambda x: x.fillna(0,limit = s.loc[x.name]),axis=1)
Output:
0 1 2 3 4
A 0.0 0.0 4.0 5.0 6.0
B 0.0 7.0 8.0 NaN 10.0
C 0.0 2.0 11.0 24.0 17.0
I have a dataframe with Columns A,B,D and C. I would like to drop all NaN containing rows in the dataframe only where D and C columns contain value 0.
Eg:
Would anyone be able to help me in this issue.
Thanks & Best Regards
Michael
Use boolean indexing with inverted mask by ~:
np.random.seed(2021)
df = pd.DataFrame(np.random.choice([1,0,np.nan], size=(10, 4)), columns=list('ABCD'))
print (df)
A B C D
0 1.0 0.0 0.0 1.0
1 0.0 NaN NaN 1.0
2 NaN 0.0 0.0 0.0
3 1.0 1.0 NaN NaN
4 NaN NaN 0.0 0.0
5 0.0 NaN 0.0 1.0
6 0.0 NaN NaN 1.0
7 0.0 1.0 NaN NaN
8 1.0 0.0 1.0 0.0
9 0.0 NaN NaN NaN
If need remove columns if both D and C has 0 and another columns has NaNs use DataFrame.all for test if both values are 0 and chain by & for bitwise AND with
DataFrame.any for test if at least one value is NaN tested by DataFrame.isna:
m = df[['D','C']].eq(0).all(axis=1) & df.isna().any(axis=1)
df1 = df[~m]
print (df1)
A B C D
0 1.0 0.0 0.0 1.0
1 0.0 NaN NaN 1.0
3 1.0 1.0 NaN NaN
5 0.0 NaN 0.0 1.0
6 0.0 NaN NaN 1.0
7 0.0 1.0 NaN NaN
8 1.0 0.0 1.0 0.0
9 0.0 NaN NaN NaN
Another alternative without ~ for invert, but all conditions and also & is changed to | for bitwise OR:
m = df[['D','C']].ne(0).any(axis=1) | df.notna().all(axis=1)
df1 = df[m]
print (df1)
A B C D
0 1.0 0.0 0.0 1.0
1 0.0 NaN NaN 1.0
3 1.0 1.0 NaN NaN
5 0.0 NaN 0.0 1.0
6 0.0 NaN NaN 1.0
7 0.0 1.0 NaN NaN
8 1.0 0.0 1.0 0.0
9 0.0 NaN NaN NaN
I want to add a median row to the top. Based on this stack answer I do the following:
pd.concat([df.median(),df],axis=0, ignore_index=True)
Shape of DF: 50000 x 226
Shape expected: 50001 x 226
Shape of modified DF: 500213 x 227 ???
What am I doing wrong? I am unable to understand what is going on?
Maybe what you wanted is like this:
dfn = pd.concat([df.median().to_frame().T, df], ignore_index=True)
create some sample data:
df = pd.DataFrame(np.arange(20).reshape(4,5), columns= list('ABCDE'))
dfn = pd.concat([df.median().to_frame().T, df])
df
A B C D E
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
df.median().to_frame().T
A B C D E
0 7.5 8.5 9.5 10.5 11.5
dfn
A B C D E
0 7.5 8.5 9.5 10.5 11.5
0 0.0 1.0 2.0 3.0 4.0
1 5.0 6.0 7.0 8.0 9.0
2 10.0 11.0 12.0 13.0 14.0
3 15.0 16.0 17.0 18.0 19.0
df.median() is an Series, with row index of A, B, C, D, E, so when you concat df.median() with df, the result is that:
pd.concat([df.median(),df], axis=0)
0 A B C D E
A 7.5 NaN NaN NaN NaN NaN
B 8.5 NaN NaN NaN NaN NaN
C 9.5 NaN NaN NaN NaN NaN
D 10.5 NaN NaN NaN NaN NaN
E 11.5 NaN NaN NaN NaN NaN
0 NaN 0.0 1.0 2.0 3.0 4.0
1 NaN 5.0 6.0 7.0 8.0 9.0
2 NaN 10.0 11.0 12.0 13.0 14.0
3 NaN 15.0 16.0 17.0 18.0 19.0
pd.concat([df.median(),df],axis=0, ignore_index=True)
this code creates a row for you but that is not a DataFrame it is a Series. So you want to convert the series to DataFrame
so you can use
.to_frame().T
to your code then your code become
pd.concat([df.median().to_frame().T,df],axis=0, ignore_index=True)
I try to get new columns a and b based on the following dataframe:
a_x b_x a_y b_y
0 13.67 0.0 13.67 0.0
1 13.42 0.0 13.42 0.0
2 13.52 1.0 13.17 1.0
3 13.61 1.0 13.11 1.0
4 12.68 1.0 13.06 1.0
5 12.70 1.0 12.93 1.0
6 13.60 1.0 NaN NaN
7 12.89 1.0 NaN NaN
8 11.68 1.0 NaN NaN
9 NaN NaN 8.87 0.0
10 NaN NaN 8.77 0.0
11 NaN NaN 7.97 0.0
If b_x or b_y are 0.0 (at this case they have same values if they both exist), then a_x and b_y share same values, so I take either of them as new columns a and b; if b_x or b_y are 1.0, they are different values, so I calculate means of a_x and a_y as the values of a, take either b_x and b_y as b;
If a_x, b_x or a_y, b_y is not null, so I'll take existing values as a and b.
My expected results will like this:
a_x b_x a_y b_y a b
0 13.67 0.0 13.67 0.0 13.670 0
1 13.42 0.0 13.42 0.0 13.420 0
2 13.52 1.0 13.17 1.0 13.345 1
3 13.61 1.0 13.11 1.0 13.360 1
4 12.68 1.0 13.06 1.0 12.870 1
5 12.70 1.0 12.93 1.0 12.815 1
6 13.60 1.0 NaN NaN 13.600 1
7 12.89 1.0 NaN NaN 12.890 1
8 11.68 1.0 NaN NaN 11.680 1
9 NaN NaN 8.87 0.0 8.870 0
10 NaN NaN 8.77 0.0 8.770 0
11 NaN NaN 7.97 0.0 7.970 0
How can I get an result above? Thank you.
Use:
#filter all a and b columns
b = df.filter(like='b')
a = df.filter(like='a')
#test if at least one 0 or 1 value
m1 = b.eq(0).any(axis=1)
m2 = b.eq(1).any(axis=1)
#get means of a columns
a1 = a.mean(axis=1)
#forward filling mising values and select last column
b1 = b.ffill(axis=1).iloc[:, -1]
a2 = a.ffill(axis=1).iloc[:, -1]
#new Dataframe with 2 conditions
df1 = pd.DataFrame(np.select([m1, m2], [[a2, b1], [a1, b1]]), index=['a','b']).T
#join to original
df = df.join(df1)
print (df)
a_x b_x a_y b_y a b
0 13.67 0.0 13.67 0.0 13.670 0.0
1 13.42 0.0 13.42 0.0 13.420 0.0
2 13.52 1.0 13.17 1.0 13.345 1.0
3 13.61 1.0 13.11 1.0 13.360 1.0
4 12.68 1.0 13.06 1.0 12.870 1.0
5 12.70 1.0 12.93 1.0 12.815 1.0
6 13.60 1.0 NaN NaN 13.600 1.0
7 12.89 1.0 NaN NaN 12.890 1.0
8 11.68 1.0 NaN NaN 11.680 1.0
9 NaN NaN 8.87 0.0 8.870 0.0
10 NaN NaN 8.77 0.0 8.770 0.0
11 NaN NaN 7.97 0.0 7.970 0.0
But I think solution should be simplify, because mean should be used for both conditions (because mean of same values is same like first value):
b = df.filter(like='b')
a = df.filter(like='a')
m1 = b.eq(0).any(axis=1)
m2 = b.eq(1).any(axis=1)
a1 = a.mean(axis=1)
b1 = b.ffill(axis=1).iloc[:, -1]
df['a'] = a1
df['b'] = b1
print (df)
a_x b_x a_y b_y a b
0 13.67 0.0 13.67 0.0 13.670 0.0
1 13.42 0.0 13.42 0.0 13.420 0.0
2 13.52 1.0 13.17 1.0 13.345 1.0
3 13.61 1.0 13.11 1.0 13.360 1.0
4 12.68 1.0 13.06 1.0 12.870 1.0
5 12.70 1.0 12.93 1.0 12.815 1.0
6 13.60 1.0 NaN NaN 13.600 1.0
7 12.89 1.0 NaN NaN 12.890 1.0
8 11.68 1.0 NaN NaN 11.680 1.0
9 NaN NaN 8.87 0.0 8.870 0.0
10 NaN NaN 8.77 0.0 8.770 0.0
11 NaN NaN 7.97 0.0 7.970 0.0
Faced a simple task, but I can not solve. There is a table in df:
Date X1 X2
02.03.2019 2 2
03.03.2019 1 1
04.03.2019 2 3
05.03.2019 1 12
06.03.2019 2 2
07.03.2019 3 3
08.03.2019 4 1
09.03.2019 1 2
And I need for rows where Date < 05.03.2019 set X1=NaN, X2=NaN:
Date X1 X2
02.03.2019 NaN NaN
03.03.2019 NaN NaN
04.03.2019 NaN NaN
05.03.2019 1 12
06.03.2019 2 2
07.03.2019 3 3
08.03.2019 4 1
09.03.2019 1 2
First convert column Date to datetimes and then set values by DataFrame.loc:
df['Date'] = pd.to_datetime(df['Date'], format='%d.%m.%Y')
df.loc[df['Date'] < '2019-03-05', ['X1','X2']] = np.nan
print (df)
Date X1 X2
0 2019-03-02 NaN NaN
1 2019-03-03 NaN NaN
2 2019-03-04 NaN NaN
3 2019-03-05 1.0 12.0
4 2019-03-06 2.0 2.0
5 2019-03-07 3.0 3.0
6 2019-03-08 4.0 1.0
7 2019-03-09 1.0 2.0
If there is DatetimeIndex:
df.index = pd.to_datetime(df.index, format='%d.%m.%Y')
#change datetime to 2019-03-04
df.loc[:'2019-03-04'] = np.nan
print (df)
X1 X2
Date
2019-03-02 NaN NaN
2019-03-03 NaN NaN
2019-03-04 NaN NaN
2019-03-05 1.0 12.0
2019-03-06 2.0 2.0
2019-03-07 3.0 3.0
2019-03-08 4.0 1.0
2019-03-09 1.0 2.0
Or:
df.index = pd.to_datetime(df.index, format='%d.%m.%Y')
df.loc[df.index < '2019-03-05'] = np.nan
Dont use this solution, this is just another approach possible (-: (this will affect all columns)
df.mask(df.Date < '05.03.2019').combine_first(df[['Date']])
Date X1 X2
0 02.03.2019 NaN NaN
1 03.03.2019 NaN NaN
2 04.03.2019 NaN NaN
3 05.03.2019 1.0 12.0
4 06.03.2019 2.0 2.0
5 07.03.2019 3.0 3.0
6 08.03.2019 4.0 1.0
7 09.03.2019 1.0 2.0