I have a DataFrame with some NaN values. In this DataFrame there are some rows with all NaN values. When I apply sum function on these rows, it is returning zero instead of NaN. Code is as follows:
df = pd.DataFrame(np.random.randint(10,60,size=(5,3)),
index = ['a','c','e','f','h'],
columns = ['One','Two','Three'])
df = df.reindex(index=['a','b','c','d','e','f','g','h'])
print(df.loc['b'].sum())
Any Suggestion?
the sum function takes the NaN values as 0.
if you want the result of the sum of NaN values to be NaN:
df.loc['b'].sum(min_count=1)
Output:
nan
if you apply to all rows(
after using reindex) you will get the following:
df.sum(axis=1,min_count=1)
a 137.0
b NaN
c 79.0
d NaN
e 132.0
f 95.0
g NaN
h 81.0
dtype: float64
if you now modify a NaN value of a row:
df.at['b','One']=0
print(df)
One Two Three
a 54.0 20.0 29.0
b 0.0 NaN NaN
c 13.0 24.0 27.0
d NaN NaN NaN
e 28.0 53.0 25.0
f 46.0 55.0 50.0
g NaN NaN NaN
h 47.0 26.0 48.0
df.sum(axis=1,min_count=1)
a 103.0
b 0.0
c 64.0
d NaN
e 106.0
f 151.0
g NaN
h 121.0
dtype: float64
as you can see now the result of row b is 0
Related
I have a this Dataframe
a
b
c
d
e
f
g
h
o
1
1
nan
nan
nan
nan
nan
nan
o
nan
nan
2
2
nan
nan
nan
nan
o
nan
nan
nan
nan
3
3
nan
nan
o
nan
nan
nan
nan
nan
nan
4
4
I want to conversion this DataFrame
a
b
c
d
e
f
g
h
o
1
1
2
2
3
3
4
4
How to make code use pandas..? I try numpy diagonal but it is failed
Another possible solution:
a = df.values.flatten()
pd.DataFrame(a[~np.isnan(a)].reshape(-1,df.shape[1]), columns=df.columns)
Output:
a b c d e f g h
0 1.0 1.0 2.0 2.0 3.0 3.0 4.0 4.0
Use:
df = df.replace('nan', np.nan).groupby(level=0).first()
This question already has answers here:
How to merge multiple dataframes
(13 answers)
Closed 1 year ago.
I have a 3 dfs as shown below
df1:
ID March_Number March_Amount
A 10 200
B 4 300
C 2 100
df2:
ID Feb_Number Feb_Amount
A 1 100
B 8 500
E 4 400
F 8 100
H 4 200
df3:
ID Jan_Number Jan_Amount
A 6 800
H 3 500
B 1 50
G 8 100
I tried below code and worked well.
df_outer = pd.merge(df1, df2, on='ID', how='outer')
df_outer = pd.merge(df_outer , df3, on='ID', how='outer')
But would like to pass all df together and merge at a short. I tried below code with error as shown.
df_outer = pd.merge(df1, df2, df3, on='ID', how='outer')
please guide me, how to merge if I have 12 months of data. i.e I have to merge 12 dfs.
Error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-32-a63627da7233> in <module>
----> 1 df_outer = pd.merge(df1, df2, df3, on='ID', how='outer')
TypeError: merge() got multiple values for argument 'how'
Expected output:
ID March_Number March_Amount Feb_Number Feb_Amount Jan_Number Jan_Amount
A 10.0 200.0 1.0 100.0 6.0 800.0
B 4.0 300.0 8.0 500.0 1.0 50.0
C 2.0 100.0 NaN NaN NaN NaN
E NaN NaN 4.0 400.0 NaN NaN
F NaN NaN 8.0 100.0 NaN NaN
H NaN NaN 4.0 200.0 3.0 500.0
G NaN NaN NaN NaN 8.0 100.0
We can create a list of dfs in this case dfl which we want to merge and then we can merge them together.
We can add as many dfs as we want in dfl=[df1, df2, df3,..., dfn]
from functools import reduce
dfl=[df1, df2, df3]
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['ID'],
how='outer'), dfl)
Output
ID March_Number March_Amount Feb_Number Feb_Amount Jan_Number Jan_Amount
0 A 10.0 200.0 1.0 0.0 6.0 800.0
1 B 4.0 300.0 8.0 500.0 1.0 50.0
2 C 2.0 100.0 NaN NaN NaN NaN
3 E NaN NaN 4.0 400.0 NaN NaN
4 F NaN NaN 8.0 0.0 NaN NaN
5 H NaN NaN 4.0 200.0 3.0 500.0
6 G NaN NaN NaN NaN 8.0 100.0
I want to add a median row to the top. Based on this stack answer I do the following:
pd.concat([df.median(),df],axis=0, ignore_index=True)
Shape of DF: 50000 x 226
Shape expected: 50001 x 226
Shape of modified DF: 500213 x 227 ???
What am I doing wrong? I am unable to understand what is going on?
Maybe what you wanted is like this:
dfn = pd.concat([df.median().to_frame().T, df], ignore_index=True)
create some sample data:
df = pd.DataFrame(np.arange(20).reshape(4,5), columns= list('ABCDE'))
dfn = pd.concat([df.median().to_frame().T, df])
df
A B C D E
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
df.median().to_frame().T
A B C D E
0 7.5 8.5 9.5 10.5 11.5
dfn
A B C D E
0 7.5 8.5 9.5 10.5 11.5
0 0.0 1.0 2.0 3.0 4.0
1 5.0 6.0 7.0 8.0 9.0
2 10.0 11.0 12.0 13.0 14.0
3 15.0 16.0 17.0 18.0 19.0
df.median() is an Series, with row index of A, B, C, D, E, so when you concat df.median() with df, the result is that:
pd.concat([df.median(),df], axis=0)
0 A B C D E
A 7.5 NaN NaN NaN NaN NaN
B 8.5 NaN NaN NaN NaN NaN
C 9.5 NaN NaN NaN NaN NaN
D 10.5 NaN NaN NaN NaN NaN
E 11.5 NaN NaN NaN NaN NaN
0 NaN 0.0 1.0 2.0 3.0 4.0
1 NaN 5.0 6.0 7.0 8.0 9.0
2 NaN 10.0 11.0 12.0 13.0 14.0
3 NaN 15.0 16.0 17.0 18.0 19.0
pd.concat([df.median(),df],axis=0, ignore_index=True)
this code creates a row for you but that is not a DataFrame it is a Series. So you want to convert the series to DataFrame
so you can use
.to_frame().T
to your code then your code become
pd.concat([df.median().to_frame().T,df],axis=0, ignore_index=True)
I am trying to do some transformations and kind of stuck. Hopefully somebody, can help me out here.
l0 a b c d e f
l1 1 2 1 2 1 2 1 2 1 2 1 2
0 NaN NaN NaN NaN 93.4 NaN NaN NaN NaN NaN 19.0 28.9
1 NaN 9.0 NaN NaN 43.5 32.0 NaN NaN NaN NaN NaN 3.4
2 NaN 5.0 NaN NaN 93.3 83.6 NaN NaN NaN NaN 59.5 28.2
3 NaN 19.6 NaN NaN 72.8 47.4 NaN NaN NaN NaN 31.5 67.2
4 NaN NaN NaN NaN NaN 62.5 NaN NaN NaN NaN NaN 1.8
I have a dataframe, (shown above), and as u can see that, there are multiple 'NaN' with an multiindex column. Selecting the columns along level = 0 (i.e. l0)
I would like to drop the entire column if all are NaN. so, in this case the column's
l0 = ['b', 'd', 'e'] # drop-cols
should be dropped from the Dataframe
l0 a c f
l1 1 2 1 2 1 2
0 NaN NaN 93.4 NaN 19.0 28.9
1 NaN 9.0 43.5 32.0 NaN 3.4
2 NaN 5.0 93.3 83.6 59.5 28.2
3 NaN 19.6 72.8 47.4 31.5 67.2
4 NaN NaN NaN 62.5 NaN 1.8
This will give me the dataframe (as shown above). I would like to then slide values along the rows if all the entries before are null (or swap values between adjacent cols). e.g. Looking at index = 0 i.e. first row.
l0 a c f
l1 1 2 1 2 1 2
0 NaN NaN 93.4 NaN 19.0 28.9
Since, all the values in col - a are null.
I would like to slide / swap values first b/w col - a and col - c.
and then receprocate the same for columns along the right-side i.e. replace entries in col-c with col-f and make all entries in col-f, NaN giving me
l0 a c f
l1 1 2 1 2 1 2
0 93.4 NaN 19.0 28.9 NaN NaN
This is really to save memory for processing and storing information, as interchainging labels ['a', 'b', 'c'...] does not change the meaning of the data.
EDIT: Any Idea's for (2)
I have managed to solve (1) with the following code:
for c in df.columns.get_level_values(0).unique():
if df[c].isna().all().all():
df = df.drop(columns=[c])
df
You can do with all
s=df.isnull().all(level=0,axis=1).all()
df.drop(s.index[s],axis=1,level=0)
Out[55]:
a c f
1 2 1 2 1 2
l1
0 NaN NaN 93.4 NaN 19.0 28.9
1 NaN 9.0 43.5 32.0 NaN 3.4
2 NaN 5.0 93.3 83.6 59.5 28.2
3 NaN 19.6 72.8 47.4 31.5 67.2
4 NaN NaN NaN 62.5 NaN 1.8
groupby and filter
df.groupby(axis=1, level=0).filter(lambda d: ~d.isna().all().all())
a c f
1 2 1 2 1 2
0 NaN NaN 93.4 NaN 19.0 28.9
1 NaN 9.0 43.5 32.0 NaN 3.4
2 NaN 5.0 93.3 83.6 59.5 28.2
3 NaN 19.6 72.8 47.4 31.5 67.2
4 NaN NaN NaN 62.5 NaN 1.8
A little bit shorter
df.groupby(axis=1, level=0).filter(lambda d: ~np.all(d.isna()))
I have a dataframe called ref(first dataframe) with columns c1, c2 ,c3 and c4.
ref= pd.DataFrame([[1,3,.3,7],[0,4,.5,4.5],[2,5,.6,3]], columns=['c1','c2','c3','c4'])
print(ref)
c1 c2 c3 c4
0 1 3 0.3 7.0
1 0 4 0.5 4.5
2 2 5 0.6 3.0
I wanted to create a new column i.e, c5 ( second dataframe) that has all the values from columns c1,c2,c3 and c4.
I tried concat, merge columns but i cannot get it work.
Please let me know if you have a solutions?
You can use unstack for creating Series from DataFrame and then concat to original:
print (pd.concat([ref, ref.unstack().reset_index(drop=True).rename('c5')], axis=1))
c1 c2 c3 c4 c5
0 1.0 3.0 0.3 7.0 1.0
1 0.0 4.0 0.5 4.5 0.0
2 2.0 5.0 0.6 3.0 2.0
3 NaN NaN NaN NaN 3.0
4 NaN NaN NaN NaN 4.0
5 NaN NaN NaN NaN 5.0
6 NaN NaN NaN NaN 0.3
7 NaN NaN NaN NaN 0.5
8 NaN NaN NaN NaN 0.6
9 NaN NaN NaN NaN 7.0
10 NaN NaN NaN NaN 4.5
11 NaN NaN NaN NaN 3.0
Alternative solution for creating Series is convert df to numpy array by values and then reshape by ravel:
print (pd.concat([ref, pd.Series(ref.values.ravel('F'), name='c5')], axis=1))
c1 c2 c3 c4 c5
0 1.0 3.0 0.3 7.0 1.0
1 0.0 4.0 0.5 4.5 0.0
2 2.0 5.0 0.6 3.0 2.0
3 NaN NaN NaN NaN 3.0
4 NaN NaN NaN NaN 4.0
5 NaN NaN NaN NaN 5.0
6 NaN NaN NaN NaN 0.3
7 NaN NaN NaN NaN 0.5
8 NaN NaN NaN NaN 0.6
9 NaN NaN NaN NaN 7.0
10 NaN NaN NaN NaN 4.5
11 NaN NaN NaN NaN 3.0
using join + ravel('F')
ref.join(pd.Series(ref.values.ravel('F')).to_frame('c5'), how='right')
using join + T.ravel()
ref.join(pd.Series(ref.values.T.ravel()).to_frame('c5'), how='right')
pd.concat + T.stack() + rename
pd.concat([ref, ref.T.stack().reset_index(drop=True).rename('c5')], axis=1)
way too many transposes + append
ref.T.append(ref.T.stack().reset_index(drop=True).rename('c5')).T
combine_first + ravel('F') <--- my favorite
ref.combine_first(pd.Series(ref.values.ravel('F')).to_frame('c5'))
All yield
c1 c2 c3 c4 c5
0 1.0 3.0 0.3 7.0 1.0
1 0.0 4.0 0.5 4.5 0.0
2 2.0 5.0 0.6 3.0 2.0
3 NaN NaN NaN NaN 3.0
4 NaN NaN NaN NaN 4.0
5 NaN NaN NaN NaN 5.0
6 NaN NaN NaN NaN 0.3
7 NaN NaN NaN NaN 0.5
8 NaN NaN NaN NaN 0.6
9 NaN NaN NaN NaN 7.0
10 NaN NaN NaN NaN 4.5
11 NaN NaN NaN NaN 3.0
use the list(zip()) as follows:
d=list(zip(df1.c1,df1.c2,df1.c3,df1.c4))
df2['c5']=pd.Series(d)
try this one, works as you expected
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,3,4],[2,3,4,5],[3,4,5,6]], columns=['c1','c2','c3','c4'])
print(df)
r = len(df['c1'])
c = len(list(df))
ndata = list(df.c1) + list(df.c2) + list(df.c3) + list(df.c4)
r = len(ndata) - r
t = r*c
dfnan = pd.DataFrame(np.reshape([np.nan]*t, (r,c)), columns=list(df))
df = df.append(dfnan)
df['c5'] = ndata
print(df)
output is below
This could be a fast option and maybe you can use it inside a loop.
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,3,4],[2,3,4,5],[3,4,5,6]], columns=['c1','c2','c3','c4'])
df['c5'] = df.iloc[:,0].astype(str) + df.iloc[:,1].astype(str) + df.iloc[:,2].astype(str) + df.iloc[:,3].astype(str)
Greetings