I have a this Dataframe
a
b
c
d
e
f
g
h
o
1
1
nan
nan
nan
nan
nan
nan
o
nan
nan
2
2
nan
nan
nan
nan
o
nan
nan
nan
nan
3
3
nan
nan
o
nan
nan
nan
nan
nan
nan
4
4
I want to conversion this DataFrame
a
b
c
d
e
f
g
h
o
1
1
2
2
3
3
4
4
How to make code use pandas..? I try numpy diagonal but it is failed
Another possible solution:
a = df.values.flatten()
pd.DataFrame(a[~np.isnan(a)].reshape(-1,df.shape[1]), columns=df.columns)
Output:
a b c d e f g h
0 1.0 1.0 2.0 2.0 3.0 3.0 4.0 4.0
Use:
df = df.replace('nan', np.nan).groupby(level=0).first()
Related
I am trying to apply a function on a grouped dataset. For that I have this Pandas dataframe:
test_df = pd.DataFrame({
'A':list('aabdee'),
'AA':['2020-03-22', '2020-03-22', '2020-03-29', '2020-03-22','2020-03-22', '2020-03-29'],
'B':[1,0.5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,1,7,1,1],
'E':[5,3,6,9,2,4]
})
And I want to apply a Zscore to each column (grouped by the variables A and AA). So I did:
numeric_columns = test_df.select_dtypes(np.number)
test_df.groupby(['A', 'AA'])[numeric_columns.columns].apply(stats.zscore)
But then I have lot of errors, like this:
Series.name must be a hashable type
and this:
RuntimeWarning: invalid value encountered in true_divide
return (a - mns) / sstd
For me working GroupBy.transform:
numeric_columns = test_df.select_dtypes(np.number)
c = numeric_columns.columns
test_df[c] = test_df.groupby(['A', 'AA'])[c].transform(stats.zscore)
print (test_df)
A AA B C D E
0 a 2020-03-22 1.0 -1.0 -1.0 1.0
1 a 2020-03-22 -1.0 1.0 1.0 -1.0
2 b 2020-03-29 NaN NaN NaN NaN
3 d 2020-03-22 NaN NaN NaN NaN
4 e 2020-03-22 NaN NaN NaN NaN
5 e 2020-03-29 NaN NaN NaN NaN
EDIT:
c = numeric_columns.columns
for g, df in test_df.groupby(['A', 'AA']):
print (df)
A AA B C D E
0 a 2020-03-22 1.0 7 1 5
1 a 2020-03-22 0.5 8 3 3
A AA B C D E
2 b 2020-03-29 4.0 9 1 6
A AA B C D E
3 d 2020-03-22 5.0 4 7 9
A AA B C D E
4 e 2020-03-22 5.0 2 1 2
A AA B C D E
5 e 2020-03-29 4.0 3 1 4
I have a DataFrame with some NaN values. In this DataFrame there are some rows with all NaN values. When I apply sum function on these rows, it is returning zero instead of NaN. Code is as follows:
df = pd.DataFrame(np.random.randint(10,60,size=(5,3)),
index = ['a','c','e','f','h'],
columns = ['One','Two','Three'])
df = df.reindex(index=['a','b','c','d','e','f','g','h'])
print(df.loc['b'].sum())
Any Suggestion?
the sum function takes the NaN values as 0.
if you want the result of the sum of NaN values to be NaN:
df.loc['b'].sum(min_count=1)
Output:
nan
if you apply to all rows(
after using reindex) you will get the following:
df.sum(axis=1,min_count=1)
a 137.0
b NaN
c 79.0
d NaN
e 132.0
f 95.0
g NaN
h 81.0
dtype: float64
if you now modify a NaN value of a row:
df.at['b','One']=0
print(df)
One Two Three
a 54.0 20.0 29.0
b 0.0 NaN NaN
c 13.0 24.0 27.0
d NaN NaN NaN
e 28.0 53.0 25.0
f 46.0 55.0 50.0
g NaN NaN NaN
h 47.0 26.0 48.0
df.sum(axis=1,min_count=1)
a 103.0
b 0.0
c 64.0
d NaN
e 106.0
f 151.0
g NaN
h 121.0
dtype: float64
as you can see now the result of row b is 0
If i've a dataframe like this:
A B C
Nan 1.0 0.0
1.0 Nan 1.0
1.0 0.0 Nan
I want to create a new column in the dataframe that will provide info about which column in each row contains contains nan values.
A B C Col4
Nan 1.0 Nan A,C
1.0 Nan 1.0 B
1.0 Nan Nan B,C
Any help?
Compare by DataFrame.isna and use DataFrame.dot with columns names, last remove last , by Series.str.rstrip:
df['col4'] = df.isna().dot(df.columns + ',').str.rstrip(',')
#if values are strings Nan
#df['col4'] = df.eq('Nan').dot(df.columns + ',').str.rstrip(',')
print (df)
A B C col4
0 NaN 1.0 NaN A,C
1 1.0 NaN 1.0 B
2 1.0 NaN NaN B,C
Naive approach:
def f(r):
ret=[]
if(r['A']=='Nan'): ret.append('A')
if(r['B']=='Nan'): ret.append('B')
if(r['C']=='Nan'): ret.append('C')
return ','.join(ret)
df['D'] = df.apply(f, axis=1)
print(df)
A B C
0 Nan 1.0 Nan
1 1.0 Nan 1.0
2 1.0 Nan Nan
A B C D
0 Nan 1.0 Nan A,C
1 1.0 Nan 1.0 B
2 1.0 Nan Nan B,C
I tested on strings but you can replace that with np.nan.
Let us assume this is my DataFrame
City State Country
Name
A NYC NaN NaN
B NaN NaN USA
C NYC NY NaN
D 601009 NaN NaN
E NYC AZ NaN
F 000001 NaN NaN
G NaN NaN NaN
How do I get hold of rows that have NaNs, both in State and Country.
I'm looking for the following output
City State Country
Name
A NYC NaN NaN
D 601009 NaN NaN
F 000001 NaN NaN
G NaN NaN NaN
Thanks a bunch!
use isnull:
In [133]: wd[wd['Country'].isnull() & wd['State'].isnull()]
Out[133]:
City State Country
Name
A NYC NaN NaN
D 601009 NaN NaN
F 000001 NaN NaN
G NaN NaN NaN
or
In [135]: wd[wd[['State', 'Country']].isnull().all(axis=1)]
Out[135]:
City State Country
Name
A NYC NaN NaN
D 601009 NaN NaN
F 000001 NaN NaN
G NaN NaN NaN
I have a dataframe called ref(first dataframe) with columns c1, c2 ,c3 and c4.
ref= pd.DataFrame([[1,3,.3,7],[0,4,.5,4.5],[2,5,.6,3]], columns=['c1','c2','c3','c4'])
print(ref)
c1 c2 c3 c4
0 1 3 0.3 7.0
1 0 4 0.5 4.5
2 2 5 0.6 3.0
I wanted to create a new column i.e, c5 ( second dataframe) that has all the values from columns c1,c2,c3 and c4.
I tried concat, merge columns but i cannot get it work.
Please let me know if you have a solutions?
You can use unstack for creating Series from DataFrame and then concat to original:
print (pd.concat([ref, ref.unstack().reset_index(drop=True).rename('c5')], axis=1))
c1 c2 c3 c4 c5
0 1.0 3.0 0.3 7.0 1.0
1 0.0 4.0 0.5 4.5 0.0
2 2.0 5.0 0.6 3.0 2.0
3 NaN NaN NaN NaN 3.0
4 NaN NaN NaN NaN 4.0
5 NaN NaN NaN NaN 5.0
6 NaN NaN NaN NaN 0.3
7 NaN NaN NaN NaN 0.5
8 NaN NaN NaN NaN 0.6
9 NaN NaN NaN NaN 7.0
10 NaN NaN NaN NaN 4.5
11 NaN NaN NaN NaN 3.0
Alternative solution for creating Series is convert df to numpy array by values and then reshape by ravel:
print (pd.concat([ref, pd.Series(ref.values.ravel('F'), name='c5')], axis=1))
c1 c2 c3 c4 c5
0 1.0 3.0 0.3 7.0 1.0
1 0.0 4.0 0.5 4.5 0.0
2 2.0 5.0 0.6 3.0 2.0
3 NaN NaN NaN NaN 3.0
4 NaN NaN NaN NaN 4.0
5 NaN NaN NaN NaN 5.0
6 NaN NaN NaN NaN 0.3
7 NaN NaN NaN NaN 0.5
8 NaN NaN NaN NaN 0.6
9 NaN NaN NaN NaN 7.0
10 NaN NaN NaN NaN 4.5
11 NaN NaN NaN NaN 3.0
using join + ravel('F')
ref.join(pd.Series(ref.values.ravel('F')).to_frame('c5'), how='right')
using join + T.ravel()
ref.join(pd.Series(ref.values.T.ravel()).to_frame('c5'), how='right')
pd.concat + T.stack() + rename
pd.concat([ref, ref.T.stack().reset_index(drop=True).rename('c5')], axis=1)
way too many transposes + append
ref.T.append(ref.T.stack().reset_index(drop=True).rename('c5')).T
combine_first + ravel('F') <--- my favorite
ref.combine_first(pd.Series(ref.values.ravel('F')).to_frame('c5'))
All yield
c1 c2 c3 c4 c5
0 1.0 3.0 0.3 7.0 1.0
1 0.0 4.0 0.5 4.5 0.0
2 2.0 5.0 0.6 3.0 2.0
3 NaN NaN NaN NaN 3.0
4 NaN NaN NaN NaN 4.0
5 NaN NaN NaN NaN 5.0
6 NaN NaN NaN NaN 0.3
7 NaN NaN NaN NaN 0.5
8 NaN NaN NaN NaN 0.6
9 NaN NaN NaN NaN 7.0
10 NaN NaN NaN NaN 4.5
11 NaN NaN NaN NaN 3.0
use the list(zip()) as follows:
d=list(zip(df1.c1,df1.c2,df1.c3,df1.c4))
df2['c5']=pd.Series(d)
try this one, works as you expected
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,3,4],[2,3,4,5],[3,4,5,6]], columns=['c1','c2','c3','c4'])
print(df)
r = len(df['c1'])
c = len(list(df))
ndata = list(df.c1) + list(df.c2) + list(df.c3) + list(df.c4)
r = len(ndata) - r
t = r*c
dfnan = pd.DataFrame(np.reshape([np.nan]*t, (r,c)), columns=list(df))
df = df.append(dfnan)
df['c5'] = ndata
print(df)
output is below
This could be a fast option and maybe you can use it inside a loop.
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,3,4],[2,3,4,5],[3,4,5,6]], columns=['c1','c2','c3','c4'])
df['c5'] = df.iloc[:,0].astype(str) + df.iloc[:,1].astype(str) + df.iloc[:,2].astype(str) + df.iloc[:,3].astype(str)
Greetings