Apply Zscore after groupby in Python - python-3.x

I am trying to apply a function on a grouped dataset. For that I have this Pandas dataframe:
test_df = pd.DataFrame({
'A':list('aabdee'),
'AA':['2020-03-22', '2020-03-22', '2020-03-29', '2020-03-22','2020-03-22', '2020-03-29'],
'B':[1,0.5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,1,7,1,1],
'E':[5,3,6,9,2,4]
})
And I want to apply a Zscore to each column (grouped by the variables A and AA). So I did:
numeric_columns = test_df.select_dtypes(np.number)
test_df.groupby(['A', 'AA'])[numeric_columns.columns].apply(stats.zscore)
But then I have lot of errors, like this:
Series.name must be a hashable type
and this:
RuntimeWarning: invalid value encountered in true_divide
return (a - mns) / sstd

For me working GroupBy.transform:
numeric_columns = test_df.select_dtypes(np.number)
c = numeric_columns.columns
test_df[c] = test_df.groupby(['A', 'AA'])[c].transform(stats.zscore)
print (test_df)
A AA B C D E
0 a 2020-03-22 1.0 -1.0 -1.0 1.0
1 a 2020-03-22 -1.0 1.0 1.0 -1.0
2 b 2020-03-29 NaN NaN NaN NaN
3 d 2020-03-22 NaN NaN NaN NaN
4 e 2020-03-22 NaN NaN NaN NaN
5 e 2020-03-29 NaN NaN NaN NaN
EDIT:
c = numeric_columns.columns
for g, df in test_df.groupby(['A', 'AA']):
print (df)
A AA B C D E
0 a 2020-03-22 1.0 7 1 5
1 a 2020-03-22 0.5 8 3 3
A AA B C D E
2 b 2020-03-29 4.0 9 1 6
A AA B C D E
3 d 2020-03-22 5.0 4 7 9
A AA B C D E
4 e 2020-03-22 5.0 2 1 2
A AA B C D E
5 e 2020-03-29 4.0 3 1 4

Related

How to Conversion matrix to 1 row Dataframe

I have a this Dataframe
a
b
c
d
e
f
g
h
o
1
1
nan
nan
nan
nan
nan
nan
o
nan
nan
2
2
nan
nan
nan
nan
o
nan
nan
nan
nan
3
3
nan
nan
o
nan
nan
nan
nan
nan
nan
4
4
I want to conversion this DataFrame
a
b
c
d
e
f
g
h
o
1
1
2
2
3
3
4
4
How to make code use pandas..? I try numpy diagonal but it is failed
Another possible solution:
a = df.values.flatten()
pd.DataFrame(a[~np.isnan(a)].reshape(-1,df.shape[1]), columns=df.columns)
Output:
a b c d e f g h
0 1.0 1.0 2.0 2.0 3.0 3.0 4.0 4.0
Use:
df = df.replace('nan', np.nan).groupby(level=0).first()

How to read data from excel and concatenate columns vertically?

I'm reading this data from an excel file:
a b
0 x y x y
1 0 1 2 3
2 0 1 2 3
3 0 1 2 3
4 0 1 2 3
5 0 1 2 3
For each a and b categories (a.k.a samples), there two colums of x and y values. I want to convert this excel data into a dataframe that looks like this (concatenating vertically data from samples a and b):
sample x y
0 a 0.0 1.0
1 a 0.0 1.0
2 a 0.0 1.0
3 a 0.0 1.0
4 a 0.0 1.0
5 b 2.0 3.0
6 b 2.0 3.0
7 b 2.0 3.0
8 b 2.0 3.0
9 b 2.0 3.0
I've written the following code:
x=np.arange(0,4,2) # create a variable that allows to select even columns
sample_df=pd.DataFrame() # create an empty dataFrame
for i in x: # looping through the excel data
sample = pd.read_excel(xls2, usecols=[i,i], nrows=0, header=0)
values_df= pd.read_excel(xls2, usecols=[i,i+1], nrows=5, header=1)
values_df.insert(loc=0, column='sample', value=sample.columns[0])
sample_df=pd.concat([sample_df, values_df], ignore_index=True)
display(sample_df)
But, this is the Output I obtain:
sample x y x.1 y.1
0 a 0.0 1.0 NaN NaN
1 a 0.0 1.0 NaN NaN
2 a 0.0 1.0 NaN NaN
3 a 0.0 1.0 NaN NaN
4 a 0.0 1.0 NaN NaN
5 b NaN NaN 2.0 3.0
6 b NaN NaN 2.0 3.0
7 b NaN NaN 2.0 3.0
8 b NaN NaN 2.0 3.0
9 b NaN NaN 2.0 3.0

How sum function works with NaN element?

I have a DataFrame with some NaN values. In this DataFrame there are some rows with all NaN values. When I apply sum function on these rows, it is returning zero instead of NaN. Code is as follows:
df = pd.DataFrame(np.random.randint(10,60,size=(5,3)),
index = ['a','c','e','f','h'],
columns = ['One','Two','Three'])
df = df.reindex(index=['a','b','c','d','e','f','g','h'])
print(df.loc['b'].sum())
Any Suggestion?
the sum function takes the NaN values ​​as 0.
if you want the result of the sum of NaN values ​​to be NaN:
df.loc['b'].sum(min_count=1)
Output:
nan
if you apply to all rows(
after using reindex) you will get the following:
df.sum(axis=1,min_count=1)
a 137.0
b NaN
c 79.0
d NaN
e 132.0
f 95.0
g NaN
h 81.0
dtype: float64
if you now modify a NaN value of a row:
df.at['b','One']=0
print(df)
One Two Three
a 54.0 20.0 29.0
b 0.0 NaN NaN
c 13.0 24.0 27.0
d NaN NaN NaN
e 28.0 53.0 25.0
f 46.0 55.0 50.0
g NaN NaN NaN
h 47.0 26.0 48.0
df.sum(axis=1,min_count=1)
a 103.0
b 0.0
c 64.0
d NaN
e 106.0
f 151.0
g NaN
h 121.0
dtype: float64
as you can see now the result of row b is 0

How to create a new column containing names of columns that are Nan with pandas?

If i've a dataframe like this:
A B C
Nan 1.0 0.0
1.0 Nan 1.0
1.0 0.0 Nan
I want to create a new column in the dataframe that will provide info about which column in each row contains contains nan values.
A B C Col4
Nan 1.0 Nan A,C
1.0 Nan 1.0 B
1.0 Nan Nan B,C
Any help?
Compare by DataFrame.isna and use DataFrame.dot with columns names, last remove last , by Series.str.rstrip:
df['col4'] = df.isna().dot(df.columns + ',').str.rstrip(',')
#if values are strings Nan
#df['col4'] = df.eq('Nan').dot(df.columns + ',').str.rstrip(',')
print (df)
A B C col4
0 NaN 1.0 NaN A,C
1 1.0 NaN 1.0 B
2 1.0 NaN NaN B,C
Naive approach:
def f(r):
ret=[]
if(r['A']=='Nan'): ret.append('A')
if(r['B']=='Nan'): ret.append('B')
if(r['C']=='Nan'): ret.append('C')
return ','.join(ret)
df['D'] = df.apply(f, axis=1)
print(df)
A B C
0 Nan 1.0 Nan
1 1.0 Nan 1.0
2 1.0 Nan Nan
A B C D
0 Nan 1.0 Nan A,C
1 1.0 Nan 1.0 B
2 1.0 Nan Nan B,C
I tested on strings but you can replace that with np.nan.

Stack two pandas dataframes with different columns, keeping source dataframe as column, also

I have a couple of toy dataframes I can stack using df.append, but I need to keep the source dataframes as a column, as well. I can't seem to find anything about how to do that. Here's what I do have:
d2005 = pd.DataFrame({"A": [1,2,3,4], "B": [2,4,5,6], "C": [3,5,7,8],
"G": [7,8,9,10]})
d2006 = pd.DataFrame({"A": [2,1,4,5], "B": [3,1,5,6], "D": ["a","c","d","e"],
"F": [7,8,10,12]})
d2005
A B C G
0 1 2 3 7
1 2 4 5 8
2 3 5 7 9
3 4 6 8 10
d2006
A B D F
0 2 3 a 7
1 1 1 c 8
2 4 5 d 10
3 5 6 e 12
Then I can stack them like this:
d_combined = d2005.append(d2006, ignore_index = True, sort = True)
d_combined
A B C D F G
0 1 2 3.0 NaN NaN 7.0
1 2 4 5.0 NaN NaN 8.0
2 3 5 7.0 NaN NaN 9.0
3 4 6 8.0 NaN NaN 10.0
4 2 3 NaN a 7.0 NaN
5 1 1 NaN c 8.0 NaN
6 4 5 NaN d 10.0 NaN
7 5 6 NaN e 12.0 NaN
But what I really need is another column with the source dataframe added to the right end of d_combined. Something like this:
A B C D G F From
0 1 2 3.0 NaN 7.0 NaN d2005
1 2 4 5.0 NaN 8.0 NaN d2005
2 3 5 7.0 NaN 9.0 NaN d2005
3 4 6 8.0 NaN 10.0 NaN d2005
4 2 3 NaN a NaN 7.0 d2006
5 1 1 NaN c NaN 8.0 d2006
6 4 5 NaN d NaN 10.0 d2006
7 5 6 NaN e NaN 12.0 d2006
Hopefully someone has a quick trick they can share.
Thanks.
This gets what you want but there should be a more elegant way:
df_list = [d2005, d2006]
name_list = ['2005', '2006']
for df, name in zip(df_list, name_list):
df['from'] = name
Then
d_combined = d2005.append(d2006, ignore_index=True)
d_combined
A B C D F G from
0 1 2 3.0 NaN NaN 7.0 2005
1 2 4 5.0 NaN NaN 8.0 2005
2 3 5 7.0 NaN NaN 9.0 2005
3 4 6 8.0 NaN NaN 10.0 2005
4 2 3 NaN a 7.0 NaN 2006
5 1 1 NaN c 8.0 NaN 2006
6 4 5 NaN d 10.0 NaN 2006
7 5 6 NaN e 12.0 NaN 2006
Alternatively, you can set df.name at the time of creation of the df and use it in the for loop.
d2005 = pd.DataFrame({"A": [1,2,3,4], "B": [2,4,5,6], "C": [3,5,7,8],
"G": [7,8,9,10]} )
d2005.name = 2005
d2006 = pd.DataFrame({"A": [2,1,4,5], "B": [3,1,5,6], "D": ["a","c","d","e"],
"F": [7,8,10,12]})
d2006.name = 2006
df_list = [d2005, d2006]
for df in df_list:
df['from'] = df.name
I believe this can be simply achieved by adding the From column to the original dataframes itself.
So effectively,
d2005 = pd.DataFrame({"A": [1,2,3,4], "B": [2,4,5,6], "C": [3,5,7,8],
"G": [7,8,9,10]})
d2006 = pd.DataFrame({"A": [2,1,4,5], "B": [3,1,5,6], "D": ["a","c","d","e"],
"F": [7,8,10,12]})
Then,
d2005['From'] = 'd2005'
d2006['From'] = 'd2006'
And then you append,
d_combined = d2005.append(d2006, ignore_index = True, sort = True)
gives you something like this:

Resources