Including NaN values in function applied to Pandas GroupBy object - pandas-groupby

I would like to calculate the mean of replicate measurements and return NaN when one or both replicates have an NaN value. I am aware that groupby excludes NaN values, but it took me some time to realize apply was doing the same thing. Below is an example of my code. It only returns NaN when both replicates have missing data. In this example I would like it to return NaN for Sample 1, Assay 2. Instead, it is behaving as if I applied np.nanmean and returns the one nonzero element, 27.0. Any ideas on a strategy to include NaN values in the function I am applying?
In[4]: import pandas as pd
In[5]: import numpy as np
In[6]: df = pd.DataFrame({'Sample ID': ['Sample 1', 'Sample 1', 'Sample 1', 'Sample 1', 'Sample 2', 'Sample 2', 'Sample 2', 'Sample 2'],
'Assay': ['Assay 1', 'Assay 1', 'Assay 2', 'Assay 2', 'Assay 1', 'Assay 1', 'Assay 2', 'Assay 2'],
'Replicate': [1, 2, 1, 2, 1, 2, 1, 2],
'Value': [34.0, 30.0, 27.0, np.nan, 16.0, 18.0, np.nan, np.nan]})
In[7]: df
Out[8]:
Sample ID Assay Replicate Value
0 Sample 1 Assay 1 1 34.0
1 Sample 1 Assay 1 2 30.0
2 Sample 1 Assay 2 1 27.0
3 Sample 1 Assay 2 2 NaN
4 Sample 2 Assay 1 1 16.0
5 Sample 2 Assay 1 2 18.0
6 Sample 2 Assay 2 1 NaN
7 Sample 2 Assay 2 2 NaN
In[9]: Group = df.groupby(['Sample ID', 'Assay'])
In[10]: df2 = Group['Value'].aggregate(np.mean).unstack()
Out[82]:
Assay Assay 1 Assay 2
Sample ID
Sample 1 32.0 27.0
Sample 2 17.0 NaN

I think the issue is during the conversion that happens on the mean function execution.
From the documentation:
Array containing numbers whose mean is desired. If a is not an array,
a conversion is attempted.
I was able to make it work by manually doing the conversion by defining a function that calls the mean
def aggregate_func(serie):
return np.mean(serie.values)
and using that function on the aggregate call like so:
df2 = Group['Value'].aggregate(aggregate_func).unstack()
Another option is that the function np.average behaves the same as np.mean if you don't provide the optional weight parameter. But it looks like the conversion works as expected.
Using it gave me the expected result:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Sample ID': ['Sample 1', 'Sample 1', 'Sample 1', 'Sample 1', 'Sample 2', 'Sample 2', 'Sample 2', 'Sample 2'],
'Assay': ['Assay 1', 'Assay 1', 'Assay 2', 'Assay 2', 'Assay 1', 'Assay 1', 'Assay 2', 'Assay 2'],
'Replicate': [1, 2, 1, 2, 1, 2, 1, 2],
'Value': [34.0, 30.0, 27.0, np.nan, 16.0, 18.0, np.nan, np.nan]})
Group = df.groupby(['Sample ID', 'Assay'])
df2 = Group['Value'].aggregate(np.average).unstack()
results in
Assay Assay 1 Assay 2
Sample ID
Sample 1 32.0 NaN
Sample 2 17.0 NaN

Related

Python: create new column and copy value from other row which is a swap of current row

I have a dataframe which has 3 columns:
import pandas as pd
d = {'A': ['left', 'right', 'east', 'west', 'south', 'north'], 'B': ['right', 'left', 'west', 'east', 'north', 'south'], 'VALUE': [0, 1, 2, 3, 4, 5]}
df = pd.DataFrame(d)
Dataframe looks like this:
A B VALUE
left right 0
right left 1
east west 2
west east 3
south north 4
north south 5
I am trying to create a new column VALUE_2 which should contain the value from the swapped row in the same Dataframe.
Eg: right - left value is 0, left - right value is 1 and I want the swapped values in the new column like this:
A B VALUE VALUE_2
left right 0 1
right left 1 0
east west 2 3
west east 3 2
south north 4 5
north south 5 4
I tried:
for row_num, record in df.iterrows():
A = df['A'][index]
B = df['B'][index]
if(pd.Series([record['A'] == B, record['B'] == A).all()):
df['VALUE_2'] = df['VALUE']
I'm struck here, inputs will be highly appreciated.
Use map by Series:
df['VALUE_2'] = df['A'].map(df.set_index('B')['VALUE'])
print (df)
A B VALUE VALUE_2
0 left right 0 1
1 right left 1 0
2 east west 2 3
3 west east 3 2
4 south north 4 5
5 north south 5 4
Just a more verbose answer:
import pandas as pd
d = {'A': ['left', 'right', 'east', 'west', 'south', 'north'], 'B': ['right', 'left', 'west', 'east', 'north', 'south'], 'VALUE': [0, 1, 2, 3, 4, 5]}
df = pd.DataFrame(d)
pdf = pd.DataFrame([])
for idx, item in df.iterrows():
indx = list(df['B']).index(str(df['A'][idx]))
pdf = pdf.append(pd.DataFrame({'VALUE_2': df.iloc[indx][2]}, index=[0]), ignore_index=True)
print(pdf)
data = pd.concat([df, pdf], axis=1)
print(data)

Is there a better way to get the labels as a list of those features (columns) with non a single missing values in a data frame?

I am playing with some data set from kaggle. I would like to get all the columns labels (features) as a list of those features with non a single missing values
. I have done it (I think so) but I wonder if there is a better way to do it. Here is my code, result is a list of those features with non a single missing values :
import matplotlib as plt
data = pd.read_csv(r'C:\Users\.kaggle\house-prices\train.csv')
result = data.isnull().sum(axis=0)[data.isnull().sum(axis=0) ==
0].index.tolist()
For example, if I run the following code:
d = { 'Feature 1': [None, 1, 2, None ], 'Feature 2': [4, 5, 5, 6],
'Feature 3': [7, 7, 8, 9 ]}
df = pd.DataFrame(data = d)
print(df
print(df.isnull().sum(axis=0)[df.isnull().sum(axis=0) ==
0].index.tolist())
I will get the following result:
Feature 1 Feature 2 Feature 3
0 NaN 4 7
1 1.0 5 7
2 2.0 5 8
3 NaN 6 9
['Feature 2', 'Feature 3']
Use dropna and convert columns names to list:
print (df.dropna(axis=1).columns.tolist())
['Feature 2', 'Feature 3']
Detail:
print (df.dropna(axis=1))
Feature 2 Feature 3
0 4 7
1 5 7
2 5 8
3 6 9
notnull + all
df.notnull().all().loc[lambda x : x].index.tolist()
Out[449]: ['Feature 2', 'Feature 3']

Adding additional text to the hovertext label

I've searched for some time now and I can't seem to find a related question. There are similar questions, but nothing that gets to the heart of what I am trying to do with the code I have.
I am trying to add additional text to the hovertext in plotly. Here is my code so far:
import pandas as pd
import numpy as np
from plotly.offline import *
init_notebook_mode(connected=True)
graph1 = merged.groupby(['study_arm', 'visit_label'])['mjsn'].mean().unstack('study_arm')
graph1.iplot(mode='lines+markers',
symbol=['diamond-open', 'square-open', 'circle-dot', 'hexagon-open'],
size=8, colorscale = 'dark2', yTitle='ytitle', xTitle='xtitle',
title='Title of Graph',
hoverformat = '.2f')
hv = merged.groupby(['study_arm', 'visit_label']).size()
print(hv)
Note: 'merged' is the dataframe and is shown in the sample data below.
The code above gives the following output (note: hovering over the timepoint gives some information for each trace, and I took a picture to show what that looks like).
My question is how can I get the subject count number from the table into the hovertext for each trace at each timepoint (preferably on the second line of the hovertext that looks like 'N=x', where x is the subject number from the table under the graph in the picture).
Here is a sample of the dummy dataset I used to create this graph and table:
subject_number visit_label mjsn study_arm
20001 Day 1 0 B
20001 Month 06 0.4 B
20001 Month 12 0.2 B
20003 Day 1 0 B
20003 Month 06 -0.9 B
20003 Month 12 -0.7 B
20005 Day 1 0 C
20005 Month 06 0.1 C
20005 Month 12 -0.1 C
20007 Day 1 0 D
20007 Month 06 0 D
20007 Month 12 -0.3 D
20008 Day 1 0 C
20008 Month 06 -0.3 C
20008 Month 12 -0.1 C
20010 Day 1 0 A
20010 Month 06 -0.6 A
20010 Month 12 -0.4 A
You want to set the text or hovertext element for each chart/trace. Both text and hovertext will work here. The reason you may need both can be seen here. You may also want to change the hoverinfo element. Your options are 'x', 'y', 'none', 'text', 'all'. Additional resources are: text and annotations, docs, and python example. In addition, to get the count of cases at a time period I took two different groupby operations and then concatenated them together.
Example using your dataframe:
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
import pandas as pd
df = pd.DataFrame({
'subject_number' : [20001, 20001, 20001, 20003, 20003, 20003, 20005, 20005,
20005, 20007, 20007, 20007, 20008, 20008, 20008, 20010, 20010, 20010],
'visit_label' : ['Day 1', 'Month 6', 'Month 12', 'Day 1', 'Month 6',
'Month 12', 'Day 1', 'Month 6', 'Month 12', 'Day 1', 'Month 6',
'Month 12', 'Day 1', 'Month 6', 'Month 12', 'Day 1', 'Month 6',
'Month 12'],
'mjsn':[0, 0.4, 0.2, 0, -0.9, -0.7, 0, 0.1, -0.1, 0, 0, -0.3, 0, -0.3, -0.1,
0, -0.6, -0.4],
'study_arm':['B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'D', 'D', 'D',
'C', 'C', 'C', 'A', 'A', 'A']
})
grouped = df.groupby(['study_arm', 'visit_label'])
tmp1 = grouped.mean()["mjsn"]
tmp2 = grouped.count()["subject_number"]
output_df = pd.concat([tmp1, tmp2], axis = 1)
data = []
for study in output_df.index.get_level_values(0).unique():
trace = go.Scatter(
x = output_df.loc[study, :].index,
y = output_df.loc[study, "mjsn"],
hovertext= ["msjn:{0}<br>subject:{1}".format(x, int(y))
for x,y in zip(output_df.loc[study, "mjsn"],
output_df.loc[study, "subject_number"])],
mode = 'lines+markers',
hoverinfo = 'text'
)
data += [trace]
#
iplot(data)
Similar SO questions are here and here

Pandas: How to build a column based on another column which is indexed by another one?

I have this dataframe presented below. I tried a solution below, but I am not sure if this is a good solution.
import pandas as pd
def creatingDataFrame():
raw_data = {'code': [1, 2, 3, 2 , 3, 3],
'Region': ['A', 'A', 'C', 'B' , 'A', 'B'],
'var-A': [2,4,6,4,6,6],
'var-B': [20, 30, 40 , 50, 10, 20],
'var-C': [3, 4 , 5, 1, 2, 3]}
df = pd.DataFrame(raw_data, columns = ['code', 'Region','var-A', 'var-B', 'var-C'])
return df
if __name__=="__main__":
df=creatingDataFrame()
df['var']=np.where(df['Region']=='A',1.0,0.0)*df['var-A']+np.where(df['Region']=='B',1.0,0.0)*df['var-B']+np.where(df['Region']=='C',1.0,0.0)*df['var-C']
I want the variable var assumes values of column 'var-A', 'var-B' or 'var-C' depending on the region provided by region 'Region'.
The result must be
df['var']
Out[50]:
0 2.0
1 4.0
2 5.0
3 50.0
4 6.0
5 20.0
Name: var, dtype: float64
You can try with lookup
df.columns=df.columns.str.split('-').str[-1]
df
Out[255]:
code Region A B C
0 1 A 2 20 3
1 2 A 4 30 4
2 3 C 6 40 5
3 2 B 4 50 1
4 3 A 6 10 2
5 3 B 6 20 3
df.lookup(df.index,df.Region)
Out[256]: array([ 2, 4, 5, 50, 6, 20], dtype=int64)
#df['var']=df.lookup(df.index,df.Region)

Updating values in a pandas dataframe using another dataframe

I have an existing pandas Dataframe with the following format:
sample_dict = {'ID': [100, 200, 300], 'a': [1, 2, 3], 'b': [.1, .2, .3], 'c': [4, 5, 6], 'd': [.4, .5, .6]}
df_sample = pd.DataFrame(sample_dict)
Now, I want to update df_sample using another dataframe that looks like this:
sample_update = {'ID': [100, 300], 'a': [3, 2], 'b': [.4, .2], 'c': [2, 5], 'd': [.7, .1]}
df_updater = pd.DataFrame(sample_update)
The rule for the update is this:
For column a and c, just add values from a and c in df_updater.
For column b, it depends on the updated value of a. Let's say the update function would be b = old_b + (new_b / updated_a).
For column d, the rules are similar to that of column b except that it depends on values of the updated c and new_d.
Here is the desired output:
new = {'ID': [100, 200, 300], 'a': [4, 2, 5], 'b': [.233333, .2, .33999999], 'c': [6, 5, 11], 'd': [.51666666, .5, .609090]}
df_new = pd.DataFrame(new)
My actual problems are using a little more complicated version of this but I think this example is enough to solve my problem. Also, In my real DataFrame, I have more columns following the same rules so I would like to make this method to loop over the columns if possible. Thanks!
You can use functions merge, add and div:
df = pd.merge(df_sample,df_updater,on='ID', how='left')
df[['a','c']] = df[['a_y','c_y']].add(df[['a_x','c_x']].values, fill_value=0)
df['b'] = df['b_x'].add(df['b_y'].div(df.a_y), fill_value=0)
df['d'] = df['c_x'].add(df['d_y'].div(df.c_y), fill_value=0)
print (df)
ID a_x b_x c_x d_x a_y b_y c_y d_y a c b d
0 100 1 0.1 4 0.4 3.0 0.4 2.0 0.7 4.0 6.0 0.233333 4.35
1 200 2 0.2 5 0.5 NaN NaN NaN NaN 2.0 5.0 0.200000 5.00
2 300 3 0.3 6 0.6 2.0 0.2 5.0 0.1 5.0 11.0 0.400000 6.02
print (df[['a','b','c','d']])
a b c d
0 4.0 0.233333 6.0 4.35
1 2.0 0.200000 5.0 5.00
2 5.0 0.400000 11.0 6.02
Instead merge is posible use concat:
df=pd.concat([df_sample.set_index('ID'),df_updater.set_index('ID')], axis=1,keys=('_x','_y'))
df.columns = [''.join((col[1], col[0])) for col in df.columns]
df.reset_index(inplace=True)
print (df)
ID a_x b_x c_x d_x a_y b_y c_y d_y
0 100 1 0.1 4 0.4 3.0 0.4 2.0 0.7
1 200 2 0.2 5 0.5 NaN NaN NaN NaN
2 300 3 0.3 6 0.6 2.0 0.2 5.0 0.1

Resources