Adding additional text to the hovertext label - python-3.x

I've searched for some time now and I can't seem to find a related question. There are similar questions, but nothing that gets to the heart of what I am trying to do with the code I have.
I am trying to add additional text to the hovertext in plotly. Here is my code so far:
import pandas as pd
import numpy as np
from plotly.offline import *
init_notebook_mode(connected=True)
graph1 = merged.groupby(['study_arm', 'visit_label'])['mjsn'].mean().unstack('study_arm')
graph1.iplot(mode='lines+markers',
symbol=['diamond-open', 'square-open', 'circle-dot', 'hexagon-open'],
size=8, colorscale = 'dark2', yTitle='ytitle', xTitle='xtitle',
title='Title of Graph',
hoverformat = '.2f')
hv = merged.groupby(['study_arm', 'visit_label']).size()
print(hv)
Note: 'merged' is the dataframe and is shown in the sample data below.
The code above gives the following output (note: hovering over the timepoint gives some information for each trace, and I took a picture to show what that looks like).
My question is how can I get the subject count number from the table into the hovertext for each trace at each timepoint (preferably on the second line of the hovertext that looks like 'N=x', where x is the subject number from the table under the graph in the picture).
Here is a sample of the dummy dataset I used to create this graph and table:
subject_number visit_label mjsn study_arm
20001 Day 1 0 B
20001 Month 06 0.4 B
20001 Month 12 0.2 B
20003 Day 1 0 B
20003 Month 06 -0.9 B
20003 Month 12 -0.7 B
20005 Day 1 0 C
20005 Month 06 0.1 C
20005 Month 12 -0.1 C
20007 Day 1 0 D
20007 Month 06 0 D
20007 Month 12 -0.3 D
20008 Day 1 0 C
20008 Month 06 -0.3 C
20008 Month 12 -0.1 C
20010 Day 1 0 A
20010 Month 06 -0.6 A
20010 Month 12 -0.4 A

You want to set the text or hovertext element for each chart/trace. Both text and hovertext will work here. The reason you may need both can be seen here. You may also want to change the hoverinfo element. Your options are 'x', 'y', 'none', 'text', 'all'. Additional resources are: text and annotations, docs, and python example. In addition, to get the count of cases at a time period I took two different groupby operations and then concatenated them together.
Example using your dataframe:
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
import pandas as pd
df = pd.DataFrame({
'subject_number' : [20001, 20001, 20001, 20003, 20003, 20003, 20005, 20005,
20005, 20007, 20007, 20007, 20008, 20008, 20008, 20010, 20010, 20010],
'visit_label' : ['Day 1', 'Month 6', 'Month 12', 'Day 1', 'Month 6',
'Month 12', 'Day 1', 'Month 6', 'Month 12', 'Day 1', 'Month 6',
'Month 12', 'Day 1', 'Month 6', 'Month 12', 'Day 1', 'Month 6',
'Month 12'],
'mjsn':[0, 0.4, 0.2, 0, -0.9, -0.7, 0, 0.1, -0.1, 0, 0, -0.3, 0, -0.3, -0.1,
0, -0.6, -0.4],
'study_arm':['B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'D', 'D', 'D',
'C', 'C', 'C', 'A', 'A', 'A']
})
grouped = df.groupby(['study_arm', 'visit_label'])
tmp1 = grouped.mean()["mjsn"]
tmp2 = grouped.count()["subject_number"]
output_df = pd.concat([tmp1, tmp2], axis = 1)
data = []
for study in output_df.index.get_level_values(0).unique():
trace = go.Scatter(
x = output_df.loc[study, :].index,
y = output_df.loc[study, "mjsn"],
hovertext= ["msjn:{0}<br>subject:{1}".format(x, int(y))
for x,y in zip(output_df.loc[study, "mjsn"],
output_df.loc[study, "subject_number"])],
mode = 'lines+markers',
hoverinfo = 'text'
)
data += [trace]
#
iplot(data)
Similar SO questions are here and here

Related

Pandas Dataframe: Reduce the value of a 'Days' by 1 if the corresponding 'Year' is a leap year

If 'Days' is greater than e.g 10 and corresponding 'Year' is a leap year, then reduce 'Days' by 1 only in that particular row. I tried some operations but couldn't do it. I am new in pandas. Appreciate any help.
sample data:
data = [['1', '2005'], ['2', '2006'], ['3', '2008'],['50','2009'],['69','2008']]
df=pd.DataFrame(data,columns=['Days','Year'])
I want 'Days' of row 5 to become 69 and everything else remains the same.
In [98]: import calendar
In [99]: data = [['1', '2005'], ['2', '2006'], ['3', '2008'],['50','2009'],['70','2008']] ;df=pd.DataFrame(data,column
...: s=['Days','Year'])
In [100]: df = df.astype(int)
In [102]: df["New_Days"] = df.apply(lambda x: x["Days"]-1 if (x["Days"] > 10 and calendar.isleap(x["Year"])) else x["D
...: ays"], axis=1)
In [103]: df
Out[103]:
Days Year New_Days
0 1 2005 1
1 2 2006 2
2 3 2008 3
3 50 2009 50
4 70 2008 69

pandas move to correspondent column based on value of other column

Im trying to move the f1_am, f2_am, f3_am to the correspondent column based on the values of f1_ty, f2_ty, f3_ty
I started adding new columns to the dataframe based on unique values from the _ty using sets, but I'm trying to figure it out how to move the _am values to were it belongs
Looked for the option of group by and pivot but the result exploded my mind....
I would appreciate some guidance.
Below the code.
import pandas as pd
import numpy as np
data = {
'mem_id': ['A', 'B', 'C', 'A', 'B', 'C']
, 'date_inf': ['01/01/2019', '01/01/2019', '01/01/2019', '02/01/2019', '02/01/2019', '02/01/2019']
, 'f1_ty': ['ABC', 'ABC', 'ABC', 'ABC', 'GHI', 'GHI']
, 'f1_am': [100, 20, 57, 44, 15, 10]
, 'f2_ty': ['DEF', 'DEF', 'DEF', 'GHI', 'ABC', 'XYZ']
, 'f2_am':[20, 30, 45, 66, 14, 21]
, 'f3_ty': ['XYZ', 'GHI', 'OPQ', 'OPQ', 'XYZ', 'DEF']
, 'f3_am':[20, 30, 45, 66, 14, 21]
}
df = pd.DataFrame (data)
#distinct values in columns using sets
distinct_values = sorted(list(set(df['f1_ty'])|set(df['f2_ty'])|set(df['f3_ty'])))
# add distinct values as new columns in the DataFrame
new_df = df.reindex(columns = np.append( df.columns.values, distinct_values))
So this would be my starting point and my wanted result.
Here is a try, thanks for the interesting problem (rename colujmns to make compatible to wide_to_long() followed by unstack() while dropping extra levels:
m=df.set_index(['mem_id','date_inf']).rename(columns=lambda x: ''.join(x.split('_')[::-1]))
n=(pd.wide_to_long(m.reset_index(),['tyf','amf'],['mem_id','date_inf'],'v')
.droplevel(-1).set_index('tyf',append=True).unstack(fill_value=0).reindex(m.index))
final=n.droplevel(0,axis=1).rename_axis(None,axis=1).reset_index()
print(final)
mem_id date_inf ABC DEF GHI OPQ XYZ
0 A 01/01/2019 100 20 0 0 20
1 B 01/01/2019 20 30 30 0 0
2 C 01/01/2019 57 45 0 45 0
3 A 02/01/2019 44 0 66 66 0
4 B 02/01/2019 14 0 15 0 14
5 C 02/01/2019 0 21 10 0 21

find out percentage of duplicates

I have the following data:
id date A Area Price Hol
0 1 2019-01-01 No 80 200 No
1 2 2019-01-02 Yes 100 300 Yes
2 3 2019-01-03 Yes 100 300 Yes
3 4 2019-01-04 No 50 100 No
4 5 2019-01-05 No 20 50 No
5 1 2019-01-01 No 80 200 No
I want to find out duplicates (for the same id).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'id': [1, 2, 3, 4, 5, 1], 'date': ['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
'2019-01-05', '2019-01-01'],
'A': ['No', 'Yes', 'Yes', 'No', 'No', 'No'],
'Area': [80, 100, 100, 50, 20, 80], 'Price': [200, 300, 300, 100, 50, 200],
'Hol': ['No', 'Yes', 'Yes', 'No', 'No', 'No']})
df['date'] = pd.to_datetime(df['date'])
fig, ax = plt.subplots(figsize=(15, 7))
df.groupby(['A', 'Area', 'Price', 'Hol'])['id'].value_counts().plot(ax=ax)
I can see that I have one duplicate (for id 1 , all the entries are the same)
Now, I want to find out what percentage those duplicates represent in the whole dataset.
I can't find a way to express this, since I am already using value_counts() in order to find the duplicates and I can't do something like:
df.groupby(['A', 'Area', 'Price', 'Hol'])['id'].value_counts().size()
percentage = (test / test.groupby(level=0).sum()) * 100
I believe you need DataFrame.duplicated with Series.value_counts:
percentage = df.duplicated(keep=False).value_counts(normalize=True) * 100
print (percentage)
False 66.666667
True 33.333333
dtype: float64
Is duplicated what you need ?
df.duplicated(keep=False).mean()
Out[107]: 0.3333333333333333

Python: create new column and copy value from other row which is a swap of current row

I have a dataframe which has 3 columns:
import pandas as pd
d = {'A': ['left', 'right', 'east', 'west', 'south', 'north'], 'B': ['right', 'left', 'west', 'east', 'north', 'south'], 'VALUE': [0, 1, 2, 3, 4, 5]}
df = pd.DataFrame(d)
Dataframe looks like this:
A B VALUE
left right 0
right left 1
east west 2
west east 3
south north 4
north south 5
I am trying to create a new column VALUE_2 which should contain the value from the swapped row in the same Dataframe.
Eg: right - left value is 0, left - right value is 1 and I want the swapped values in the new column like this:
A B VALUE VALUE_2
left right 0 1
right left 1 0
east west 2 3
west east 3 2
south north 4 5
north south 5 4
I tried:
for row_num, record in df.iterrows():
A = df['A'][index]
B = df['B'][index]
if(pd.Series([record['A'] == B, record['B'] == A).all()):
df['VALUE_2'] = df['VALUE']
I'm struck here, inputs will be highly appreciated.
Use map by Series:
df['VALUE_2'] = df['A'].map(df.set_index('B')['VALUE'])
print (df)
A B VALUE VALUE_2
0 left right 0 1
1 right left 1 0
2 east west 2 3
3 west east 3 2
4 south north 4 5
5 north south 5 4
Just a more verbose answer:
import pandas as pd
d = {'A': ['left', 'right', 'east', 'west', 'south', 'north'], 'B': ['right', 'left', 'west', 'east', 'north', 'south'], 'VALUE': [0, 1, 2, 3, 4, 5]}
df = pd.DataFrame(d)
pdf = pd.DataFrame([])
for idx, item in df.iterrows():
indx = list(df['B']).index(str(df['A'][idx]))
pdf = pdf.append(pd.DataFrame({'VALUE_2': df.iloc[indx][2]}, index=[0]), ignore_index=True)
print(pdf)
data = pd.concat([df, pdf], axis=1)
print(data)

Including NaN values in function applied to Pandas GroupBy object

I would like to calculate the mean of replicate measurements and return NaN when one or both replicates have an NaN value. I am aware that groupby excludes NaN values, but it took me some time to realize apply was doing the same thing. Below is an example of my code. It only returns NaN when both replicates have missing data. In this example I would like it to return NaN for Sample 1, Assay 2. Instead, it is behaving as if I applied np.nanmean and returns the one nonzero element, 27.0. Any ideas on a strategy to include NaN values in the function I am applying?
In[4]: import pandas as pd
In[5]: import numpy as np
In[6]: df = pd.DataFrame({'Sample ID': ['Sample 1', 'Sample 1', 'Sample 1', 'Sample 1', 'Sample 2', 'Sample 2', 'Sample 2', 'Sample 2'],
'Assay': ['Assay 1', 'Assay 1', 'Assay 2', 'Assay 2', 'Assay 1', 'Assay 1', 'Assay 2', 'Assay 2'],
'Replicate': [1, 2, 1, 2, 1, 2, 1, 2],
'Value': [34.0, 30.0, 27.0, np.nan, 16.0, 18.0, np.nan, np.nan]})
In[7]: df
Out[8]:
Sample ID Assay Replicate Value
0 Sample 1 Assay 1 1 34.0
1 Sample 1 Assay 1 2 30.0
2 Sample 1 Assay 2 1 27.0
3 Sample 1 Assay 2 2 NaN
4 Sample 2 Assay 1 1 16.0
5 Sample 2 Assay 1 2 18.0
6 Sample 2 Assay 2 1 NaN
7 Sample 2 Assay 2 2 NaN
In[9]: Group = df.groupby(['Sample ID', 'Assay'])
In[10]: df2 = Group['Value'].aggregate(np.mean).unstack()
Out[82]:
Assay Assay 1 Assay 2
Sample ID
Sample 1 32.0 27.0
Sample 2 17.0 NaN
I think the issue is during the conversion that happens on the mean function execution.
From the documentation:
Array containing numbers whose mean is desired. If a is not an array,
a conversion is attempted.
I was able to make it work by manually doing the conversion by defining a function that calls the mean
def aggregate_func(serie):
return np.mean(serie.values)
and using that function on the aggregate call like so:
df2 = Group['Value'].aggregate(aggregate_func).unstack()
Another option is that the function np.average behaves the same as np.mean if you don't provide the optional weight parameter. But it looks like the conversion works as expected.
Using it gave me the expected result:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Sample ID': ['Sample 1', 'Sample 1', 'Sample 1', 'Sample 1', 'Sample 2', 'Sample 2', 'Sample 2', 'Sample 2'],
'Assay': ['Assay 1', 'Assay 1', 'Assay 2', 'Assay 2', 'Assay 1', 'Assay 1', 'Assay 2', 'Assay 2'],
'Replicate': [1, 2, 1, 2, 1, 2, 1, 2],
'Value': [34.0, 30.0, 27.0, np.nan, 16.0, 18.0, np.nan, np.nan]})
Group = df.groupby(['Sample ID', 'Assay'])
df2 = Group['Value'].aggregate(np.average).unstack()
results in
Assay Assay 1 Assay 2
Sample ID
Sample 1 32.0 NaN
Sample 2 17.0 NaN

Resources