I am attempting to create a heatmap and retain only the top 5 samples based on the mean of the relative abundance. I am able to sort the heatmap properly, but I can't figure out how to retain only the top 5, in this case samples c, e, b, y, a. I am pasting a subset of the df with the image. I've tried myriad permutations of the "Top K Items Tutorial" link at the altair-viz website. I'd prefer to use altair to do the filtering if possible, as opposed to filtering the df itself in the python code.
Dateframe:
,Lowest_Taxon,p1,p2,p3,p4,p5,p6,p7
0,a,0.03241281,0.0,0.467738067,3.14456785,0.589519651,13.5744323,0.0
1,b,0.680669,9.315121951,2.848404893,13.99058458,2.139737991,16.60779366,7.574639383
2,c,40.65862829,1.244878049,71.01223315,4.82197541,83.18777293,0.0,0.0
3,d,0.0,0.0,0.0,0.548471137,0.272925764,0.925147183,0.0
4,e,0.090755867,13.81853659,5.205085152,27.75721011,1.703056769,19.6691898,12.27775914
5,f,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,g,0.187994295,0.027317073,0.0,0.0,0.0,0.02242781,0.0
7,h,0.16854661,0.534634146,1.217318302,7.271813154,1.73580786,0.57751612,0.57027843
8,i,0.142616362,2.528780488,1.163348525,0.34279446,0.0,0.0,0.0
9,j,1.711396344,0.694634146,0.251858959,4.273504274,0.087336245,1.721334455,0.899027172
10,k,0.0,1.475121951,0.0,0.0,0.0,5.573310906,0.0
11,l,0.194476857,0.253658537,1.517150396,2.413273002,0.949781659,5.147182506,1.650452868
12,m,0.0,1.736585366,0.0,0.063988299,0.0,8.42724979,0.623951694
13,n,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14,o,4.68689226,0.12097561,0.0,0.0,0.0,0.0,0.0
15,p,0.0,0.885853659,0.0,0.0,0.0,0.913933277,0.046964106
16,q,0.252819914,0.050731707,0.023986568,0.0,0.087336245,0.0,0.0
17,r,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18,s,0.0,0.0,0.0,0.0,0.0,0.0,0.0
19,t,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20,u,0.0,0.058536585,0.089949628,0.356506239,0.0,0.285954584,1.17410265
21,v,0.0,0.0,0.0,0.0,0.0,0.0,0.0
22,w,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23,x,1.471541553,2.396097561,0.593667546,0.278806161,0.065502183,0.280347631,0.952700436
24,y,0.0,0.32,0.0,0.461629873,0.0,7.804878049,18.38980208
25,z,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Code block:
import pandas as pd
import numpy as np
import altair as alt
from vega_datasets import data
from altair_saver import save
# Read in the file and fill empty cells with zero
df = pd.read_excel("path\to\df")
doNotMelt = df.drop(df.iloc[:,1:], axis=1)
df_melted = pd.melt(df, id_vars = doNotMelt, var_name = 'SampleID', value_name = 'Relative_abundance')
# Tell altair to plot as many rows as is necessary
alt.data_transformers.disable_max_rows()
alt.Chart(df_melted).mark_rect().encode(
alt.X('SampleID:N'),
alt.Y('Lowest_Taxon:N', sort=alt.EncodingSortField(field='Relative_abundance', op='mean', order='descending')),
alt.Color('Relative_abundance:Q')
)
If you know what you want to show is the entries with c, e, b, y and a (and it will not change later) you could simply apply a transform_filter on the field Lowest_Taxon.
If you want to calculate on the spot which ones make it into the top five, it needs a bit more effort, i.e. a combination of joinaggregate, window and filter transforms.
For both I paste an example below. By the way, I converted the original data that you pasted into a csv file which is imported by the code snippets. You can make it easier for others to to use your pandas toy data by providing it as a dict, which can then be simply read directly in the code.
Simple approach:
import pandas as pd
import altair as alt
import numpy as np
alt.data_transformers.disable_max_rows()
df = pd.read_csv('df.csv', index_col=0)
doNotMelt = df.drop(df.iloc[:,1:], axis=1)
df_melted = pd.melt(df, id_vars = doNotMelt, var_name = 'SampleID', value_name = 'Relative_abundance')
alt.Chart(df_melted).mark_rect().encode(
alt.X('SampleID:N'),
alt.Y('Lowest_Taxon:N', sort=alt.EncodingSortField(field='Relative_abundance', op='mean', order='descending')),
alt.Color('Relative_abundance:Q')
).transform_filter(
alt.FieldOneOfPredicate(field='Lowest_Taxon', oneOf=['c', 'e', 'b', 'y', 'a'])
)
Flexible approach:
set n to how many of the top entries you want to see
import pandas as pd
import altair as alt
import numpy as np
alt.data_transformers.disable_max_rows()
df = pd.read_csv('df.csv', index_col=0)
doNotMelt = df.drop(df.iloc[:,1:], axis=1)
df_melted = pd.melt(df, id_vars = doNotMelt, var_name = 'SampleID', value_name = 'Relative_abundance')
n = 5 # number of entries to display
alt.Chart(df_melted).mark_rect().encode(
alt.X('SampleID:N'),
alt.Y('Lowest_Taxon:N', sort=alt.EncodingSortField(field='Relative_abundance', op='mean', order='descending')),
alt.Color('Relative_abundance:Q')
).transform_joinaggregate(
mean_rel_ab = 'mean(Relative_abundance)',
count_of_samples = 'valid(Relative_abundance)',
groupby = ['Lowest_Taxon']
).transform_window(
rank='rank(mean_rel_ab)',
sort=[alt.SortField('mean_rel_ab', order='descending')],
frame = [None, None]
).transform_filter(
(alt.datum.rank <= (n-1) * alt.datum.count_of_samples + 1)
I want to create a loop that helps me to pull data from Google Trends via PyTrends. I need to iterate through a lot of keywords but Google Trends allows only to compare five keywords at the time, hence I need to iterate through the keywords manually and create a dataframe in pandas. However, it seems something is off.
I get data but my dataframe with pandas creates the dataframe with values that are shifted in different rows and with duplicate "NaN" values.
instead of 62 rows I get 372 rows(with duplicate "NaN").
from pytrends.request import TrendReq
import pandas as pd
pytrend = TrendReq()
kw_list = ['cool', 'fun', 'big','house', 'phone', 'garden']
df1 = pd.DataFrame()
for i in kw_list:
kw_list = i
pytrend.build_payload([kw_list], timeframe='2015-10-14 2015-12-14', geo='FR')
df1 = df1.append(pytrend.interest_over_time())
print(df1.head)
I want to have one coherent dataframe, with the columns 'cool', 'fun', 'big','house', 'phone', 'garden' and their respective values in each column on the same row. Like e.g. a dataframe with 62 rows and 6 columns.
I'm probably going to lose some rep points because this is an old question (oldish, anyway), but I was struggling with the same problem and I solved it like this:
import pytrends
import pandas as pd
from pytrends.request import TrendReq
pytrend = TrendReq()
kw_list = ['cool', 'fun', 'big','house', 'phone', 'garden']
df_gtrends_kw = {}
df_gtrends = pd.DataFrame()
for kw in kw_list:
pytrend.build_payload(kw_list = [kw], timeframe='today 12-m')
df_gtrends_kw[kw] = pytrend.interest_by_region(resolution='COUNTRY')
df_gtrends = pd.concat([df_gtrends_kw[key] for key in kw_list], join = 'inner', axis = 1)
According to the official doc, one has to specify the axes along which one is to glue the dataframes; in this case, the Column names, since the index name is the same for each dataframe.
Is it possible to add Tooltips to a Timeseries chart?
In the simplified code example below, I want to see a single column name ('a','b' or 'c') when the mouse hovers over the relevant line.
Instead, a "???" is displayed and ALL three lines get a tool tip (rather than just the one im hovering over)
Per the documentation (
http://docs.bokeh.org/en/latest/docs/user_guide/tools.html#hovertool), field names starting with “#” are interpreted as columns on the data source.
How can I display the 'columns' from a pandas DataFrame in the tooltip?
Or, if the high level TimeSeries interface doesn't support this, any clues for using the lower level interfaces to do the same thing? (line? multi_line?) or convert the DataFrame into a different format (ColumnDataSource?)
For bonus credit, how should the "$x" be formatted to display the date as a date?
thanks in advance
import pandas as pd
import numpy as np
from bokeh.charts import TimeSeries
from bokeh.models import HoverTool
from bokeh.plotting import show
toy_df = pd.DataFrame(data=np.random.rand(5,3), columns = ('a', 'b' ,'c'), index = pd.DatetimeIndex(start='01-01-2015',periods=5, freq='d'))
p = TimeSeries(toy_df, tools='hover')
hover = p.select(dict(type=HoverTool))
hover.tooltips = [
("Series", "#columns"),
("Date", "$x"),
("Value", "$y"),
]
show(p)
Below is what I came up with.
Its not pretty but it works.
Im still new to Bokeh (& Python for that matter) so if anyone wants to suggest a better way to do this, please feel free.
import pandas as pd
import numpy as np
from bokeh.charts import TimeSeries
from bokeh.models import HoverTool
from bokeh.plotting import show
toy_df = pd.DataFrame(data=np.random.rand(5,3), columns = ('a', 'b' ,'c'), index = pd.DatetimeIndex(start='01-01-2015',periods=5, freq='d'))
_tools_to_show = 'box_zoom,pan,save,hover,resize,reset,tap,wheel_zoom'
p = figure(width=1200, height=900, x_axis_type="datetime", tools=_tools_to_show)
# FIRST plot ALL lines (This is a hack to get it working, why can't i pass in a dataframe to multi_line?)
# It's not pretty but it works.
# what I want to do!: p.multi_line(df)
ts_list_of_list = []
for i in range(0,len(toy_df.columns)):
ts_list_of_list.append(toy_df.index.T)
vals_list_of_list = toy_df.values.T.tolist()
# Define colors because otherwise multi_line will use blue for all lines...
cols_to_use = ['Black', 'Red', 'Lime']
p.multi_line(ts_list_of_list, vals_list_of_list, line_color=cols_to_use)
# THEN put scatter one at a time on top of each one to get tool tips (HACK! lines with tooltips not yet supported by Bokeh?)
for (name, series) in toy_df.iteritems():
# need to repmat the name to be same dimension as index
name_for_display = np.tile(name, [len(toy_df.index),1])
source = ColumnDataSource({'x': toy_df.index, 'y': series.values, 'series_name': name_for_display, 'Date': toy_df.index.format()})
# trouble formating x as datestring, so pre-formating and using an extra column. It's not pretty but it works.
p.scatter('x', 'y', source = source, fill_alpha=0, line_alpha=0.3, line_color="grey")
hover = p.select(dict(type=HoverTool))
hover.tooltips = [("Series", "#series_name"), ("Date", "#Date"), ("Value", "#y{0.00%}"),]
hover.mode = 'mouse'
show(p)
I’m not familiar with Pandas,I just use python list to show the very example of how to add tooltips to muti_lines, show series names ,and properly display date/time。Below is the result.
Thanks to #bs123's answer and #tterry's answer in Bokeh Plotting: Enable tooltips for only some glyphs
my result
# -*- coding: utf-8 -*-
from bokeh.plotting import figure, output_file, show, ColumnDataSource
from bokeh.models import HoverTool
from datetime import datetime
dateX_str = ['2016-11-14','2016-11-15','2016-11-16']
#conver the string of datetime to python datetime object
dateX = [datetime.strptime(i, "%Y-%m-%d") for i in dateX_str]
v1= [10,13,5]
v2 = [8,4,14]
v3= [14,9,6]
v = [v1,v2,v3]
names = ['v1','v2','v3']
colors = ['red','blue','yellow']
output_file('example.html',title = 'example of add tooltips to multi_timeseries')
tools_to_show = 'hover,box_zoom,pan,save,resize,reset,wheel_zoom'
p = figure(x_axis_type="datetime", tools=tools_to_show)
#to show the tooltip for multi_lines,you need use the ColumnDataSource which define the data source of glyph
#the key is to use the same column name for each data source of the glyph
#so you don't have to add tooltip for each glyph,the tooltip is added to the figure
#plot each timeseries line glyph
for i in xrange(3):
# bokeh can't show datetime object in tooltip properly,so we use string instead
source = ColumnDataSource(data={
'dateX': dateX, # python datetime object as X axis
'v': v[i],
'dateX_str': dateX_str, #string of datetime for display in tooltip
'name': [names[i] for n in xrange(3)]
})
p.line('dateX', 'v',source=source,legend=names[i],color = colors[i])
circle = p.circle('dateX', 'v',source=source, fill_color="white", size=8, legend=names[i],color = colors[i])
#to avoid some strange behavior(as shown in the picture at the end), only add the circle glyph to the renders of hover tool
#so tooltip only takes effect on circle glyph
p.tools[0].renderers.append(circle)
# show the tooltip
hover = p.select(dict(type=HoverTool))
hover.tooltips = [("value", "#v"), ("name", "#name"), ("date", "#dateX_str")]
hover.mode = 'mouse'
show(p)
tooltips with some strange behavior,two tips displayed at the same time
Here is my solution. I inspected the glyph render data source to see what are the names on it. Then I use those names on the hoover tooltips. You can see the resulting plot here.
import numpy as np
from bokeh.charts import TimeSeries
from bokeh.models import HoverTool
from bokeh.plotting import show
toy_df = pd.DataFrame(data=np.random.rand(5,3), columns = ('a', 'b' ,'c'), index = pd.DatetimeIndex(start='01-01-2015',periods=5, freq='d'))
#Bockeh display dates as numbers so convert to string tu show correctly
toy_df.index = toy_df.index.astype(str)
p = TimeSeries(toy_df, tools='hover')
#Next 3 lines are to inspect how are names on gliph to call them with #name on hover
#glyph_renderers = p.select(dict(type=GlyphRenderer))
#bar_source = glyph_renderers[0].data_source
#print(bar_source.data) #Here we can inspect names to call on hover
hover = p.select(dict(type=HoverTool))
hover.tooltips = [
("Series", "#series"),
("Date", "#x_values"),
("Value", "#y_values"),
]
show(p)
The original poster's code doesn't work with the latest pandas (DatetimeIndex constructor has changed), but Hovertool now supports a formatters attribute that lets you specify a format as a strftime string. Something like
fig.add_tool(HoverTool(
tooltip=[
('time', '#index{%Y-%m-%d}')
],
formatters={
'#index': 'datetime'
}
))