Altair/Vega-Lite heatmap: Filter top k - altair

I am attempting to create a heatmap and retain only the top 5 samples based on the mean of the relative abundance. I am able to sort the heatmap properly, but I can't figure out how to retain only the top 5, in this case samples c, e, b, y, a. I am pasting a subset of the df with the image. I've tried myriad permutations of the "Top K Items Tutorial" link at the altair-viz website. I'd prefer to use altair to do the filtering if possible, as opposed to filtering the df itself in the python code.
Dateframe:
,Lowest_Taxon,p1,p2,p3,p4,p5,p6,p7
0,a,0.03241281,0.0,0.467738067,3.14456785,0.589519651,13.5744323,0.0
1,b,0.680669,9.315121951,2.848404893,13.99058458,2.139737991,16.60779366,7.574639383
2,c,40.65862829,1.244878049,71.01223315,4.82197541,83.18777293,0.0,0.0
3,d,0.0,0.0,0.0,0.548471137,0.272925764,0.925147183,0.0
4,e,0.090755867,13.81853659,5.205085152,27.75721011,1.703056769,19.6691898,12.27775914
5,f,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,g,0.187994295,0.027317073,0.0,0.0,0.0,0.02242781,0.0
7,h,0.16854661,0.534634146,1.217318302,7.271813154,1.73580786,0.57751612,0.57027843
8,i,0.142616362,2.528780488,1.163348525,0.34279446,0.0,0.0,0.0
9,j,1.711396344,0.694634146,0.251858959,4.273504274,0.087336245,1.721334455,0.899027172
10,k,0.0,1.475121951,0.0,0.0,0.0,5.573310906,0.0
11,l,0.194476857,0.253658537,1.517150396,2.413273002,0.949781659,5.147182506,1.650452868
12,m,0.0,1.736585366,0.0,0.063988299,0.0,8.42724979,0.623951694
13,n,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14,o,4.68689226,0.12097561,0.0,0.0,0.0,0.0,0.0
15,p,0.0,0.885853659,0.0,0.0,0.0,0.913933277,0.046964106
16,q,0.252819914,0.050731707,0.023986568,0.0,0.087336245,0.0,0.0
17,r,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18,s,0.0,0.0,0.0,0.0,0.0,0.0,0.0
19,t,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20,u,0.0,0.058536585,0.089949628,0.356506239,0.0,0.285954584,1.17410265
21,v,0.0,0.0,0.0,0.0,0.0,0.0,0.0
22,w,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23,x,1.471541553,2.396097561,0.593667546,0.278806161,0.065502183,0.280347631,0.952700436
24,y,0.0,0.32,0.0,0.461629873,0.0,7.804878049,18.38980208
25,z,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Code block:
import pandas as pd
import numpy as np
import altair as alt
from vega_datasets import data
from altair_saver import save
# Read in the file and fill empty cells with zero
df = pd.read_excel("path\to\df")
doNotMelt = df.drop(df.iloc[:,1:], axis=1)
df_melted = pd.melt(df, id_vars = doNotMelt, var_name = 'SampleID', value_name = 'Relative_abundance')
# Tell altair to plot as many rows as is necessary
alt.data_transformers.disable_max_rows()
alt.Chart(df_melted).mark_rect().encode(
alt.X('SampleID:N'),
alt.Y('Lowest_Taxon:N', sort=alt.EncodingSortField(field='Relative_abundance', op='mean', order='descending')),
alt.Color('Relative_abundance:Q')
)

If you know what you want to show is the entries with c, e, b, y and a (and it will not change later) you could simply apply a transform_filter on the field Lowest_Taxon.
If you want to calculate on the spot which ones make it into the top five, it needs a bit more effort, i.e. a combination of joinaggregate, window and filter transforms.
For both I paste an example below. By the way, I converted the original data that you pasted into a csv file which is imported by the code snippets. You can make it easier for others to to use your pandas toy data by providing it as a dict, which can then be simply read directly in the code.
Simple approach:
import pandas as pd
import altair as alt
import numpy as np
alt.data_transformers.disable_max_rows()
df = pd.read_csv('df.csv', index_col=0)
doNotMelt = df.drop(df.iloc[:,1:], axis=1)
df_melted = pd.melt(df, id_vars = doNotMelt, var_name = 'SampleID', value_name = 'Relative_abundance')
alt.Chart(df_melted).mark_rect().encode(
alt.X('SampleID:N'),
alt.Y('Lowest_Taxon:N', sort=alt.EncodingSortField(field='Relative_abundance', op='mean', order='descending')),
alt.Color('Relative_abundance:Q')
).transform_filter(
alt.FieldOneOfPredicate(field='Lowest_Taxon', oneOf=['c', 'e', 'b', 'y', 'a'])
)
Flexible approach:
set n to how many of the top entries you want to see
import pandas as pd
import altair as alt
import numpy as np
alt.data_transformers.disable_max_rows()
df = pd.read_csv('df.csv', index_col=0)
doNotMelt = df.drop(df.iloc[:,1:], axis=1)
df_melted = pd.melt(df, id_vars = doNotMelt, var_name = 'SampleID', value_name = 'Relative_abundance')
n = 5 # number of entries to display
alt.Chart(df_melted).mark_rect().encode(
alt.X('SampleID:N'),
alt.Y('Lowest_Taxon:N', sort=alt.EncodingSortField(field='Relative_abundance', op='mean', order='descending')),
alt.Color('Relative_abundance:Q')
).transform_joinaggregate(
mean_rel_ab = 'mean(Relative_abundance)',
count_of_samples = 'valid(Relative_abundance)',
groupby = ['Lowest_Taxon']
).transform_window(
rank='rank(mean_rel_ab)',
sort=[alt.SortField('mean_rel_ab', order='descending')],
frame = [None, None]
).transform_filter(
(alt.datum.rank <= (n-1) * alt.datum.count_of_samples + 1)

Related

Changing the values of a dict in lowercase ( values are code colors ) to be accepted as a color parametrer in plotly.graph.object

So, I'm trying to get the colors from the dictionary 'Disaster_type' to draw the markers in geoscatters depending of the type of disaster.
Basically, I want to reprensent in the graphic the natural diasasters with it's color code. eg; it's is a volcanic activity paint it 'orange'. I want to change the size of the marker as well depending of the magnitude of the disaster, but that's for another day.
here's the link of the dataset: https://www.kaggle.com/datasets/brsdincer/all-natural-disasters-19002021-eosdis
import plotly.graph_objects as go
import pandas as pd
import plotly as plt
df = pd.read_csv('1900_2021_DISASTERS - main.csv')
df.head()
df.tail()
disaster_set = {disaster for disaster in df['Disaster Type']}
disaster_type = {'Storm':'aliceblue',
'Volcanic activity':'orange',
'Flood':'royalblue',
'Mass movement (dry)':'darkorange',
'Landslide':'#C76114',
'Extreme temperature':'#FF0000',
'Animal accident':'gray55',
'Glacial lake outburst':'#7D9EC0',
'Earthquake':'#CD8C95',
'Insect infestation':'#EEE8AA',
'Wildfire':' #FFFF00',
'Fog':'#00E5EE',
'Drought':'#FFEFD5',
'Epidemic':'#00CD66 ',
'Impact':'#FF6347'}
# disaster_type_lower = {(k, v.lower()) for k, v in disaster_type.items()}
# print(disaster_type_lower)
# for values in disaster_type.values():
# disaster_type[values] = disaster_type.lowercase()
fig = go.Figure(data=go.Scattergeo(
lon = df['Longitude'],
lat = df['Latitude'],
text = df['Country'],
mode = 'markers',
marker_color = disaster_type_.values()
)
)
fig.show()
I cant figure how, I've left in comments after the dict how I tried to do that.
It changes them to lowercase, but know I dont know hot to get them...My brain is completly melted
it's a simple case of pandas map
found data that appears same as yours on kaggle so have used that
one type is unmapped Extreme temperature so used a fillna("red") to remove any errors
gray55 gave me an error so replaced it with RGB equivalent
import kaggle.cli
import sys
import pandas as pd
from zipfile import ZipFile
import urllib
import plotly.graph_objects as go
# fmt: off
# download data set
url = "https://www.kaggle.com/brsdincer/all-natural-disasters-19002021-eosdis"
sys.argv = [sys.argv[0]] + f"datasets download {urllib.parse.urlparse(url).path[1:]}".split(" ")
kaggle.cli.main()
zfile = ZipFile(f'{urllib.parse.urlparse(url).path.split("/")[-1]}.zip')
dfs = {f.filename: pd.read_csv(zfile.open(f)) for f in zfile.infolist()}
# fmt: on
df = dfs["DISASTERS/1970-2021_DISASTERS.xlsx - emdat data.csv"]
disaster_type = {
"Storm": "aliceblue",
"Volcanic activity": "orange",
"Flood": "royalblue",
"Mass movement (dry)": "darkorange",
"Landslide": "#C76114",
"Extreme temperature": "#FF0000",
"Animal accident": "#8c8c8c", # gray55
"Glacial lake outburst": "#7D9EC0",
"Earthquake": "#CD8C95",
"Insect infestation": "#EEE8AA",
"Wildfire": " #FFFF00",
"Fog": "#00E5EE",
"Drought": "#FFEFD5",
"Epidemic": "#00CD66 ",
"Impact": "#FF6347",
}
fig = go.Figure(
data=go.Scattergeo(
lon=df["Longitude"],
lat=df["Latitude"],
text=df["Country"],
mode="markers",
marker_color=df["Disaster Type"].map(disaster_type).fillna("red"),
)
)
fig.show()

finding latest trip information from a large data frame

I have one requirement:
I have a dataframe "df_input" having 20M rows which includes trip details. columns are "vehicle-no", "geolocation","start","end".
For each of the vehicle number there are multiple rows each having different geolocation for different trips.
Now I want to create a new dataframe df_final which will have only the first record for all of the vehicle-no. How can do that in efficient way?
I used something like below which is taking more than 5 hours to complete:
import dfply as dp
from dfply import X
output_df_columns = ["vehicle-no","start", "end", "geolocations"]
df_final = pd.DataFrame(columns = output_df_columns) #create empty dataframe
unique_vehicle_no = list(df_input["vehicle-no"].unique())
df_input.sort_values(["start"],inplace=True)
for each_vehicle in unique_vehicle_no:
df_temp = (df_input >> dp.mask(X.vehicle-no == each_vehicle))
df_final = df_final.append(df_temp.head(1),ignore_index=True, sort=False)
I think this will work out
import pandas as pd
import numpy as np
df_input=pd.DataFrame(np.random.randint(10,size=(1000,3)),columns=['Geolocation','start','end'])
df_input['vehicle_number']=np.random.randint(100,size=(1000))
print(df_input.shape)
print(df_input['vehicle_number'].nunique())
df_final=df_input.groupby('vehicle_number').apply(lambda x : x.head(1)).reset_index(drop=True)
print(df_final['vehicle_number'].nunique())
print(df_final.shape)

Finding Specific word in a pandas column and assigning to a new column and replicate the row

I am trying to find specific words from a pandas column and assign it to a new column and column may contain two or more words. Once I find it I wish to replicate the row by creating it for that word.
import pandas as pd
import numpy as np
import re
wizard=pd.read_excel(r'C:\Python\L\Book1.xlsx'
,sheet_name='Sheet1'
, header=0)
test_set = {'941', '942',}
test_set2={'MN','OK','33/3305'}
wizard['ZTYPE'] = wizard['Comment'].apply(lambda x: any(i in test_set for i in x.split()))
wizard['ZJURIS']=wizard['Comment'].apply(lambda x: any(i in test_set2 for i in x.split()))
wizard_new = pd.DataFrame(np.repeat(wizard.values,3,axis=0))
wizard_new.columns = wizard.columns
wizard_new.head()
I am getting true and false, however unable to split it.
Above is how the sample data reflects. I need to find anything like this '33/3305', Year could be entered as '19', '2019', and quarter could be entered are 'Q1'or '1Q' or 'Q 1' or '1 Q' and my test set lists.
ZJURIS = dict(list(itertools.chain(*[[(y_, x) for y_ in y] for x, y in wizard.comment()])))
def to_category(x):
for w in x.lower().split(" "):
if w in ZJURIS:
return ZJURIS[w]
return None
Finally, apply the method on the column and save the result to a new one:
wizard["ZJURIS"] = wizard["comment"].apply(to_category)
I tried the above solution well it did not
Any suggestions how to do I get the code to work.
Sample data.
data={ 'ID':['351362278576','351539320880','351582465214','351609744560','351708198604'],
'BU':['SBS','MAS','NAS','ET','SBS'],
'Comment':['940/941/w2-W3NYSIT/SUI33/3305/2019/1q','OK SUI 2Q19','941 - 3Q2019NJ SIT - 3Q2019NJ SUI/SDI - 3Q2019','IL,SUI,2016Q4,2017Q1,2017Q2','1Q2019 PA 39/5659 39/2476','UT SIT 1Q19-3Q19']
}
df = pd.DataFrame(data)
Based on the data sample data set attached is the output.

Why is plot returning "ValueError: could not convert string to float:" when a dataframe column of floats is being passed to the plot function?

I am trying to plot a dataframe I have created from an excel spreadsheet using either matplotlib or matplotlib and pandas ie. df.plot. However, python keeps returning a cannot convert string to float error. This is confusing since when I print the column of the dataframe it appears to be all float values.
I've tried printing the values of the dataframe column and using the pandas.plot syntax. I've also tried saving the column to a new variable.
import pandas as pd
from matplotlib import pyplot as plt
import glob
import openpyxl
import math
from openpyxl.utils.dataframe import dataframe_to_rows
from openpyxl.styles import Border, Side, Alignment
import seaborn as sns
import itertools
directory = 'E:\some directory'
#QA_directory = directory + '**/*COPY.xlsx'
wb = openpyxl.load_workbook(directory + '\\Calcs\\' + "excel file.xlsx", data_only = 'True')
plt.figure(figsize=(16,9))
axes = plt.axes()
plt.title('Drag Amplification', fontsize = 16)
plt.xlabel('Time (s)', fontsize = 14)
plt.ylabel('Cf', fontsize = 14)
d = pd.DataFrame()
n=[]
for sheets in wb.sheetnames:
if '2_1' in sheets and '2%' not in sheets and '44%' not in sheets:
name = sheets[:8]
print(name)
ws = wb[sheets]
data = ws.values
cols = next(data)[1:]
data = list(data)
idx = [r[0] for r in data]
data = (itertools.islice(r, 1, None) for r in data)
df = pd.DataFrame(data, index=idx, columns=cols)
df = df.dropna()
#x = df['x/l']
#y = df.Cf
print(df.columns)
print(df.Cf.values)
x=df['x/l'].values
plt.plot(x, df.Cf.values)
"""x = [wb[sheets].cell(row=row,column=1).value for row in range(1,2000) if wb[sheets].cell(row=row,column=1).value]
print(x)
Cf = [wb[sheets].cell(row=row,column=6).value for row in range(1,2000) if wb[sheets].cell(row=row,column=1).value]
d[name+ 'x'] = pd.DataFrame(x)
d[name + '_Cf'] = pd.Series(Cf, index=d.index)
print(name)"""
print(df)
plt.show()
I'm expecting a plot of line graphs with the values of x/l on the x access and Cf on the 'y' with a line for each of the relevant sheets in the workbook. Any insights as to why i am getting this error would be appreciated!

In Bokeh, how do I add tooltips to a Timeseries chart (hover tool)?

Is it possible to add Tooltips to a Timeseries chart?
In the simplified code example below, I want to see a single column name ('a','b' or 'c') when the mouse hovers over the relevant line.
Instead, a "???" is displayed and ALL three lines get a tool tip (rather than just the one im hovering over)
Per the documentation (
http://docs.bokeh.org/en/latest/docs/user_guide/tools.html#hovertool), field names starting with “#” are interpreted as columns on the data source.
How can I display the 'columns' from a pandas DataFrame in the tooltip?
Or, if the high level TimeSeries interface doesn't support this, any clues for using the lower level interfaces to do the same thing? (line? multi_line?) or convert the DataFrame into a different format (ColumnDataSource?)
For bonus credit, how should the "$x" be formatted to display the date as a date?
thanks in advance
import pandas as pd
import numpy as np
from bokeh.charts import TimeSeries
from bokeh.models import HoverTool
from bokeh.plotting import show
toy_df = pd.DataFrame(data=np.random.rand(5,3), columns = ('a', 'b' ,'c'), index = pd.DatetimeIndex(start='01-01-2015',periods=5, freq='d'))
p = TimeSeries(toy_df, tools='hover')
hover = p.select(dict(type=HoverTool))
hover.tooltips = [
("Series", "#columns"),
("Date", "$x"),
("Value", "$y"),
]
show(p)
Below is what I came up with.
Its not pretty but it works.
Im still new to Bokeh (& Python for that matter) so if anyone wants to suggest a better way to do this, please feel free.
import pandas as pd
import numpy as np
from bokeh.charts import TimeSeries
from bokeh.models import HoverTool
from bokeh.plotting import show
toy_df = pd.DataFrame(data=np.random.rand(5,3), columns = ('a', 'b' ,'c'), index = pd.DatetimeIndex(start='01-01-2015',periods=5, freq='d'))
_tools_to_show = 'box_zoom,pan,save,hover,resize,reset,tap,wheel_zoom'
p = figure(width=1200, height=900, x_axis_type="datetime", tools=_tools_to_show)
# FIRST plot ALL lines (This is a hack to get it working, why can't i pass in a dataframe to multi_line?)
# It's not pretty but it works.
# what I want to do!: p.multi_line(df)
ts_list_of_list = []
for i in range(0,len(toy_df.columns)):
ts_list_of_list.append(toy_df.index.T)
vals_list_of_list = toy_df.values.T.tolist()
# Define colors because otherwise multi_line will use blue for all lines...
cols_to_use = ['Black', 'Red', 'Lime']
p.multi_line(ts_list_of_list, vals_list_of_list, line_color=cols_to_use)
# THEN put scatter one at a time on top of each one to get tool tips (HACK! lines with tooltips not yet supported by Bokeh?)
for (name, series) in toy_df.iteritems():
# need to repmat the name to be same dimension as index
name_for_display = np.tile(name, [len(toy_df.index),1])
source = ColumnDataSource({'x': toy_df.index, 'y': series.values, 'series_name': name_for_display, 'Date': toy_df.index.format()})
# trouble formating x as datestring, so pre-formating and using an extra column. It's not pretty but it works.
p.scatter('x', 'y', source = source, fill_alpha=0, line_alpha=0.3, line_color="grey")
hover = p.select(dict(type=HoverTool))
hover.tooltips = [("Series", "#series_name"), ("Date", "#Date"), ("Value", "#y{0.00%}"),]
hover.mode = 'mouse'
show(p)
I’m not familiar with Pandas,I just use python list to show the very example of how to add tooltips to muti_lines, show series names ,and properly display date/time。Below is the result.
Thanks to #bs123's answer and #tterry's answer in Bokeh Plotting: Enable tooltips for only some glyphs
my result
# -*- coding: utf-8 -*-
from bokeh.plotting import figure, output_file, show, ColumnDataSource
from bokeh.models import HoverTool
from datetime import datetime
dateX_str = ['2016-11-14','2016-11-15','2016-11-16']
#conver the string of datetime to python datetime object
dateX = [datetime.strptime(i, "%Y-%m-%d") for i in dateX_str]
v1= [10,13,5]
v2 = [8,4,14]
v3= [14,9,6]
v = [v1,v2,v3]
names = ['v1','v2','v3']
colors = ['red','blue','yellow']
output_file('example.html',title = 'example of add tooltips to multi_timeseries')
tools_to_show = 'hover,box_zoom,pan,save,resize,reset,wheel_zoom'
p = figure(x_axis_type="datetime", tools=tools_to_show)
#to show the tooltip for multi_lines,you need use the ColumnDataSource which define the data source of glyph
#the key is to use the same column name for each data source of the glyph
#so you don't have to add tooltip for each glyph,the tooltip is added to the figure
#plot each timeseries line glyph
for i in xrange(3):
# bokeh can't show datetime object in tooltip properly,so we use string instead
source = ColumnDataSource(data={
'dateX': dateX, # python datetime object as X axis
'v': v[i],
'dateX_str': dateX_str, #string of datetime for display in tooltip
'name': [names[i] for n in xrange(3)]
})
p.line('dateX', 'v',source=source,legend=names[i],color = colors[i])
circle = p.circle('dateX', 'v',source=source, fill_color="white", size=8, legend=names[i],color = colors[i])
#to avoid some strange behavior(as shown in the picture at the end), only add the circle glyph to the renders of hover tool
#so tooltip only takes effect on circle glyph
p.tools[0].renderers.append(circle)
# show the tooltip
hover = p.select(dict(type=HoverTool))
hover.tooltips = [("value", "#v"), ("name", "#name"), ("date", "#dateX_str")]
hover.mode = 'mouse'
show(p)
tooltips with some strange behavior,two tips displayed at the same time
Here is my solution. I inspected the glyph render data source to see what are the names on it. Then I use those names on the hoover tooltips. You can see the resulting plot here.
import numpy as np
from bokeh.charts import TimeSeries
from bokeh.models import HoverTool
from bokeh.plotting import show
toy_df = pd.DataFrame(data=np.random.rand(5,3), columns = ('a', 'b' ,'c'), index = pd.DatetimeIndex(start='01-01-2015',periods=5, freq='d'))
#Bockeh display dates as numbers so convert to string tu show correctly
toy_df.index = toy_df.index.astype(str)
p = TimeSeries(toy_df, tools='hover')
#Next 3 lines are to inspect how are names on gliph to call them with #name on hover
#glyph_renderers = p.select(dict(type=GlyphRenderer))
#bar_source = glyph_renderers[0].data_source
#print(bar_source.data) #Here we can inspect names to call on hover
hover = p.select(dict(type=HoverTool))
hover.tooltips = [
("Series", "#series"),
("Date", "#x_values"),
("Value", "#y_values"),
]
show(p)
The original poster's code doesn't work with the latest pandas (DatetimeIndex constructor has changed), but Hovertool now supports a formatters attribute that lets you specify a format as a strftime string. Something like
fig.add_tool(HoverTool(
tooltip=[
('time', '#index{%Y-%m-%d}')
],
formatters={
'#index': 'datetime'
}
))

Resources