Use a second column for faceting in Altair - altair

Using Altair, I'm trying to plot some data from a Dataframe:
plot_N50 = alt.Chart(data).mark_boxplot(opacity=0.5).encode(
y=alt.Y('N50', scale=alt.Scale(domain=[0, 35000], clamp=True), axis=alt.Axis(tickCount=9)),
color=alt.Color('Assembler', scale=alt.Scale(scheme='turbo'), legend=None),
column=alt.Column('Assembler:N',
title="",
header=alt.Header(labelAngle=-45, labelOrient='bottom', labelPadding=-5)
),
row=alt.Row('Amplicon:N',
title="",
sort='descending',
),
).configure_axis(
grid=False,
labelFontSize=12,
titleFontSize=12
).configure_view(
stroke=None
).properties(
height=height, width=width
)
works fine so far producing the following plot:
N50 plot
would it also be possible to enter a second column (actually the one in the row definition: Amplicon) to get the two plots side by side? In the DF the column Amplicon only has two states. I'm happy to provide any further information.
Thanks in advance

Related

Pandas Crosstab Plot Top N Elements

I have created a Pandas Crosstab with some data I am currently working with.
contingencyTable = pd.crosstab(index=dfBoolean['ColA'],
columns=dfBoolean['Category'],
margins=True)
Some of the categories from both the primaryDistribution and the categoricalColumnName have a TON of categories--way too many to plot nicely on a single graph. Is it possible to select the top n (say, 5, for example) categories to plot in the cross tab? Thanks!
plt.rcParams.update({'font.size': 12})
bPlot = contingencyTable.plot.bar(rot=0, figsize=(9.2, 5.7))
Here is some sample data:

How to group-by twice, preserve original columns, and plot

I have the following data sets (only sample is shown):
I want to find the most impactful exercise per area and then plot it via Seaborn barplot.
I use the following code to do so.
# Create Dataset Using Only Area, Exercise and Impact Level Chategories
CA_data = Data[['area', 'exercise', 'impact level']]
# Compute Mean Impact Level per Exercise per Area
mean_il_CA = CA_data.groupby(['area', 'exercise'])['impact level'].mean().reset_index()
mean_il_CA_hello = mean_il_CA.groupby('area')['impact level'].max().reset_index()
# Plot
cx = sns.barplot(x="impact level", y="area", data=mean_il_CA_hello)
plt.title('Most Impactful Exercises Considering Area')
plt.show()
The resulting dataset is:
This means that when I plot, on the y axis only the label relative to the area appears, NOT 'area label' + 'exercise label' like I would like.
How do I reinsert 'exercise column into my final dataset?
How do I get both the name of the area and the exercise on the y plot?
The problem of losing the values of 'exercise' when grouping by the maximum of 'area' can be solved by keeping the MultiIndex (i.e. not using reset_index) and using .transform to create a boolean mask to select the appropriate full rows of mean_il_CA that contain the maximum 'impact_level' values per 'area'. This solution is based on the code provided in this answer by unutbu. The full labels for the bar chart can be created by concatenating the labels of 'area' and 'exercise'.
Here is an example using the titanic dataset from the seaborn package. The variables 'class', 'embark_town', and 'fare' are used in place of 'area', 'exercise', and 'impact_level'. The categorical variables both contain three unique values: 'First', 'Second', 'Third', and 'Cherbourg', 'Queenstown', 'Southampton'.
import pandas as pd # v 1.2.5
import seaborn as sns # v 0.11.1
df = sns.load_dataset('titanic')
data = df[['class', 'embark_town', 'fare']]
data.head()
data_mean = data.groupby(['class', 'embark_town'])['fare'].mean()
data_mean
# Select max values in each class and create concatenated labels
mask_max = data_mean.groupby(level=0).transform(lambda x: x == x.max())
data_mean_max = data_mean[mask_max].reset_index()
data_mean_max['class, embark_town'] = data_mean_max['class'].astype(str) + ', ' \
+ data_mean_max['embark_town']
data_mean_max
# Draw seaborn bar chart
sns.barplot(data=data_mean_max,
x=data_mean_max['fare'],
y=data_mean_max['class, embark_town'])

Is it possible to sort the columns of an Altair grouped bar chart based on the value of one of the categories?

I have the following chart -
I'd like to be able to sort the columns (NOT the individual bars of a single group - I know how to do that already), i.e order the 3 sub-chart - if you will - based on the value of any category(a,b or c) I choose.
I tried using alt.SortField and alt.EncodeSortField, they move around the charts a bit, but don't actually work if you change the category to see if they actually work.
Code -
import altair as alt
import pandas as pd
dummy = pd.DataFrame({'place':['Asia', 'Antarctica','Africa', 'Antarctica', 'Asia', 'Africa', 'Africa','Antarctica', 'Asia'],'category':['a','a','a','b','b','b','c','c','c'],'value':[5,2,3,4,3,5,6,9,5]})
alt.Chart(dummy).mark_bar().encode(
x=alt.X('category'),
y='value',
column=alt.Column('place:N', sort=alt.SortField(field='value', order='descending')),
color='category',
)
I know that alt.Column('place:N', sort=alt.SortField(field='value', order='descending')), doesn't seem correct, since I am not targeting any category, so I tried x=alt.X('category', sort=alt.SortField(field='c', order='descending')), too, but it doesn't work either.
Expected Output (assuming descending order)-
If I want to order by 'c', then middle column should be first, followed by left and finally right column.
It already seems ordered by 'b'.
If I want to order by 'a', then right column should be first, followed by left and finally middle column.
This is a bit involved, but you can do this with a series of transforms:
a Calculate Transform to select the value you want to sort on
a Join-Aggregate Transform with argmax to join the desired values to each group
another calculate transform to pull-out the specific field within this result that you would like to sort by
It looks like this, first sorting by "c":
import altair as alt
import pandas as pd
dummy = pd.DataFrame({'place':['Asia', 'Antarctica','Africa', 'Antarctica', 'Asia', 'Africa', 'Africa','Antarctica', 'Asia'],'category':['a','a','a','b','b','b','c','c','c'],'value':[5,2,3,4,3,5,6,9,5]})
alt.Chart(dummy).transform_calculate(
key="datum.category == 'c'"
).transform_joinaggregate(
sort_key="argmax(key)", groupby=['place']
).transform_calculate(
sort_val='datum.sort_key.value'
).mark_bar().encode(
x=alt.X('category'),
y='value',
column=alt.Column('place:N', sort=alt.SortField("sort_val", order="descending")),
color='category',
)
Then sorting by "a":
alt.Chart(dummy).transform_calculate(
key="datum.category == 'a'"
).transform_joinaggregate(
sort_key="argmax(key)", groupby=['place']
).transform_calculate(
sort_val='datum.sort_key.value'
).mark_bar().encode(
x=alt.X('category'),
y='value',
column=alt.Column('place:N', sort=alt.SortField("sort_val", order="descending")),
color='category',
)

How can I put two bars of distinct series next to each other in the same chart?

Let's assume our data frame has two series of type integer: estimated_value and sell_price.
I want to have two bars next to each other in the same bar chart.
The left one shows average(estimated_value) and the right one shows average(sell_price).
They shall share the same axis.
I thought this would be a very common use case but I could not find any example in the docs. All the examples use 'colour' or 'column' to group bars.
I've tried using y2 but it seems to simply erase the difference to y1 instead of adding a second series.
Then I tried using a layeredChart but this puts both bars on top of each other instead of next to each other.
It sounds like you have wide-form data rather than long-form data. The difference is discussed in Long-form vs. Wide-form data.
Once you've transformed your data to long-form, you can use standard encodings to achieve this result. Here's how it might look, using some example data:
import altair as alt
import pandas as pd
data = pd.DataFrame({
'estimated_value': [500, 600, 700, 800, 900],
'sell_price': [550, 610, 690, 810, 950]
})
alt.Chart(data).transform_fold(
['estimated_value', 'sell_price'], as_=['category', 'price']
).mark_bar().encode(
y='category:N',
x='average(price):Q',
)

SNS catplot (Box plot) selecting only 5 data point to be displayed based on their mean

I have a sns catplot (boxplot) (click on the link below). For each time window as seen from x axis, there are multiple boxplots which correspond to 1 ID each. How can i code such that for every time window, only 5 IDs of the highest mean at the particular time are displayed for all time window? Thank you!`
sns.catplot('time_window', hue='ID', y='Time (ms)', data=mo_finaldf, kind="box", showfliers=False)
I have found an answer to my question. Basically, seaborn and Matplotlib do not have any settings for these and you have to split your dataframe on your own. What I have done is a groupby followed by a SQL join. Hope it helps anyone who has encountered the same problem in the future.
df_to_join = mo_finaldf.groupby(['time_window', 'ID']).agg({"time": {'Mean': 'mean', 'var': 'var'}})\
['time'].sort_values(by='Mean', ascending=False).sort_index(level='time_window', sort_remaining=False)
highest_5_mean = df_to_join.groupby(['time_window']).head(5).copy()
highest_5_mean.reset_index(inplace=True)
highest_5_mean.rename(columns={'time': 'Mean'}, inplace=True)
dataset_filtered = pd.merge(mo_finaldf, highest_5_mean, how='inner', left_on=['time_window', 'tap'],
right_on=['time_window', 'ID'])
sns.catplot(x='time_window', hue='tap', y='time', data=dataset_filtered, kind="box",
showfliers=False)

Resources