specify color in seaborn catplot - python-3.x

I would like to specify the color of particular observations using seaborn catplot. In a made up exemple:
import seaborn as sns
import random as r
name_list=['pepe','Fabrice','jim','Michael']
country_list=['spain','France','uk','Uruguay']
favourite_color=['green','blue','red','white']
df=pd.DataFrame({'name':[r.choice(name_list) for n in range(100)],
'country':[r.choice(country_list) for n in range(100)],
'fav_color':[r.choice(favourite_color) for n in range(100)],
'score':np.random.rand(100),
})
sns.catplot(x='fav_color',
y='score',
col='country',
col_wrap=2,
data=df,
kind='swarm')
I would like to colour (or mark in another distinctive way, it could be the marker) all the observations with the name 'pepe'. How I could do that ? The other colors I dont mind, it would be better if they are all the same.

You can achieve the results you want by adding a boolean column to the dataframe and using it as the hue parameter of the catplot() call. This way you will obtain the results with two colours (one for pepe observations and one for the rest). Results can be seen here:
Also, the parameter legend=False should be set since otherwise the legend for is_pepe will appear on the side.
The code would be as below:
df['is_pepe'] = df['name'] == 'pepe'
ax = sns.catplot(x='fav_color',
y='score',
col='country',
col_wrap=2,
data=df,
kind='swarm',
hue='is_pepe',
legend=False)
Furthermore, you can specify the two colours you want for the two kinds of observations (pepe and not-pepe) using the parameter palette and the top-level function sns.color_palette() as shown below:
ax = sns.catplot(x='fav_color',
y='score',
col='country',
col_wrap=2,
data=df,
kind='swarm',
hue='is_pepe',
legend=False,
palette=sns.color_palette(['green', 'blue']))
Obtaining this results:

Related

How do the factors in factor_cmap in Bokeh work?

I am trying to construct a grouped vertical bar chart in Bokeh from a pandas dataframe. I'm struggling with understanding the use of factor_cmap and how the color mapping works with this function. There's an example in the documentation (https://docs.bokeh.org/en/latest/docs/user_guide/categorical.html#pandas) that was helpful to follow, here:
from bokeh.io import output_file, show
from bokeh.palettes import Spectral5
from bokeh.plotting import figure
from bokeh.sampledata.autompg import autompg_clean as df
from bokeh.transform import factor_cmap
output_file("bar_pandas_groupby_nested.html")
df.cyl = df.cyl.astype(str)
df.yr = df.yr.astype(str)
group = df.groupby(by=['cyl', 'mfr'])
index_cmap = factor_cmap('cyl_mfr', palette=Spectral5, factors=sorted(df.cyl.unique()), end=1)
p = figure(plot_width=800, plot_height=300, title="Mean MPG by # Cylinders and Manufacturer",
x_range=group, toolbar_location=None, tooltips=[("MPG", "#mpg_mean"), ("Cyl, Mfr", "#cyl_mfr")])
p.vbar(x='cyl_mfr', top='mpg_mean', width=1, source=group,
line_color="white", fill_color=index_cmap, )
p.y_range.start = 0
p.x_range.range_padding = 0.05
p.xgrid.grid_line_color = None
p.xaxis.axis_label = "Manufacturer grouped by # Cylinders"
p.xaxis.major_label_orientation = 1.2
p.outline_line_color = None
show(p)
This yields the following (again, a screen shot from the documentation):
Grouped Vbar output
I understand how factor_cmap is working here, I think. The index for the dataframe has multiple factors and we're only taking the first by slicing (as seen with the end = 1). But when I try to instead set coloring based on the second index level, mfr, (setting start = 1 , end = 2) , the index mapping breaks and I get this. I based this change on my assumption that the factors were hierarchical and I needed to slice them to get the second level.
I think I must be thinking about the indexing with these categorical factors wrong, but I'm not sure what I'm doing wrong. How do I get a categorical mapper to color by the second level of the factor? I assumed the format of the factors was ('cyl', 'mfr') but maybe that assumption is wrong?
Here's the documentation for factor_cmap, although it wasn't very helpful: https://docs.bokeh.org/en/latest/docs/reference/transform.html#bokeh.transform.factor_cmap .
If you mean you are trying this:
index_cmap = factor_cmap('cyl_mfr',
palette=Spectral5,
factors=sorted(df.cyl.unique()),
start=1, end=2)
Then there are at least two issues:
2 is out of bounds for the length of the list of sub-factors ('cyl', 'mfr'). You would just want start=1 and leave end with its default value of None (which means to the end of the list, as usual for any Python slice).
In this specific case, with start=1 that means "colormap based on mfr sub-factors of the values", but you are still configuring the cololormapper with the cylinders as the factors for the map:
factors=sorted(df.cyl.unique())
When the colormapper goes to look up a value with mfr="mazda" in the mapping, it does not find anything (because you only put cylinder values in the mapping) so it gets shaded the default color grey (as expected).
So you could do something like this:
index_cmap = factor_cmap('cyl_mfr',
palette=Spectral5,
factors=sorted(df.mfr.unique()),
start=1)
Which "works" modulo the fact that there are way more manufacturer values than there are colors in the Spectral5 palette:
In the real situation you'll need to make sure you use a palette as least as big as the number of (sub-)factors that you configure.

How to add traces in plotly.express

I am very new to python and plotly.express, and I find it very confusing...
I am trying to use the principle of adding different traces to my figure, using example code shown here https://plotly.com/python/line-charts/, Line Plot Modes, #Create traces.
BUT I get my data from a .CSV file.
import plotly.express as px
import plotly as plotly
import plotly.graph_objs as go
import pandas as pd
data = pd.read_csv(r"C:\Users\x.csv")
fig = px.scatter(data, x="Time", y="OD", color="C-source", size="C:A 1 ratio")
fig = px.line(data, x="Time", y="OD", color="C-source")
fig.show()
The above lines produces scatter/line plots with the correct data, but the data is mixed together. I have data from 2 different sources marked by a column named "Strain" in my .csv file that I would like the chart to reflect.
Is the traces option a possible way to do it, or is there another way?
You can add traces using an Express plot by using .select_traces(). Something like:
fig.add_traces(
list(px.line(...).select_traces())
)
Note the need to convert to list, since .select_traces() returns a generator.
It looks like you probably want the lines with the scatter dots as well on a single plot?
You're setting fig to equal px.scatter() and then setting (changing) it to equal px.line(). When set to line, the scatter plot is overwritten.
You're already importing graph objects so you can use add_trace with go, something like this:
fig.add_trace(go.Scatter(x=data["Time"], y=data["OD"], mode='markers', marker=dict(color=data["C-source"], size=data["C:A 1 ratio"])))
Depending on how your data is set up, you may need to add each C-source separately doing something like:
x=data.query("C-source=='Term'")["Time"], ... , name='Term'`
Here's a few references with examples and options you can use to set up your scatter:
Scatter plot examples  
Marker styles  
Scatter arguments and attributes
You can use the apporach stated in Plotly: How to combine scatter and line plots using Plotly Express?
fig3 = go.Figure(data=fig1.data + fig2.data)
or a more convenient and scalable approach:
fig1.data and fig2.data are common tuples that hold all the info needed for a plot and the + just concatenates them.
# this will hold all figures until they are combined
all_figures = []
# data_collection: dictionary with Pandas dataframes
for df_label in data_collection:
df = data_collection[df_label]
fig = px.line(df, x='Date', y=['Value'])
all_figures.append(fig)
import operator
import functools
# now you can concatenate all the data tuples
# by using the programmatic add operator
fig3 = go.Figure(data=functools.reduce(operator.add, [_.data for _ in all_figures]))
fig3.show()
thanks for taking the time to help me out. I ended up with two solutions that worked, of which using "facet_col" to divide the plot into two subplots (1 for each strain) was the most simple solution.
https://plotly.com/python/axes/
Thanks. this worked for me also where Fig_Set_B is a list of scatter plots
# create a tuple of first line plots in first 6 plots from plot set Fig_Set_B`
fig_combined = go.Figure(data= tuple(Fig_Set_B[x].data[0] for x in range(6)) )
fig_combined.show()

Using RGB values control individual data points matplotlib

I'm trying to be able to control the colour of an individual data point using a corresponding rgb tuple. I've tried looping through the data set and plotting individual data points however I get the same effect as the code I have below; all that happens is it refuses to produce a graph.
This is an example of the data type I'm working with
Any tips?
import matplotlib.pyplot as plt
y=[(0.200,0.1100,0.520)]
for i in range(4):
y.append(y)
plt.plot([1,2,3,4], [3,4,5,2],c=y)
plt.show()
One problem is that you are appending the list to the new list. Instead, try appending the tuple to the list. Moreover, you need to use scatter plot for the color argument which contains rgb tuple for each point. However, in oyur case, I see only a single color for all the scatter points.
tup=(0.200,0.1100,0.520)
y = []
for i in range(4):
y.append(tup)
plt.scatter([1,2,3,4], [3,4,5,2], c=y)
A rather short version to your code is using a list comprehension
tup=(0.200,0.1100,0.520)
y = [tup for _ in range(4)]
plt.scatter([1,2,3,4], [3,4,5,2], c=y)

Creating a structured grid of subplots with Seaborn FacetGrid

My attempt to use FacetGrid in Seaborn does not produces the expected results.
Moreover, I would like to control the white space in the grid.
My data and code is the following:
toy.to_json()
'{"has_cus_id_but_not_acc_id":{"0":0,"1":0,"2":0,"3":0,"4":0,"5":0,"6":0,"7":0,"8":0,"9":0,"10":0,"11":0,"12":0,"13":0,"14":0,"15":0,"16":0,"17":0,"18":1,"19":0,"20":0,"21":0,"22":1,"23":0,"24":0,"25":1,"26":0,"27":1,"28":0,"29":1,"30":0,"31":1,"32":0,"33":1,"34":0,"35":1,"36":0,"37":1,"38":0,"39":0,"40":1,"41":1,"42":0,"43":1,"44":0,"45":1,"46":0,"47":1,"48":0,"49":1,"50":0,"51":1,"52":0,"53":1,"54":0,"55":1,"56":0,"57":1,"58":0,"59":1,"60":0,"61":1,"62":0,"63":1,"64":0,"65":1,"66":0,"67":1,"68":0,"69":1,"70":0,"71":1,"72":0,"73":1,"74":0,"75":1,"76":0,"77":0,"78":1,"79":0,"80":1,"81":0,"82":0,"83":1,"84":0,"85":1},"reg_year":{"0":2014.0,"1":2014.0,"2":2014.0,"3":2014.0,"4":2014.0,"5":2014.0,"6":2014.0,"7":2014.0,"8":2015.0,"9":2015.0,"10":2015.0,"11":2015.0,"12":2015.0,"13":2015.0,"14":2015.0,"15":2015.0,"16":2015.0,"17":2016.0,"18":2016.0,"19":2016.0,"20":2016.0,"21":2016.0,"22":2016.0,"23":2016.0,"24":2016.0,"25":2016.0,"26":2016.0,"27":2016.0,"28":2016.0,"29":2016.0,"30":2016.0,"31":2016.0,"32":2016.0,"33":2016.0,"34":2016.0,"35":2016.0,"36":2016.0,"37":2016.0,"38":2017.0,"39":2017.0,"40":2017.0,"41":2017.0,"42":2017.0,"43":2017.0,"44":2017.0,"45":2017.0,"46":2017.0,"47":2017.0,"48":2017.0,"49":2017.0,"50":2017.0,"51":2017.0,"52":2017.0,"53":2017.0,"54":2017.0,"55":2017.0,"56":2017.0,"57":2017.0,"58":2017.0,"59":2017.0,"60":2018.0,"61":2018.0,"62":2018.0,"63":2018.0,"64":2018.0,"65":2018.0,"66":2018.0,"67":2018.0,"68":2018.0,"69":2018.0,"70":2018.0,"71":2018.0,"72":2018.0,"73":2018.0,"74":2018.0,"75":2018.0,"76":2018.0,"77":2018.0,"78":2018.0,"79":2018.0,"80":2018.0,"81":2018.0,"82":2019.0,"83":2019.0,"84":2019.0,"85":2019.0},"reg_month":{"0":3.0,"1":5.0,"2":6.0,"3":7.0,"4":9.0,"5":10.0,"6":11.0,"7":12.0,"8":1.0,"9":3.0,"10":5.0,"11":6.0,"12":7.0,"13":8.0,"14":9.0,"15":11.0,"16":12.0,"17":1.0,"18":1.0,"19":2.0,"20":3.0,"21":4.0,"22":4.0,"23":5.0,"24":6.0,"25":6.0,"26":7.0,"27":7.0,"28":8.0,"29":8.0,"30":9.0,"31":9.0,"32":10.0,"33":10.0,"34":11.0,"35":11.0,"36":12.0,"37":12.0,"38":1.0,"39":2.0,"40":2.0,"41":3.0,"42":4.0,"43":4.0,"44":5.0,"45":5.0,"46":6.0,"47":6.0,"48":7.0,"49":7.0,"50":8.0,"51":8.0,"52":9.0,"53":9.0,"54":10.0,"55":10.0,"56":11.0,"57":11.0,"58":12.0,"59":12.0,"60":1.0,"61":1.0,"62":2.0,"63":2.0,"64":3.0,"65":3.0,"66":4.0,"67":4.0,"68":5.0,"69":5.0,"70":6.0,"71":6.0,"72":7.0,"73":7.0,"74":8.0,"75":8.0,"76":9.0,"77":10.0,"78":10.0,"79":11.0,"80":11.0,"81":12.0,"82":1.0,"83":1.0,"84":2.0,"85":2.0},"Total_Revenue":{"0":35852.02,"1":2623.97,"2":3526.67,"3":21466.71,"4":72784.1200000003,"5":103921.2899999999,"6":10852.87,"7":16522.07,"8":7443.76,"9":68962.1600000002,"10":10956.38,"11":193856.8799999985,"12":110766.6099999997,"13":123861.8599999987,"14":2722.34,"15":303488.6900000007,"16":6876.58,"17":17729.5,"18":4687.93,"19":26914.06,"20":2228.12,"21":15708.93,"22":859.58,"23":19164.89,"24":163164.4799999995,"25":33180.7300000001,"26":10033.01,"27":1114.48,"28":462613.2900000042,"29":9822.95,"30":70901.4400000003,"31":22370.29,"32":46711.8900000002,"33":2335.02,"34":7259.28,"35":11.83,"36":13590.51,"37":7677.77,"38":282.01,"39":358522.7900000003,"40":5844.0,"41":7027.28,"42":1908.71,"43":4032.35,"44":11072.6,"45":3973.15,"46":30706.23,"47":2644.13,"48":23831.75,"49":670.12,"50":6949.54,"51":4687.7,"52":9672.69,"53":7333.01,"54":12814.33,"55":689.39,"56":6962.86,"57":2283.16,"58":1259.5,"59":224.84,"60":12812.12,"61":247.68,"62":25452.65,"63":1245.02,"64":24211.36,"65":5255.25,"66":28402.76,"67":9148.55,"68":14822.61,"69":345.37,"70":12408.13,"71":989.93,"72":10601.33,"73":730.32,"74":169020.5000000001,"75":697.54,"76":3862038.6799997138,"77":6148750.9899984254,"78":194.06,"79":2379382.4500000761,"80":1174.11,"81":1729567.9000000793,"82":889650.029999995,"83":95.8,"84":415996.6999999974,"85":654.78}}'
g = sns.FacetGrid(toy, col='has_cus_id_but_not_acc_id', hue='reg_year')
g.map(sns.barplot, 'reg_month', 'Total_Revenue')
g.add_legend();
If I use bar in pyplot I get this:
g = sns.FacetGrid(toy, col='has_cus_id_but_not_acc_id', hue='reg_year')
g.map(plt.bar, 'reg_month', 'Total_Revenue')
g.add_legend();
Again, I would like to be able to define the white space of the grid.
In addition I would not like to have the bars stacked one over the other but rather one next to the other.
Some values of the year 2018 are really large compared to the any of the values where has_cus_id_but_not_acc_id is 1. Hence the right plot is almost empty. It might make sense to use a logarithmic scale.
Now you have 6 years, so each month would need to show 6 bars next to each other. That will make bars pretty small and does not let the chart be easily readable. Still it's possible.
The following does not use seaborn, but pandas and matplotlib:
import matplotlib.pyplot as plt
import pandas as pd
toy = '{"has_cus_id_but_not_acc_id":{"0":0,"1":0,"2":0,"3":0,"4":0,"5":0,"6":0,"7":0,"8":0,"9":0,"10":0,"11":0,"12":0,"13":0,"14":0,"15":0,"16":0,"17":0,"18":1,"19":0,"20":0,"21":0,"22":1,"23":0,"24":0,"25":1,"26":0,"27":1,"28":0,"29":1,"30":0,"31":1,"32":0,"33":1,"34":0,"35":1,"36":0,"37":1,"38":0,"39":0,"40":1,"41":1,"42":0,"43":1,"44":0,"45":1,"46":0,"47":1,"48":0,"49":1,"50":0,"51":1,"52":0,"53":1,"54":0,"55":1,"56":0,"57":1,"58":0,"59":1,"60":0,"61":1,"62":0,"63":1,"64":0,"65":1,"66":0,"67":1,"68":0,"69":1,"70":0,"71":1,"72":0,"73":1,"74":0,"75":1,"76":0,"77":0,"78":1,"79":0,"80":1,"81":0,"82":0,"83":1,"84":0,"85":1},"reg_year":{"0":2014.0,"1":2014.0,"2":2014.0,"3":2014.0,"4":2014.0,"5":2014.0,"6":2014.0,"7":2014.0,"8":2015.0,"9":2015.0,"10":2015.0,"11":2015.0,"12":2015.0,"13":2015.0,"14":2015.0,"15":2015.0,"16":2015.0,"17":2016.0,"18":2016.0,"19":2016.0,"20":2016.0,"21":2016.0,"22":2016.0,"23":2016.0,"24":2016.0,"25":2016.0,"26":2016.0,"27":2016.0,"28":2016.0,"29":2016.0,"30":2016.0,"31":2016.0,"32":2016.0,"33":2016.0,"34":2016.0,"35":2016.0,"36":2016.0,"37":2016.0,"38":2017.0,"39":2017.0,"40":2017.0,"41":2017.0,"42":2017.0,"43":2017.0,"44":2017.0,"45":2017.0,"46":2017.0,"47":2017.0,"48":2017.0,"49":2017.0,"50":2017.0,"51":2017.0,"52":2017.0,"53":2017.0,"54":2017.0,"55":2017.0,"56":2017.0,"57":2017.0,"58":2017.0,"59":2017.0,"60":2018.0,"61":2018.0,"62":2018.0,"63":2018.0,"64":2018.0,"65":2018.0,"66":2018.0,"67":2018.0,"68":2018.0,"69":2018.0,"70":2018.0,"71":2018.0,"72":2018.0,"73":2018.0,"74":2018.0,"75":2018.0,"76":2018.0,"77":2018.0,"78":2018.0,"79":2018.0,"80":2018.0,"81":2018.0,"82":2019.0,"83":2019.0,"84":2019.0,"85":2019.0},"reg_month":{"0":3.0,"1":5.0,"2":6.0,"3":7.0,"4":9.0,"5":10.0,"6":11.0,"7":12.0,"8":1.0,"9":3.0,"10":5.0,"11":6.0,"12":7.0,"13":8.0,"14":9.0,"15":11.0,"16":12.0,"17":1.0,"18":1.0,"19":2.0,"20":3.0,"21":4.0,"22":4.0,"23":5.0,"24":6.0,"25":6.0,"26":7.0,"27":7.0,"28":8.0,"29":8.0,"30":9.0,"31":9.0,"32":10.0,"33":10.0,"34":11.0,"35":11.0,"36":12.0,"37":12.0,"38":1.0,"39":2.0,"40":2.0,"41":3.0,"42":4.0,"43":4.0,"44":5.0,"45":5.0,"46":6.0,"47":6.0,"48":7.0,"49":7.0,"50":8.0,"51":8.0,"52":9.0,"53":9.0,"54":10.0,"55":10.0,"56":11.0,"57":11.0,"58":12.0,"59":12.0,"60":1.0,"61":1.0,"62":2.0,"63":2.0,"64":3.0,"65":3.0,"66":4.0,"67":4.0,"68":5.0,"69":5.0,"70":6.0,"71":6.0,"72":7.0,"73":7.0,"74":8.0,"75":8.0,"76":9.0,"77":10.0,"78":10.0,"79":11.0,"80":11.0,"81":12.0,"82":1.0,"83":1.0,"84":2.0,"85":2.0},"Total_Revenue":{"0":35852.02,"1":2623.97,"2":3526.67,"3":21466.71,"4":72784.1200000003,"5":103921.2899999999,"6":10852.87,"7":16522.07,"8":7443.76,"9":68962.1600000002,"10":10956.38,"11":193856.8799999985,"12":110766.6099999997,"13":123861.8599999987,"14":2722.34,"15":303488.6900000007,"16":6876.58,"17":17729.5,"18":4687.93,"19":26914.06,"20":2228.12,"21":15708.93,"22":859.58,"23":19164.89,"24":163164.4799999995,"25":33180.7300000001,"26":10033.01,"27":1114.48,"28":462613.2900000042,"29":9822.95,"30":70901.4400000003,"31":22370.29,"32":46711.8900000002,"33":2335.02,"34":7259.28,"35":11.83,"36":13590.51,"37":7677.77,"38":282.01,"39":358522.7900000003,"40":5844.0,"41":7027.28,"42":1908.71,"43":4032.35,"44":11072.6,"45":3973.15,"46":30706.23,"47":2644.13,"48":23831.75,"49":670.12,"50":6949.54,"51":4687.7,"52":9672.69,"53":7333.01,"54":12814.33,"55":689.39,"56":6962.86,"57":2283.16,"58":1259.5,"59":224.84,"60":12812.12,"61":247.68,"62":25452.65,"63":1245.02,"64":24211.36,"65":5255.25,"66":28402.76,"67":9148.55,"68":14822.61,"69":345.37,"70":12408.13,"71":989.93,"72":10601.33,"73":730.32,"74":169020.5000000001,"75":697.54,"76":3862038.6799997138,"77":6148750.9899984254,"78":194.06,"79":2379382.4500000761,"80":1174.11,"81":1729567.9000000793,"82":889650.029999995,"83":95.8,"84":415996.6999999974,"85":654.78}}'
df = pd.read_json(toy)
df['reg_year'].astype(int)
u = df["has_cus_id_but_not_acc_id"].unique()
y = df['reg_year'].unique()
fig, axes = plt.subplots(1,len(u), sharey=True)
axes[0].set_yscale("log")
for ax, (n, grp) in zip(axes.flat, df.groupby("has_cus_id_but_not_acc_id")):
piv = grp.pivot('reg_month', 'reg_year', 'Total_Revenue')
empty = pd.DataFrame(index=range(1,12), columns=y)
empty.combine_first(piv).plot.bar(ax=ax, width=0.8, legend=False)
axes[1].legend()
plt.show()

Concatenating multiple barplots in seaborn

My data-frame contains the following column headers: subject, Group, MASQ_GDA, MASQ_AA, MASQ_GDD, MASQ_AD
I was successfully able to plot one of them using a bar plot with the following specifications:
bar_plot = sns.barplot(x="Group", y='MASQ_GDA', units="subject", ci = 68, hue="Group", data=demo_masq)
However, I am attempting to create several of such bar plot side by side. Might anyone know how I can accomplish this, for each plot to contain the remaining 3 variables (MASQ_AA, MASQ_GDD, MASQ_AD). Here is an example of what I am trying to achieve.
If you look in the documentation for sns.barplot(), you will see that the function accepts a parameter ax= allowing you to tell seaborn which Axes object to use to plot the result
ax : matplotlib Axes, optional
Axes object to draw the plot onto, otherwise uses the current Axes.
Therefore, the simple way to obtain the desired output is to create the Axes beforehand, and then calling sns.barplot() with the corresponding ax parameter
fig, axs = plt.subplots(1,4) # create 4 subplots on 1 row
for ax,col in zip(axs,["MASQ_GDA", "MASQ_AA", "MASQ_GDD", "MASQ_AD"]):
sns.barplot(x="Group", y=col, units="subject", ci = 68, hue="Group", data=demo_masq, ax=ax) # <- notice ax= argument
Another option, and maybe an option that is more in line with the philosophy of seaborn is to use a FacetGrid. This would allow you to automatically create the required number of subplots depending on the number of categories in your dataset. However, it requires to reshape your dataframe so that the content of your MASQ_* columns are on a single column, with a new column showing what category each value corresponds to.

Resources