create density plots of continuous field by categorical field - python-3.x

I have the code below which overlays a density curve on a histogram. It does this for the ‘Fresh’ field in my data, which is a continuous field. I would like to create similar plots filtering by the unique values in the ‘Channel’ field. For example in pandas to create histograms similar to what I'm trying to accomplish I would use:
data_df.hist(column=‘Fresh’,by=‘Channel’)
Can anyone suggest how to do something similar for the seaborn code below?
code:
import seaborn as sns
sns.distplot(data_df[‘Fresh’], hist=True, kde=True,
bins=int(data_df.shape[0]/5), color = 'darkblue',
hist_kws={'edgecolor':'black'},
kde_kws={'linewidth': 4})
data
Channel Fresh
0 2 12669
1 2 7057
2 2 6353
3 1 13265
4 2 22615
5 2 9413
6 2 12126
7 2 7579
8 1 5963
9 2 6006

I think the Seaborn way is to create a FacetGrid, and then to map an axis-level plotting function onto it. In your case:
g = sns.FacetGrid(data_df, col='Channel', margin_titles=True)
g.map(sns.distplot,
'Fresh',
bins=int(data_df.shape[0]/5),
color='darkblue',
hist_kws={'edgecolor': 'black'},
kde_kws={'linewidth': 4});
Check out the docs for more: https://seaborn.pydata.org/tutorial/axis_grids.html

Alternatively, you can groupby your DataFrame based on the Channel and then plot the two groups in different subplots
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data_df = pd.DataFrame({'Channel': [2, 2, 2, 1, 2, 2, 2, 2, 1, 2],
'Fresh': [12669, 7057, 6353, 13265, 22615,
9413, 12126, 7579, 5963,6006]})
df1 = data_df.groupby('Channel')
fig, axes = plt.subplots(nrows=1, ncols=len(df1), figsize=(10, 3))
for ax, df in zip(axes.flatten(), df1.groups):
sns.distplot(df1.get_group(df)['Fresh'], hist=True, kde=True,
bins=int(data_df.shape[0]/5), color = 'darkblue',
hist_kws={'edgecolor':'black'},
kde_kws={'linewidth': 4}, ax=ax)
plt.tight_layout()

Related

How to plot histogram subplots for each group

When I run the following code, I get 4 different histograms separated by groups. How can I achieve the same type of visualization with 4 different sns.distplot() also separated by their groups?
df = pd.DataFrame({
"group": [1, 1, 2, 2, 3, 3, 4, 4],
"similarity": [0.1, 0.2, 0.35, 0.6, 0.7, 0.25, 0.15, 0.55]
})
df['similarity'].hist(by=df['group'])
seaborn is a high-level api for matplotlib, and pandas uses matplotlib as the default plotting backend.
From seaborn v0.11.2, sns.distplot is deprecated, and, as per the Warning in the documentation, it is not recommended to directly use FacetGrid.
sns.distplot is replaced by the axes-level function sns.histplot, and the figure-level function sns.displot.
Also see seaborn histplot and displot output doesn't match
It is easy to produce a plot, but not necessarily to produce the correct plot, unless you are aware of the different parameter defaults for each api.
Note the difference between common_bins as True and Fales.
Tested in python 3.10, pandas 1.4.2, matplotlib 3.5.1, seaborn 0.11.2
common_bins=False
import seaborn as sns
# plot
g = sns.displot(data=df, x='similarity', col='group', col_wrap=2, common_bins=False, height=4)
common_bins=True (4)
sns.displot, and pandas.DataFrame.plot with kind='hist' and bins=4 produce the same plot.
g = sns.displot(data=df, x='similarity', col='group', col_wrap=2, common_bins=True, bins=4, height=4)
# reshape the dataframe to a wide format
dfp = df.pivot(columns='group', values='similarity')
axes = dfp.plot(kind='hist', subplots=True, layout=(2, 2), figsize=(9, 9), ec='k', bins=4, sharey=True)
You can use FacetGrid from seaborn:
import seaborn as sns
g = sns.FacetGrid(data=df, col='group', col_wrap=2)
g.map(sns.histplot, 'similarity')
Output:

Plot multiple histograms with seaborn

I have a data frame with 36 columns. I want to plot histograms for each feature in one go (6x6) using seaborn. Basically reproducing df.hist() but with seaborn. My code below shows the plot for only the first feature and all other come empty.
Test dataframe:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 36)), columns=range(0,36))
My code:
import seaborn as sns
# plot
f, axes = plt.subplots(6, 6, figsize=(20, 20), sharex=True)
for feature in df.columns:
sns.distplot(df[feature] , color="skyblue", ax=axes[0, 0])
I guess it would make sense to loop over the axes and features simultaneously.
f, axes = plt.subplots(6, 6, figsize=(20, 20), sharex=True)
for ax, feature in zip(axes.flat, df.columns):
sns.distplot(df[feature] , color="skyblue", ax=ax)
Numpy arrays are flattened by row-wise, i.e. you would get the first 6 features in the first row, the features 6 to 11 in the second row etc.
If this is not what you want, you can define the index for the axes array manually,
f, axes = plt.subplots(6, 6, figsize=(20, 20), sharex=True)
for i, feature in enumerate(df.columns):
sns.distplot(df[feature] , color="skyblue", ax=axes[i%6, i//6])
e.g. the above will fill the subplots column by column.

Seaborn barplot with two y-axis

considering the following pandas DataFrame:
labels values_a values_b values_x values_y
0 date1 1 3 150 170
1 date2 2 6 200 180
It is easy to plot this with Seaborn (see example code below). However, due to the big difference between values_a/values_b and values_x/values_y, the bars for values_a and values_b are not easily visible (actually, the dataset given above is just a sample and in my real dataset the difference is even bigger). Therefore, I would like to use two y-axis, i.e., one y-axis for values_a/values_b and one for values_x/values_y. I tried to use plt.twinx() to get a second axis but unfortunately, the plot shows only two bars for values_x and values_y, even though there are at least two y-axis with the right scaling. :) Do you have an idea how to fix that and get four bars for each label whereas the values_a/values_b bars relate to the left y-axis and the values_x/values_y bars relate to the right y-axis?
Thanks in advance!
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
columns = ["labels", "values_a", "values_b", "values_x", "values_y"]
test_data = pd.DataFrame.from_records([("date1", 1, 3, 150, 170),\
("date2", 2, 6, 200, 180)],\
columns=columns)
# working example but with unreadable values_a and values_b
test_data_melted = pd.melt(test_data, id_vars=columns[0],\
var_name="source", value_name="value_numbers")
g = sns.barplot(x=columns[0], y="value_numbers", hue="source",\
data=test_data_melted)
plt.show()
# values_a and values_b are not displayed
values1_melted = pd.melt(test_data, id_vars=columns[0],\
value_vars=["values_a", "values_b"],\
var_name="source1", value_name="value_numbers1")
values2_melted = pd.melt(test_data, id_vars=columns[0],\
value_vars=["values_x", "values_y"],\
var_name="source2", value_name="value_numbers2")
g1 = sns.barplot(x=columns[0], y="value_numbers1", hue="source1",\
data=values1_melted)
ax2 = plt.twinx()
g2 = sns.barplot(x=columns[0], y="value_numbers2", hue="source2",\
data=values2_melted, ax=ax2)
plt.show()
This is probably best suited for multiple sub-plots, but if you are truly set on a single plot, you can scale the data before plotting, create another axis and then modify the tick values.
Sample Data
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
columns = ["labels", "values_a", "values_b", "values_x", "values_y"]
test_data = pd.DataFrame.from_records([("date1", 1, 3, 150, 170),\
("date2", 2, 6, 200, 180)],\
columns=columns)
test_data_melted = pd.melt(test_data, id_vars=columns[0],\
var_name="source", value_name="value_numbers")
Code:
# Scale the data, just a simple example of how you might determine the scaling
mask = test_data_melted.source.isin(['values_a', 'values_b'])
scale = int(test_data_melted[~mask].value_numbers.mean()
/test_data_melted[mask].value_numbers.mean())
test_data_melted.loc[mask, 'value_numbers'] = test_data_melted.loc[mask, 'value_numbers']*scale
# Plot
fig, ax1 = plt.subplots()
g = sns.barplot(x=columns[0], y="value_numbers", hue="source",\
data=test_data_melted, ax=ax1)
# Create a second y-axis with the scaled ticks
ax1.set_ylabel('X and Y')
ax2 = ax1.twinx()
# Ensure ticks occur at the same positions, then modify labels
ax2.set_ylim(ax1.get_ylim())
ax2.set_yticklabels(np.round(ax1.get_yticks()/scale,1))
ax2.set_ylabel('A and B')
plt.show()

day of the week as X in Seaborn plot

I have a dataset with clicks and impressions, I aggregated them by the day of week using groupby and agg
df2=df.groupby('day_of_week',as_index=False, sort=True, group_keys=True).agg({'Clicks':'sum','Impressions':'sum'})
Then I was trying to plot them out using subplot
f, axes = plt.subplots(2, 2, figsize=(7, 7))
sns.countplot(data=df2['Clicks'], x=df2['day_of_week'],ax=axes[0, 0])
sns.countplot(data=df2["Impressions"], x=df2['day_of_week'],ax=axes[0, 1])
but instead using the day of week as the X, the plot used values in clicks and impressions instead. Is there a way to force the X to day of the week while value is in Y instead? Thanks.
Full code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
df=pd.read_csv('data/data_clean.csv')
df2=df.groupby('day_of_week',as_index=False, sort=True, group_keys=True).agg({'Clicks':'sum','Impressions':'sum'})
f, axes = plt.subplots(2, 2, figsize=(7, 7))
sns.countplot(data=df2['Clicks'], x=df2['day_of_week'],ax=axes[0, 0])
sns.countplot(data=df2["Impressions"], x=df2['day_of_week'],ax=axes[0, 1])
plt.show()
Fake Data:
day_of_week,Clicks,Impressions
0 100 2000
1 400 4000
2 300 3500
3 200 2000
4 100 1000
5 50 500
6 10 150
I was able to find the answer with seaborn with Peter's guidance.
The correct plotting code is
sns.barplot( x=df2['day_of_week'],y=df2['Clicks'] , color="skyblue", ax=axes[0, 0])
sns.barplot( x=df2['day_of_week'],y=df2['Impressions'] , color="olive", ax=axes[0, 1])
It seems seaborn by default would take the first variable as X instead of Y.
Based on the seaborn docs, I think countplot expects a long-form dataframe such as your df, not the pre-aggregated df2 that you built and pass in your question. countplot does the counting for you.
However, your df2 is ready for a pandas bar plot:
df2.plot(kind='bar', y=['Impressions', 'Clicks'])
Result:

Second y-axis and overlapping labeling?

I am using python for a simple time-series analysis of calory intake. I am plotting the time series and the rolling mean/std over time. It looks like this:
Here is how I do it:
## packages & libraries
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from pandas import Series, DataFrame, Panel
## import data and set time series structure
data = pd.read_csv('time_series_calories.csv', parse_dates={'dates': ['year','month','day']}, index_col=0)
## check ts for stationarity
from statsmodels.tsa.stattools import adfuller
def test_stationarity(timeseries):
#Determing rolling statistics
rolmean = pd.rolling_mean(timeseries, window=14)
rolstd = pd.rolling_std(timeseries, window=14)
#Plot rolling statistics:
orig = plt.plot(timeseries, color='blue',label='Original')
mean = plt.plot(rolmean, color='red', label='Rolling Mean')
std = plt.plot(rolstd, color='black', label = 'Rolling Std')
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation')
plt.show()
The plot doesn't look good - since the rolling std distorts the scale of variation and the x-axis labelling is screwed up. I have two question: (1) How can I plot the rolling std on a secony y-axis? (2) How can I fix the x-axis overlapping labeling?
EDIT
With your help I managed to get the following:
But do I get the legend sorted out?
1) Making a second (twin) axis can be done with ax2 = ax1.twinx(), see here for an example. Is this what you needed?
2) I believe there are several old answers to this question, i.e. here, here and here. According to the links provided, the easiest way is probably to use either plt.xticks(rotation=70) or plt.setp( ax.xaxis.get_majorticklabels(), rotation=70 ) or fig.autofmt_xdate().
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot([1, 2, 3, 4, 5], [1, 2, 3, 4, 5])
plt.xticks(rotation=70) # Either this
ax.set_xticks([1, 2, 3, 4, 5])
ax.set_xticklabels(['aaaaaaaaaaaaaaaa','bbbbbbbbbbbbbbbbbb','cccccccccccccccccc','ddddddddddddddddddd','eeeeeeeeeeeeeeeeee'])
# fig.autofmt_xdate() # or this
# plt.setp( ax.xaxis.get_majorticklabels(), rotation=70 ) # or this works
fig.tight_layout()
plt.show()
Answer to Edit
When sharing lines between different axes into one legend is to create some fake-plots into the axis you want to have the legend as:
ax1.plot(something, 'r--') # one plot into ax1
ax2.plot(something else, 'gx') # another into ax2
# create two empty plots into ax1
ax1.plot([][], 'r--', label='Line 1 from ax1') # empty fake-plot with same lines/markers as first line you want to put in legend
ax1.plot([][], 'gx', label='Line 2 from ax2') # empty fake-plot as line 2
ax1.legend()
In my silly example it is probably better to label the original plot in ax1, but I hope you get the idea. The important thing is to create the "legend-plots" with the same line and marker settings as the original plots. Note that the fake-plots will not be plotted since there is no data to plot.

Resources