day of the week as X in Seaborn plot - python-3.x

I have a dataset with clicks and impressions, I aggregated them by the day of week using groupby and agg
df2=df.groupby('day_of_week',as_index=False, sort=True, group_keys=True).agg({'Clicks':'sum','Impressions':'sum'})
Then I was trying to plot them out using subplot
f, axes = plt.subplots(2, 2, figsize=(7, 7))
sns.countplot(data=df2['Clicks'], x=df2['day_of_week'],ax=axes[0, 0])
sns.countplot(data=df2["Impressions"], x=df2['day_of_week'],ax=axes[0, 1])
but instead using the day of week as the X, the plot used values in clicks and impressions instead. Is there a way to force the X to day of the week while value is in Y instead? Thanks.
Full code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
df=pd.read_csv('data/data_clean.csv')
df2=df.groupby('day_of_week',as_index=False, sort=True, group_keys=True).agg({'Clicks':'sum','Impressions':'sum'})
f, axes = plt.subplots(2, 2, figsize=(7, 7))
sns.countplot(data=df2['Clicks'], x=df2['day_of_week'],ax=axes[0, 0])
sns.countplot(data=df2["Impressions"], x=df2['day_of_week'],ax=axes[0, 1])
plt.show()
Fake Data:
day_of_week,Clicks,Impressions
0 100 2000
1 400 4000
2 300 3500
3 200 2000
4 100 1000
5 50 500
6 10 150

I was able to find the answer with seaborn with Peter's guidance.
The correct plotting code is
sns.barplot( x=df2['day_of_week'],y=df2['Clicks'] , color="skyblue", ax=axes[0, 0])
sns.barplot( x=df2['day_of_week'],y=df2['Impressions'] , color="olive", ax=axes[0, 1])
It seems seaborn by default would take the first variable as X instead of Y.

Based on the seaborn docs, I think countplot expects a long-form dataframe such as your df, not the pre-aggregated df2 that you built and pass in your question. countplot does the counting for you.
However, your df2 is ready for a pandas bar plot:
df2.plot(kind='bar', y=['Impressions', 'Clicks'])
Result:

Related

Python - Add new curve from a df into existing lineplot

I create a plot using sns base on a DafaFrame.
Now, I would like to add new curve from another dataframe on the plot created previusly.
This is the code of my plot:
tline = sns.lineplot(x='reads', y='time', data=df, hue='method', style='method', markers=True, dashes=False, ax=axs[0, 0])
tline.set_xlabel('Numero di reads')
tline.set_ylabel ('Time [s]')
tline.legend(loc='lower right')
tline.set_yscale('log')
tline.autoscale(enable=True, axis='x')
tline.autoscale(enable=True, axis='y')
Now I have another Dataframe with the same column of the first DataFrame. How can I add this new curve with a custom entry in the legend?
This is the structure of the DataFrame:
Dataset
Method
Reads
Time
Peak-memory
14M
Set
14000000
7.33
1035204
20K
Set
200000
0.38
107464
200K
Set
20000
0.07
42936
2M
Set
28428648
16.09
2347740
28M
Set
2000000
1.41
240240
I suggest to use matplotlibs OOP interface like this
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
import seaborn as sns
# generate sample data
time_column = np.arange(10)
data_column1 = np.random.randint(0, 10, 10)
data_column2 = np.random.randint(0, 10, 10)
# store in pandas dfs
df1 = pd.DataFrame(zip(time_column, data_column1), columns=['Time', 'Data'])
df2 = pd.DataFrame(zip(time_column, data_column2), columns=['Time', 'Data'])
f, ax = plt.subplots()
sns.lineplot(df1.Time, df1.Data, label='foo', ax=ax)
sns.lineplot(df2.Time, df2.Data, label='bar', ax=ax)
ax.legend()
plt.show()
which generates the following output
the important thing is that both lineplots are on the same subplot (ax in this case).

Need to force overlapping for seaborn's heatmap and kdeplot

I'm trying to combine seaborn's heatmap and kdeplot in one figure, but so far the result is not very promising since I cannot find a way to make them overlap. As a result, the heatmap is just squeezed to the left side of the figure.
I think the reason is that seaborn doesn't seem to recognize the x-axis as the same one in two charts (see picture below), although the data points are exactly the same. The only difference is that for heatmap I needed to pivot them, while for the kdeplot pivoting is not needed.
Therefore, data for the axis are coming from the same dataset, but in the different forms as it can be seen in the code below.
The dataset sample looks something like this:
X Y Z
7,75 280 52,73
3,25 340 54,19
5,75 340 53,61
2,5 180 54,67
3 340 53,66
1,75 340 54,81
4,5 380 55,18
4 240 56,49
4,75 380 55,17
4,25 180 55,40
2 420 56,42
2,25 380 54,90
My code:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
f, ax = plt.subplots(figsize=(11, 9), dpi=300)
plt.tick_params(bottom='on')
# dataset is just a pandas frame with data
X1 = dataset.iloc[:, :3].pivot("X", "Y", "Z")
X2 = dataset.iloc[:, :2]
ax = sns.heatmap(X1, cmap="Spectral")
ax.invert_yaxis()
ax2 = plt.twinx()
sns.kdeplot(X2.iloc[:, 1], X2.iloc[:, 0], ax=ax2, zorder=2)
ax.axis('tight')
plt.show()
Please help me with placing kdeplot on top of the heatmap. Ideally, I would like my final plot to look something like this:
Any tips or hints will be greatly appreciated!
The question can be a bit hard to understand, because the dataset can't be "just some data". The X and Y values need to lie on a very regular grid. No X,Y combination can be repeated, but not all values appear. The kdeplot will then show where the used values of X,Y are concentrated.
Such a dataset can be simulated by first generating dummy data for a full grid, and then take a subset.
Now, a seaborn heatmap uses categorical X and Y axes. Such axes are very hard to align with the kdeplot. To obtain a similar heatmap with numerical axes, ax.pcolor() can be used.
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
xs = np.arange(2, 10, 0.25)
ys = np.arange(150, 400, 10)
# first create a dummy dataset over a full grid
dataset = pd.DataFrame({'X': np.repeat(xs, len(ys)),
'Y': np.tile(ys, len(xs)),
'Z': np.random.uniform(50, 60, len(xs) * len(ys))})
# take a random subset of the rows
dataset = dataset.sample(200)
fig, ax = plt.subplots(figsize=(11, 9), dpi=300)
X1 = dataset.pivot("X", "Y", "Z")
collection = ax.pcolor(X1.columns, X1.index, X1, shading='nearest', cmap="Spectral")
plt.colorbar(collection, ax=ax, pad=0.02)
# default, cut=3, which causes a lot of surrounding whitespace
sns.kdeplot(x=dataset["Y"], y=dataset["X"], cut=1.5, ax=ax)
fig.tight_layout()
plt.show()

create density plots of continuous field by categorical field

I have the code below which overlays a density curve on a histogram. It does this for the ‘Fresh’ field in my data, which is a continuous field. I would like to create similar plots filtering by the unique values in the ‘Channel’ field. For example in pandas to create histograms similar to what I'm trying to accomplish I would use:
data_df.hist(column=‘Fresh’,by=‘Channel’)
Can anyone suggest how to do something similar for the seaborn code below?
code:
import seaborn as sns
sns.distplot(data_df[‘Fresh’], hist=True, kde=True,
bins=int(data_df.shape[0]/5), color = 'darkblue',
hist_kws={'edgecolor':'black'},
kde_kws={'linewidth': 4})
data
Channel Fresh
0 2 12669
1 2 7057
2 2 6353
3 1 13265
4 2 22615
5 2 9413
6 2 12126
7 2 7579
8 1 5963
9 2 6006
I think the Seaborn way is to create a FacetGrid, and then to map an axis-level plotting function onto it. In your case:
g = sns.FacetGrid(data_df, col='Channel', margin_titles=True)
g.map(sns.distplot,
'Fresh',
bins=int(data_df.shape[0]/5),
color='darkblue',
hist_kws={'edgecolor': 'black'},
kde_kws={'linewidth': 4});
Check out the docs for more: https://seaborn.pydata.org/tutorial/axis_grids.html
Alternatively, you can groupby your DataFrame based on the Channel and then plot the two groups in different subplots
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data_df = pd.DataFrame({'Channel': [2, 2, 2, 1, 2, 2, 2, 2, 1, 2],
'Fresh': [12669, 7057, 6353, 13265, 22615,
9413, 12126, 7579, 5963,6006]})
df1 = data_df.groupby('Channel')
fig, axes = plt.subplots(nrows=1, ncols=len(df1), figsize=(10, 3))
for ax, df in zip(axes.flatten(), df1.groups):
sns.distplot(df1.get_group(df)['Fresh'], hist=True, kde=True,
bins=int(data_df.shape[0]/5), color = 'darkblue',
hist_kws={'edgecolor':'black'},
kde_kws={'linewidth': 4}, ax=ax)
plt.tight_layout()

How do I plot the dataframe with only 2 columns(text and int)?

index reviews label
0 0 i admit the great majority of... 1
1 1 take a low budget inexperienced ... 0
2 2 everybody has seen back to th... 1
3 3 doris day was an icon of b... 0
4 4 after a series of silly fun ... 0
I've a dataframe of movie reviews and I've predicted label column(1-postive , 0-negative review) using kmeans.labels_ . How do I visualise /plot the above?
Desired output: scatter plot of 1's and 0's
Code tried :
colors = ['red', 'blue']
pred_colors = [colors[label] for label in km.labels_]
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(x='index',y='label',c=pred_colors)
Output: Plot with a red dot at center
This plot comes from:
http://www3.ntu.edu.sg/home/ehchua/programming/webprogramming/Python4_DataAnalysis.html
You do not have values to plot on the x-axis, so we can simply use the index.
The reviews could be added to data as another column.
import pandas as pd
from matplotlib import pyplot as plt
data = [1,0,1,0,0]
df = pd.DataFrame(data, index=range(5), columns=['label'])
#
# line plot
#df.reset_index().plot(x='index', y='label') # turn index into column for plotting on x-axis
#
# scatter plot
ax1 = df.reset_index().plot.scatter(x='index', y='label', c='DarkBlue')
#
plt.tight_layout() # helps prevent labels from being cropped
plt.show()

Seaborn barplot with two y-axis

considering the following pandas DataFrame:
labels values_a values_b values_x values_y
0 date1 1 3 150 170
1 date2 2 6 200 180
It is easy to plot this with Seaborn (see example code below). However, due to the big difference between values_a/values_b and values_x/values_y, the bars for values_a and values_b are not easily visible (actually, the dataset given above is just a sample and in my real dataset the difference is even bigger). Therefore, I would like to use two y-axis, i.e., one y-axis for values_a/values_b and one for values_x/values_y. I tried to use plt.twinx() to get a second axis but unfortunately, the plot shows only two bars for values_x and values_y, even though there are at least two y-axis with the right scaling. :) Do you have an idea how to fix that and get four bars for each label whereas the values_a/values_b bars relate to the left y-axis and the values_x/values_y bars relate to the right y-axis?
Thanks in advance!
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
columns = ["labels", "values_a", "values_b", "values_x", "values_y"]
test_data = pd.DataFrame.from_records([("date1", 1, 3, 150, 170),\
("date2", 2, 6, 200, 180)],\
columns=columns)
# working example but with unreadable values_a and values_b
test_data_melted = pd.melt(test_data, id_vars=columns[0],\
var_name="source", value_name="value_numbers")
g = sns.barplot(x=columns[0], y="value_numbers", hue="source",\
data=test_data_melted)
plt.show()
# values_a and values_b are not displayed
values1_melted = pd.melt(test_data, id_vars=columns[0],\
value_vars=["values_a", "values_b"],\
var_name="source1", value_name="value_numbers1")
values2_melted = pd.melt(test_data, id_vars=columns[0],\
value_vars=["values_x", "values_y"],\
var_name="source2", value_name="value_numbers2")
g1 = sns.barplot(x=columns[0], y="value_numbers1", hue="source1",\
data=values1_melted)
ax2 = plt.twinx()
g2 = sns.barplot(x=columns[0], y="value_numbers2", hue="source2",\
data=values2_melted, ax=ax2)
plt.show()
This is probably best suited for multiple sub-plots, but if you are truly set on a single plot, you can scale the data before plotting, create another axis and then modify the tick values.
Sample Data
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
columns = ["labels", "values_a", "values_b", "values_x", "values_y"]
test_data = pd.DataFrame.from_records([("date1", 1, 3, 150, 170),\
("date2", 2, 6, 200, 180)],\
columns=columns)
test_data_melted = pd.melt(test_data, id_vars=columns[0],\
var_name="source", value_name="value_numbers")
Code:
# Scale the data, just a simple example of how you might determine the scaling
mask = test_data_melted.source.isin(['values_a', 'values_b'])
scale = int(test_data_melted[~mask].value_numbers.mean()
/test_data_melted[mask].value_numbers.mean())
test_data_melted.loc[mask, 'value_numbers'] = test_data_melted.loc[mask, 'value_numbers']*scale
# Plot
fig, ax1 = plt.subplots()
g = sns.barplot(x=columns[0], y="value_numbers", hue="source",\
data=test_data_melted, ax=ax1)
# Create a second y-axis with the scaled ticks
ax1.set_ylabel('X and Y')
ax2 = ax1.twinx()
# Ensure ticks occur at the same positions, then modify labels
ax2.set_ylim(ax1.get_ylim())
ax2.set_yticklabels(np.round(ax1.get_yticks()/scale,1))
ax2.set_ylabel('A and B')
plt.show()

Resources