Matplotlib Annotate using values from DataFrame - python-3.x

I'm trying to annotate a chart to include the plotted values of the x-axis as well as additional information from the DataFrame. I am able to annotate the values from the x-axis but not sure how I can add additional information from the data frame. In my example below I am annotating the x-axis which are the values from the Completion column but also want to add the Completed and Participants values from the DataFrame.
For example the Running Completion is 20% but I want my annotation to show the Completed and Participants values in the format - 20% (2/10). Below is sample code that can reproduce my scenario as well as current and desired results. Any help is appreciated.
Code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
mydict = {
'Event': ['Running', 'Swimming', 'Biking', 'Hiking'],
'Completed': [2, 4, 3, 7],
'Participants': [10, 20, 35, 10]}
df = pd.DataFrame(mydict).set_index('Event')
df = df.assign(Completion=(df.Completed/df.Participants) * 100)
print(df)
plt.subplots(figsize=(5, 3))
ax = sns.barplot(x=df.Completion, y=df.index, color="cyan", orient='h')
for i in ax.patches:
ax.text(i.get_width() + .4,
i.get_y() + .67,
str(round((i.get_width()), 2)) + '%', fontsize=10)
plt.tight_layout()
plt.show()
DataFrame:
Completed Participants Completion
Event
Running 2 10 20.000000
Swimming 4 20 20.000000
Biking 3 35 8.571429
Hiking 7 10 70.000000
Current Output:
Desired Output:

Loop through the columns Completed and Participants as well when you annotate:
for (c,p), i in zip(df[["Completed","Participants"]].values, ax.patches):
ax.text(i.get_width() + .4,
i.get_y() + .67,
str(round((i.get_width()), 2)) + '%' + f" ({c}/{p})", fontsize=10)

Related

pandas: draw plot using dict and labels on top of each bar

I am trying to plot a graph from a dict, which works fine but I also have a similar dict with values that I intend to write on top of each bar.
This works fine for plotting the graph:
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['axes.formatter.useoffset'] = False
df = pd.DataFrame([population_dct])
df.sum().sort_values(ascending=False).plot.bar(color='b')
plt.savefig("temp_fig.png")
Where the population_dct is:
{'pak': 210, 'afg': 182, 'ban': 94, 'ind': 32, 'aus': 14, 'usa': 345, 'nz': 571, 'col': 47, 'iran': 2}
Now I have another dict, called counter_dct:
{'pak': 1.12134, 'afg': 32.4522, 'ban': 3.44, 'ind': 1.123, 'aus': 4.22, 'usa': 9.44343, 'nz': 57.12121, 'col': 2.447, 'iran': 27.5}
I need the second dict items to be shown on top of each bar from the previous graph.
What I tried:
df = pd.DataFrame([population_dct])
df.sum().sort_values(ascending=False).plot.bar(color='g')
for i, v in enumerate(counter_dct.values()):
plt.text(v, i, " " + str(v), color='blue', va='center', fontweight='bold')
This has two issues:
counter_dct.values() msesses up with the sequence of values
The values are shown at the bottom of each graph with poor alignment
Perhaps there's a better way to achieve this?
Since you are drawing the graph in a desc manner;
You need to first sort the population_dict in a desc manner based on values
temp_dct = dict(sorted(population_dct.items(), key=lambda x: x[1], reverse=True))
Start with the temp_dct and then get the value from the counter_dct
counter = 0 # to start from the x-axis
for key, val in temp_dct.items():
top_val = counter_dct[key]
plt.text(x=counter, y=val + 2, s=f"{top_val}", fontdict=dict(fontsize=11))
counter += 1
plt.xticks(rotation=45, ha='right')

Plot Histogram on different axes

I am reading CSV file:
Notation Level RFResult PRIResult PDResult Total Result
AAA 1 1.23 0 2 3.23
AAA 1 3.4 1 0 4.4
BBB 2 0.26 1 1.42 2.68
BBB 2 0.73 1 1.3 3.03
CCC 3 0.30 0 2.73 3.03
DDD 4 0.25 1 1.50 2.75
AAA 5 0.25 1 1.50 2.75
FFF 6 0.26 1 1.42 2.68
...
...
Here is the code
import pandas as pd
import matplotlib.pyplot as plt
df = pd.rad_csv('home\NewFiles\Files.csv')
Notation = df['Notation']
Level = df['Level']
RFResult = df['RFResult']
PRIResult = df['PRIResult']
PDResult = df['PDResult']
fig, axes = plt.subplots(nrows=7, ncols=1)
ax1, ax2, ax3, ax4, ax5, ax6, ax7 = axes.flatten()
n_bins = 13
ax1.hist(data['Total'], n_bins, histtype='bar') #Current this shows all Total Results in one plot
plt.show()
I want to show each Level Total Result in each different axes like as follow:
ax1 will show Level 1 Total Result
ax2 will show Level 2 Total Result
ax3 will show Level 3 Total Result
ax4 will show Level 4 Total Result
ax5 will show Level 5 Total Result
ax6 will show Level 6 Total Result
ax7 will show Level 7 Total Result
You can select a filtered part of a dataframe just by indexing: df[df['Level'] == level]['Total']. You can loop through the axes using for ax in axes.flatten(). To also get the index, use for ind, ax in enumerate(axes.flatten()). Note that Python normally starts counting from 1, so adding 1 to the index would be a good choice to indicate the level.
Note that when you have backslashes in a string, you can escape them using an r-string: r'home\NewFiles\Files.csv'.
The default ylim is from 0 to the maximum bar height, plus some padding. This can be changed for each ax separately. In the example below a list of ymax values is used to show the principle.
ax.grid(True, axis='both) sets the grid on for that ax. Instead of 'both', also 'x' or 'y' can be used to only set the grid for that axis. A grid line is drawn for each tick value. (The example below tries to use little space, so only a few gridlines are visible.)
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
N = 1000
df = pd.DataFrame({'Level': np.random.randint(1, 6, N), 'Total': np.random.uniform(1, 5, N)})
fig, axes = plt.subplots(nrows=5, ncols=1, sharex=True)
ymax_per_level = [27, 29, 28, 26, 27]
for ind, (ax, lev_ymax) in enumerate(zip(axes.flatten(), ymax_per_level)):
level = ind + 1
n_bins = 13
ax.hist(df[df['Level'] == level]['Total'], bins=n_bins, histtype='bar')
ax.set_ylabel(f'TL={level}') # to add the level in the ylabel
ax.set_ylim(0, lev_ymax)
ax.grid(True, axis='both')
plt.show()
PS: A stacked histogram with custom legend and custom vertical lines could be created as:
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
import pandas as pd
import numpy as np
N = 1000
df = pd.DataFrame({'Level': np.random.randint(1, 6, N),
'RFResult': np.random.uniform(1, 5, N),
'PRIResult': np.random.uniform(1, 5, N),
'PDResult': np.random.uniform(1, 5, N)})
df['Total'] = df['RFResult'] + df['PRIResult'] + df['PDResult']
fig, axes = plt.subplots(nrows=5, ncols=1, sharex=True)
colors = ['crimson', 'limegreen', 'dodgerblue']
column_names = ['RFResult', 'PRIResult', 'PDResult']
level_vertical_line = [1, 2, 3, 4, 5]
for level, (ax, vertical_line) in enumerate(zip(axes.flatten(), level_vertical_line), start=1):
n_bins = 13
level_data = df[df['Level'] == level][column_names].to_numpy()
# vertical_line = level_data.mean()
ax.hist(level_data, bins=n_bins,
histtype='bar', stacked=True, color=colors)
ax.axvline(vertical_line, color='gold', ls=':', lw=2)
ax.set_ylabel(f'TL={level}') # to add the level in the ylabel
ax.margins(x=0.01)
ax.grid(True, axis='both')
legend_handles = [Patch(color=color) for color in colors]
axes[0].legend(legend_handles, column_names, ncol=len(column_names), loc='lower center', bbox_to_anchor=(0.5, 1.02))
plt.show()

Is there a way to filter dimension in holoviews Sankey diagram

I am trying to show migration from locations in a Sankey diagram in Holoviews, but I can't find a way to add a dropdown-type filter. I am not allowed to list a higher number of key dimensions than what I am plotting, which I expected to work as I get dropdown menu in other HoloViews elements as it automatically groups my data by all the key dimensions I did not assign to the element.
import pandas as pd
import numpy as np
import holoviews as hv
from holoviews import opts
hv.extension('bokeh')
df = pd.DataFrame({'from': ["a", "b", "c", "a", "b", "c"],
'to': ["d", "d", "e", "e", "e", "d"],
'number': [10, 2, 1, 8, 2, 2],
'year': [2018, 2018, 2018, 2017, 2017, 2017]})
df
from to number year
0 a d 10 2018
1 b d 2 2018
2 c e 1 2018
3 a e 8 2017
4 b e 2 2017
5 c d 2 2017
Now to Holoviews adding the year column to kdims as I want the dropdown to filter by year:
kdims = ["from", "to", "year"]
vdims = ["number"]
sankey = hv.Sankey(df, kdims=kdims, vdims=vdims)
sankey.opts(label_position='left', edge_color='to', node_padding=30, node_color='number', cmap='tab20')
returning:
ValueError: kdims: list length must be between 2 and 2 (inclusive)
Without the third key dimension the Sankey diagram work as expected, but then there is no interactive filter:
Here's 2 ways of solving your problem:
1) Turn your dataframe into a holoviews dataset and turn that into a Sankey plot:
Since 'year' is in the code below the 3rd key dimension, it will be used as the dimension for the slider. The first 2 variables ('from' and 'to') will be used as the key dims for the Sankey plot.
hv_ds = hv.Dataset(
data=df,
kdims=['from', 'to', 'year'],
vdims=['number'],
)
hv_ds.to(hv.Sankey)
2) Or, create a dictionary of Sankey plots per year and put those into a holomap:
sankey_dict = {
year: hv.Sankey(df[df.year == year])
for year in df.year.unique()
}
holo = hv.HoloMap(sankey_dict, kdims='year')
Both solutions create a holomap:
http://holoviews.org/reference/containers/bokeh/HoloMap.html
Resulting plot + slider:
I've tested this on:
hvplot 0.5.2
holoviews 1.12.5 and holoviews 1.13
jupyterlab 1.2.4

create density plots of continuous field by categorical field

I have the code below which overlays a density curve on a histogram. It does this for the ‘Fresh’ field in my data, which is a continuous field. I would like to create similar plots filtering by the unique values in the ‘Channel’ field. For example in pandas to create histograms similar to what I'm trying to accomplish I would use:
data_df.hist(column=‘Fresh’,by=‘Channel’)
Can anyone suggest how to do something similar for the seaborn code below?
code:
import seaborn as sns
sns.distplot(data_df[‘Fresh’], hist=True, kde=True,
bins=int(data_df.shape[0]/5), color = 'darkblue',
hist_kws={'edgecolor':'black'},
kde_kws={'linewidth': 4})
data
Channel Fresh
0 2 12669
1 2 7057
2 2 6353
3 1 13265
4 2 22615
5 2 9413
6 2 12126
7 2 7579
8 1 5963
9 2 6006
I think the Seaborn way is to create a FacetGrid, and then to map an axis-level plotting function onto it. In your case:
g = sns.FacetGrid(data_df, col='Channel', margin_titles=True)
g.map(sns.distplot,
'Fresh',
bins=int(data_df.shape[0]/5),
color='darkblue',
hist_kws={'edgecolor': 'black'},
kde_kws={'linewidth': 4});
Check out the docs for more: https://seaborn.pydata.org/tutorial/axis_grids.html
Alternatively, you can groupby your DataFrame based on the Channel and then plot the two groups in different subplots
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data_df = pd.DataFrame({'Channel': [2, 2, 2, 1, 2, 2, 2, 2, 1, 2],
'Fresh': [12669, 7057, 6353, 13265, 22615,
9413, 12126, 7579, 5963,6006]})
df1 = data_df.groupby('Channel')
fig, axes = plt.subplots(nrows=1, ncols=len(df1), figsize=(10, 3))
for ax, df in zip(axes.flatten(), df1.groups):
sns.distplot(df1.get_group(df)['Fresh'], hist=True, kde=True,
bins=int(data_df.shape[0]/5), color = 'darkblue',
hist_kws={'edgecolor':'black'},
kde_kws={'linewidth': 4}, ax=ax)
plt.tight_layout()

day of the week as X in Seaborn plot

I have a dataset with clicks and impressions, I aggregated them by the day of week using groupby and agg
df2=df.groupby('day_of_week',as_index=False, sort=True, group_keys=True).agg({'Clicks':'sum','Impressions':'sum'})
Then I was trying to plot them out using subplot
f, axes = plt.subplots(2, 2, figsize=(7, 7))
sns.countplot(data=df2['Clicks'], x=df2['day_of_week'],ax=axes[0, 0])
sns.countplot(data=df2["Impressions"], x=df2['day_of_week'],ax=axes[0, 1])
but instead using the day of week as the X, the plot used values in clicks and impressions instead. Is there a way to force the X to day of the week while value is in Y instead? Thanks.
Full code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
df=pd.read_csv('data/data_clean.csv')
df2=df.groupby('day_of_week',as_index=False, sort=True, group_keys=True).agg({'Clicks':'sum','Impressions':'sum'})
f, axes = plt.subplots(2, 2, figsize=(7, 7))
sns.countplot(data=df2['Clicks'], x=df2['day_of_week'],ax=axes[0, 0])
sns.countplot(data=df2["Impressions"], x=df2['day_of_week'],ax=axes[0, 1])
plt.show()
Fake Data:
day_of_week,Clicks,Impressions
0 100 2000
1 400 4000
2 300 3500
3 200 2000
4 100 1000
5 50 500
6 10 150
I was able to find the answer with seaborn with Peter's guidance.
The correct plotting code is
sns.barplot( x=df2['day_of_week'],y=df2['Clicks'] , color="skyblue", ax=axes[0, 0])
sns.barplot( x=df2['day_of_week'],y=df2['Impressions'] , color="olive", ax=axes[0, 1])
It seems seaborn by default would take the first variable as X instead of Y.
Based on the seaborn docs, I think countplot expects a long-form dataframe such as your df, not the pre-aggregated df2 that you built and pass in your question. countplot does the counting for you.
However, your df2 is ready for a pandas bar plot:
df2.plot(kind='bar', y=['Impressions', 'Clicks'])
Result:

Resources