Say I have an example dataframe below:
Division Home Corners Away Corners
Bundesliga 5 3
Bundesliga 5 5
EPL 7 4
EPL 3 2
League 1 10 6
Serie A 3 3
Serie A 8 2
League 1 3 1
I want to create a boxplot of total corners per game grouped by divison, but I want the home corners and away Corners to be separated but on the same figure. Similar to what the "hue" keyword accomplishes, but how do I accomplish that?
seaborn.boxplot
Reshape the data to a long form with pandas.DataFrame.stack
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = {'Division': ['Bundesliga', 'Bundesliga', 'EPL', 'EPL', 'League 1', 'Serie A', 'Serie A', 'League 1'],
'Home Corners': [5, 5, 7, 3, 10, 3, 8, 3],
'Away Corners ': [3, 5, 4, 2, 6, 3, 2, 1]}
df = pd.DataFrame(data)
# convert the data to a long format
df.set_index('Division', inplace=True)
dfl = df.stack().reset_index().rename(columns={'level_1': 'corners', 0: 'val'})
# plot
sns.boxplot('corners', 'val', data=dfl, hue='Division')
plt.legend(title='Division', bbox_to_anchor=(1.05, 1), loc='upper left')
You can melt the original data and use sns.boxplot:
sns.boxplot(data=df.melt('Division', var_name='Home/Away', value_name='Corners'),
x='Division', y='Corners',hue='Home/Away')
Output:
Related
I'm trying to plot data from two dataframes in the same figure. The problem is that I'm using calendar dates for my x axis, and pandas apparently does not like this. The code below shows a minimum example of what I'm trying to do. There are two datasets with some numeric value associated with calendar dates. the data on the second data frame is posterior to the data on the first data frame. I wanted to plot them both in the same figure with appropriate dates and different line colors. the problem is that the pandas.DataFrame.plot method joins the starting date of both dataframes in the chart, thus rendering the visualization useless.
import pandas as pd
import matplotlib.pyplot as plt
df1 = pd.DataFrame({'date': ['2020-03-10', '2020-03-11', '2020-03-12', '2020-03-13', '2020-03-14', '2020-03-15'],
'number': [1, 2, 3, 4, 5, 6]})
df2 = pd.DataFrame({'date': ['2020-03-16', '2020-03-17', '2020-03-18', '2020-03-19'],
'number': [7, 6, 5, 4]})
ax = df1.plot(x='date', y='number', label='beginning')
df2.plot(x='date', y='number', label='ending', ax=ax)
plt.show()
The figure created looks like this:
Is there any way I can fix this? Could I also get dates to be shown in the x-axis tilted so they're also more legible?
You need to cast 'date' to datetime dtype using pd.to_datetime:
import pandas as pd
import matplotlib.pyplot as plt
df1 = pd.DataFrame({'date': ['2020-03-10', '2020-03-11', '2020-03-12', '2020-03-13', '2020-03-14', '2020-03-15'],
'number': [1, 2, 3, 4, 5, 6]})
df2 = pd.DataFrame({'date': ['2020-03-16', '2020-03-17', '2020-03-18', '2020-03-19'],
'number': [7, 6, 5, 4]})
df1['date'] = pd.to_datetime(df1['date'])
df2['date'] = pd.to_datetime(df2['date'])
ax = df1.plot(x='date', y='number', label='beginning')
df2.plot(x='date', y='number', label='ending', ax=ax)
plt.show()
Output:
This question already has answers here:
Stacked Bar Chart with Centered Labels
(2 answers)
Closed 2 years ago.
I have dataframe which is in below form:
data = [['M',0],['F',0],['M',1], ['M',1],['M',1],['F',1],['M',0], ['M',1],['M',0],['F',1],['M',0], ['M',0]]
df = pd.DataFrame(data,columns=['Gender','label'])
print (df)
Gender label
0 M 0
1 F 0
2 M 1
3 M 1
4 M 1
5 F 1
6 M 0
7 M 1
8 M 0
9 F 1
10 M 0
11 M 0
I am trying to create a stacked bar chart which should percentage as the annotation on the chart.
Code below to create stacked bar chart:
df.groupby('Gender')['label']\
.value_counts()\
.unstack(level=1)\
.plot.bar(stacked=True)
I am not sure how to get percentages on the chart.
Thanks ina dvance
I can offer you this solution:
I have created a new DataFrame,df2, that contains the percentages that need to be painted.
The values of df2, have been ordered to correspond correctly with the index i that refers to the different bars.
This allows you to paint each value in the right place.
get_xy obtains the x and y coordinates of the bottom corner of each bar.
get_width gets the width of each bar.
get_height gets the length of each bar.
To paint the percentages a loop is used. Each turn of a loop refers to a bar. The center of each bar is half the width and length. kx and ky are used to slightly correct the position.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
data = [['M', 0], ['F', 0], ['M',1 ], ['M', 1], ['M', 1], ['F', 1], ['M', 0], ['M', 1], ['M', 0], ['F', 1], ['M', 0], ['M', 0]]
df = pd.DataFrame(data,columns=['Gender','label'])
F_Serie = df.groupby('Gender')['label'].value_counts()['F']
M_Serie = df.groupby('Gender')['label'].value_counts()['M']
M_Serie = M_Serie*(100/M_Serie.sum())
F_Serie = F_Serie*(100/F_Serie.sum())
df2 = pd.DataFrame(np.array([list(F_Serie), list(M_Serie)]), index = ['F', 'M'], columns = [0, 1])
ax = df.groupby('Gender')['label'].value_counts().unstack(level=1).plot.barh(stacked=True, figsize=(10, 6))
# Set txt
kx = -0.3
ky = -0.02
values = []
for key in df2.values:
values = values + list(key)
# ordering the values
val = values[1:3]
values.pop(1)
values.pop(1)
values = val + values
for i,rec in enumerate(ax.patches):
ax.text(rec.get_xy()[0]+rec.get_width()/2+kx,rec.get_xy()[1]+rec.get_height()/2+ky,'{:.1%}'.format(values[i]/100), fontsize=12, color='black')
I have dataframe x2 with two columns. i am trying to plot but didnt get xticks.
data:
bins pp
0 (0, 1] 0.155463
1 (1, 2] 1.528947
2 (2, 3] 2.436064
3 (3, 4] 3.507811
4 (4, 5] 4.377849
5 (5, 6] 5.538044
6 (6, 7] 6.577340
7 (7, 8] 7.510983
8 (8, 9] 8.520378
9 (9, 10] 9.721899
i tried this code result is fine just cant find x-axis ticks just blank. i want bins column value should be on x-axis
x2.plot(x='bins',y=['pp'])
x2.dtypes
Out[141]:
bins category
pp float64
The following is to show that this problem should not occur with pandas 0.24.1 or higher.
import numpy as np
import pandas as pd
print(pd.__version__) # prints 0.24.2
import matplotlib.pyplot as plt
df = pd.DataFrame({"Age" : np.random.rayleigh(30, size=300)})
s = pd.cut(df["Age"], bins=np.arange(0,91,10)).value_counts().to_frame().sort_index().reset_index()
s.plot(x='index',y="Age")
plt.show()
results in
I have the following data frame my_df:
my_1 my_2 my_3
--------------------------------
0 5 7 4
1 3 5 13
2 1 2 8
3 12 9 9
4 6 1 2
I want to make a plot where x-axis is categorical values with my_1, my_2, and my_3. y-axis is integer. For each column in my_df, I want to plot all its 5 values at x = my_i. What kind of plot should I use in matplotlib? Thanks!
You could make a bar chart:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'my_1': [5, 3, 1, 12, 6], 'my_2': [7, 5, 2, 9, 1], 'my_3': [4, 13, 8, 9, 2]})
df.T.plot(kind='bar')
plt.show()
or a scatter plot:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'my_1': [5, 3, 1, 12, 6], 'my_2': [7, 5, 2, 9, 1], 'my_3': [4, 13, 8, 9, 2]})
fig, ax = plt.subplots()
cols = np.arange(len(df.columns))
x = np.repeat(cols, len(df))
y = df.values.ravel(order='F')
color = np.tile(np.arange(len(df)), len(df.columns))
scatter = ax.scatter(x, y, s=150, c=color)
ax.set_xticks(cols)
ax.set_xticklabels(df.columns)
cbar = plt.colorbar(scatter)
cbar.set_ticks(np.arange(len(df)))
plt.show()
Just for fun, here is how to make the same scatter plot using Pandas' df.plot:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'my_1': [5, 3, 1, 12, 6], 'my_2': [7, 5, 2, 9, 1], 'my_3': [4, 13, 8, 9, 2]})
columns = df.columns
index = df.index
df = df.stack()
df.index.names = ['color', 'column']
df = df.rename('y').reset_index()
df['x'] = pd.Categorical(df['column']).codes
ax = df.plot(kind='scatter', x='x', y='y', c='color', colorbar=True,
cmap='viridis', s=150)
ax.set_xticks(np.arange(len(columns)))
ax.set_xticklabels(columns)
cbar = ax.collections[-1].colorbar
cbar.set_ticks(index)
plt.show()
Unfortunately, it requires quite a bit of DataFrame manipulation just to call
df.plot and then there are some extra matplotlib calls needed to set the tick
marks on the scatter plot and colorbar. Since Pandas is not saving effort here,
I would go with the first (NumPy/matplotlib) approach shown above.
I have time series data which are multi-indexed on (Year, Month) as seen here:
print(df.index)
print(df)
MultiIndex(levels=[[2016, 2017], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]],
labels=[[0, 0, 0, 0, 0, 0, 0, 0], [2, 3, 4, 5, 6, 7, 8, 9]],
names=['Year', 'Month'])
Value
Year Month
2016 3 65.018150
4 63.130035
5 71.071254
6 72.127967
7 67.357795
8 66.639228
9 64.815232
10 68.387698
I want to do very basic linear regression on these time series data. Because pandas.DataFrame.plot does not do any regression, I intend to use Seaborn to do my plotting.
I attempted to do this by using lmplot:
sns.lmplot(x=("Year", "Month"), y="Value", data=df, fit_reg=True)
but I get an error:
TypeError: '>' not supported between instances of 'str' and 'tuple'
This is particularly interesting to me because all elements in df.index.levels[:] are of type numpy.int64, all elements in df.index.labels[:] are of type numpy.int8.
Why am I receiving this error? How can I resolve it?
You can use reset_index to turn the dataframe's index into columns. Plotting DataFrames columns is then straight forward with seaborn.
As I guess the reason to use lmplot would be to show different regressions for different years (otherwise a regplot may be better suited), the "Year"column can be used as hue.
import numpy as np
import pandas as pd
import seaborn.apionly as sns
import matplotlib.pyplot as plt
iterables = [[2016, 2017], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]]
index = pd.MultiIndex.from_product(iterables, names=['Year', 'Month'])
df = pd.DataFrame({"values":np.random.rand(24)}, index=index)
df2 = df.reset_index() # or, df.reset_index(inplace=True) if df is not required otherwise
g = sns.lmplot(x="Month", y="values", data=df2, hue="Year")
plt.show()
Consider the following approach:
df['x'] = df.index.get_level_values(0) + df.index.get_level_values(1)/100
yields:
In [49]: df
Out[49]:
Value x
Year Month
2016 3 65.018150 2016.03
4 63.130035 2016.04
5 71.071254 2016.05
6 72.127967 2016.06
7 67.357795 2016.07
8 66.639228 2016.08
9 64.815232 2016.09
10 68.387698 2016.10
let's prepare X-ticks labels:
labels = df.index.get_level_values(0).astype(str) + '-' + \
df.index.get_level_values(1).astype(str).str.zfill(2)
sns.lmplot(x='x', y='Value', data=df, fit_reg=True)
ax = plt.gca()
ax.set_xticklabels(labels)
Result: