Stacked bar plot from Dataframe using groupby - python-3.x

I have the following dataframe and I am trying to create a stacked bar plot
import os
from pprint import pprint
import matplotlib.pyplot as plt
import pandas as pd
def classify_data():
race = ['race1','race1','race1','race1','race2','race2','race2', 'race2']
qualifier = ['last','first','first','first','last','last','first','first']
participant = ['rat','rat','cat','cat','rat','dog','dog','dog']
df = pd.DataFrame(
{'race':race,
'qualifier':qualifier,
'participant':participant
}
)
pprint(df)
df2 = df.groupby(['race','qualifier'])['race'].count().unstack('qualifier').fillna(0)
df2[['first','last']].plot(kind='bar', stacked=True)
plt.show()
classify_data()
I could manage to obtain the following plot. But , I want to create two plots out of my dataframe
One plot containing the following data for the qualifier 'last'
Race1 rat 1
Race1 cat 0
Race1 dog 0
Race2 rat 1
Race2 dog 1
Race2 cat 0
So the first bar plot would have 2 bars and each bar coded with a different color for the count of participant
Likewise a second plot for qualifier 'first'
EDIT:
Race1 rat 1
Race1 cat 2
Race1 dog 0
Race2 rat 0
Race2 dog 2
Race2 cat 0
From the original dataframe , I have to create the above two dataframe for creating the stacked plots
I am not sure how to use the groupby function and get the count of 'participant' for each 'qualifier' for a given 'race'
EDIT 2 : For qualifier 'last' the desired plot would look like( blue for rat , red for dog).
For qualifier 'first'
Could someone suggest me on how to proceed from here?

IIUC, this is what you want:
df2 = (df.groupby(['race','qualifier','participant'])
.size()
.unstack(level=-1)
.reset_index()
)
fig,axes = plt.subplots(1,2,figsize=(12,6),sharey=True)
for ax,q in zip(axes.ravel(),['first','last']):
tmp_df = df2[df2.qualifier.eq(q)]
tmp_df.plot.bar(x='race', ax=ax, stacked=True)
Output:

Related

How to plot Histogram on specific data

I am reading CSV file:
Notation Level RFResult PRIResult PDResult Total Result
AAA 1 1.23 0 2 3.23
AAA 1 3.4 1 0 4.4
BBB 2 0.26 1 1.42 2.68
BBB 2 0.73 1 1.3 3.03
CCC 3 0.30 0 2.73 3.03
DDD 4 0.25 1 1.50 2.75
...
...
Here is the code
import pandas as pd
df = pd.rad_csv('home\NewFiles\Files.csv')
Notation = df['Notation']
Level = df['Level']
RFResult = df['RFResult']
PRIResult = df['PRIResult']
PDResult = df['PDResult']
df.groupby('Level').plot(kind='bar')
Above code gives me four different figures. I want to change few things below:
I don't want to show the Level and Total Results bar in graph. How should I remove that?
Also, how should I label xaxis and yaxis and title of each plot. So for this, I want to give the title of plot is "level number".
To plot use the following code...
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('home\NewFiles\Files.csv')
plt.hist((df['RFResult'],df['PRIResult'],df['PDResult']),bins=10)
plt.title('Level Number')
plt.xlabel('Label name')
plt.ylabel('Label name')
plt.plot()
You can do:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('home\NewFiles\Files.csv')
df.plot(kind='hist', y = ['RFResult', 'PRIResult', 'PDResult'], bins=20)
plt.title('level numbers')
plt.xlabel('X-Label')
plt.ylabel('Y-Label')
Remember the plot is called by pandas, but is based on matplotlib. So you can pass additional arguments!

Python pandas sort dataframe by enum class values

If I have enum class:
from enum import Enum
class Colors(Enum):
RED = 1
ORANGE = 2
GREEN = 3
And if I have a dataframe whose one column is color (it can be in lowercase to):
>>> import pandas as pd
>>> df = pd.DataFrame({'X':['A', 'B', 'C', 'A'], 'color' : ['GREEN', 'RED', 'ORANGE', 'ORANGE']})
>>> df
X color
0 A GREEN
1 B RED
2 C ORANGE
3 A ORANGE
How to make color column as categorical type respecting Color class values, and sort the dataframe by "color" and "X" (ascending)?
For example, the dataframe above should be sorted as:
X, color
--------
B, RED
A, ORANGE
C, ORANGE
A, GREEN
Combination of this answer and this one: use a pd.Categorical to sort by the Colors class (with a slight edit to change its str):
from enum import Enum
import pandas as pd
df = pd.DataFrame({'X':['A', 'B', 'C', 'A'], 'color' : ['GREEN', 'RED', 'ORANGE', 'ORANGE']})
class Colors(Enum):
RED = 1
ORANGE = 2
GREEN = 3
def __str__(self):
return self.name
df['color'] = pd.Categorical(df['color'], [str(i) for i in Colors], ordered=True)
df = df.sort_values(['color','X'])
Result:
X color
1 B RED
3 A ORANGE
2 C ORANGE
0 A GREEN
Use getattr:
df["value"] = df["color"].apply(lambda x: getattr(Colors, x).value)
df.sort_values(by=['value',"X"])
Output:
X color value
1 B RED 1
3 A ORANGE 2
2 C ORANGE 2
0 A GREEN 3
In one line (and without creation of value column):
df.iloc[pd.concat([df["X"], df["color"].apply(lambda x: getattr(Colors, x))], axis=1).sort_values(by=['color',"X"]).index]
Output:
X color
1 B RED
3 A ORANGE
2 C ORANGE
0 A GREEN

How to split a dataframe and plot some columns

I have a dataframe with 990 rows and 7 columns, I want to make a XvsY linear graph, broking the line at every 22 rows.
I think that dividing the dataframe and then plotting it will be good way, but I don't get good results.
max_rows = 22
dataframes = []
while len(Co1new) > max_rows:
top = Co1new[:max_rows]
dataframes.append(top)
Co1new = Co1new[max_rows:]
else:
dataframes.append(Co1new)
for grafico in dataframes:
AC = plt.plot(grafico)
AC = plt.xlabel('Frequency (Hz)')
AC = plt.ylabel("Temperature (K)")
plt.show()
The code functions but it is not plotting the right columns.
Here some reduced data and in this case it should be divided at every four rows:
df = pd.DataFrame({
'col1':[2.17073,2.14109,2.16052,2.81882,2.29713,2.26273,2.26479,2.7643,2.5444,2.5027,2.52532,2.6778],
'col2':[10,100,1000,10000,10,100,1000,10000,10,100,1000,10000],
'col3':[2.17169E-4,2.15889E-4,2.10526E-4,1.53785E-4,2.09867E-4,2.07583E-4,2.01699E-4,1.56658E-4,1.94864E-4,1.92924E-4,1.87634E-4,1.58252E-4]})
One way I can think of is to add a new column with labels for every 22 records. See below
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn
seaborn.set(style='ticks')
"""
Assuming the index is numeric and is from [0-990)
this will return an integer for every 22 records
"""
Co1new['subset'] = 'S' + np.floor_divide(Co1new.index, 22).astype(str)
Out:
col1 col2 col3 subset
0 2.17073 10 0.000217 S0
1 2.14109 100 0.000216 S0
2 2.16052 1000 0.000211 S0
3 2.81882 10000 0.000154 S0
4 2.29713 10 0.000210 S1
5 2.26273 100 0.000208 S1
6 2.26479 1000 0.000202 S1
7 2.76434 10000 0.000157 S1
8 2.54445 10 0.000195 S2
9 2.50270 100 0.000193 S2
10 2.52532 1000 0.000188 S2
11 2.67780 10000 0.000158 S2
You can then use seaborn.pairplot to plot your data pairwise and use Co1new['subset'] as legend.
seaborn.pairplot(Co1new, hue='subset')
Or if you absolutely need line charts, you can make line charts of your data, each pair at a time separately, here is col1 vs. col3
seaborn.lineplot('col1', 'col3', hue='subset', data=Co1new)
Using #SIA ' s answer
df['groups'] = np.floor_divide(df.index, 3).astype(str)
import plotly.express as px
fig = px.line(df, x="col1", y="col2", color='groups')
fig.show()

How to draw venn diagram from a dummy variable in Python Matplotlib_venn?

I have the following code to draw the venn diagram.
import numpy as np
import pandas as pd
import matplotlib_venn as vplt
x = np.random.randint(2, size=(10,3))
df = pd.DataFrame(x, columns=['A', 'B','C'])
print(df)
v = vplt.venn3(subsets=(1,1,1,1,1,1,1))
and the output looks like this:
I actually want to find the numbers in subsets() using the data set. How to do that? or is there any other easy way to make these venn diagram directly from the dataset.
I also want to make a box around it and annotate the remaining area as people with all the A,B,C are 0. Then calculate the percentage of the people in each circle and keep it as label. Not sure how to achieve this.
Background of the Problem:
I have a dataset of more than 500 observations and these three columns are recorded from one variable where multiple choices can be chosen as answers.
I want to visualize the data in a graph which shows that how many people have chosen 1st, 2nd, etc., as well as how many people have chosen 1st and 2nd, 1st and 3rd, etc.,
Use numpy.argwhere to get the indices of the 1s for each column and plot them the resultant
In [85]: df
Out[85]:
A B C
0 0 1 1
1 1 1 0
2 1 1 0
3 0 0 1
4 1 1 0
5 1 1 0
6 0 0 0
7 0 0 0
8 1 1 0
9 1 0 0
In [86]: sets = [set(np.argwhere(v).ravel()) for k,v in df.items()]
...: venn3(sets, df.columns)
...: plt.show()
Note: if you want to draw an additional box with the number of items not in either of the categories, add those lines:
In [87]: ax = plt.gca()
In [88]: xmin, _, ymin, _ = ax.axes.axis('on')
In [89]: ax.text(xmin, ymin, (df == 0).all(1).sum(), ha='left', va='bottom')

Change the bar item name in Pandas

I have a test excel file like:
df = pd.DataFrame({'name':list('abcdefg'),
'age':[10,20,5,23,58,4,6]})
print (df)
name age
0 a 10
1 b 20
2 c 5
3 d 23
4 e 58
5 f 4
6 g 6
I use Pandas and matplotlib to read and plot it:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
excel_file = 'test.xlsx'
df = pd.read_excel(excel_file, sheet_name=0)
df.plot(kind="bar")
plt.show()
the result shows:
it use index number as item name, how can I change it to the name, which stored in column name?
You can specify columns for x and y values in plot.bar:
df.plot(x='name', y='age', kind="bar")
Or create Series first by DataFrame.set_index and select age column:
df.set_index('name')['age'].plot(kind="bar")
#if multiple columns
#df.set_index('name').plot(kind="bar")

Resources