Difficulty grouping barchart using Python, Pandas and Matplotlib - python-3.x

I am having difficulty getting plot.bar to group the bars together the way I have them grouped in the dataframe. The dataframe returns the grouped data correctly, however, the bar graph is providing a separate bar for every line int he dataframe. Ideally, everything in my code below should group 3-6 bars together for each department (Dept X should have bars grouped together for each type, then count of true/false as the Y axis).
Dataframe:
dname Type purchased
Dept X 0 False 141
True 270
1 False 2020
True 2604
2 False 2023
True 1047
Code:
import psycopg2
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
##connection and query data removed
df = pd.merge(df_departments[["id", "dname"]], df_widgets[["department", "widgetid", "purchased","Type"]], how='inner', left_on='id', right_on='department')
df.set_index(['dname'], inplace=True)
dx=df.groupby(['dname', 'Type','purchased'])['widgetid'].size()
dx.plot.bar(x='dname', y='widgetid', rot=90)

I can't be sure without a more reproducible example, but try unstacking the innermost level of the MultiIndex of dx before plotting:
dx.unstack().plot.bar(x='dname', y='widgetid', rot=90)
I expect this to work because when plotting a DataFrame, each column becomes a legend entry and each row becomes a category on the horizontal axis.

Related

How to create a scatter plot where values are across multiple columns?

I have a dataframe in Pandas in which the rows are observations at different times and each column is a size bin where the values represent the number of particles observed for that size bin. So it looks like the following:
bin1 bin2 bin3 bin4 bin5
Time1 50 200 30 40 5
Time2 60 60 40 420 700
Time3 34 200 30 67 43
I would like to use plotly/cufflinks to create a scatterplot in which the x axis will be each size bin, and the y axis will be the values in each size bin. There will be three colors, one for each observation.
As I'm more experienced in Matlab, I tried indexing the values using iloc (note the example below is just trying to plot one observation):
df.iplot(kind="scatter",theme="white",x=df.columns, y=df.iloc[1,:])
But I just get a key error: 0 message.
Is it possible to use indexing when choosing x and y values in Pandas?
Rather than indexing, I think you need to better understand how pandas and matplotlib interact each other.
Let's go by steps for your case:
As the pandas.DataFrame.plot documentation says, the plotted series is a column. You have the series in the row, so you need to transpose your dataframe.
To create a scatterplot, you need both x and y coordinates in different columns, but you are missing the x column, so you also need to create a column with the x values in the transposed dataframe.
Apparently pandas does not change color by default with consecutive calls to plot (matplotlib does it), so you need to pick a color map and pass a color argument, otherwise all points will have the same color.
Here a working example:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#Here I copied you data in a data.txt text file and import it in pandas as a csv.
#You may have a different way to get your data.
df = pd.read_csv('data.txt', sep='\s+', engine='python')
#I assume to have a column named 'time' which is set as the index, as you show in your post.
df.set_index('time')
tdf = df.transpose() #transpose the dataframe
#Drop the time column from the trasponsed dataframe. time is not a data to be plotted.
tdf = tdf.drop('time')
#Creating x values, I go for 1 to 5 but they can be different.
tdf['xval'] = np.arange(1, len(tdf)+1)
#Choose a colormap and making a list of colors to be used.
colormap = plt.cm.rainbow
colors = [colormap(i) for i in np.linspace(0, 1, len(tdf))]
#Make an empty plot, the columns will be added to the axes in the loop.
fig, axes = plt.subplots(1, 1)
for i, cl in enumerate([datacol for datacol in tdf.columns if datacol != 'xval']):
tdf.plot(x='xval', y=cl, kind="scatter", ax=axes, color=colors[i])
plt.show()
This plots the following image:
Here a tutorial on picking colors in matplotlib.

P-value normal test for multiple rows

I got the following simple code to calculate normality over an array:
import pandas as pd
df = pd.read_excel("directory\file.xlsx")
import numpy as np
x=df.iloc[:,1:].values.flatten()
import scipy.stats as stats
from scipy.stats import normaltest
stats.normaltest(x,axis=None)
This gives me nicely a p-value and a statistic.
The only thing I want right now is to:
Add 2 columns in the file with this p value and statistic and if i have multiple rows, do it for all the rows (calculate p value & statistic for each row and add 2 columns with these values in it).
Can someone help?
If you want to calculate row-wise normaltest, you should not flatten your data in x and use axis=1 such as
df = pd.DataFrame(np.random.random(105).reshape(5,21)) # to generate data
# calculate normaltest row-wise without the first column like you
df['stat'] ,df['p'] = stats.normaltest(df.iloc[:,1:],axis=1)
Then df contains two columns 'stat' and 'p' with the values your are looking for IIUC.
Note: to be able to perform normaltest, you need at least 8 values (according to what I experienced) so you need at least 8 columns in df.iloc[:,1:] otherwise it will raise an error. And even, it would be better to have more than 20 values in each row.

Is it possible to explicitly set order the stacks in a matplotlib stackplot?

I want to explicitly set the order of the stacks in a Matplotlib stackplot. Here is an example:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(1)
df = pd.DataFrame(np.random.randint(0,100,size=(100,4)),columns=list('ABCD'))
df.plot(kind='area',stacked=True,figsize=(20,10));
This produces the following image:
The last row of the dataframe from:
df.tail(1)
is:
A B C D
99 16 30 84 57
Here is what I want to achieve:
I want to re-order the plot of the stacks such that the stacks are plotted from the bottom up A, B, D, C i.e. the columns ordered from the bottom up, by the order of their increasing values in the last row of the df.
So far, I have tried re-ordering explicitly the columns in the df before plotting:
df[['A','B','D','C']].plot(kind='area',stacked=True,figsize=(20,10))
but this produces exactly the same chart as above.
Thank you for any help here!
The graphs are not the same. Look at the areas beneath the red graph for a particle x. The shapes for those graphs are different for the green and blue shaded areas.
And now,
df[['A','B','D','C']].plot(kind='area',stacked=True,figsize=(20,10))

Hoy can I plot a group by boxplot in pandas dropping unused categories?

Just to have the long story short. How can I plot a grouped boxplot from a category column in pandas and show only the present categories in the subset instead all posible categories.
[reproducible example]
I have a pandas dataframe with a factor column, and I want to plot a boxplot. If I plot by the factor is OK. If I do a subset and plot the boxplots by the factor, also is OK and only factors present in the subset are plotted. But if I have set the column as category, then all categories are ploted in the boxplot even if they are not present.
- Create the dataframe
import pandas as pd
import numpy as np
x = ['A']*150 + ['B']*150 + ['C']*150 + ['D']*150 + ['E']*150 + ['F']*150
y = np.random.randn(900)
z = ['X']*450 + ['Y']*450
df = pd.DataFrame({'Letter':x, 'N':y, 'type':z})
print(df.head())
print(df.tail())
- Plot by factor
df.boxplot(by='Letter')
- Plot a subset (only categories in the subset are ploted but sorted alphabetically not in the wanted order)
df[df['type']=='X'].boxplot(by='Letter')
- Convert the factor to a category and plot the subset in order to have the set ordered: All categories are plotted even if they are missing from the subset. The good part is that they are in "wanted_sort_order"
df['Letter2'] = df['Letter'].copy()
df['Letter2'] = df['Letter2'].astype('category')
# set a category in order to sort the factor in specific order
df['Letter2'].cat.set_categories(df['Letter2'].drop_duplicates().tolist()[::-1], inplace=True)
df[df['type']=='X'].boxplot(by='Letter2')
After creating the DataFrame (first code block), try the following:
df['Letter2'] = pd.Categorical(df['Letter'], list('BAC'))
df[df['type']=='X'].boxplot(by='Letter2')
Result:
What pd.Categorical is doing is just simply setting as NaN whatever isn't in the your category list (second parameter) and .boxplot() naturally just ignores that and plots only the categories you are looking for.

Pandas Matplotlib Line Graph

Given the following data frame:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'YYYYMM':[201603,201503,201403,201303,201603,201503,201403,201303],
'Count':[5,6,2,7,4,7,8,9],
'Group':['A','A','A','A','B','B','B','B']})
df
Count Group YYYYMM
0 5 A 201603
1 6 A 201503
2 2 A 201403
3 7 A 201303
4 4 B 201603
5 7 B 201503
6 8 B 201403
7 9 B 201303
I need to generate a line graph with one line per group with a summary table at the bottom. Something like this:
I need each instance of 'YYYYMM' to be treated like a year by Pandas/Matplotlib.
So far, this seems to help, but I'm not sure if it will do the trick:
df['YYYYMM']=df['YYYYMM'].astype(str).str[:-2].astype(np.int64)
Then, I did this to pivot the data:
t=df.pivot_table(df,index=['YYYYMM'],columns=['Group'],aggfunc=np.sum)
Count
Group A B
YYYYMM
2013 7 9
2014 2 8
2015 6 7
2016 5 4
Then, I tried to plot it:
import matplotlib.pyplot as plt
%matplotlib inline
fig, ax = plt.subplots(1,1)
t.plot(table=t,ax=ax)
...and this happened:
I'd like to do the following:
remove all lines (borders) from the table at the bottom
remove the jumbled text in the table
remove the x axis tick labels (it should just show the years for tick labels)
I can clean up the rest myself (remove legend and borders, etc..).
Thanks in advance!
I may not have fully understood what you mean by 1., since you are showing the table lines in your reference. I have also not understood whether you want to transpose the table.
What you may be looking for is:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame(
{'YYYYMM':[201603,201503,201403,201303,201603,201503,201403,201303],
'Count':[5,6,2,7,4,7,8,9],
'Group':['A','A','A','A','B','B','B','B']})
df['YYYYMM']=df['YYYYMM'].astype(str).str[:-2].astype(int)
t=pd.pivot_table(df, values='Count', index='YYYYMM',columns='Group',aggfunc=np.sum)
t.index.name = None
fig, ax = plt.subplots(1,1)
t.plot(table=t,ax=ax)
ax.xaxis.set_major_formatter(plt.NullFormatter())
plt.tick_params(
axis='x', # changes apply to the x-axis
which='both', # both major and minor ticks are affected
bottom='off', # ticks along the bottom edge are off
top='off', # ticks along the top edge are off
labelbottom='off') # labels along the bottom edge are off
plt.show()

Resources