How can I plot a categorical vs categorical plot? - python-3.x

I want to check the count of categories (in the first column) with the count of categories in the second column. I have two columns:
1. Max_glu_serum with categories: None, Norm, <200, <300.
2. Readmitted with categories: No, <30, >30.
I want a plot so that I can check what is the count of '<300' with '>30' i.e., how many patients had max_glu_serum = >300 and were readmitted in '>30' days
I tried the following code:
sns.catplot(y=train_data_wmis['max_glu_serum'],
hue=train_data_wmis['readmitted'],
kind="count",
palette="pastel", edgecolor=".6", dropna=True)
but it throws the following error:
TypeError Traceback (most recent call last)
<ipython-input-384-1be2c9032203> in <module>
----> 1 sns.catplot(y=train_data_wmis['max_glu_serum'], hue=train_data_wmis['readmitted'], kind="count", palette="pastel", edgecolor=".6", dropna=True)
F:\Anaconda3\lib\site-packages\seaborn\categorical.py in catplot(x, y, hue, data, row, col, col_wrap, estimator, ci, n_boot, units, order, hue_order, row_order, col_order, kind, height, aspect, orient, color, palette, legend, legend_out, sharex, sharey, margin_titles, facet_kws, **kwargs)
3750
3751 # Initialize the facets
-> 3752 g = FacetGrid(**facet_kws)
3753
3754 # Draw the plot onto the facets
F:\Anaconda3\lib\site-packages\seaborn\axisgrid.py in __init__(self, data, row, col, hue, col_wrap, sharex, sharey, height, aspect, palette, row_order, col_order, hue_order, hue_kws, dropna, legend_out, despine, margin_titles, xlim, ylim, subplot_kws, gridspec_kws, size)
255 # Make a boolean mask that is True anywhere there is an NA
256 # value in one of the faceting variables, but only if dropna is True
--> 257 none_na = np.zeros(len(data), np.bool)
258 if dropna:
259 row_na = none_na if row is None else data[row].isnull()
TypeError: object of type 'NoneType' has no len()
Can someone help me, please!

I tried a couple of things and finally found one solution to the above problem. Defined the following function:
def plot_stack(column_1, column_2):
plot_stck=pd.crosstab(index=column_1, columns=column_2)
plot_stck.plot(kind='bar', figsize=(8,8), stacked=True)
return
Then,
plot_stack(train_data_wmis['max_glu_serum'], train_data_wmis['readmitted'])
Output:
Stacked Plot of 'max_glu_serum' and 'readmitted'
Please comment, if a better solution is available via Seaborn. Thanks

Related

Plotting and modeling text data with swarmplot on python

I have a csv file and i want to use seaborn library's swarmplot for plotting the relation between two of the selected columns.
This is a sample of 5 rows from the csv file that i am working with
SeriesCode Year DESCRIPTION
21 IC.FRM.CORR.ZS YR2004 The sample was drawn from the manufacturing sector only.
38 SP.ADO.TFRT YR2010 Interpolated using data for 2007 and 2012.
10 SP.ADO.TFRT YR2000 Interpolated using data for 1997 and 2002.
18 IC.FRM.CORR.ZS YR2003 The sample was drawn from the manufacturing sector only.
32 IC.TAX.METG YR2007 The sample was drawn from the manufacturing sector only.
28 SP.ADO.TFRT YR2006 Interpolated using data for 2002 and 2007.
And i have this piece of code
import re
import pandas
df1=pandas.read_csv("./Jobs_csv/JobsSeries-Time.csv")
ifcz=df1[df1['SeriesCode'].str.contains("IC.FRM.CORR.ZS",flags=re.IGNORECASE,regex=True)].DESCRIPTION
ify=df1[df1['SeriesCode'].str.contains("IC.FRM.CORR.ZS",flags=re.IGNORECASE,regex=True)].Year
sb.swarmplot(x="ifcz", y="ify", data=df1)
But whenever i run it
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-68-3c5933ceba52> in <module>()
----> 1 sb.swarmplot(x="ifcz", y="ify", data=df1)
/home/user/.local/lib/python3.6/site-packages/seaborn/categorical.py in swarmplot(x, y, hue, data, order, hue_order, dodge, orient, color, palette, size, edgecolor, linewidth, ax, **kwargs)
2975
2976 plotter = _SwarmPlotter(x, y, hue, data, order, hue_order,
-> 2977 dodge, orient, color, palette)
2978 if ax is None:
2979 ax = plt.gca()
/home/user/.local/lib/python3.6/site-packages/seaborn/categorical.py in __init__(self, x, y, hue, data, order, hue_order, dodge, orient, color, palette)
1214 dodge, orient, color, palette):
1215 """Initialize the plotter."""
-> 1216 self.establish_variables(x, y, hue, data, orient, order, hue_order)
1217 self.establish_colors(color, palette, 1)
1218
/home/user/.local/lib/python3.6/site-packages/seaborn/categorical.py in establish_variables(self, x, y, hue, data, orient, order, hue_order, units)
150 if isinstance(var, str):
151 err = "Could not interpret input '{}'".format(var)
--> 152 raise ValueError(err)
153
154 # Figure out the plotting orientation
ValueError: Could not interpret input 'ifcz'
I get these errors.I dont know why it gives this error or how i can fix it.And i have become unsure whether swarmplot is supposed to be used for this.If you think it's because swarmplot shouldnt be used, then can you name other plots to model this?

KeyError: "['cut'] not found in axis"

I am working with thousands of lines of data trying to narrow a search for certain grains. To do this, I have an 'Asset' column with about 20 different values, of which I need to receive the sum of all of the lines in the adjacent column 'Load'.
I would like to cut the unnecessary rows out of my data set. To start, I relabeled all of the extra assets as 'cut' (as shown in the example below) so that I could manage one .drop command. Here is how it is coded:
df14['Asset'] = df14["Asset"].str.replace('BEANS', 'cut')
df14.drop("cut", axis=0)
set(df14['Asset'])
This is the error I have received:
KeyError Traceback (most recent call last)
<ipython-input-593-40006512df80> in <module>
----> 1 df14.drop("cut", axis=0)
2 set(df14['Asset'])
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in drop(self, labels, axis, index, columns, level, inplace, errors)
4100 level=level,
4101 inplace=inplace,
-> 4102 errors=errors,
4103 )
4104
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in drop(self, labels, axis, index, columns, level, inplace, errors)
3912 for axis, labels in axes.items():
3913 if labels is not None:
-> 3914 obj = obj._drop_axis(labels, axis, level=level, errors=errors)
3915
3916 if inplace:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in _drop_axis(self, labels, axis, level, errors)
3944 new_axis = axis.drop(labels, level=level, errors=errors)
3945 else:
-> 3946 new_axis = axis.drop(labels, errors=errors)
3947 result = self.reindex(**{axis_name: new_axis})
3948
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexes\base.py in drop(self, labels, errors)
5338 if mask.any():
5339 if errors != "ignore":
-> 5340 raise KeyError("{} not found in axis".format(labels[mask]))
5341 indexer = indexer[~mask]
5342 return self.delete(indexer)
KeyError: "['cut'] not found in axis"
I have tried several commands to remove said lines, like:
df14.drop(["cut"], inplace = True)
df14[~df14['Asset'].isin(to_drop)]
df14[df14['Asset'].str.contains('cut', na = True)]
And all of them yield the same fruits.
When I code
df14 = df14[~df14["Asset"].str.contains('BEANS')]
It does not remove the Load number, which is the next column over, from my final calculations.
Is it possible to remove all rows of data with a certain label so I can trim from 20 assets to 7 assets?
Thank you
pd.drop works by column or row wise. You give column name to drop a column or index to drop a row. Andaxis=0 means index-wise. Since you don't have a index named "cut", it gives the error.
I recommend doing it by:
df = df.loc[df['Asset'] != 'cut']
I believe that df14.drop("cut", axis=0) is failing because it is looking for the value "cut" in the index of df14. You could potentially specify the asset column as an index, see the pandas documentation on drop for how, but I think a better solution might be something along lines of
df14 = df14.query('asset != "cut"')
I can't say I know if this is the fastest solution since I usually work with small-ish datasets I've not had to worry about performance too much.
This should do the job.
Here you are basically selecting all rows other than 'cut'
df14 = df14.loc[df14['Asset'] != 'cut']

User Warning: The following kwargs were not used by contour: 'label', 'color'

Im trying to create a comparison plot using Seaborn's PairGrid function on my dataset. My data set has 6 columns that I am trying to plot using the scatter() function in my .map_upper segment of the PairGrid function I'm applying to the entire dataframe. Here is a quick peak at my dataframe object; the 'year' object is set as the dataframe's index
Here are the data types of my dataframe comp_pct_chg_df:
year object
Amsterdam float64
Barcelona float64
Kingston float64
Milan float64
Philadelphia float64
Global float64
dtype: object
Here is my erroneous code below:
# Creating a comparison plot (.PairGrid()) of all my cities' and global data's average percent change in temperature
# Set up my figure by naming it 'pct_chg_yrly_fig', then call PairGrid on the DataFrame
pct_chg_yrly_fig = sns.PairGrid(comp_pct_chg_df.dropna())
# Using map_upper we can specify what the upper triangle will look like.
pct_chg_yrly_fig.map_upper(plt.scatter,color='purple')
# We can also define the lower triangle in the figure, including the plot type (KDE) or the color map (BluePurple)
pct_chg_yrly_fig.map_lower(sns.kdeplot,cmap='cool_d')
# Finally we'll define the diagonal as a series of histogram plots of the yearly average percent change in temperature
pct_chg_yrly_fig.map_diag(plt.hist,histtype='step',linewidth=3,bins=30)
# Adding a legend
pct_chg_yrly_fig.add_legend()
Some of the visualizations do plot out, like the .map_lower() function I used, which turned out great. I'd like to plot each city however, in a different color for my scatter plot used in the .map_upper() function I've used. Right now its monochromatic, and hard to tell which data points belong to which city. And lastly, my .map_diag() doesn't plot at all. I don't know what I'm doing wrong. I've assessed the ValueError msg I received (which is below) and tried manipulating dozens of **kwargs, label and color specifically, to no avail. Help would be greatly appreciated.
Here is the ValueError msg I'm receiving:
ValueError Traceback (most recent call last)
<ipython-input-38-3fcf1b69d4ef> in <module>()
11
12 # Finally we'll define the diagonal as a series of histogram plots of the yearly average percent change in temperature
---> 13 pct_chg_yrly_fig.map_diag(plt.hist,histtype='step',linewidth=3,bins=30)
14
15 # Adding a legend
~/anaconda3/lib/python3.6/site-packages/seaborn/axisgrid.py in map_diag(self, func, **kwargs)
1361
1362 if "histtype" in kwargs:
-> 1363 func(vals, color=color, **kwargs)
1364 else:
1365 func(vals, color=color, histtype="barstacked", **kwargs)
~/anaconda3/lib/python3.6/site-packages/matplotlib/pyplot.py in hist(x, bins, range, density, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, normed, hold, data, **kwargs)
3023 histtype=histtype, align=align, orientation=orientation,
3024 rwidth=rwidth, log=log, color=color, label=label,
-> 3025 stacked=stacked, normed=normed, data=data, **kwargs)
3026 finally:
3027 ax._hold = washold
~/anaconda3/lib/python3.6/site-packages/matplotlib/__init__.py in inner(ax, *args, **kwargs)
1715 warnings.warn(msg % (label_namer, func.__name__),
1716 RuntimeWarning, stacklevel=2)
-> 1717 return func(ax, *args, **kwargs)
1718 pre_doc = inner.__doc__
1719 if pre_doc is None:
~/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py in hist(***failed resolving arguments***)
6137 color = mcolors.to_rgba_array(color)
6138 if len(color) != nx:
-> 6139 raise ValueError("color kwarg must have one color per dataset")
6140
6141 # If bins are not specified either explicitly or via range,
ValueError: color kwarg must have one color per dataset
I also noticed that my index, the year object, is plotting out in the upper left corner of my PairGrid. It looks like a bunch of vertical lines plotted next to one another. Not sure why it’s plotting but could it be because the values ( years 1743 - 2015) end in ‘.0’? I noticed this when I put the data frame together (and I don’t know how to drop it... Python newb here) so I changed the year column’s data type from float64 to string and set it as my index. I thought doing this would make my index ‘unworkable’ meaning even though the values are numbers, the data type is set to string so no calculations could be done on them? Am I missing something here?

Make subplots of the histogram in pandas dataframe using matpolot library?

I have the following data separate by tab:
CHROM ms02g:PI num_Vars_by_PI range_of_PI total_haplotypes total_Vars
1 1,2 60,6 2820,81 2 66
2 9,8,10,7,11 94,78,10,69,25 89910,1102167,600,1621365,636 5 276
3 5,3,4,6 6,12,14,17 908,394,759,115656 4 49
4 17,18,22,16,19,21,20 22,11,3,16,7,12,6 1463,171,149,256,157,388,195 7 77
5 13,15,12,14 56,25,96,107 2600821,858,5666,1792 4 284
7 24,26,29,25,27,23,30,28,31 12,31,19,6,12,23,9,37,25 968,3353,489,116,523,1933,823,2655,331 9 174
8 33,32 53,35 1603,2991338 2 88
I am using this code to build a histogram plots with subplots for each CHROM:
with open(outputdir + '/' + 'hap_size_byVar_'+ soi +'_'+ prefix+'.png', 'wb') as fig_initial:
fig, ax = plt.subplots(nrows=len(hap_stats), sharex=True)
for i, data in hap_stats.iterrows():
# first convert data to list of integers
data_i = [int(x) for x in data['num_Vars_by_PI'].split(',')]
ax[i].hist(data_i, label=str(data['CHROM']), alpha=0.5)
ax[i].legend()
plt.xlabel('size of the haplotype (number of variants)')
plt.ylabel('frequency of the haplotypes')
plt.suptitle('histogram of size of the haplotype (number of variants) \n'
'for each chromosome')
plt.savefig(fig_initial)
Everything is fine except two problems:
The Y-label frequency of the haplotypes is not adjusted properly in this output plot.
When the data contain only one row (see data below) the subplot are not possible and I get TypeError, even though it should be able to make the subgroup with only one index.
Dataframe with only one line of data:
CHROM ms02g:PI num_Vars_by_PI range_of_PI total_haplotypes total_Vars
2 9,8,10,7,11 94,78,10,69,25 89910,1102167,600,1621365,636 5 276
TypeError :
Traceback (most recent call last):
File "phase-Extender.py", line 1806, in <module>
main()
File "phase-Extender.py", line 502, in main
compute_haplotype_stats(initial_haplotype, soi, prefix='initial')
File "phase-Extender.py", line 1719, in compute_haplotype_stats
ax[i].hist(data_i, label=str(data['CHROM']), alpha=0.5)
TypeError: 'AxesSubplot' object does not support indexing
How can I fix these two issues ?
Your first problem comes from the fact that you are using plt.ylabel() at the end of your loop. pyplot functions act on the current active axes object, which, in this case, is the last one created by subplots(). If you want your label to be centered over your subplots, the easiest might be to create a text object centered vertically in the figure.
# replace plt.ylabel('frequency of the haplotypes') with:
fig.text(.02, .5, 'frequency of the haplotypes', ha='center', va='center', rotation='vertical')
you can play around with the x-position (0.02) until you find a position you're happy with. The coordinates are in figure coordinates, (0,0) is bottom left (1,1) is top right. Using 0.5 as y position ensures the label is centered in the figure.
The second problem is due to the fact that, when numrows=1 plt.subplots() returns directly the axes object, instead of a list of axes. There are two options to circumvent this problem
1 - test whether you have only one line, and then replace ax with a list:
fig, ax = plt.subplots(nrows=len(hap_stats), sharex=True)
if len(hap_stats)==1:
ax = [ax]
(...)
2 - use the option squeeze=False in your call to plt.subplots(). As explained in the documentation, using this option will force subplots()to always return a 2D array. Therefore you'll have to modify a bit how you are indexing your axes:
fig, ax = plt.subplots(nrows=len(hap_stats), sharex=True, squeeze=False)
for i, data in hap_stats.iterrows():
(...)
ax[i,0].hist(data_i, label=str(data['CHROM']), alpha=0.5)
(...)

matplotlib boxplot with split y-axis

I would like to make a box plot with data similar to this
d = {'Education': [1,1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4],
'Hours absent': [3, 100,5,7,2,128,4,6,7,1,2,118,2,4,136,1,1]}
df = pd.DataFrame(data=d)
df.head()
This works beautifully:
df.boxplot(column=['Hours absent'] , by=['Education'])
plt.ylim(0, 140)
plt.show()
But the outliers are far away, therefore I would like to split the y-axis.
But here the boxplot commands "column" and "by" are not accepted anymore. So instead of splitting the data by education, I only get one merged data point.
This is my code:
dfnew = df[['Hours absent', 'Education']] # In reality I take the different
columns from a much bigger dataset
fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True)
ax1.boxplot(dfnew['Hours absent'])
ax1.set_ylim(40, 140)
ax2.boxplot(dfnew['Hours absent'])
ax2.set_ylim(0, 40)
ax1.spines['bottom'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax1.xaxis.tick_top()
ax1.tick_params(labeltop='off') # don't put tick labels at the top
ax2.xaxis.tick_bottom()
d = .015 # how big to make the diagonal lines in axes coordinates
# arguments to pass to plot, just so we don't keep repeating them
kwargs = dict(transform=ax1.transAxes, color='k', clip_on=False)
ax1.plot((-d, +d), (-d, +d), **kwargs) # top-left diagonal
ax1.plot((1 - d, 1 + d), (-d, +d), **kwargs) # top-right diagonal
kwargs.update(transform=ax2.transAxes) # switch to the bottom axes
ax2.plot((-d, +d), (1 - d, 1 + d), **kwargs) # bottom-left diagonal
ax2.plot((1 - d, 1 + d), (1 - d, 1 + d), **kwargs) # bottom-right diagonal
plt.show()
These are the things I tried (I always changed this both for the first and second subplot) and the errors I got.
ax1.boxplot(dfnew['Hours absent'],dfnew['Education'])
#The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(),
#a.any() or a.all().
ax1.boxplot(column=dfnew['Hours absent'], by=dfnew['Education'])#boxplot()
#got an unexpected keyword argument 'column'
ax1.boxplot(dfnew['Hours absent'], by=dfnew['Education']) #boxplot() got an
#unexpected keyword argument 'by'
I also tried to convert data into array for y axis and list for x axis:
data = df[['Hours absent']].as_matrix()
labels= list(df['Education'])
print(labels)
print(len(data))
print(len(labels))
print(type(data))
print(type(labels))
And I substituted in the plot command like this:
ax1.boxplot(x=data, labels=labels)
ax2.boxplot(x=data, labels=labels)
Now the error is ValueError: Dimensions of labels and X must be compatible.
But they are both 17 long, I don't understand what is going wrong here.
You are overcomplicating this, the code for breaking the Y-axis is independent of the code for plotting the boxplot. Nothing keeps you from using df.boxplot, it will add some labels and titles you do not want but that is easy to fix.
df.boxplot(column='Hours absent', by='Education', ax=ax1)
ax1.set_xlabel('')
ax1.set_ylim(ymin=90)
df.boxplot(column='Hours absent', by='Education', ax=ax2)
ax2.set_title('')
ax2.set_ylim(ymax=50)
fig.subplots_adjust(top=0.87)
Of course you can also use matplotlib's boxplot, as long as you provide the parameters it needs. According to the docstring it will make
a box and whisker plot for each column of x or each vector in
sequence x
Which means you have to do the "by" part yourself.
grouper = df.groupby('Education')['Hours absent']
x = [grouper.get_group(k) for k in grouper.groups]
ax1.boxplot(x)
ax1.set_ylim(ymin=90)
ax2.boxplot(x)
ax2.set_ylim(ymax=50)

Resources