Plotting and modeling text data with swarmplot on python - python-3.x

I have a csv file and i want to use seaborn library's swarmplot for plotting the relation between two of the selected columns.
This is a sample of 5 rows from the csv file that i am working with
SeriesCode Year DESCRIPTION
21 IC.FRM.CORR.ZS YR2004 The sample was drawn from the manufacturing sector only.
38 SP.ADO.TFRT YR2010 Interpolated using data for 2007 and 2012.
10 SP.ADO.TFRT YR2000 Interpolated using data for 1997 and 2002.
18 IC.FRM.CORR.ZS YR2003 The sample was drawn from the manufacturing sector only.
32 IC.TAX.METG YR2007 The sample was drawn from the manufacturing sector only.
28 SP.ADO.TFRT YR2006 Interpolated using data for 2002 and 2007.
And i have this piece of code
import re
import pandas
df1=pandas.read_csv("./Jobs_csv/JobsSeries-Time.csv")
ifcz=df1[df1['SeriesCode'].str.contains("IC.FRM.CORR.ZS",flags=re.IGNORECASE,regex=True)].DESCRIPTION
ify=df1[df1['SeriesCode'].str.contains("IC.FRM.CORR.ZS",flags=re.IGNORECASE,regex=True)].Year
sb.swarmplot(x="ifcz", y="ify", data=df1)
But whenever i run it
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-68-3c5933ceba52> in <module>()
----> 1 sb.swarmplot(x="ifcz", y="ify", data=df1)
/home/user/.local/lib/python3.6/site-packages/seaborn/categorical.py in swarmplot(x, y, hue, data, order, hue_order, dodge, orient, color, palette, size, edgecolor, linewidth, ax, **kwargs)
2975
2976 plotter = _SwarmPlotter(x, y, hue, data, order, hue_order,
-> 2977 dodge, orient, color, palette)
2978 if ax is None:
2979 ax = plt.gca()
/home/user/.local/lib/python3.6/site-packages/seaborn/categorical.py in __init__(self, x, y, hue, data, order, hue_order, dodge, orient, color, palette)
1214 dodge, orient, color, palette):
1215 """Initialize the plotter."""
-> 1216 self.establish_variables(x, y, hue, data, orient, order, hue_order)
1217 self.establish_colors(color, palette, 1)
1218
/home/user/.local/lib/python3.6/site-packages/seaborn/categorical.py in establish_variables(self, x, y, hue, data, orient, order, hue_order, units)
150 if isinstance(var, str):
151 err = "Could not interpret input '{}'".format(var)
--> 152 raise ValueError(err)
153
154 # Figure out the plotting orientation
ValueError: Could not interpret input 'ifcz'
I get these errors.I dont know why it gives this error or how i can fix it.And i have become unsure whether swarmplot is supposed to be used for this.If you think it's because swarmplot shouldnt be used, then can you name other plots to model this?

Related

How do I pass the values to Catboost?

I'm trying to work with catboost and I've got a problem that I'm really stuck with right now. I have a dataframe with 28 columns, 2 of them are categorical. When the data is numerical there are some even and some fractional numbers, also some 0.00 values that should represent not an empty values but the actual nulls (like 1-1=0).
I'm trying to run this:
train_cl = cb.Pool(data=ret_df.iloc[:580000, :-1], label=ret_df.iloc[:580000, -1], cat_features=cats)
evl_cl = cb.Pool(data=ret_df.iloc[580000:, :-1], label=ret_df.iloc[580000:, -1], cat_features=cats)
But I have this error
---------------------------------------------------------------------------
CatBoostError Traceback (most recent call last)
<ipython-input-112-a515b0ab357b> in <module>
1 train_cl = cb.Pool(data=ret_df.iloc[:580000, :-1], label=ret_df.iloc[:580000, -1], cat_features=cats)
----> 2 evl_cl = cb.Pool(data=ret_df.iloc[580000:, :-1], label=ret_df.iloc[580000:, -1], cat_features=cats)
~\AppData\Local\Programs\Python\Python36\lib\site-packages\catboost\core.py in __init__(self, data, label, cat_features, text_features, embedding_features, column_description, pairs, delimiter, has_header, ignore_csv_quoting, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names, thread_count, log_cout, log_cerr)
615 )
616
--> 617 self._init(data, label, cat_features, text_features, embedding_features, pairs, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names, thread_count)
618 super(Pool, self).__init__()
619
~\AppData\Local\Programs\Python\Python36\lib\site-packages\catboost\core.py in _init(self, data, label, cat_features, text_features, embedding_features, pairs, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names, thread_count)
1081 if label is not None:
1082 self._check_label_type(label)
-> 1083 self._check_label_empty(label)
1084 label = self._label_if_pandas_to_numpy(label)
1085 if len(np.shape(label)) == 1:
~\AppData\Local\Programs\Python\Python36\lib\site-packages\catboost\core.py in _check_label_empty(self, label)
723 """
724 if len(label) == 0:
--> 725 raise CatBoostError("Labels variable is empty.")
726
727 def _check_label_shape(self, label, samples_count):
CatBoostError: Labels variable is empty.
I've googled this trouble, but found nothing. My hypothesis is that there is a problem with 0.00 values, but I do not know how to solve this because I literally can't replace these values with anything.
Please, help me!

How can I plot a categorical vs categorical plot?

I want to check the count of categories (in the first column) with the count of categories in the second column. I have two columns:
1. Max_glu_serum with categories: None, Norm, <200, <300.
2. Readmitted with categories: No, <30, >30.
I want a plot so that I can check what is the count of '<300' with '>30' i.e., how many patients had max_glu_serum = >300 and were readmitted in '>30' days
I tried the following code:
sns.catplot(y=train_data_wmis['max_glu_serum'],
hue=train_data_wmis['readmitted'],
kind="count",
palette="pastel", edgecolor=".6", dropna=True)
but it throws the following error:
TypeError Traceback (most recent call last)
<ipython-input-384-1be2c9032203> in <module>
----> 1 sns.catplot(y=train_data_wmis['max_glu_serum'], hue=train_data_wmis['readmitted'], kind="count", palette="pastel", edgecolor=".6", dropna=True)
F:\Anaconda3\lib\site-packages\seaborn\categorical.py in catplot(x, y, hue, data, row, col, col_wrap, estimator, ci, n_boot, units, order, hue_order, row_order, col_order, kind, height, aspect, orient, color, palette, legend, legend_out, sharex, sharey, margin_titles, facet_kws, **kwargs)
3750
3751 # Initialize the facets
-> 3752 g = FacetGrid(**facet_kws)
3753
3754 # Draw the plot onto the facets
F:\Anaconda3\lib\site-packages\seaborn\axisgrid.py in __init__(self, data, row, col, hue, col_wrap, sharex, sharey, height, aspect, palette, row_order, col_order, hue_order, hue_kws, dropna, legend_out, despine, margin_titles, xlim, ylim, subplot_kws, gridspec_kws, size)
255 # Make a boolean mask that is True anywhere there is an NA
256 # value in one of the faceting variables, but only if dropna is True
--> 257 none_na = np.zeros(len(data), np.bool)
258 if dropna:
259 row_na = none_na if row is None else data[row].isnull()
TypeError: object of type 'NoneType' has no len()
Can someone help me, please!
I tried a couple of things and finally found one solution to the above problem. Defined the following function:
def plot_stack(column_1, column_2):
plot_stck=pd.crosstab(index=column_1, columns=column_2)
plot_stck.plot(kind='bar', figsize=(8,8), stacked=True)
return
Then,
plot_stack(train_data_wmis['max_glu_serum'], train_data_wmis['readmitted'])
Output:
Stacked Plot of 'max_glu_serum' and 'readmitted'
Please comment, if a better solution is available via Seaborn. Thanks

Don't understand error message (basic sklearn command)

I'm new to Python and programming in general and I wanted to exercise a littlebit with linear regression in one variable.
Im currently following this tutorial in the link
https://www.youtube.com/watch?v=8jazNUpO3lQ&list=PLeo1K3hjS3uvCeTYTeyfe0-rN5r8zn9rw&index=2
and I am exactly doing what he is doing.
I did however encounter an error when compiling as shown in the code below
(for simplicity, I put '--' to places which is the output. I used Jupyter Notebook)
At the end I encounterd a long list of errors when trying to compile 'reg.predict(3300)'.
I don't understand what went wrong.
Can someone help me out?
Cheers!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
df = pd.read_csv("homeprices.csv")
df
--area price
0 2600 550000
1 3000 565000
2 3200 610000
3 3600 680000
4 4000 725000
%matplotlib inline
plt.xlabel('area(sqr ft)')
plt.ylabel('price(US$)')
plt.scatter(df.area, df.price, color='red', marker = '+')
--<matplotlib.collections.PathCollection at 0x2e823ce66a0>
reg = linear_model.LinearRegression()
reg.fit(df[['area']],df.price)
--LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
reg.predict(3300)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-16-ad5a8409ff75> in <module>
----> 1 reg.predict(3300)
~\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in predict(self, X)
211 Returns predicted values.
212 """
--> 213 return self._decision_function(X)
214
215 _preprocess_data = staticmethod(_preprocess_data)
~\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in _decision_function(self, X)
194 check_is_fitted(self, "coef_")
195
--> 196 X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
197 return safe_sparse_dot(X, self.coef_.T,
198 dense_output=True) + self.intercept_
~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
543 "Reshape your data either using array.reshape(-1, 1) if "
544 "your data has a single feature or array.reshape(1, -1) "
--> 545 "if it contains a single sample.".format(array))
546 # If input is 1D raise error
547 if array.ndim == 1:
ValueError: Expected 2D array, got scalar array instead:
array=3300.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Try reg.predict([[3300]]). The api used to allow scalar value but now you need to give 2D array
reg.fit(df[['area']],df.price)
I think above we are using 2 variables, so using 2D array to fit [X]. we need to use 2D array in reg.predict for [X],too. Hence,
reg.predict([[3300]])
Expected 2D array,got scalar array instead: this is written in the error explained box so
kindly change it to :
just wrote it like this
reg.predict([[3300]])

Problems with seaborn (ndim)

I am using seaborn to plot a very simple data set. Here is what I do:
import seaborn as sns
import pandas as pd
df = pd.read_excel('myfile.xlsx')
sns.set(style="white")
g = sns.PairGrid(df, diag_sharey=False)
g.map_lower(sns.kdeplot)
g.map_upper(sns.scatterplot)
g.map_diag(sns.kdeplot, lw=3)
I get the following error: AttributeError: 'NoneType' object has no attribute 'ndim'. Weirdly, the plot is ploted in parts (see below).
Any idea why that is the case and what I can do to solve the issue?
EDIT:
The dataframe has the following attributes:
plan_change int64
user_login float64
new_act_ratio float64
on_time int64
Unfortunately, I cannot upload the data set. However I can say, that plotting other seaborn graphs works just fine.
The total error message is the following:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-16-2dbc61abd2bd> in <module>()
3 g = sns.PairGrid(df, diag_sharey=False)
4 g.map_lower(sns.kdeplot)
----> 5 g.map_upper(sns.scatterplot)
6 g.map_diag(sns.kdeplot, lw=3)
7
/anaconda/lib/python3.5/site-packages/seaborn/axisgrid.py in map_upper(self, func, **kwargs)
1488 color = self.palette[k] if kw_color is None else kw_color
1489 func(data_k[x_var], data_k[y_var], label=label_k,
-> 1490 color=color, **kwargs)
1491
1492 self._clean_axis(ax)
/anaconda/lib/python3.5/site-packages/seaborn/relational.py in scatterplot(x, y, hue, style, size, data, palette, hue_order, hue_norm, sizes, size_order, size_norm, markers, style_order, x_bins, y_bins, units, estimator, ci, n_boot, alpha, x_jitter, y_jitter, legend, ax, **kwargs)
1333 x_bins=x_bins, y_bins=y_bins,
1334 estimator=estimator, ci=ci, n_boot=n_boot,
-> 1335 alpha=alpha, x_jitter=x_jitter, y_jitter=y_jitter, legend=legend,
1336 )
1337
/anaconda/lib/python3.5/site-packages/seaborn/relational.py in __init__(self, x, y, hue, size, style, data, palette, hue_order, hue_norm, sizes, size_order, size_norm, dashes, markers, style_order, x_bins, y_bins, units, estimator, ci, n_boot, alpha, x_jitter, y_jitter, legend)
850
851 plot_data = self.establish_variables(
--> 852 x, y, hue, size, style, units, data
853 )
854
/anaconda/lib/python3.5/site-packages/seaborn/relational.py in establish_variables(self, x, y, hue, size, style, units, data)
155 units=units
156 )
--> 157 plot_data = pd.DataFrame(plot_data)
158
159 # Option 3:
/anaconda/lib/python3.5/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
264 dtype=dtype, copy=copy)
265 elif isinstance(data, dict):
--> 266 mgr = self._init_dict(data, index, columns, dtype=dtype)
267 elif isinstance(data, ma.MaskedArray):
268 import numpy.ma.mrecords as mrecords
/anaconda/lib/python3.5/site-packages/pandas/core/frame.py in _init_dict(self, data, index, columns, dtype)
400 arrays = [data[k] for k in keys]
401
--> 402 return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
403
404 def _init_ndarray(self, values, index, columns, dtype=None, copy=False):
/anaconda/lib/python3.5/site-packages/pandas/core/frame.py in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
5382
5383 # don't force copy because getting jammed in an ndarray anyway
-> 5384 arrays = _homogenize(arrays, index, dtype)
5385
5386 # from BlockManager perspective
/anaconda/lib/python3.5/site-packages/pandas/core/frame.py in _homogenize(data, index, dtype)
5693 v = lib.fast_multiget(v, oindex.values, default=NA)
5694 v = _sanitize_array(v, index, dtype=dtype, copy=False,
-> 5695 raise_cast_failure=False)
5696
5697 homogenized.append(v)
/anaconda/lib/python3.5/site-packages/pandas/core/series.py in _sanitize_array(data, index, dtype, copy, raise_cast_failure)
2917
2918 # scalar like
-> 2919 if subarr.ndim == 0:
2920 if isinstance(data, list): # pragma: no cover
2921 subarr = np.array(data, dtype=object)
AttributeError: 'NoneType' object has no attribute 'ndim'

User Warning: The following kwargs were not used by contour: 'label', 'color'

Im trying to create a comparison plot using Seaborn's PairGrid function on my dataset. My data set has 6 columns that I am trying to plot using the scatter() function in my .map_upper segment of the PairGrid function I'm applying to the entire dataframe. Here is a quick peak at my dataframe object; the 'year' object is set as the dataframe's index
Here are the data types of my dataframe comp_pct_chg_df:
year object
Amsterdam float64
Barcelona float64
Kingston float64
Milan float64
Philadelphia float64
Global float64
dtype: object
Here is my erroneous code below:
# Creating a comparison plot (.PairGrid()) of all my cities' and global data's average percent change in temperature
# Set up my figure by naming it 'pct_chg_yrly_fig', then call PairGrid on the DataFrame
pct_chg_yrly_fig = sns.PairGrid(comp_pct_chg_df.dropna())
# Using map_upper we can specify what the upper triangle will look like.
pct_chg_yrly_fig.map_upper(plt.scatter,color='purple')
# We can also define the lower triangle in the figure, including the plot type (KDE) or the color map (BluePurple)
pct_chg_yrly_fig.map_lower(sns.kdeplot,cmap='cool_d')
# Finally we'll define the diagonal as a series of histogram plots of the yearly average percent change in temperature
pct_chg_yrly_fig.map_diag(plt.hist,histtype='step',linewidth=3,bins=30)
# Adding a legend
pct_chg_yrly_fig.add_legend()
Some of the visualizations do plot out, like the .map_lower() function I used, which turned out great. I'd like to plot each city however, in a different color for my scatter plot used in the .map_upper() function I've used. Right now its monochromatic, and hard to tell which data points belong to which city. And lastly, my .map_diag() doesn't plot at all. I don't know what I'm doing wrong. I've assessed the ValueError msg I received (which is below) and tried manipulating dozens of **kwargs, label and color specifically, to no avail. Help would be greatly appreciated.
Here is the ValueError msg I'm receiving:
ValueError Traceback (most recent call last)
<ipython-input-38-3fcf1b69d4ef> in <module>()
11
12 # Finally we'll define the diagonal as a series of histogram plots of the yearly average percent change in temperature
---> 13 pct_chg_yrly_fig.map_diag(plt.hist,histtype='step',linewidth=3,bins=30)
14
15 # Adding a legend
~/anaconda3/lib/python3.6/site-packages/seaborn/axisgrid.py in map_diag(self, func, **kwargs)
1361
1362 if "histtype" in kwargs:
-> 1363 func(vals, color=color, **kwargs)
1364 else:
1365 func(vals, color=color, histtype="barstacked", **kwargs)
~/anaconda3/lib/python3.6/site-packages/matplotlib/pyplot.py in hist(x, bins, range, density, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, normed, hold, data, **kwargs)
3023 histtype=histtype, align=align, orientation=orientation,
3024 rwidth=rwidth, log=log, color=color, label=label,
-> 3025 stacked=stacked, normed=normed, data=data, **kwargs)
3026 finally:
3027 ax._hold = washold
~/anaconda3/lib/python3.6/site-packages/matplotlib/__init__.py in inner(ax, *args, **kwargs)
1715 warnings.warn(msg % (label_namer, func.__name__),
1716 RuntimeWarning, stacklevel=2)
-> 1717 return func(ax, *args, **kwargs)
1718 pre_doc = inner.__doc__
1719 if pre_doc is None:
~/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py in hist(***failed resolving arguments***)
6137 color = mcolors.to_rgba_array(color)
6138 if len(color) != nx:
-> 6139 raise ValueError("color kwarg must have one color per dataset")
6140
6141 # If bins are not specified either explicitly or via range,
ValueError: color kwarg must have one color per dataset
I also noticed that my index, the year object, is plotting out in the upper left corner of my PairGrid. It looks like a bunch of vertical lines plotted next to one another. Not sure why it’s plotting but could it be because the values ( years 1743 - 2015) end in ‘.0’? I noticed this when I put the data frame together (and I don’t know how to drop it... Python newb here) so I changed the year column’s data type from float64 to string and set it as my index. I thought doing this would make my index ‘unworkable’ meaning even though the values are numbers, the data type is set to string so no calculations could be done on them? Am I missing something here?

Resources