Using huber scale and location estimator in statsmodel

Using huber scale and location estimator in statsmodel - python-3.x

I want to use huber simultaneous scale and mean estimator found here : http://www.statsmodels.org/dev/generated/statsmodels.robust.scale.Huber.html but here is the error :
In [1]: from statsmodels.robust.scale import huber
In [2]: huber([1,2,1000,3265,454])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-2-80c7d73a4467> in <module>()
----> 1 huber([1,2,1000,3265,454])
/usr/local/lib/python3.5/dist-packages/statsmodels/robust/scale.py in __call__(self, a, mu, initscale, axis)
132 scale = tools.unsqueeze(scale, axis, a.shape)
133 mu = tools.unsqueeze(mu, axis, a.shape)
--> 134 return self._estimate_both(a, scale, mu, axis, est_mu, n)
135
136 def _estimate_both(self, a, scale, mu, axis, est_mu, n):
/usr/local/lib/python3.5/dist-packages/statsmodels/robust/scale.py in _estimate_both(self, a, scale, mu, axis, est_mu, n)
176 else:
177 return nmu.squeeze(), nscale.squeeze()
--> 178 raise ValueError('joint estimation of location and scale failed to converge in %d iterations' % self.maxiter)
179
180 huber = Huber()
ValueError: joint estimation of location and scale failed to converge in 30 iterations
The weird thing is that it depends on the input:
In [3]: huber([1,2,1000,3265])
Out[3]: (array(1067.0), array(1744.3785635989168))
Is it a bug or did I do something wrong here ?
Thanks
EDIT : I knew about the tol and maxiter parameter, what you say works in that case but here is an example where it doesn't :
In [1]: a=[4.3498776644415429, 16.549773154535362, 4.6335866963356445, 8.2581784707468771, 1.3508951981036594, 1.2918098244960199, 5.734
...: 9939516388453, 0.41663442483143953, 4.5632532990486077, 8.1020487048604473, 1.3823829480004797, 1.7848176927929804, 4.3058348043
...: 423473, 0.9427710734983884, 0.95646846668018171, 0.75309469901235238, 8.4689505489677011, 0.77420558084543778, 0.765060223824508
...: 45, 1.5673666392992407, 1.4109878442590897, 0.45592078018861532, 4.71748181503082, 0.65942167325205436, 0.19099796838644958, 1.0
...: 979997466466069, 4.8145761128848106, 0.75417363824157768, 5.0723603274833362, 0.30627007428414721, 4.8178689054947981, 1.5383475
...: 959362511, 0.7971041296695851, 4.689826268915076, 8.6704498595703274, 0.56825576954483947, 8.0383098149129708, 0.394000842811084
...: 22, 0.89827542590321019, 8.5160701523615785, 9.0413284666560934, 1.3590549231652516, 8.355489609767794, 4.2413169378427682, 4.84
...: 97143419119348, 4.8566372637376292, 0.80979444214378904, 0.26613505510736446, 1.1525345100417608, 4.9784132426823824, 1.07663603
...: 91211101, 1.9604545887151259, 0.77151237419054963, 1.2302626325699455, 0.846912462599126, 0.85852710339862037, 0.380355420248302
...: 99, 4.7586522644359093, 0.46796412732813891, 0.52933680009769146, 5.2521765047159708, 0.71915381047435945, 1.3502865819436387, 0
...: .76647272458736559, 1.1206637428992841, 0.72560665950851866, 4.4248008256265781, 4.7984989298357457, 1.0696617588880453, 0.71104
...: 701759920497, 0.46986438176394463, 0.71008686283792688, 0.40698839770374351, 1.0015132141773508, 1.3825224746094535, 0.932562703
...: 04709066, 8.8896053101317687, 0.64148877800521564, 0.69250319745644506, 4.7187793763802919, 5.0620089438920939, 5.17105647739872
...: 33, 9.5341720525579809, 0.43052713463119635, 0.79288845392647533, 0.51059695992994469, 0.48295891743804287, 0.93370512281086504,
...: 1.7493284310512855, 0.62744557356984221, 5.0965146009791704, 0.12615625248684664, 1.1064189602023351, 0.33183381198282491, 4.90
...: 32450273833179, 0.90296573725985785, 1.2885647882049298, 0.84669066664867576, 1.1481783837280477, 0.94784483590946278, 9.8019240
...: 792478755, 0.91501030105202807, 0.57121190468293803, 5.5511993201050887, 0.66054793663263078, 9.6626055869916065, 5.262806161853
...: 6908, 9.5905100705465696, 0.70369230764306401, 8.9747551552440186, 1.572014845182425, 1.9571634928868149, 0.62030418652298325, 0
...: .3395356767840213, 0.48287760518144929, 4.7937042347984198, 0.74251393675618682, 0.87369567300592954, 4.5381205696031586, 5.2673
...: 192797619084]
In [2]: from statsmodels.robust.scale import huber, Huber
In [3]: Huber(maxiter=10000,tol=1e-1)(a)
/usr/lib/python3.6/site-packages/statsmodels/robust/scale.py:168: RuntimeWarning: invalid value encountered in sqrt
/ (n * self.gamma - (a.shape[axis] - card) * self.c**2))
/usr/lib/python3.6/site-packages/statsmodels/robust/scale.py:164: RuntimeWarning: invalid value encountered in less_equal
subset = np.less_equal(np.fabs((a - mu)/scale), self.c)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-4b9929ff84bb> in <module>()
----> 1 Huber(maxiter=10000,tol=1e-1)(a)
/usr/lib/python3.6/site-packages/statsmodels/robust/scale.py in __call__(self, a, mu, initscale, axis)
132 scale = tools.unsqueeze(scale, axis, a.shape)
133 mu = tools.unsqueeze(mu, axis, a.shape)
--> 134 return self._estimate_both(a, scale, mu, axis, est_mu, n)
135
136 def _estimate_both(self, a, scale, mu, axis, est_mu, n):
/usr/lib/python3.6/site-packages/statsmodels/robust/scale.py in _estimate_both(self, a, scale, mu, axis, est_mu, n)
176 else:
177 return nmu.squeeze(), nscale.squeeze()
--> 178 raise ValueError('joint estimation of location and scale failed to converge in %d iterations' % self.maxiter)
179
180 huber = Huber()
ValueError: joint estimation of location and scale failed to converge in 10000 iterations
Sorry, this was my original error but because the "a" is long, I tried to recreate the error with a smaller array. In this case, I don't think maxiter and tol are to blame.

The number of iterations allowed, maxiter, can be changed when using the Huber class.
e.g. this works
>>> from statsmodels.robust.scale import huber, Huber
>>> Huber(maxiter=200)([1,2,1000,3265,454])
(array(925.6483958529737), array(1497.0624070525248))
It is also possible to change the threshold parameter for the norm function when using the class. In very small samples like this the estimate might be very sensitive to the threshold parameter.
As alternative we can use the RLM model and regress on a constant, both thresholds and the algorithm are different but it should produce similar robust results. In the new example the estimate for the scale in between standard deviation and robust MAD, while the mean estimate is larger than the median and the mean.
>>> res = RLM(a, np.ones(len(a)), M=norms.HuberT(t=1.5)).fit(scale_est=scale.HuberScale(d=1.5))
>>> res.params, res.scale
(array([ 2.47711987]), 2.5218278029435406)
>>> np.median(a), scale.mad(a)
(1.1503564468849041, 0.98954533464908301)
>>> np.mean(a), np.std(a)
(2.8650886010542269, 3.0657561979615977)
The resulting weights show that some of the high values are downweighted
>>> widx = np.argsort(res.weights)
>>> np.asarray(a)[widx[:10]]
array([ 16.54977315, 9.80192408, 9.66260559, 9.59051007,
9.53417205, 9.04132847, 8.97475516, 8.88960531,
8.67044986, 8.51607015])
I am not familiar with the details of the implementation of the Huber joint mean-scale estimator.
One possible reason for the convergence failure is that the distribution of the values is bunched in 3 groups with one extra outlier at 16, visible when plotting the histogram. This could result in a convergence cycle with the iterative solver where the third group is either included or excluded. But that is just a guess.

Related

I keep getting "TypeError: only integer scalar arrays can be converted to a scalar index" while using custom-defined metric in KNeighborsClassifier

I am using a custom-defined metric in SKlearn's KNeighborsClassifier. Here's my code:
def chi_squared(x,y):
return np.divide(np.square(np.subtract(x,y)), np.sum(x,y))
Above function implementation of chi squared distance function. I have used NumPy functions because according to scikit-learn docs, metric function takes two one-dimensional numpy arrays.
I have passed the chi_squared function as an argument to KNeighborsClassifier().
knn = KNeighborsClassifier(algorithm='ball_tree', metric=chi_squared)
However, I keep getting following error:
TypeError Traceback (most recent call last)
<ipython-input-29-d2a365ebb538> in <module>
4
5 knn = KNeighborsClassifier(algorithm='ball_tree', metric=chi_squared)
----> 6 knn.fit(X_train, Y_train)
7 predictions = knn.predict(X_test)
8 print(accuracy_score(Y_test, predictions))
~/.local/lib/python3.8/site-packages/sklearn/neighbors/_classification.py in fit(self, X, y)
177 The fitted k-nearest neighbors classifier.
178 """
--> 179 return self._fit(X, y)
180
181 def predict(self, X):
~/.local/lib/python3.8/site-packages/sklearn/neighbors/_base.py in _fit(self, X, y)
497
498 if self._fit_method == 'ball_tree':
--> 499 self._tree = BallTree(X, self.leaf_size,
500 metric=self.effective_metric_,
501 **self.effective_metric_params_)
sklearn/neighbors/_binary_tree.pxi in sklearn.neighbors._ball_tree.BinaryTree.__init__()
sklearn/neighbors/_binary_tree.pxi in sklearn.neighbors._ball_tree.BinaryTree._recursive_build()
sklearn/neighbors/_ball_tree.pyx in sklearn.neighbors._ball_tree.init_node()
sklearn/neighbors/_binary_tree.pxi in sklearn.neighbors._ball_tree.BinaryTree.rdist()
sklearn/neighbors/_dist_metrics.pyx in sklearn.neighbors._dist_metrics.DistanceMetric.rdist()
sklearn/neighbors/_dist_metrics.pyx in sklearn.neighbors._dist_metrics.PyFuncDistance.dist()
sklearn/neighbors/_dist_metrics.pyx in sklearn.neighbors._dist_metrics.PyFuncDistance._dist()
<ipython-input-29-d2a365ebb538> in chi_squared(x, y)
1 def chi_squared(x,y):
----> 2 return np.divide(np.square(np.subtract(x,y)), np.sum(x,y))
3
4
5 knn = KNeighborsClassifier(algorithm='ball_tree', metric=chi_squared)
<__array_function__ internals> in sum(*args, **kwargs)
~/.local/lib/python3.8/site-packages/numpy/core/fromnumeric.py in sum(a, axis, dtype, out, keepdims, initial, where)
2239 return res
2240
-> 2241 return _wrapreduction(a, np.add, 'sum', axis, dtype, out, keepdims=keepdims,
2242 initial=initial, where=where)
2243
~/.local/lib/python3.8/site-packages/numpy/core/fromnumeric.py in _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs)
85 return reduction(axis=axis, out=out, **passkwargs)
86
---> 87 return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
88
89
TypeError: only integer scalar arrays can be converted to a scalar index

I can reproduce your error message with:
In [173]: x=np.arange(3); y=np.array([2,3,4])
In [174]: np.sum(x,y)
Traceback (most recent call last):
File "<ipython-input-174-1a1a267ebd82>", line 1, in <module>
np.sum(x,y)
File "<__array_function__ internals>", line 5, in sum
File "/usr/local/lib/python3.8/dist-packages/numpy/core/fromnumeric.py", line 2247, in sum
return _wrapreduction(a, np.add, 'sum', axis, dtype, out, keepdims=keepdims,
File "/usr/local/lib/python3.8/dist-packages/numpy/core/fromnumeric.py", line 87, in _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
TypeError: only integer scalar arrays can be converted to a scalar index
Correct use(s) of np.sum:
In [175]: np.sum(x)
Out[175]: 3
In [177]: np.sum(np.arange(6).reshape(2,3), axis=0)
Out[177]: array([3, 5, 7])
In [178]: np.sum(np.arange(6).reshape(2,3), 0)
Out[178]: array([3, 5, 7])
(re)read the np.sum docs if necessary!
Using np.add instead of np.sum:
In [179]: np.add(x,y)
Out[179]: array([2, 4, 6])
In [180]: x+y
Out[180]: array([2, 4, 6])
The following should be equivalent:
np.divide(np.square(np.subtract(x,y)), np.add(x,y))
(x-y)**2/(x+y)

How to do clustering with k-means algorithm for an imported data set with proper scaling of both axis

I m new to data science and python, and jupyter notebook, I m currently studying how to do k means clustering on a data set. I came across ways in which can introduce data
Data = {'x': [25,34,22,27,33,33,31,22,35,34,67,54,57,43,50,57,59,52,65,47,49,48,35,33,44,45,38,43,51,46],
'y': [79,51,53,78,59,74,73,57,69,75,51,32,40,47,53,36,35,58,59,50,25,20,14,12,20,5,29,27,8,7]
}
df = DataFrame(Data,columns=['x','y'])
and use of blobs
data = make_blobs(n_samples=200, n_features=2, centers=4, cluster_std=1.6, random_state=50)
but I would like to know how to do a proper code with a csv file imported from my computer and do a k means with scaling, thank you in advance, I could not find relevant blogs to help me
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.cluster import KMeans
data=pd.read_csv("C:/Users/Dulangi/Downloads/winequality-red.csv")
data
data["alcohol"]=data["alcohol"]/data["alcohol"].max()
data["quality"]=data["quality"]/data["quality"].max()
plt.scatter(data["alcohol"],data['quality'])
plt.xlabel("alcohol")
plt.ylabel('quality')
plt.show()
x=data.copy()
kmeans=KMeans(2)
kmeans.fit(x)
clusters=x.copy()
clusters['cluster_pred']=kmeans.fit_predict(x)
plt.scatter(clusters["alcohol"],clusters['quality'],c=clusters['cluster_pred'],cmap='rainbow')
plt.xlabel("alcohol")
plt.ylabel('quality')
plt.show()
from sklearn import preprocessing
x_scaled=preprocessing.scale(x)
#x_scaled
wcss=[]
for i in range(1,30):
kmeans=KMeans(i)
kmeans.fit(x_scaled)
wcss.append(kmeans.inertia_)
wcss
plt.plot(range(1,30),wcss)
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
This is what i tried
the error i got
ValueError Traceback (most recent call last)
<ipython-input-12-d4955ce8615e> in <module>
39
40
---> 41 plt.plot(range(1,30),wcss)
42 plt.xlabel('Number of clusters')
43 plt.ylabel('WCSS')
~\Anaconda3\lib\site-packages\matplotlib\pyplot.py in plot(scalex, scaley, data, *args, **kwargs)
2787 return gca().plot(
2788 *args, scalex=scalex, scaley=scaley, **({"data": data} if data
-> 2789 is not None else {}), **kwargs)
2790
2791
~\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py in plot(self, scalex, scaley, data, *args, **kwargs)
1664 """
1665 kwargs = cbook.normalize_kwargs(kwargs, mlines.Line2D._alias_map)
-> 1666 lines = [*self._get_lines(*args, data=data, **kwargs)]
1667 for line in lines:
1668 self.add_line(line)
~\Anaconda3\lib\site-packages\matplotlib\axes\_base.py in __call__(self, *args, **kwargs)
223 this += args[0],
224 args = args[1:]
--> 225 yield from self._plot_args(this, kwargs)
226
227 def get_next_color(self):
~\Anaconda3\lib\site-packages\matplotlib\axes\_base.py in _plot_args(self, tup, kwargs)
389 x, y = index_of(tup[-1])
390
--> 391 x, y = self._xy_from_xy(x, y)
392
393 if self.command == 'plot':
~\Anaconda3\lib\site-packages\matplotlib\axes\_base.py in _xy_from_xy(self, x, y)
268 if x.shape[0] != y.shape[0]:
269 raise ValueError("x and y must have same first dimension, but "
--> 270 "have shapes {} and {}".format(x.shape, y.shape))
271 if x.ndim > 2 or y.ndim > 2:
272 raise ValueError("x and y can be no greater than 2-D, but have "
ValueError: x and y must have same first dimension, but have shapes (29,) and (1,)

You can easily do by using scikit-Learn
import pandas as pd
data=pd.read_csv('myfile.csv')
df=pd.DataFrame(data,index=None)
df.head()
Check if rows contain any null values
df.isnull().sum()
Drop all the rows with null values if any
df_numeric.dropna(inplace=True)
Normalize data
Normalize the data with MinMax scaling provided by sklearn
from sklearn import preprocessing
minmax_processed = preprocessing.MinMaxScaler().fit_transform(df.drop('title',axis=1))
df_numeric_scaled = pd.DataFrame(minmax_processed, index=df.index, columns=df.columns[:-1])
df_numeric_scaled.head()
from sklearn.cluster import KMeans
Apply K-Means Clustering
What k to choose?
Let's fit cluster size 1 to 20 on our data and take a look at the corresponding score value.
Nc = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in Nc]
score = [kmeans[i].fit(df_numeric_scaled).score(df_numeric_scaled) for i in range(len(kmeans))]
These score values signify how far our observations are from the cluster center. We want to keep this score value around 0. A large positive or a large negative value would indicate that the cluster center is far from the observations.
Based on these scores value, we plot an Elbow curve to decide which cluster size is optimal. Note that we are dealing with tradeoff between cluster size(hence the computation required) and the relative accuracy.
import matplotlib as pl
pl.plot(Nc,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()
Fit K-Means for clustering with k=5
kmeans = KMeans(n_clusters=5)
kmeans.fit(df_numeric_scaled)
df['cluster'] = kmeans.labels_
df.head()

ValueError: Number of priors must match number of classes

I want to compile my python3 code on ubuntu, and also want to know about the problem, such that i can handle that in future.
It seems there is some problem with the imported library function.
## sample code
1 import numpy as np
2 x = np.array([[-1,-1],[-2,-1],[-3,-2],[1,1],[2,1],[3,2]])
3 y = np.array([1,1,1,2,2,2])
4 from sklearn.naive_bayes import GaussianNB
5 clf = GaussianNB(x, y)
6 clf = clf.fit(x,y) ###showing error on compiling
7 print(clf.predict([[-2,1]]))
## output shown
Traceback (most recent call last):
File "naive.py", line 7, in <module>
clf = clf.fit(x,y)
File "/home/abhihsek/.local/lib/python3.6/site-
packages/sklearn/naive_bayes.py", line 192, in fit
sample_weight=sample_weight)
File "/home/abhihsek/.local/lib/python3.6/site-
packages/sklearn/naive_bayes.py", line 371, in _partial_fit
raise ValueError('Number of priors must match number of'
ValueError: Number of priors must match number of classes.
## code of library function line 192
190 X, y = check_X_y(X, y)
191 return self._partial_fit(X, y, np.unique(y),
_refit=True,
192
sample_weight=sample_weight)
## code of library function line 371
369 # Check that the provide prior match the number of classes
370 if len(priors) != n_classes:
371 raise ValueError('Number of priors must
match
number of'
372 ' classes.')
373 # Check that the sum is 1

As #Suvan Pandey mentioned, then the code won't give any error when writing clf = GaussianNB() instead of clf = GaussianNB(x, y).
If we look at the GaussianNB class then the __init__() can take these parameters:
def __init__(self, priors=None, var_smoothing=1e-9): # <-- these have a default value
self.priors = priors
self.var_smoothing = var_smoothing
The documentation about the two parameters:
priors – Prior probabilities of the classes. If specified the priors are not adjusted according to the data.
var_smoothing – Portion of the largest variance of all features that is added to variances for calculation stability.
As your x and y variables both return an array object then they don't fit the parameters of the __init__(...).

Difference in use of ** and pow function

while attempting to write a cost function for linear regression the error is arising while replacing ** with pow function in cost_function :
Original cost function
def cost_function(x,y,theta):
m = np.size(y)
j = (1/(2*m))*np.sum(np.power(np.matmul(x,theta)-y),2)
return j
Cost function giving the error:
def cost_function(x,y,theta):
m = np.size(y)
j = (1/(2*m))*np.sum((np.matmul(x,theta)-y)**2)
return j
Gradient Descent
def gradient_descent(x,y,theta,learn_rate,iters):
x = np.mat(x);y = np.mat(y); theta= np.mat(theta);
m = np.size(y)
j_hist = np.zeros(iters)
for i in range(0,iters):
temp = theta - (learn_rate/m)*(x.T*(x*theta-y))
theta = temp
j_hist[i] = cost_function(x,y,theta)
return (theta),j_hist
Variable values
theta = np.zeros((2,1))
learn_rate = 0.01
iters = 1000
x is (97,2) matrix
y is (97,1) matrix
cost function is calculated fine with value of 32.0727
The error arises while using the same function in gradient descent.
The error am getting is LinAlgError: Last 2 dimensions of the array must be square

First let's distinguish between pow, ** and np.power. pow is the Python function, that according to docs is equivalent to ** when used with 2 arguments.
Second, you apply np.mat to the arrays, making np.matrix objects. According to its docs:
It has certain special operators, such as *
(matrix multiplication) and ** (matrix power).
matrix power:
In [475]: np.mat([[1,2],[3,4]])**2
Out[475]:
matrix([[ 7, 10],
[15, 22]])
Elementwise square:
In [476]: np.array([[1,2],[3,4]])**2
Out[476]:
array([[ 1, 4],
[ 9, 16]])
In [477]: np.power(np.mat([[1,2],[3,4]]),2)
Out[477]:
matrix([[ 1, 4],
[ 9, 16]])
Matrix power:
In [478]: arr = np.array([[1,2],[3,4]])
In [479]: arr#arr # np.matmul
Out[479]:
array([[ 7, 10],
[15, 22]])
With a non-square matrix:
In [480]: np.power(np.mat([[1,2]]),2)
Out[480]: matrix([[1, 4]]) # elementwise
Attempting to do matrix_power on a non-square matrix:
In [481]: np.mat([[1,2]])**2
---------------------------------------------------------------------------
LinAlgError Traceback (most recent call last)
<ipython-input-481-18e19d5a9d6c> in <module>()
----> 1 np.mat([[1,2]])**2
/usr/local/lib/python3.6/dist-packages/numpy/matrixlib/defmatrix.py in __pow__(self, other)
226
227 def __pow__(self, other):
--> 228 return matrix_power(self, other)
229
230 def __ipow__(self, other):
/usr/local/lib/python3.6/dist-packages/numpy/linalg/linalg.py in matrix_power(a, n)
600 a = asanyarray(a)
601 _assertRankAtLeast2(a)
--> 602 _assertNdSquareness(a)
603
604 try:
/usr/local/lib/python3.6/dist-packages/numpy/linalg/linalg.py in _assertNdSquareness(*arrays)
213 m, n = a.shape[-2:]
214 if m != n:
--> 215 raise LinAlgError('Last 2 dimensions of the array must be square')
216
217 def _assertFinite(*arrays):
LinAlgError: Last 2 dimensions of the array must be square
Note that the whole traceback lists matrix_power. That's why we often ask to see the whole traceback.
Why are you setting x,y and theta to np.mat? The cost_function uses matmul. With that function, and its # operator, there are few(er) good reasons for using np.matrix.
Despite the subject line, you did not try to use pow. That confused me and at least one other commentator. I tried to find a np.pow or a scipy version.

Set thresholds in PySpark multinomial logistic regression

I would like to perform a multinomial logistic regression but I can't set threshold and thresholds parameters correctly. Consider the following DF:
from pyspark.ml.linalg import DenseVector
test_train_df = (
sqlc
.createDataFrame([(0, DenseVector([-1.0, 1.2, 0.7])),
(0, DenseVector([3.1, -2.0, -2.9])),
(1, DenseVector([1.0, 0.8, 0.3])),
(1, DenseVector([4.2, 1.4, -1.7])),
(0, DenseVector([-1.9, 2.5, -2.3])),
(2, DenseVector([2.6, -0.2, 0.2])),
(1, DenseVector([0.3, -3.4, 1.8])),
(2, DenseVector([-1.0, -3.5, 4.7]))],
['label', 'features'])
)
My label has 3 classes, so I have to set thresholds (plural, which default is None) rather than threshold (singular, which default is 0.5). Then I write:
from pyspark.ml import classification as cl
test_logit_abst = (
cl.LogisticRegression()
.setFamily('multinomial')
.setThresholds([.5, .5, .5])
)
Then I would like to fit the model on my DF:
test_logit = test_logit_abst.fit(test_train_df)
but when executing this last command I get an error:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
~/anaconda3/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
Py4JJavaError: An error occurred while calling o3769.fit.
: java.lang.IllegalArgumentException: requirement failed: Logistic Regression found inconsistent values for threshold and thresholds. Param threshold is set (0.5), indicating binary classification, but Param thresholds is set with length 3. Clear one Param value to fix this problem.
During handling of the above exception, another exception occurred:
IllegalArgumentException Traceback (most recent call last)
<ipython-input-211-8f3443f41b6b> in <module>()
----> 1 test_logit = test_logit_abst.fit(test_train_df)
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/base.py in fit(self, dataset, params)
62 return self.copy(params)._fit(dataset)
63 else:
---> 64 return self._fit(dataset)
65 else:
66 raise ValueError("Params must be either a param map or a list/tuple of param maps, "
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit(self, dataset)
263
264 def _fit(self, dataset):
--> 265 java_model = self._fit_java(dataset)
266 return self._create_model(java_model)
267
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit_java(self, dataset)
260 """
261 self._transfer_params_to_java()
--> 262 return self._java_obj.fit(dataset._jdf)
263
264 def _fit(self, dataset):
~/anaconda3/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
~/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
77 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
78 if s.startswith('java.lang.IllegalArgumentException: '):
---> 79 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
80 raise
81 return deco
IllegalArgumentException: 'requirement failed: Logistic Regression found inconsistent values for threshold and thresholds. Param threshold is set (0.5), indicating binary classification, but Param thresholds is set with length 3. Clear one Param value to fix this problem.'
The error says threshold is set. This looks strange, as the documentation says that setting thresholds (plural) clears threshold (singular), so that the value 0.5 should be deleted.
So, how to clear threshold since no clearThreshold() exists?
In order to achieve this I tried to clear threshold this way:
logit_abst = (
cl.LogisticRegression()
.setFamily('multinomial')
.setThresholds([.5, .5, .5])
.setThreshold(None)
)
This time the fit command works, I even obtain the model intercept and coefficients:
test_logit.interceptVector
DenseVector([65.6445, 31.6369, -97.2814])
test_logit.coefficientMatrix
DenseMatrix(3, 3, [-76.4534, -19.4797, -79.4949, 12.3659, 4.642, 4.1057, 64.0876, 14.8377, 75.3892], 1)
But if I try to get thresholds (plural) from test_logit_abst I get an error:
test_logit_abst.getThresholds()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-214-fc1c8617ce80> in <module>()
----> 1 test_logit_abst.getThresholds()
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/classification.py in getThresholds(self)
363 if not self.isSet(self.thresholds) and self.isSet(self.threshold):
364 t = self.getOrDefault(self.threshold)
--> 365 return [1.0-t, t]
366 else:
367 return self.getOrDefault(self.thresholds)
TypeError: unsupported operand type(s) for -: 'float' and 'NoneType'
What does this mean?
As a further detail, curiously (and incomprehensibly to me) inverting the order of the parameters settings produces the first error I posted above:
logit_abst = (
cl.LogisticRegression()
.setFamily('multinomial')
.setThreshold(None)
.setThresholds([.5, .5, .5])
)
Why does changing the order of the "set" instructions change the output as well?

It is a messy situation indeed...
The short answer is:
setThresholds (plural) not clearing the threshold (singular) seems to be a bug
For multinomial classification (i.e. number of classes > 2), setThresholds does not do what you expect (and arguably you don't need it)
If all you need is having some "thresholds" in the "default" value of 0.5, you don't have a problem - simply don't use any relevant argument or setThresholds statement
If you really need to apply different decision thresholds to different classes in multinomial classification, you will have to do it manually, by post-processing the respective probabilities, i.e. the probability column in the transformed dataframe (it works OK though with setThreshold(s) for binary classification)
And now for the long answer...
Let's start with binary classification, adapting the toy data from the docs:
spark.version
# u'2.2.0'
from pyspark.ml.classification import LogisticRegression
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
bdf = sc.parallelize([
Row(label=1.0, features=Vectors.dense(0.0, 5.0)),
Row(label=0.0, features=Vectors.dense(1.0, 2.0)),
blor = LogisticRegression(threshold=0.7, thresholds=[0.3, 0.7])
Row(label=1.0, features=Vectors.dense(2.0, 1.0)),
Row(label=0.0, features=Vectors.dense(3.0, 3.0))]).toDF()
We don't need to set thresholds (plural) here - threshold=0.7 is enough, but it will be useful when illustrating the differences with setThreshold below.
blorModel = blor.fit(bdf) # works OK
blor.getThreshold()
# 0.7
blor.getThresholds()
# [0.3, 0.7]
blorModel.transform(bdf).show(truncate=False) # transform the training data
Here is the result:
+---------+-----+------------------------------------------+----------------------------------------+----------+
|features |label|rawPrediction |probability |prediction|
+---------+-----+------------------------------------------+----------------------------------------+----------+
|[0.0,5.0]|1.0 |[-1.138455151184087,1.138455151184087] |[0.242604109995602,0.757395890004398] |1.0 |
|[1.0,2.0]|0.0 |[-0.6056346859838877,0.6056346859838877] |[0.35305562698104337,0.6469443730189567]|0.0 |
|[2.0,1.0]|1.0 |[0.26586039040308496,-0.26586039040308496]|[0.5660763559614698,0.4339236440385302] |0.0 |
|[3.0,3.0]|0.0 |[1.6453673835702176,-1.6453673835702176] |[0.8382639556951765,0.16173604430482344]|0.0 |
+---------+-----+------------------------------------------+----------------------------------------+----------+
What is the meaning of thresholds=[0.3, 0.7]? The answer lies in the 2nd row, where the prediction is 0.0, despite the fact that the the probability is higher for 1.0 (0.65): 0.65 is indeed higher that 0.35, but it is lower than the threshold we have set for this class (0.7), hence it is not classified as such.
Let's now try the seemingly identical operation, but with setThreshold(s) instead:
blor2 = (LogisticRegression()
.setThreshold(0.7)
.setThresholds([0.3, 0.7]) ) # works OK
blorModel2 = blor2.fit(bdf)
[...]
IllegalArgumentException: u'requirement failed: Logistic Regression getThreshold found inconsistent values for threshold (0.5) and thresholds (equivalent to 0.7)'
Nice, eh?
setThresholds (plural) seems indeed to have cleared our value of threshold (0.7) set in the previous line, as claimed in the docs, but it seemingly did so only to restore it to its default value of 0.5...
Omitting .setThreshold(0.7) gives the first error you report yourself (not shown).
Inverting the order of the parameter settings resolves the issue (!!!) and, moreover, renders both getThreshold (singular) and getThresholds (plural) operational (in contrast with your case):
blor2 = (LogisticRegression()
.setThresholds([0.3, 0.7])
.setThreshold(0.7) )
blorModel2 = blor2.fit(bdf) # works OK
blor2.getThreshold()
# 0.7
blor2.getThresholds()
# [0.30000000000000004, 0.7]
Let's move now to the multinomial case; we'll stick again to the example in the docs, with data from the Spark Github repo (they should also be available locally, in your $SPARK_HOME/data/mllib/sample_multiclass_classification_data.txt, but I am working on a Databricks notebook); it is a 3-class case, with labels in {0.0, 1.0, 2.0}.
data_path ="/FileStore/tables/sample_multiclass_classification_data.txt"
mdf = spark.read.format("libsvm").load(data_path)
Similarly with the binary case above, where the elements of our thresholds (plural) sum up to 1, let's ask for a threshold of 0.8 for class 2:
mlor = (LogisticRegression()
.setFamily("multinomial")
.setThresholds([0, 0.2, 0.8])
.setThreshold(0.8) )
mlorModel= mlor.fit(mdf) # works OK
mlor.getThreshold()
# 0.8
mlor.getThresholds()
# [0.19999999999999996, 0.8]
Looks fine, but let's ask for a prediction in the (training) dataset:
mlorModel.transform(mdf).show(truncate=False)
I have singled out only one row - it should be the 2nd from the end of the full output:
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+
|label|features |rawPrediction |probability |prediction|
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+
[...]
|0.0 |(4,[0,1,2,3],[0.111111,-0.333333,0.38983,0.166667]) |[36.67790353804905,-74.71196613173531,38.034062593686244]|[0.20486526556822454,8.619113376801409E-50,0.7951347344317755] |2.0 |
[...]
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+
Scrolling to the right, you'll see that despite the fact that the prediction for class 2.0 here is below the threshold we have set (0.8), the row is indeed predicted as 2.0 - in contrast with the binary case demonstrated above...
So, what to do? Simply remove all the threshold-related statements; you don't need them - even setFamily is unnecessary, as the algorithm will detect by itself that you have more than 2 classes. This will give identical results with the above:
mlor = LogisticRegression() # works OK - no family, no threshold(s)
To summarize:
In both the binary & multinomial cases, what is actually returned by the algorithm is a vector of probabilities of length equal to the number of classes, with elements summing up to 1.
In the binary case only, Spark allows you to go one step further and not naively selecting the highest probability class as the prediction, but applying a user-defined threshold instead; this setting might be useful e.g. in cases with imbalanced data.
This threshold(s) setting has actually no effect in the multinomial case, where Spark will always return as prediction the class with the highest probability.
Despite the mess in the documentation (about which I have argued elsewhere) and the possibility of some bugs, let me say about (3) that this design choice is not unjustifiable; as it has been nicely argued elsewhere (emphasis in the original):
the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.
Although the above argument was made for the binary case, it fully holds for the multinomial one, too...

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Using huber scale and location estimator in statsmodel - python-3.x

Related

I keep getting "TypeError: only integer scalar arrays can be converted to a scalar index" while using custom-defined metric in KNeighborsClassifier

How to do clustering with k-means algorithm for an imported data set with proper scaling of both axis

ValueError: Number of priors must match number of classes

Difference in use of ** and pow function

Set thresholds in PySpark multinomial logistic regression

Categories

Resources