How to use OneVsRestClassifier with SVC for multilabel problems? - svm

I'm using OneVsRestClassifier for multilabel classification. It works with LinearSVC, but when I apply it to SVC, the following error appears:
classifier = OneVsRestClassifier(SVC(class_weight='balanced'))
classifier.fit(X1, y1)
y2 = classifier.predict(X2)
Traceback (most recent call last):
...
File "/usr/local/lib/python2.7/dist-packages/sklearn/multiclass.py", line 219, in predict
return predict_ovr(self.estimators_, self.label_binarizer_, X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/multiclass.py", line 93, in predict_ovr
Y = np.array([_predict_binary(e, X) for e in estimators])
File "/usr/local/lib/python2.7/dist-packages/sklearn/multiclass.py", line 66, in _predict_binary
score = estimator.predict_proba(X)[:, 1]
File "/usr/local/lib/python2.7/dist-packages/sklearn/svm/base.py", line 490, in predict_proba
"probability estimates must be enabled to use this method")
NotImplementedError: probability estimates must be enabled to use this method</code>
Does anybody know what is it?

This is a bug. The OneVsRestClassifier calls the predict_proba method when it finds one, but the one on SVC does not actually work unless you construct it with probability=True to get Platt scaling (which I don't actually encourage).
The reason that it works for LinearSVC is that that class does not have a predict_proba, so OvR backs off to the decision_function method.

Related

Bugs when fitting Multi label text classification models

I am now trying to fit a classification model for a Multi label text classification problem.
I have a train set X_train that contains list of cleaned text, like
["I am constructing Markov chains with to states and inferring
transition probabilities empirically by simply counting how many
times I saw each transition in my raw data",
"I know the chips only of the players of my table and mine obviously I
also know the total number of chips the max and min amount chips the
players have and the average stackIs it possible to make an
approximation of my probability of winningI have,
...]
and a train multiple tags set y corresponding to each text in X_train, like
[['hypothesis-testing', 'statistical-significance', 'markov-process'],
['probability', 'normal-distribution', 'games'],
...]
Now I want to fit a model that could predict the tags in a text set X_test that has same format as X_train.
I have used the MultiLabelBinarizer to convert the tags and used TfidfVectorizer to convert the cleaned text in train set.
multilabel_binarizer = MultiLabelBinarizer()
multilabel_binarizer.fit(y)
Y = multilabel_binarizer.transform(y)
vectorizer = TfidfVectorizer(stop_words = stopWordList)
vectorizer.fit(X_train)
x_train = vectorizer.transform(X_train)
But when I try to fit the model I always get bugs.I have tried OneVsRestClassifier and LogisticRegression.
When I fit a OneVsRestClassifier model I got bugs like
Traceback (most recent call last):
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 317, in _handle_request_noblock
self.process_request(request, client_address)
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 348, in process_request
self.finish_request(request, client_address)
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 361, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 696, in __init__
self.handle()
File "/usr/local/spark/python/pyspark/accumulators.py", line 268, in handle
poll(accum_updates)
File "/usr/local/spark/python/pyspark/accumulators.py", line 241, in poll
if func():
File "/usr/local/spark/python/pyspark/accumulators.py", line 245, in accum_updates
num_updates = read_int(self.rfile)
File "/usr/local/spark/python/pyspark/serializers.py", line 714, in read_int
raise EOFError
EOFError
When I fit a LogisticRegression model I got bugs like
/opt/conda/envs/data3/lib/python3.6/site-packages/sklearn/linear_model/sag.py:326: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
"the coef_ did not converge", ConvergenceWarning)
Anyone knows where the problem is and how to solve this? Many thanks.
OneVsRestClassifier fits one classifier per class. You need to tell it which type of classifier you want (for example Losgistic regression).
The following code works for me:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
classifier = OneVsRestClassifier(LogisticRegression())
classifier.fit(x_train, Y)
X_test= ["I play with Markov chains"]
x_test = vectorizer.transform(X_test)
classifier.predict(x_test)
output: array([[0, 1, 1, 0, 0, 1]])

Constrained optimization for word2vec giving TypeError

I want to run word2vec model in tensorflow with bounds on embeddings so that they are within unit distance from the origin. That is for every coordinate x in Target and Context matrices -1 <= x <= 1.
I found this nice interface that allows one to do this. However, when I make the changes to the original code, I get the following error.
File "/home/sivanov/PycharmProjects/hypemb/tf_word2vec/word2vec_original.py", line 379, in build_graph
self.optimize(loss)
File "/home/sivanov/PycharmProjects/hypemb/tf_word2vec/word2vec_original.py", line 297, in optimize
method = 'SLSQP')
File "/home/sivanov/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/opt/python/training/external_optimizer.py", line 126, in __init__
self._packed_var = self._pack(self._vars)
File "/home/sivanov/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/opt/python/training/external_optimizer.py", line 259, in _pack
return array_ops.concat(flattened, 0)
File "/home/sivanov/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1066, in concat
name=name)
File "/home/sivanov/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 493, in _concat_v2
name=name)
File "/home/sivanov/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 463, in apply_op
raise TypeError("%s that don't all match." % prefix) TypeError: Tensors in list passed to 'values' of 'ConcatV2' Op have types [float32, float32, float32, int32] that don't all match.
So, I expect that the matrices that I try to bound have mismatch on type. However, I initialize them with uniform distribution, so I'm not sure where I can get int32 type.
emb = tf.Variable(
tf.random_uniform(
[opts.vocab_size, opts.emb_dim], -init_width, init_width),
name="emb")
self._emb = emb
sm_w_t = tf.Variable(
tf.random_uniform(
[opts.vocab_size, opts.emb_dim], -init_width, init_width),
name="sm_w_t", dtype=tf.float32)
self._sm_w_t = sm_w_t
In addition to the aforementioned lines, I also added the following snippets:
In optimize function:
optimizer = tf.contrib.opt.ScipyOptimizerInterface(loss, options={'max_iter': 5},
var_to_bounds = {self._emb: (-1., 1.), self._sm_w_t: (-1., 1.)},
method = 'SLSQP')
self._optimizer = optimizer
In _thrain_thread_body function:
self._optimizer.minimize(self._session)
Full code can be found word2vec code can be found here, and I call this code by running python word2vec_test.py (with first pre-compiled word2vec_ops.so file; instructions can be found here).
How can I perform optimization of word2vec embeddings with constraints on the coordinates?

Missing method NeuralNet.train_split() in lasagne

I am learning to deal with python and lasagne. I have following installed on my pc:
python 3.4.3
theano 0.9.0
lasagne 0.2.dev1
and also six, scipy and numpy. I call net.fit(), and the stacktrace tries to call train_split(X, y, self), which, I guess, should split the samples into training set and validation set (both the inputs X as well as the outputs Y).
But there is no method like train_split(X, y, self) , there is only a float field train_split - I assume, the ratio between training and validation set sizes. Then I get following error:
Traceback (most recent call last):
File "...\workspaces\python\cnn\dl_tutorial\lasagne\Test.py", line
72, in
net = net1.fit(X[0:10,:,:,:],y[0:10])
File "...\Python34\lib\site-packages\nolearn\lasagne\base.py", line
544, in fit
self.train_loop(X, y, epochs=epochs)
File "...\Python34\lib\site-packages\nolearn\lasagne\base.py", line
554, in train_loop
X_train, X_valid, y_train, y_valid = self.train_split(X, y, self)
TypeError: 'float' object is not callable
What could be wrong or missing? Any suggestions? Thank you very much.
SOLVED
in previous versions, the input parameter train_split has been a number, that was used by the same-named method. In nolearn 0.6.0, it's a callable object, that can implement its own logic to split the data. So instead of providing a float number to the input parameter train_split, I have to provide a callable instance (the default one is TrainSplit), that will be executed in each training epoch.

scikit-learn: building a learning curve with SVC

I'm trying to graph a learning curve using the SVC classifier. The dataset is kinda skewed, about 150, 1000, 1000, 1000 and 150 in size. I'm running into problem with fitting the estimator:
File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/learning_curve.py", line 135, in learning_curve
for train, test in cv for n_train_samples in train_sizes_abs)
File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 644, in __call__
self.dispatch(function, args, kwargs)
File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 391, in dispatch
job = ImmediateApply(func, args, kwargs)
File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 129, in __init__
self.results = func(*args, **kwargs)
File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1233, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/svm/base.py", line 140, in fit
X = atleast2d_or_csr(X, dtype=np.float64, order='C')
File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/svm/base.py", line 450, in _validate_targets
% len(cls))
ValueError: The number of classes has to be greater than one; got 1
My code
df = pd.read_csv('../resources/problem2_processed_validate.csv')
data, label = preprocess_text(df)
cv = StratifiedKFold(label, 10)
plt = plot_learning_curve(estimator=SVC(), title="Learning curve", X=data, y=label.values, cv
train_sizes, train_scores, test_scores = learning_curve(
estimator, data, y=label, cv=cv, train_sizes=np.linspace(.1, 1.0, 5))
Even though I use stratified sampling, I still run into this error. I believe its because the learning curve code doesn't perform stratification when incrementing dataset size, and I've got all similar class labels at one step.
How should I resolve this??
You could use StratifiedShuffleSplit instead of StratifiedKFold, and then write the learning curve loop yourself, creating a new CV object at each iteration. StratifiedShuffleSplit allows you to specify a train_size and a test_size which you can increment as you create your learning curve. As long as you let train_size be greater than the number of classes, it will be able to stratify.
You are right. learning_curve doesn't perform stratification when creating a smaller data set, it just takes the first bit of the data. Lines 134-136 in learning_curve.py say
train[:n_train_samples] for n_train_samples in train_sizes_abs
You can shuffle your data in advance, so that the slice train[:n_train_samples] may (but is not guaranteed to) include data points from all classes. If you are willing to do some more work, what #eickenberg proposed will work.
PS This sounds like something that should be included in sklearn. If you do end up writing that code, please send a pull request on github

How to handle NaNs returned from 'roc_curve' before passing to 'auc'?

I am using 'roc_curve' from the metrics model in scikit-learn. The example shows that 'roc_curve' should be called before 'auc' similar to:
fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2)
and then:
metrics.auc(fpr, tpr)
However the following error is returned:
Traceback (most recent call last): File "analysis.py", line 207, in <module>
r = metrics.auc(fpr, tpr) File "/apps/anaconda/1.6.0/lib/python2.7/site-packages/sklearn/metrics/metrics.py", line 66, in auc
x, y = check_arrays(x, y) File "/apps/anaconda/1.6.0/lib/python2.7/site-packages/sklearn/utils/validation.py", line 215, in check_arrays
_assert_all_finite(array) File "/apps/anaconda/1.6.0/lib/python2.7/site-packages/sklearn/utils/validation.py", line 18, in _assert_all_finite
raise ValueError("Array contains NaN or infinity.") ValueError: Array contains NaN or infinity.
What does it mean in terms or results/is there a way to overcome this?
Are you trying to us roc_curve to evaluate a multiclass classifier? In other words, if you are using roc_curve on a classification problem that is not binary, then this won't work correctly. There is math out there for multidimensional ROC analysis, but the current ROC methods in python don't implement them.
To evaluate multiclass problems trying using methods like: confusion_matrix and classification_report from sklearn, and kappa() from skll.
You state this line:
fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2)
which leads to the conclusion that you may have copied the sklearn example which also uses "pos_label=2".
However, in most cases you want the "pos_label" to be 1. So if your code outputs probabilities and they are between 0 and 1, then your pos_label should be 1.

Resources