scikit-learn: building a learning curve with SVC - scikit-learn

I'm trying to graph a learning curve using the SVC classifier. The dataset is kinda skewed, about 150, 1000, 1000, 1000 and 150 in size. I'm running into problem with fitting the estimator:
File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/learning_curve.py", line 135, in learning_curve
for train, test in cv for n_train_samples in train_sizes_abs)
File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 644, in __call__
self.dispatch(function, args, kwargs)
File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 391, in dispatch
job = ImmediateApply(func, args, kwargs)
File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 129, in __init__
self.results = func(*args, **kwargs)
File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1233, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/svm/base.py", line 140, in fit
X = atleast2d_or_csr(X, dtype=np.float64, order='C')
File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/svm/base.py", line 450, in _validate_targets
% len(cls))
ValueError: The number of classes has to be greater than one; got 1
My code
df = pd.read_csv('../resources/problem2_processed_validate.csv')
data, label = preprocess_text(df)
cv = StratifiedKFold(label, 10)
plt = plot_learning_curve(estimator=SVC(), title="Learning curve", X=data, y=label.values, cv
train_sizes, train_scores, test_scores = learning_curve(
estimator, data, y=label, cv=cv, train_sizes=np.linspace(.1, 1.0, 5))
Even though I use stratified sampling, I still run into this error. I believe its because the learning curve code doesn't perform stratification when incrementing dataset size, and I've got all similar class labels at one step.
How should I resolve this??

You could use StratifiedShuffleSplit instead of StratifiedKFold, and then write the learning curve loop yourself, creating a new CV object at each iteration. StratifiedShuffleSplit allows you to specify a train_size and a test_size which you can increment as you create your learning curve. As long as you let train_size be greater than the number of classes, it will be able to stratify.

You are right. learning_curve doesn't perform stratification when creating a smaller data set, it just takes the first bit of the data. Lines 134-136 in learning_curve.py say
train[:n_train_samples] for n_train_samples in train_sizes_abs
You can shuffle your data in advance, so that the slice train[:n_train_samples] may (but is not guaranteed to) include data points from all classes. If you are willing to do some more work, what #eickenberg proposed will work.
PS This sounds like something that should be included in sklearn. If you do end up writing that code, please send a pull request on github

Related

clf.pred_proba used for Histogram Oriented Gradient predictions

I wrote this code several years ago. I am revisiting it only to find that some things have since changed in the useage of some statement. The code examines 99 images of Trucks, SUVs, and Sedans by converting them into histogram oriented gradients. It then uses sklearn to create a SVM using
clf.svm.SVC(kernal='linear', probability=True)
clf.fit(feature,targets)
where the feature is the HOG images and targets is the vehicle type i.e., TRK, SED, SUV
Everything appears to run to this point.
I then submit another image (not in the training set) to the SVM to test the model I use these statements
def rgb2gray(rgb):
return np.dot(rgb[...,:3], [0.2989, 0.5870, 0.1140])
Predfile='XXXX.prd.png'
grayscale = rgb2gray(img)
fdPred = hog(grayscale, orientations=8, pixels_per_cell=(6, 6),cells_per_block=(3, 3), visualize=False)
prob=clf.predict_proba(fdPred)
This this point I get the following error
File "C:\Users\thiir\.conda\envs\py37\lib\site-packages\sklearn\svm\_base.py", line 670, in _predict_proba
X = self._validate_for_predict(X)
File "C:\Users\thiir\.conda\envs\py37\lib\site-packages\sklearn\svm\_base.py", line 475, in _validate_for_predict
order="C", accept_large_sparse=False)
File "C:\Users\thiir\.conda\envs\py37\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "C:\Users\thiir\.conda\envs\py37\lib\site-packages\sklearn\utils\validation.py", line 698, in check_array
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[0. 0. 0. ... 0. 0. 0.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Any insights?

Bugs when fitting Multi label text classification models

I am now trying to fit a classification model for a Multi label text classification problem.
I have a train set X_train that contains list of cleaned text, like
["I am constructing Markov chains with to states and inferring
transition probabilities empirically by simply counting how many
times I saw each transition in my raw data",
"I know the chips only of the players of my table and mine obviously I
also know the total number of chips the max and min amount chips the
players have and the average stackIs it possible to make an
approximation of my probability of winningI have,
...]
and a train multiple tags set y corresponding to each text in X_train, like
[['hypothesis-testing', 'statistical-significance', 'markov-process'],
['probability', 'normal-distribution', 'games'],
...]
Now I want to fit a model that could predict the tags in a text set X_test that has same format as X_train.
I have used the MultiLabelBinarizer to convert the tags and used TfidfVectorizer to convert the cleaned text in train set.
multilabel_binarizer = MultiLabelBinarizer()
multilabel_binarizer.fit(y)
Y = multilabel_binarizer.transform(y)
vectorizer = TfidfVectorizer(stop_words = stopWordList)
vectorizer.fit(X_train)
x_train = vectorizer.transform(X_train)
But when I try to fit the model I always get bugs.I have tried OneVsRestClassifier and LogisticRegression.
When I fit a OneVsRestClassifier model I got bugs like
Traceback (most recent call last):
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 317, in _handle_request_noblock
self.process_request(request, client_address)
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 348, in process_request
self.finish_request(request, client_address)
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 361, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 696, in __init__
self.handle()
File "/usr/local/spark/python/pyspark/accumulators.py", line 268, in handle
poll(accum_updates)
File "/usr/local/spark/python/pyspark/accumulators.py", line 241, in poll
if func():
File "/usr/local/spark/python/pyspark/accumulators.py", line 245, in accum_updates
num_updates = read_int(self.rfile)
File "/usr/local/spark/python/pyspark/serializers.py", line 714, in read_int
raise EOFError
EOFError
When I fit a LogisticRegression model I got bugs like
/opt/conda/envs/data3/lib/python3.6/site-packages/sklearn/linear_model/sag.py:326: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
"the coef_ did not converge", ConvergenceWarning)
Anyone knows where the problem is and how to solve this? Many thanks.
OneVsRestClassifier fits one classifier per class. You need to tell it which type of classifier you want (for example Losgistic regression).
The following code works for me:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
classifier = OneVsRestClassifier(LogisticRegression())
classifier.fit(x_train, Y)
X_test= ["I play with Markov chains"]
x_test = vectorizer.transform(X_test)
classifier.predict(x_test)
output: array([[0, 1, 1, 0, 0, 1]])

stacked GRU model in keras

I am willing to create a GRU model of 3 layers where each layer will have 32,16,8 units respectively. The model would take analog calue as input and produce analog value as output.
I have written the following code:
def getAModelGRU(neuron=(10), look_back=1, numInputs = 1, numOutputs = 1):
model = Sequential()
if len(neuron) > 1:
model.add(GRU(units=neuron[0], input_shape=(look_back,numInputs)))
for i in range(1,len(neuron)-1):
model.add(GRU(units=neuron[i]))
model.add(GRU(units=neuron[-1], input_shape=(look_back,numInputs)))
else:
model.add(GRU(units=neuron, input_shape=(look_back,numInputs)))
model.add(Dense(numOutputs))
model.compile(loss='mean_squared_error', optimizer='adam')
return model
And, I will call this function as:
chkEKF = getAModelGRU(neuron=(32,16,8), look_back=1, numInputs=10, numOutputs=6)
And, I obtained the following:
Traceback (most recent call last):
File "/home/momtaz/Dropbox/QuadCopter/quad_simHierErrorCorrectionEstimator.py", line 695, in <module>
Single_Point2Point()
File "/home/momtaz/Dropbox/QuadCopter/quad_simHierErrorCorrectionEstimator.py", line 74, in Single_Point2Point
chkEKF = getAModelGRU(neuron=(32,16,8), look_back=1, numInputs=10, numOutputs=6)
File "/home/momtaz/Dropbox/QuadCopter/rnnUtilQuad.py", line 72, in getAModelGRU
model.add(GRU(units=neuron[i]))
File "/home/momtaz/PycharmProjects/venv/lib/python3.6/site-packages/keras/engine/sequential.py", line 181, in add
output_tensor = layer(self.outputs[0])
File "/home/momtaz/PycharmProjects/venv/lib/python3.6/site-packages/keras/layers/recurrent.py", line 532, in __call__
return super(RNN, self).__call__(inputs, **kwargs)
File "/home/momtaz/PycharmProjects/venv/lib/python3.6/site-packages/keras/engine/base_layer.py", line 414, in __call__
self.assert_input_compatibility(inputs)
File "/home/momtaz/PycharmProjects/venv/lib/python3.6/site-packages/keras/engine/base_layer.py", line 311, in assert_input_compatibility
str(K.ndim(x)))
ValueError: Input 0 is incompatible with layer gru_2: expected ndim=3, found ndim=2
I tried online but did not find any solution for 'ndim' related issue.
Please let me know which I am doing wrong here.
You need to ensure input_shape parameter is being defined in the first layer exclusively, and every layer to have return_sequences=True except potentially the last one (depending on your model).
The code below serves for the common case where you want to stack several layers and only the number of units in each layer changes.
model = tf.keras.Sequential()
gru_options = [dict(units = units,
time_major=False,
kernel_regularizer=0.01,
# ... potentially more options
return_sequences=True) for units in [32,16,8]]
gru_options[0]['input_shape'] = (n_timesteps, n_inputs)
gru_options[-1]['return_sequences']=False # optionally disable sequences in the last layer.
# If you want to return sequences in your last
# layer delete this line, however it is necessary
# if you want to connect this to a dense layer
# for example.
for opts in gru_options:
model.add(tf.keras.layers.GRU(**opts))
model.add(tf.keras.Dense(6))
By the way there is a bug in your code, as no indentation is done after the else clause. Also notice that Python classes that implement the Iterable protocol (e.g. lists and tuples) can be iterated over by using for-in syntax, you don't have to do C-like iteration (it's more idiomatic or pythonic to use the aforementioned syntax).

Constrained optimization for word2vec giving TypeError

I want to run word2vec model in tensorflow with bounds on embeddings so that they are within unit distance from the origin. That is for every coordinate x in Target and Context matrices -1 <= x <= 1.
I found this nice interface that allows one to do this. However, when I make the changes to the original code, I get the following error.
File "/home/sivanov/PycharmProjects/hypemb/tf_word2vec/word2vec_original.py", line 379, in build_graph
self.optimize(loss)
File "/home/sivanov/PycharmProjects/hypemb/tf_word2vec/word2vec_original.py", line 297, in optimize
method = 'SLSQP')
File "/home/sivanov/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/opt/python/training/external_optimizer.py", line 126, in __init__
self._packed_var = self._pack(self._vars)
File "/home/sivanov/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/opt/python/training/external_optimizer.py", line 259, in _pack
return array_ops.concat(flattened, 0)
File "/home/sivanov/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1066, in concat
name=name)
File "/home/sivanov/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 493, in _concat_v2
name=name)
File "/home/sivanov/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 463, in apply_op
raise TypeError("%s that don't all match." % prefix) TypeError: Tensors in list passed to 'values' of 'ConcatV2' Op have types [float32, float32, float32, int32] that don't all match.
So, I expect that the matrices that I try to bound have mismatch on type. However, I initialize them with uniform distribution, so I'm not sure where I can get int32 type.
emb = tf.Variable(
tf.random_uniform(
[opts.vocab_size, opts.emb_dim], -init_width, init_width),
name="emb")
self._emb = emb
sm_w_t = tf.Variable(
tf.random_uniform(
[opts.vocab_size, opts.emb_dim], -init_width, init_width),
name="sm_w_t", dtype=tf.float32)
self._sm_w_t = sm_w_t
In addition to the aforementioned lines, I also added the following snippets:
In optimize function:
optimizer = tf.contrib.opt.ScipyOptimizerInterface(loss, options={'max_iter': 5},
var_to_bounds = {self._emb: (-1., 1.), self._sm_w_t: (-1., 1.)},
method = 'SLSQP')
self._optimizer = optimizer
In _thrain_thread_body function:
self._optimizer.minimize(self._session)
Full code can be found word2vec code can be found here, and I call this code by running python word2vec_test.py (with first pre-compiled word2vec_ops.so file; instructions can be found here).
How can I perform optimization of word2vec embeddings with constraints on the coordinates?

How to use OneVsRestClassifier with SVC for multilabel problems?

I'm using OneVsRestClassifier for multilabel classification. It works with LinearSVC, but when I apply it to SVC, the following error appears:
classifier = OneVsRestClassifier(SVC(class_weight='balanced'))
classifier.fit(X1, y1)
y2 = classifier.predict(X2)
Traceback (most recent call last):
...
File "/usr/local/lib/python2.7/dist-packages/sklearn/multiclass.py", line 219, in predict
return predict_ovr(self.estimators_, self.label_binarizer_, X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/multiclass.py", line 93, in predict_ovr
Y = np.array([_predict_binary(e, X) for e in estimators])
File "/usr/local/lib/python2.7/dist-packages/sklearn/multiclass.py", line 66, in _predict_binary
score = estimator.predict_proba(X)[:, 1]
File "/usr/local/lib/python2.7/dist-packages/sklearn/svm/base.py", line 490, in predict_proba
"probability estimates must be enabled to use this method")
NotImplementedError: probability estimates must be enabled to use this method</code>
Does anybody know what is it?
This is a bug. The OneVsRestClassifier calls the predict_proba method when it finds one, but the one on SVC does not actually work unless you construct it with probability=True to get Platt scaling (which I don't actually encourage).
The reason that it works for LinearSVC is that that class does not have a predict_proba, so OvR backs off to the decision_function method.

Resources