Keras: Implementing SVM using hinger or categorical hinge?

Keras: Implementing SVM using hinger or categorical hinge? - keras

I have a layers neural net that does some stuff and I want a SVM at the end. I have googled and searched on stack exchange and it seems that it is easily implemented in keras using the loss function hinge or categorical_hinge. However, I am confused as to which one to use.
My examples is to be classifed into a binary class, either class 0 or class 1. So I can either do it via:
Method 1 https://github.com/keras-team/keras/issues/2588 (uses hinge) or How do I use categorical_hinge in Keras? (uses categorical_hinge):
Labels will be of shape (,2) with values of 0 or 1 indicating if it belongs to that class or not.
nb_classes = 2
model.add(Dense(nb_classes), W_regularizer=l2(0.01))
model.add(Activation('linear'))
model.compile(loss='hinge OR categorical_hinge ??,
optimizer='adadelta',
metrics=['accuracy'])
Then the class is the node that has a higher value of the two output node?
Method 2 https://github.com/keras-team/keras/issues/2830 (uses hinge):
The first commenter mentioned that hinge is supposed to be binary_hinge and that the labels must be -1 or 1 for no or yes, and that the activation for the last SVM layer should be tanh with 1 node only.
So it should look something like this but the labels will be (,1) shape with values either -1 or 1.
model.add(Dense(1), W_regularizer=l2(0.01))
model.add(Activation('tanh'))
model.compile(loss='hinge',
optimizer='adadelta',
metrics=['accuracy'])
So which method is correct or more desirable? I am unsure of what to use since there are multiple answers online and the keras documentation contains nothing at all for the hinge and categorial_hinge loss functions. Thank you!

Might be a bit late but here is my answer.
You can do it in multiple ways:
Since you have 2 classes it is a binary problem and you can use the normal hinge.
The architecture will then only have to put out 1 output -1 and one as you said.
You can use output 2 of the last layer also, you input just have to be one-hot encodings of the label and then use the categorical hinge.
According to the activation a linear layer and a tanh would both make an SVM the tanh will just be smoothed.
I would suggest making it binary and use a tanh layer, but try both things to see what works.

Related

Do I need to apply the Softmax Function ANYWHERE in my multi-class classification Model?

I am currently turning my Binary Classification Model to a multi-class classification Model. Bare with me.. I am very knew to pytorch and Machine Learning.
Most of what I state here, I know from the following video.
https://www.youtube.com/watch?v=7q7E91pHoW4&t=654s
What I read / know is that the CrossEntropyLoss already has the Softmax function implemented, thus my output layer is linear.
What I then read / saw is that I can just choose my Model prediction by taking the torch.max() of my model output (Which comes from my last linear output. This feels weird because I Have some negative outputs and i thought I need to apply the SOftmax function first, but It seems to work right without it.
So know the big confusing question I have is, when would I use the Softmax function? Would I only use it when my loss doesnt have it implemented? BUT then I would choose my prediction based on the outputs of the SOftmax layer which wouldnt be the same as with the linear output layer.
Thank you guys for every answer this gets.

For calculating the loss using CrossEntropy you do not need softmax because CrossEntropy already includes it. However to turn model outputs to probabilities you still need to apply softmax to turn them into probabilities.
Lets say you didnt apply softmax at the end of you model. And trained it with crossentropy. And then you want to evaluate your model with new data and get outputs and use these outputs for classification. At this point you can manually apply softmax to your outputs. And there will be no problem. This is how it is usually done.
Traning()
MODEL ----> FC LAYER --->raw outputs ---> Crossentropy Loss
Eval()
MODEL ----> FC LAYER --->raw outputs --> Softmax -> Probabilites

Yes you need to apply softmax on the output layer. When you are doing binary classification you are free to use relu, sigmoid,tanh etc activation function. But when you are doing multi class classification softmax is required because softmax activation function distributes the probability throughout each output node. So that you can easily conclude that the output node which has the highest probability belongs to a particular class. Thank you. Hope this is useful!

Normalization of input data in Keras

One common task in DL is that you normalize input samples to zero mean and unit variance. One can "manually" perform the normalization using code like this:
mean = np.mean(X, axis = 0)
std = np.std(X, axis = 0)
X = [(x - mean)/std for x in X]
However, then one must keep the mean and std values around, to normalize the testing data, in addition to the Keras model being trained. Since the mean and std are learnable parameters, perhaps Keras can learn them? Something like this:
m = Sequential()
m.add(SomeKerasLayzerForNormalizing(...))
m.add(Conv2D(20, (5, 5), input_shape = (21, 100, 3), padding = 'valid'))
... rest of network
m.add(Dense(1, activation = 'sigmoid'))
I hope you understand what I'm getting at.

Add BatchNormalization as the first layer and it works as expected, though not exactly like the OP's example. You can see the detailed explanation here.
Both the OP's example and batch normalization use a learned mean and standard deviation of the input data during inference. But the OP's example uses a simple mean that gives every training sample equal weight, while the BatchNormalization layer uses a moving average that gives recently-seen samples more weight than older samples.
Importantly, batch normalization works differently from the OP's example during training. During training, the layer normalizes its output using the mean and standard deviation of the current batch of inputs.
A second distinction is that the OP's code produces an output with a mean of zero and a standard deviation of one. Batch Normalization instead learns a mean and standard deviation for the output that improves the entire network's loss. To get the behavior of the OP's example, Batch Normalization should be initialized with the parameters scale=False and center=False.

There's now a Keras layer for this purpose, Normalization. At time of writing it is in the experimental module, keras.layers.experimental.preprocessing.
https://keras.io/api/layers/preprocessing_layers/core_preprocessing_layers/normalization/
Before you use it, you call the layer's adapt method with the data X you want to derive the scale from (i.e. mean and standard deviation). Once you do this, the scale is fixed (it does not change during training). The scale is then applied to the inputs whenever the model is used (during training and prediction).
from keras.layers.experimental.preprocessing import Normalization
norm_layer = Normalization()
norm_layer.adapt(X)
model = keras.Sequential()
model.add(norm_layer)
# ... Continue as usual.

Maybe you can use sklearn.preprocessing.StandardScaler to scale you data,
This object allow you to save the scaling parameters in an object,
Then you can use Mixin types inputs into you model, lets say:
Your_model
[param1_scaler, param2_scaler]
Here is a link https://www.pyimagesearch.com/2019/02/04/keras-multiple-inputs-and-mixed-data/
https://keras.io/getting-started/functional-api-guide/

There's BatchNormalization, which learns mean and standard deviation of the input. I haven't tried using it as the first layer of the network, but as I understand it, it should do something very similar to what you're looking for.

What kind of activation is used by ScikitLearn's MLPClasssifier in output layer?

I am currently working on a classification task with given class labels 0 and 1. For this I am using ScikitLearn's MLPClassifier providing an output of either 0 or 1 for each training example. However, I can not find any documentation, what the output layer of the MLPClassifier is exactly doing (which activation function? encoding?).
Since there is an output of only one class I assume something like One-hot_encoding is used. Is this assumption correct? Is there any documentation tackling this question for the MLPClassifier?

out_activation_ attribute would give you the type of activation used in the output layer of your MLPClassifier.
From Documentation:
out_activation_ : string
Name of the output activation function.
The activation param just sets the hidden layer's activation function.
activation : {‘identity’, ‘logistic’, ‘tanh’, ‘relu’}, default ‘relu’
Activation function for the hidden layer.
The output layer is decided internally in this piece of code.
# Output for regression
if not is_classifier(self):
self.out_activation_ = 'identity'
# Output for multi class
elif self._label_binarizer.y_type_ == 'multiclass':
self.out_activation_ = 'softmax'
# Output for binary class and multi-label
else:
self.out_activation_ = 'logistic'
Hence, for binary classification it would be logistic and for multi-class it would be softmax.
To know more details about these activations, see here.

You have most of the information in the docs. The MLP is a simple neural network. It can use several activation functions, the default is relu.
It doesn't use one-hot encoding, rather you need to feed in a y (target) vector with class labels.

My understanding is that the last activation function is the logistic function, and the output is set to 1 if the probability is >0.5 and to 0 otherwise.
However, you can output the probability if you want.

Why use ReLu in the final layer of Neural Network?

It is recommended that we use ReLu in the final layer of the neural network when we are learning regressions.
It makes sense to me, since the output from ReLu is not confined between 0 and 1.
However, how does it behave when x < 0 (ie when ReLu output is zero). Can y(the result of regression) still be lesser than 0?
I believe, I am missing a basic mathematical concept here. Any help is appreciated.

You typically use:
A linear layer for regression in order to get a continuous value
Softmax for classification where you want a probability distribution of classes
But these aren't set in stone. If you know your output value for a regression should only be positive, why not use a ReLu? If the output of your classification isn't a probability distribution (ex, which classes exists) you could just as easily use a sigmoid.

How to ignore some input layer, while predicting, in a keras model trained with multiple input layers?

I'm working with neural networks and I've implemented the following architecture using keras with tensorflow backend:
For training, I'll give some labels in the layer labels_vector, this vector can have int32 values (ie: 0 could be a label). For the testing phase, I need to just ignore this input layer, if I set it to 0 results could be wrong since I've trained with labels that can be equal to 0 vector. Is there a way to simply ignore or disable this layer on the prediction phase?
Thanks in advance.

How to ignore some input layer ?
You can't. Keras cannot just ignore an input layer as the output depends on it.
One solution to get nearly what you want is to define a custom label in your training data to be the null value. Your network will learn to ignore it if it feels that it is not an important feature.
If labels_vector is a vector of categorical labels, use one-hot encoding instead of integer encoding. integer encoding assumes that there is a natural ordered relationship between each label which is wrong.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Keras: Implementing SVM using hinger or categorical hinge? - keras

Related

Do I need to apply the Softmax Function ANYWHERE in my multi-class classification Model?

Normalization of input data in Keras

What kind of activation is used by ScikitLearn's MLPClasssifier in output layer?

Why use ReLu in the final layer of Neural Network?

How to ignore some input layer, while predicting, in a keras model trained with multiple input layers?

Categories

Resources