I got a little bit lost while studying loss functions for multi-task learning.
For instance, in binary classification with only one task, for example classifying emails as spam or not, the sum of probabilities for each label (spam/not spam) would be 1 using softmax activation + softmax_crossentropy loss function. How does that apply to multi-task learning?
Let's consider the case with 5 tasks and each of them is a binary problem. Is the softmax function applied to each task independently (e.g. for task 1: probability of label 1 = 0.7 and label 2 = 0.3; for task 2: probability of label 1 = 0.2 and label 2 = 0.8 and so on) or it considers the tasks jointly (e.g. if label 1 of task 1 has a probability of 0.80 all other labels of all other tasks will sum to 0.20)?
Some notes:
Nitpicking: you should not use softmax for binary classification, but rather a regular sigmoid (which is kind of the 2d reduction of the softmax), followed by a log-loss (same).
for mulit-task that involves classification, you would probably use multiple binary classifications. Say you have an image and you want an output to say if there are pedestrians, cars and road signs in it. This is not a multi-class classification, as an image can have all of the above in it. So instead you'd define your output as 3 nodes, and you would calculate the binary classification for each node. This is done in one multi-task NN instead of running 3 different NN's, with the assumption that all 3 classification problems can benefit from the same latent layer or embedding that is created in that one NN.
Primarily, the loss function that is calculated can be different for different tasks in the case of multi-task(I would like to comment that it is not MULTI-LABEL classification) classification.
For example, Task 1 can be binary classification; Task 2 can be next sentence prediction and so on. Therefore, since different tasks involves learning different Loss function, you can attribute to the first part of your assumption, i.e, Softmax is applied only to the labels of the first task, while learning the first task.
Related
I'm performing an image classification task . Images are labeled as 0 1 2. Should be the size of the last linear layer in the model output be 3 or 1 ? In general, for a 3-class operation, the output is set to 3, and as a result of these three, the maximum probability is returned. But I saw that the last layer is set as 1 in some codes. I think it is actually logical. What do you think about ? ( Also I dont use softmax or sigmoid function in last layer.)
To perform classification into c classes (c = 3 in your example) you need to predict the probability of each class, therefore you need to output a c-dim output.
Usually you do not explicitly apply softmax to the "raw predictions" (aka "logits") - the loss function usually does that for you in a more numerically-robust way (see, e.g., nn.CrossEntropyLoss).
After you trained the model, at inference time you can take argmax over the predicted c logits and output a single scalar - the index of the predicted class. This can only be done during inference since argmax is not a differentiable operation.
Can anyone explain how to interpret coefficientMatrix, interceptVector , Confusion matrix
of a multinomial logistic regression.
According to Spark documentation:
Multiclass classification is supported via multinomial logistic (softmax) regression. In multinomial logistic regression, the algorithm produces K sets of coefficients, or a matrix of dimension K×J where K is the number of outcome classes and J is the number of features. If the algorithm is fit with an intercept term then a length K vector of intercepts is available.
I turned an example using spark ml 2.3.0 and I got this result.
.
If I analyse what I get :
The coefficientMatrix has dimension of 5 * 11
The interceptVector has dimension of 5
If so,why the Confusion matrix has a dimension of 4 * 4 ?
Also, can anyone give an interpretation of coefficientMatrix, interceptVector ?
Why I get negative coefficients ?
If 5 is the number of classes after classification, why I get 4 rows in the confusion matrix ?
EDIT
I forgot to mention that I am still beginner in machine learning and that my search in google didn't help, so maybe I get an Up Vote :)
Regarding the 4x4 confusion matrix: I imagine that when you split your data into test and train, there were 5 classes present in your training set and only 4 classes present in your test set. This can easily happen if the distribution of your response variable is imbalanced.
You'll want to try to perform some stratified split between test and train prior to modeling. If you are working with pyspark, you may find this library helpful: https://github.com/databricks/spark-sklearn
Now regarding negative coefficients for a multi-class Logistic Regression: As you mentioned, your returned coefficientMatrix shape is 5x11.
Spark generated five models via one-vs-all approach. The 1st model corresponds to the model where the positive class is the 1st label and the negative class is composed of all other labels. Lets say the 1st coefficient for this model is -2.23. In order to interpret this coefficient we take the exponential of -2.23 which is (approx) 0.10. Interpretation here: 'With one unit increase of 1st feature we expect a reduced odds of the positive label by 90%'
I'm playing with keras+TF. I have a model which is composed by a
4 LSTM layers + 2 dense layers.
I have 3 features which are 3 sin sequences and a target which is the multiplication of the 3 sin sequences.
The LSTM layers are configured with 30 backlog time-steps.
I train the RNN with 80% of the features and than I request it to predict
the learned data (80% of the total data) I obtain a very good prediction.
Next I proceed with the last 20% of data splitting it in 10 sub-parts and
looping in predict(part_x[0]), fit(part_y[0]), predict(part_x[1]), fit(part_y[1])... But the quality of prediction dramatically drop down.
Is correct to expect that a predict(x[i])/fit(x[i],y[i]) loop should produce a decent outcome for every x[i+1] block?
Yet another question: is possible to train an RNN with 4 features and predict it with 3 features? If yes, how can I "blind" the unavailable features on prediction phase?
TIA
Roberto C.
A project i am working on has a reinforcement learning stage using the REINFORCE algorithm. The used model has a final softmax activation layer and because of that a negative learning rate is used as a replacement for negative rewards. I have some doubts about this process and can't find much literature on using a negative learning rate.
Does reinforement learning work with switching learning rate between positive and negative? and if not what would be a better approach, get rid of softmax or has keras a nice option for this?
Loss function:
def log_loss(y_true, y_pred):
'''
Keras 'loss' function for the REINFORCE algorithm,
where y_true is the action that was taken, and updates
with the negative gradient will make that action more likely.
We use the negative gradient because keras expects training data
to minimize a loss function.
'''
return -y_true * K.log(K.clip(y_pred, K.epsilon(), 1.0 - K.epsilon()))
Switching learning rate:
K.set_value(optimizer.lr, lr * (+1 if won else -1))
learner_net.train_on_batch(np.concatenate(st_tensor, axis=0),
np.concatenate(mv_tensor, axis=0))
Update, test results
I ran a test with only positive reinforcement samples, omitting all negative examples and thus the negative learning rate. Winning rate is rising, it is improving and i can safely assume using a negative learning rate is not correct.
anybody any thoughts on how we should implement it?
Update, model explanation
We are trying to recreate AlphaGo as described by DeepMind, the slow policy net:
For the first stage of the training pipeline, we build on prior work
on predicting expert moves in the game of Go using supervised
learning13,21–24. The SL policy network pσ(a| s) alternates between convolutional
layers with weights σ, and rectifier nonlinearities. A final softmax
layer outputs a probability distribution over all legal moves a.
Not sure if it the best way but at least i found a way that works.
for all negative training samples i reuse the network prediction, set the action i want to unlearn to zero and adjust all values to sum up to one again
i tried several ways to adjust them afterwards but haven't run enough tests to be sure what works best:
apply softmax ( action that has to be unlearned gets a nonzero value.. )
redistribute old action value over all other actions
set all illigal action values to zero and distribute the total removed value
distribute value proportional to value of other values
probably there are several other ways to do so, it might depend on use case what works best and there might be a better way to do so but this one works at least.
I have training data that falls into two classes, let's say Yes and No. The data represents three tasks, easy, medium and difficult. A person performs these tasks and is classified into one of the two classes as a result. Each task is classified independently and then the results are combined. I am using 3 independently trained SVM classifiers and then voting on the final result.
I am looking to provide a measure of confidence or probability associated with each classification. LIBSVM can provide a probability estimate along with the classification for each task (easy, medium and difficult, say Pe, Pm and Pd) but I am unsure of how best to combine these into an overall estimate for the final classification of the person (let's call it Pp).
My attempts so far have been along the lines of a simple average:
Pp = (Pe + Pm + Pd) / 3
An Inverse-variance weighted average (since each task is repeated a few times and sample variance (VARe, VARm and VARd) can be calculated - in which case Pe would be a simple average of all the easy samples):
Pp = (Pe/VARe + Pm/VARm + Pd/VARd) / (( 1/VARe ) + ( 1/VARm ) + ( 1/VARd ))
Or a multiplication (under the assumption that these events are independent, which I am unsure of since the underlying tasks are related):
Pp = Pe * Pm * Pd
The multiplication would provide a very low number, so it's unclear how to interpret that as an overall probability when the results of the voting are very clear.
Would any of these three options be the best or is there some other method / detail I'm overlooking?
Based on your comment, I will make the following suggestion. If you need to do this as an SVM (and because, as you say, you get better performance when you do it this way), take the output from your intermediate classifiers and feed them as features to your final classifier. Even better, switch to a multi-layer Neural Net where your inputs represent inputs to the intermediates, the (first) hidden layer represents outputs to the intermediate problem, and subsequent layer(s) represent the final decision you want. This way you get the benefit of an intermediate layer, but its output is optimised to help with the final prediction rather than for accuracy in its own right (which I assume you don't really care about).
The correct generative model for these tests likely looks something like the following:
Generate an intelligence/competence score i
For each test t: generate pass/fail according to p_t(pass | i)
This is simplified, but I think it should illustrate tht you have a latent variable i on which these tests depend (and there's also structure between them, since presumably p_easy(pass|i) > p_medium(pass|i) > p_hard(pass|i); you could potentially model this as a logistic regression with a continuous 'hardness' feature). I suspect what you're asking about is a way to do inference on some thresholding function of i, but you want to do it in a classification way rather than as a probabilistic model. That's fine, but without explicitly encoding the latent variable and the structure between the tests it's going to be hard (and no average of the probabilities will account for the missing structure).
I hope that helps---if I've made assumptions that aren't justified, please feel free to correct.