How does a trained SVR model predict values? - svm

I've been trying to understand how does a model trained with support vector machines for regression predict values. I have trained a model with the sklearn.svm.SVR, and now I'm wondering how to "manually" predict the outcome of an input.
Some background - the model is trained with kernel SVR, with RBF function and uses the dual formulation. So now I have arrays of the dual coefficients, the indexes of the support vectors, and the support vectors themselves.
I found the function which is used to fit the hyperplane but I've been unsuccessful in applying that to "manually" predict outcomes without the function .predict.
The few things I tried all include the dot products of the input (features) array, and all the support vectors.

If anyone ever needs this, I've managed to understand the equation and code it in python.
The following is the used equation for the dual formulation:
where N is the number of observations, and αi multiplied by yi are the dual coefficients found from the model's attributed model.dual_coef_. The xiT are some of the observations used for training (support vectors) accessed by the attribute model.support_vectors_ (transposed to allow multiplication of the two matrices), x is the input vector containing a value for each feature (its the one observation for which we want to get prediction), and b is the intercept accessed by model.intercept_.
The xiT and x, however, are the observations transformed in a higher-dimensional space, as explained by mery in this post.
The calculation of the transformation by RBF can be either applied manually step by stem or by using the sklearn.metrics.pairwise.rbf_kernel.
With the latter, the code would look like this (my case shows I have 589 support vectors, and 40 features).
First we access the coefficients and vectors:
support_vectors = model.support_vectors_
dual_coefs = model.dual_coef_[0]
Then:
pred = (np.matmul(dual_coefs.reshape(1,589),
rbf_kernel(support_vectors.reshape(589,40),
Y=input_array.reshape(1,40),
gamma=model.get_params()['gamma']
)
)
+ model.intercept_
)
If the RBF funcion needs to be applied manually, step by step, then:
vrbf = support_vectors.reshape(589,40) - input_array.reshape(1,40)
pred = (np.matmul(dual_coefs.reshape(1,589),
np.diag(np.exp(-model.get_params()['gamma'] *
np.matmul(vrbf, vrbf.T)
)
).reshape(589,1)
)
+ model.intercept_
)
I placed the .reshape() function even where it is not necessary, just to emphasize the shapes for the matrix operations.
These both give the same results as model.predict(input_array)

Related

Multiclass semantic segmentation model evaluation

I am doing a project on multiclass semantic segmentation. I have formulated a model that outputs pretty descent segmented images by decreasing the loss value. However, I cannot evaluate the model performance in metrics, such as meanIoU or Dice coefficient.
In case of binary semantic segmentation it was easy just to set the threshold of 0.5, to classify the outputs as an object or background, but it does not work in the case of multiclass semantic segmentation. Could you please tell me how to obtain model performance on the aforementioned metrics? Any help will be highly appreciated!
By the way, I am using PyTorch framework and CamVid dataset.
If anyone is interested in this answer, please also look at this issue. The author of the issue points out that mIoU can be computed in a different way (and that method is more accepted in literature). So, consider that before using the implementation for any formal publication.
Basically, the other method suggested by the issue-poster is to separately accumulate the intersections and unions over the entire dataset and divide them at the final step. The method in the below original answer computes intersection and union for a batch of images, then divides them to get IoU for the current batch, and then takes a mean of the IoUs over the entire dataset.
However, this below given original method is problematic because the final mean IoU would vary with the batch-size. On the other hand, the mIoU would not vary with the batch size for the method mentioned in the issue as the separate accumulation would ensure that batch size is irrelevant (though higher batch size can definitely help speed up the evaluation).
Original answer:
Given below is an implementation of mean IoU (Intersection over Union) in PyTorch.
def mIOU(label, pred, num_classes=19):
pred = F.softmax(pred, dim=1)
pred = torch.argmax(pred, dim=1).squeeze(1)
iou_list = list()
present_iou_list = list()
pred = pred.view(-1)
label = label.view(-1)
# Note: Following for loop goes from 0 to (num_classes-1)
# and ignore_index is num_classes, thus ignore_index is
# not considered in computation of IoU.
for sem_class in range(num_classes):
pred_inds = (pred == sem_class)
target_inds = (label == sem_class)
if target_inds.long().sum().item() == 0:
iou_now = float('nan')
else:
intersection_now = (pred_inds[target_inds]).long().sum().item()
union_now = pred_inds.long().sum().item() + target_inds.long().sum().item() - intersection_now
iou_now = float(intersection_now) / float(union_now)
present_iou_list.append(iou_now)
iou_list.append(iou_now)
return np.mean(present_iou_list)
Prediction of your model will be in one-hot form, so first take softmax (if your model doesn't already) followed by argmax to get the index with the highest probability at each pixel. Then, we calculate IoU for each class (and take the mean over it at the end).
We can reshape both the prediction and the label as 1-D vectors (I read that it makes the computation faster). For each class, we first identify the indices of that class using pred_inds = (pred == sem_class) and target_inds = (label == sem_class). The resulting pred_inds and target_inds will have 1 at pixels labelled as that particular class while 0 for any other class.
Then, there is a possibility that the target does not contain that particular class at all. This will make that class's IoU calculation invalid as it is not present in the target. So, you assign such classes a NaN IoU (so you can identify them later) and not involve them in the calculation of the mean.
If the particular class is present in the target, then pred_inds[target_inds] will give a vector of 1s and 0s where indices with 1 are those where prediction and target are equal and zero otherwise. Taking the sum of all elements of this will give us the intersection.
If we add all the elements of pred_inds and target_inds, we'll get the union + intersection of pixels of that particular class. So, we subtract the already calculated intersection to get the union. Then, we can divide the intersection and union to get the IoU of that particular class and add it to a list of valid IoUs.
At the end, you take the mean of the entire list to get the mIoU. If you want the Dice Coefficient, you can calculate it in a similar fashion.

How to obtain multinomial probabilities in WinBUGS with multiple regression

In WinBUGS, I am specifying a model with a multinomial likelihood function, and I need to make sure that the multinomial probabilities are all between 0 and 1 and sum to 1.
Here is the part of the code specifying the likelihood:
e[k,i,1:9] ~ dmulti(P[k,i,1:9],n[i,k])
Here, the array P[] specifies the probabilities for the multinomial distribution.
These probabilities are to be estimated from my data (the matrix e[]) using multiple linear regressions on a series of fixed and random effects. For instance, here is the multiple linear regression used to predict one of the elements of P[]:
P[k,1,2] <- intercept[1,2] + Slope1[1,2]*Covariate1[k] +
Slope2[1,2]*Covariate2[k] + Slope3[1,2]*Covariate3[k]
+ Slope4[1,2]*Covariate4[k] + RandomEffect1[group[k]] +
RandomEffect2[k]
At compiling, the model produces an error:
elements of proportion vector of multinomial e[1,1,1] must be between zero and one
If I understand this correctly, this means that the elements of the vector P[k,i,1:9] (the probability vector in the multinomial likelihood function above) may be very large (or small) numbers. In reality, they all need to be between 0 and 1, and sum to 1.
I am new to WinBUGS, but from reading around it seems that somehow using a beta regression rather than multiple linear regressions might be the way forward. However, although this would allow each element to be between 0 and 1, it doesn't seem to get to the heart of the problem, which is that all the elements of P[k,i,1:9] must be positive and sum to 1.
It may be that the response variable can very simply be transformed to be a proportion. I have tried this by trying to divide each element by the sum of P[k,i,1:9], but so far no success.
Any tips would be very gratefully appreciated!
(I have supplied the problematic sections of the model; the whole thing is fairly long.)
The usual way to do this would be to use the multinomial equivalent of a logit link to constrain the transformed probabilities to the interval (0,1). For example (for a single predictor but it is the same principle for as many predictors as you need):
Response[i, 1:Categories] ~ dmulti(prob[i, 1:Categories], Trials[i])
phi[i,1] <- 1
prob[i,1] <- 1 / sum(phi[i, 1:Categories])
for(c in 2:Categories){
log(phi[i,c]) <- intercept[c] + slope1[c] * Covariate1[i]
prob[i,c] <- phi[i,c] / sum(phi[i, 1:Categories])
}
For identifibility the value of phi[1] is set to 1, but the other values of intercept and slope1 are estimated independently. When the number of Categories is equal to 2, this collapses to the usual logistic regression but coded for a multinomial response:
log(phi[i,2]) <- intercept[2] + slope1[2] * Covariate1[i]
prob[i,2] <- phi[i, 2] / (1 + phi[i, 2])
prob[i,1] <- 1 / (1 + phi[i, 2])
ie:
logit(prob[i,2]) <- intercept[2] + slope1[2] * Covariate1[i]
prob[i,1] <- 1 - prob[i,2]
In this model I have indexed slope1 by the category, meaning that each level of the outcome has an independent relationship with the predictor. If you have an ordinal response and want to assume that the odds ratio associated with the covariate is consistent between successive levels of the response, then you can drop the index on slope1 (and reformulate the code slightly so that phi is cumulative) to get a proportional odds logistic regression (POLR).
Edit
Here is a link to some example code covering logistic regression, multinomial regression and POLR from a course I teach:
http://runjags.sourceforge.net/examples/squirrels.R
Note that it uses JAGS (rather than WinBUGS) but as far as I know there are no differences in model syntax for these types of models. If you want to quickly get started with runjags & JAGS from a WinBUGS background then you could follow this vignette:
http://runjags.sourceforge.net/quickjags.html

How to compare predictive power of PCA and NMF

I would like to compare the output of an algorithm with different preprocessed data: NMF and PCA.
In order to get somehow a comparable result, instead of choosing just the same number of components for each PCA and NMF, I would like to pick the amount that explains e.g 95% of retained variance.
I was wondering if its possible to identify the variance retained in each component of NMF.
For instance using PCA this would be given by:
retainedVariance(i) = eigenvalue(i) / sum(eigenvalue)
Any ideas?
TL;DR
You should loop over different n_components and estimate explained_variance_score of the decoded X at each iteration. This will show you how many components do you need to explain 95% of variance.
Now I will explain why.
Relationship between PCA and NMF
NMF and PCA, as many other unsupervised learning algorithms, are aimed to do two things:
encode input X into a compressed representation H;
decode H back to X', which should be as close to X as possible.
They do it in a somehow similar way:
Decoding is similar in PCA and NMF: they output X' = dot(H, W), where W is a learned matrix parameter.
Encoding is different. In PCA, it is also linear: H = dot(X, V), where V is also a learned parameter. In NMF, H = argmin(loss(X, H, W)) (with respect to H only), where loss is mean squared error between X and dot(H, W), plus some additional penalties. Minimization is performed by coordinate descent, and result may be nonlinear in X.
Training is also different. PCA learns sequentially: the first component minimizes MSE without constraints, each next kth component minimizes residual MSE subject to being orthogonal with the previous components. NMF minimizes the same loss(X, H, W) as when encoding, but now with respect to both H and W.
How to measure performance of dimensionality reduction
If you want to measure performance of an encoding/decoding algorithm, you can follow the usual steps:
Train your encoder+decoder on X_train
To measure in-sample performance, compare X_train'=decode(encode(X_train)) with X_train using your preferred metric (e.g. MAE, RMSE, or explained variance)
To measure out-of-sample performance (generalizing ability) of your algorithm, do step 2 with the unseen X_test.
Let's try it with PCA and NMF!
from sklearn import decomposition, datasets, model_selection, preprocessing, metrics
# use the well-known Iris dataset
X, _ = datasets.load_iris(return_X_y=True)
# split the dataset, to measure overfitting
X_train, X_test = model_selection.train_test_split(X, test_size=0.5, random_state=1)
# I scale the data in order to give equal importance to all its dimensions
# NMF does not allow negative input, so I don't center the data
scaler = preprocessing.StandardScaler(with_mean=False).fit(X_train)
X_train_sc = scaler.transform(X_train)
X_test_sc = scaler.transform(X_test)
# train the both decomposers
pca = decomposition.PCA(n_components=2).fit(X_train_sc)
nmf = decomposition.NMF(n_components=2).fit(X_train_sc)
print(sum(pca.explained_variance_ratio_))
It will print you explained variance ratio of 0.9536930834362043 - the default metric of PCA, estimated using its eigenvalues. We can measure it in a more direct way - by applying a metric to actual and "predicted" values:
def get_score(model, data, scorer=metrics.explained_variance_score):
""" Estimate performance of the model on the data """
prediction = model.inverse_transform(model.transform(data))
return scorer(data, prediction)
print('train set performance')
print(get_score(pca, X_train_sc))
print(get_score(nmf, X_train_sc))
print('test set performance')
print(get_score(pca, X_test_sc))
print(get_score(nmf, X_test_sc))
which gives
train set performance
0.9536930834362043 # same as before!
0.937291711378812
test set performance
0.9597828443047842
0.9590555069007827
You can see that on the training set PCA performs better than NMF, but on the test set their performance is almost identical. This happens, because NMF applies lots of regularization:
H and W (the learned parameter) must be non-negative
H should be as small as possible (L1 and L2 penalties)
W should be as small as possible (L1 and L2 penalties)
These regularizations make NMF fit worse than possible to the training data, but they might improve its generalizing ability, which happened in our case.
How to choose the number of components
In PCA, it is simple, because its components h_1, h_2, ... h_k are learned sequentially. If you add the new component h_(k+1), the first k will not change. Thus, you can estimate performance of each component, and these estimates will not depent on the number of components. This makes it possible for PCA to output the explained_variance_ratio_ array after only a single fit to data.
NMF is more complex, because all its components are trained at the same time, and each one depends on all the rest. Thus, if you add the k+1th component, the first k components will change, and you cannot match each particular component with its explained variance (or any other metric).
But what you can to is to fit a new instance of NMF for each number of components, and compare the total explained variance:
ks = [1,2,3,4]
perfs_train = []
perfs_test = []
for k in ks:
nmf = decomposition.NMF(n_components=k).fit(X_train_sc)
perfs_train.append(get_score(nmf, X_train_sc))
perfs_test.append(get_score(nmf, X_test_sc))
print(perfs_train)
print(perfs_test)
which would give
[0.3236945680665101, 0.937291711378812, 0.995459457205891, 0.9974027602663655]
[0.26186701106012833, 0.9590555069007827, 0.9941424954209546, 0.9968456603914185]
Thus, three components (judging by the train set performance) or two components (by the test set) are required to explain at least 95% of variance. Please notice that this case is unusual and caused by a small size of training and test data: usually performance degrades a little bit on the test set, but in my case it actually improved a little.

How to correctly implement a batch-input LSTM network in PyTorch?

This release of PyTorch seems provide the PackedSequence for variable lengths of input for recurrent neural network. However, I found it's a bit hard to use it correctly.
Using pad_packed_sequence to recover an output of a RNN layer which were fed by pack_padded_sequence, we got a T x B x N tensor outputs where T is the max time steps, B is the batch size and N is the hidden size. I found that for short sequences in the batch, the subsequent output will be all zeros.
Here are my questions.
For a single output task where the one would need the last output of all the sequences, simple outputs[-1] will give a wrong result since this tensor contains lots of zeros for short sequences. One will need to construct indices by sequence lengths to fetch the individual last output for all the sequences. Is there more simple way to do that?
For a multiple output task (e.g. seq2seq), usually one will add a linear layer N x O and reshape the batch outputs T x B x O into TB x O and compute the cross entropy loss with the true targets TB (usually integers in language model). In this situation, do these zeros in batch output matters?
Question 1 - Last Timestep
This is the code that i use to get the output of the last timestep. I don't know if there is a simpler solution. If it is, i'd like to know it. I followed this discussion and grabbed the relative code snippet for my last_timestep method. This is my forward.
class BaselineRNN(nn.Module):
def __init__(self, **kwargs):
...
def last_timestep(self, unpacked, lengths):
# Index of the last output for each sequence.
idx = (lengths - 1).view(-1, 1).expand(unpacked.size(0),
unpacked.size(2)).unsqueeze(1)
return unpacked.gather(1, idx).squeeze()
def forward(self, x, lengths):
embs = self.embedding(x)
# pack the batch
packed = pack_padded_sequence(embs, list(lengths.data),
batch_first=True)
out_packed, (h, c) = self.rnn(packed)
out_unpacked, _ = pad_packed_sequence(out_packed, batch_first=True)
# get the outputs from the last *non-masked* timestep for each sentence
last_outputs = self.last_timestep(out_unpacked, lengths)
# project to the classes using a linear layer
logits = self.linear(last_outputs)
return logits
Question 2 - Masked Cross Entropy Loss
Yes, by default the zero padded timesteps (targets) matter. However, it is very easy to mask them. You have two options, depending on the version of PyTorch that you use.
PyTorch 0.2.0: Now pytorch supports masking directly in the CrossEntropyLoss, with the ignore_index argument. For example, in language modeling or seq2seq, where i add zero padding, i mask the zero padded words (target) simply like this:
loss_function = nn.CrossEntropyLoss(ignore_index=0)
PyTorch 0.1.12 and older: In the older versions of PyTorch, masking was not supported, so you had to implement your own workaround. I solution that i used, was masked_cross_entropy.py, by jihunchoi. You may be also interested in this discussion.
A few days ago, I found this method which uses indexing to accomplish the same task with a one-liner.
I have my dataset batch first ([batch size, sequence length, features]), so for me:
unpacked_out = unpacked_out[np.arange(unpacked_out.shape[0]), lengths - 1, :]
where unpacked_out is the output of torch.nn.utils.rnn.pad_packed_sequence.
I have compared it with the method described here, which looks similar to the last_timestep() method Christos Baziotis is using above (also recommended here), and the results are the same in my case.

Expectation Maximization algorithm(Gaussian Mixture Model) : ValueError: the input matrix must be positive semidefinite

I am trying to implement Expectation Maximization algorithm(Gaussian Mixture Model) on a data set data=[[x,y],...]. I am using mv_norm.pdf(data, mean,cov) function to calculate cluster responsibilities. But after calculating new values of covariance (cov matrix) after 6-7 iterations, cov matrix is becoming singular i.e determinant of cov is 0 (very small value) and hence it is giving errors
ValueError: the input matrix must be positive semidefinite
and
raise np.linalg.LinAlgError('singular matrix')
Can someone suggest any solution for this?
#E-step: Compute cluster responsibilities, given cluster parameters
def calculate_cluster_responsibility(data,centroids,cov_m):
pdfmain=[[] for i in range(0,len(data))]
for i in range(0,len(data)):
sum1=0
pdfeach=[[] for m in range(0,len(centroids))]
pdfeach[0]=1/3.*mv_norm.pdf(data[i], mean=centroids[0],cov=[[cov_m[0][0][0],cov_m[0][0][1]],[cov_m[0][1][0],cov_m[0][1][1]]])
pdfeach[1]=1/3.*mv_norm.pdf(data[i], mean=centroids[1],cov=[[cov_m[1][0][0],cov_m[1][0][1]],[cov_m[1][1][0],cov_m[0][1][1]]])
pdfeach[2]=1/3.*mv_norm.pdf(data[i], mean=centroids[2],cov=[[cov_m[2][0][0],cov_m[2][0][1]],[cov_m[2][1][0],cov_m[2][1][1]]])
sum1+=pdfeach[0]+pdfeach[1]+pdfeach[2]
pdfeach[:] = [x / sum1 for x in pdfeach]
pdfmain[i]=pdfeach
global old_pdfmain
if old_pdfmain==pdfmain:
return
old_pdfmain=copy.deepcopy(pdfmain)
softcounts=[sum(i) for i in zip(*pdfmain)]
calculate_cluster_weights(data,centroids,pdfmain,soft counts)
Initially, I've passed [[3,0],[0,3]] for each cluster covariance since expected number of clusters is 3.
Can someone suggest any solution for this?
The problem is your data lies in some manifold of dimension strictly smaller than the input data. In other words for example your data lies on a circle, while you have 3 dimensional data. As a consequence when your method tries to estimate 3 dimensional ellipsoid (covariance matrix) that fits your data - it fails since the optimal one is a 2 dimensional ellipse (third dimension is 0).
How to fix it? You will need some regularization of your covariance estimator. There are many possible solutions, all in M step, not E step, the problem is with computing covariance:
Simple solution, instead of doing something like cov = np.cov(X) add some regularizing term, like cov = np.cov(X) + eps * np.identity(X.shape[1]) with small eps
Use nicer estimator like LedoitWolf estimator from scikit-learn.
Initially, I've passed [[3,0],[0,3]] for each cluster covariance since expected number of clusters is 3.
This makes no sense, covariance matrix values has nothing to do with amount of clusters. You can initialize it with anything more or less resonable.

Resources