SVM tutorials state that if a data point falls in the area surrounding the separating line (in the margin) - it isn't classified. How is this implemented in libraries like SVMlight and libsvm?
For two-class classification, we usually assume that their targets are +1 and -1 respectively. Then we find the maximum-margin hyperplane by a QP solver. Because of soft margin (see the term C in SVM}, some samples exist in the margin.
But that's not problem.
We can determine the class that positive values as +-class and negative values as --class
To sum up,
Even if the samples are trained as +1 and -1, SVM classifies +-class when >= 0 or --class when < 0
Related
I am using Keras to setup neural networks.
As input data, I use vectors in which each coordinate can be either 0 (feature not present or not measured) or a value that can range for instance between 5000 and 10000.
So my input value distribution is a kind of gaussian centered let us say around 7500 plus a very thin peak at 0.
I cannot remove the vectors with 0 in some of their coordinates because almost all of them will have some 0s at some locations.
So my question is : "how to best normalize the input vectors ?". I see two possibilities :
just substract the mean and divide by standard deviation. The problem then is that the mean is biased by the high number of meaningless 0s, and the std is overestimated, which erases the fine changes in the meaningful measurement.
compute the mean and standard deviation on the non-zeros coordinates, which is more meaningful. But then all the 0 values that correspond to non measured data will come out with high (negative) values which gives some importance to meaningless data...
Does someone have an advice on how to proceed ?
Thanks !
Instead, represent your features as 2 dimensions:
First one is normalised value of the feature if it is non zero (where normalisation is computed over non zero elements), otherwise it is 0
Second is 1 iff the feature was 0, otherwise it is 1. This makes sure that 0 from the previous feature that could either come from raw 0, or from normalised 0 can be discriminated
You can think of this as encoding extra feature saying "the other feature is missing". This way scale of each feature is normalised, and all informatino preserved
Say I have a tensor named attn_weights of size [1,a], entries of which indicate the attention weights between the given query and |a| keys. I want to select the largest one using torch.nn.functional.gumbel_softmax.
I find docs about this function describe the parameter as logits - […, num_features] unnormalized log probabilities. I wonder whether should I take log of attn_weights before passing it into gumbel_softmax? And I find Wiki defines logit=lg(p/1-p), which is different from barely logrithm. I wonder which one should I pass to the function?
Further, I wonder how to choose tau in gumbel_softmax, any guidelines?
I wonder whether should I take log of attn_weights before passing it into gumbel_softmax?
If attn_weights are probabilities (sum to 1; e.g., output of a softmax), then yes. Otherwise, no.
I wonder how to choose tau in gumbel_softmax, any guidelines?
Usually, it requires tuning. The references provided in the docs can help you with that.
From Categorical Reparameterizaion with Gumbel-Softmax:
Figure 1, caption:
... (a) For low temperatures (τ = 0.1, τ = 0.5), the expected value of a Gumbel-Softmax random variable approaches the expected value of a categorical random variable with the same logits. As the temperature increases (τ = 1.0, τ = 10.0), the expected value converges to a uniform distribution over the categories.
Section 2.2, 2nd paragraph (emphasis mine):
While Gumbel-Softmax samples are differentiable, they are not identical to samples from the corresponding categorical distribution for non-zero temperature. For learning, there is a tradeoff between small temperatures, where samples are close to one-hot but the variance of the gradients is large, and large temperatures, where samples are smooth but the variance of the gradients is small (Figure 1). In practice, we start at a high temperature and anneal to a small but non-zero temperature.
Lastly, they remind the reader that tau can be learned:
If τ is a learned parameter (rather than annealed via a fixed
schedule), this scheme can be interpreted as entropy regularization (Szegedy et al., 2015; Pereyra et al., 2016), where the Gumbel-Softmax distribution can adaptively adjust the "confidence" of proposed samples during the training process.
I Created an Aspect Based Sentiment Analysis Classifier.
i need to return answer with very high certenty ,
so i want to return unknown
in case i am not sure about the answer
my model return for each label a percent and all 3 add up to 1
for example
i try the threshold in the ROC curve point for each label against all 3 and also the precision point of each label
for now i am just returning the max between all 3 .
attached link to 1172 tagged examples
https://github.com/ntedgi/bert-sentiment/blob/master/decision_maker_weka.csv
thanks
You could set a threshold and after getting the probabilities, check if the percent of probability is below your threshold, mark it as unknown.
please i like to classify a set of image in 4 class with SIFT DESCRIPTOR and SVM. Now, using SIFT extractor I get keypoints of different sizes exemple img1 have 100 keypoints img2 have 55 keypoints.... how build histograms that give fixed size vectors with matlab
In this case, perhaps dense sift is a good choice.
There are two main stages:
Stage 1: Creating a codebook.
Divide the input image into a set of sub-images.
Apply sift on each sub-image. Each key point will have 128 dimensional feature vector.
Encode these vectors to create a codebook by simply applying k-means clustering with a chosen k. Each image will produce a matrix Vi (i <= n and n is the number of images used to create the codeword.) of size 128 * m, where m is the number of key points gathered from the image. The input to K-means is therefore, a big matrix V created by horizontal concatenation of Vi, for all i. The output of K-means is a matrix C with size 128 * k.
Stage 2: Calculating Histograms.
For each image in the dataset, do the following:
Create a histogram vector h of size k and initialize it to zeros.
Apply dense sift as in step 2 in stage 1.
For each key point's vector find the index of it's "best match" vector in the codebook matrix C (can be the minimum in the Euclidian distance) .
Increase the corresponding bin to this index in h by 1.
Normalize h by L1 or L2 norms.
Now h is ready for classification.
Another possibility is to use Fisher's vector instead of a codebook, https://hal.inria.fr/file/index/docid/633013/filename/jegou_aggregate.pdf
You will always get different number of keypoints for different images, but the size of feature vector of each descriptor point remains same i.e. 128. People prefer using Vector Quantization or K-Mean Clustering and build Bag-of-Words model histogram. You can have a look at this thread.
Using the conventional SIFT approach you will never have the same number of key points in every image. One way of achieving that is to sample the descriptors densely, using Dense SIFT, that places a regular grid on top of the image. If all images have the same size, then you will have the same number of key points per image.
I've a small set of data points (around 10) in a 2D space, and each of them have a category label. I wish to classify a new data point based on the existing data point labels and also associate a 'probability' for belonging to any particular label class.
Is it appropriate to label the new point based on the label to its nearest neighbor( like a K-nearest neighbor, K=1)? For getting the probability I wish to permute all the labels and calculate all the minimum distance of the unknown point and the rest and finding the fraction of cases where the minimum distance is lesser or equal to the distance that was used to label it.
Thanks
The Nearest Neighbour method is already using the Bayes theorem to estimate the probability using the points in a ball containing your chosen K points. There is no need to transform, as the number of points in the ball of K points belonging to each label divided by the total number of points in that ball already is an approximation of the posterior probability of that label. In other words:
P(label|z) = P(z|label)P(label) / P(z) = K(label)/K
This is obtained using the Bayes rule of probability on an estimated probability estimated using a subset of the data. In particular, using:
VP(x) = K/N (this gives you the probability of a point in a ball of volume V)
P(x) = K/NV (from above)
P(x=label) = K(label)/N(label)V (where K(label) and N(label) are the number of points in the ball of that given class and the number of points in the total samples of that class)
and
P(label) = N(label)/N.
Therefore, just pick a K, calculate the distances, count the points and by checking their labels and recounting you will have your probability.
Roweis uses a probabilistic framework with KNN in his publication Neighbourhood Component Analysis. The idea is to use a "soft" nearest neighbour classification, where the probability that a point i uses another point j as its neighbour is defined by
,
where d_ij is the euclidean distance between point i and j.
The are no probabilities for such K-nearest classification method because it is discriminative classification as well as SVM. There are should be used postporcess for learning probabilities on unseen data with generative model like logistic regression.
1. learn K nearest classifier
2. Train logistic regression on distance and average distance to K nearest for validation data.
Check for details LibSVM article.
Sort the distances to the 10 centres; they could be
1 5 6 ... — one near, others far
1 1 1 5 6 ... — 3 near, others far
... lots of possibilities.
You could combine the 10 distances to a single number, e.g. 1 - (nearest / average) ** p,
but that's throwing away information.
(Different powers p makes the hills around the centres steeper or flatter.)
If your centres are really Gaussian hills though, take a look at
Multivariate kernel density estimation.
Added:
There are zillions of functions that go smoothly between 0 and 1,
but that doesn't make them probabilities of something.
"Probability" means either that chance, likelihood, is involved,
as in probability of rain;
or that you're trying to impress somebody.
Added again: scholar.google.com "(single|1) nearest neighbor classifier" gets > 300 hits;
"k nearest neighbor classifier" gets almost 3000.
It seems to me (non-expert) that, out of 10 different ways of mapping k-NN distances to labels,
each one might be better than the 9 others — for some data, with some error measure.
Anyway, you could try asking stats.stackexchange.com ,
The answer is : it depends.
Imagine your labels are the surname of a person, and the X,Y coordinates represent some essential characteristics of the person's DNA sequence. Clearly a more close DNA description enhance the probability of having the same surnames.
Now suppose the X,Y is the lat/long of the work office for that person. Working closer isn't related to label (surname) sharing.
So, it depends on the semantic of your tags and axes.
HTH!