Support Vector Machines || find hyperplane and values of w & b and the alpha value - svm

here three points are given, they are the support vectors, two of them belongs to negative class an one is positive class, here we need to find the hyper plane and values of w and b and the alpha value, please help me, I don't understand this, this is a important question for my exam
SVM Question

Related

Calculating the parameters of a mixture of von Mises distribution

I would like to calculate the values of the concentration (kappa) and mean direction (mu) for a von Mises mixture model from the theta values given by the movMF() function in R. At the bottom of this message-chain there is a similar question with an example for a two component vonMises. The solutions seems to be mu = theta/norm(theta) and kappa = norm(theta), however, this gives a matrix of four values for mu, where I'd expect it be only a vector of two values (one mean direction for each component). I have a feeling I misunderstood the meaning of mu or it might be that the conversion formulas are wrong. I'd appreciate any help or clarification in my matter.

Impute among specific values only

I have a dataframe, where I need to impute a value based on the other samples. The column is numerical and implies industry numbers, fx (1111 - IT, 1234 - Finance, so on). I've tried to apply KNNImputer and it does produce number, but as far as I understood it averages the output of its neighbors, thus generating a number that does not exist in the column.
the imputer code is following:
X = df.copy()
imputer = KNNImputer(n_neighbors=5)
filled = imputer.fit_transform(X)
cols = X.columns
df_imputed = pd.DataFrame(data=filled, columns = cols)
The output it provides is: 6405.2
However, the closest industry codes are 6399 or 6411
How can I make an imputation for numerical column considering the existing values only?
The technical answer to this is actually surprisingly simple: just ask for a single neighbor in your knn imputer:
imputer = KNNImputer(n_neighbors=1)
This way, the knn predictions will not be averaged among the (many) neighbors, but they will actually consist only of values already existing in your data.
Notice that this is the answer to the programming question you are actually asking; if this is actually the correct approach based on the specific form of your data and features is beyond the scope of the answer (and arguably off-topic for SO).

How does probability come in to play in a kNN algorithm?

kNN seems relatively simple to understand: you have your data points and you draw them in your feature space (in a feature space of dimension 2, its the same as drawing points on a xy plane graph). When you want to classify a new data, you put the new data onto the same feature space, find the nearest k neighbors, and see what their labels are, ultimately taking the label(s) with highest votes.
So where does probability come in to play here? All I am doing to calculating distance between two points and taking the label(s) of the closest neighbor.
For a new test sample you look on the K nearest neighbors and look on their labels.
You count how many of those K samples you have in each of the classes, and divide the counts by the number of classes.
For example - lets say that you have 2 classes in your classifier and you use K=3 nearest neighbors, and the labels of those 3 nearest samples are (0,1,1) - the probability of class 0 is 1/3 and the probability for class 1 is 2/3.

Searching for closest statistically significant match in k-dimensional set

At a very high level this is similar to the nearest neighbor search problem.
From wiki: "given a set S of points in a space M and a query point q ∈ M, find the closest point in S to q".
But some significant differences. Specifics:
Each point is described by k variables.
The variables are not all numerical. Mixed data types:
string, int etc.
All possible values for all variables not known - but they come from reasonably small sets.
In the data set to search from there will be multiple points with same values for all the k variables.
Another way to look at this is there will be many duplicate points.
For each point lets call the number of duplicates as frequency.
Given a query point q need to find nearest neighbor p such that frequency of p is at-least 15
There seems to be a wide range of of algorithms around NNS and statistical classification and best bin match.
I am getting a little lost in all the variations. Is there already a standard algorithm I can use. Or would I need to modify one?

Why Principle Components of Covariance matrix capture maximum variance of the variables?

I am trying to understand PCA, I went through several tutorials. So far I understand that, the eigenvectors of a matrix implies the directions in which vectors are rotated and scaled when multiplied by that matrix, in proportion of the eigenvalues. Hence the eigenvector associated with the maximum Eigen value defines direction of maximum rotation. I understand that along the principle component, the variations are maximum and reconstruction errors are minimum. What I do not understand is:
why finding the Eigen vectors of the covariance matrix corresponds to the axis such that the original variables are better defined with this axis?
In addition to tutorials, I reviewed other answers here including this and this. But still I do not understand it.
Your premise is incorrect. PCA (and eigenvectors of a covariance matrix) certainly don't represent the original data "better".
Briefly, the goal of PCA is to find some lower dimensional representation of your data (X, which is in n dimensions) such that as much of the variation is retained as possible. The upshot is that this lower dimensional representation is an orthogonal subspace and it's the best k dimensional representation (where k < n) of your data. We must find that subspace.
Another way to think about this: given a data matrix X find a matrix Y such that Y is a k-dimensional projection of X. To find the best projection, we can just minimize the difference between X and Y, which in matrix-speak means minimizing ||X - Y||^2.
Since Y is just a projection of X onto lower dimensions, we can say Y = X*v where v*v^T is a lower rank projection. Google rank if this doesn't make sense. We know Xv is a lower dimension than X, but we don't know what direction it points.
To do that, we find the v such that ||X - X*v*v^t||^2 is minimized. This is equivalent to maximizing ||X*v||^2 = ||v^T*X^T*X*v|| and X^T*X is the sample covariance matrix of your data. This is mathematically why we care about the covariance of the data. Also, it turns out that the v that does this the best, is an eigenvector. There is one eigenvector for each dimension in the lower dimensional projection/approximation. These eigenvectors are also orthogonal.
Remember, if they are orthogonal, then the covariance between any two of them is 0. Now think of a matrix with non-zero diagonals and zero's in the off-diagonals. This is a covariance matrix of orthogonal columns, i.e. each column is an eigenvector.
Hopefully that helps bridge the connection between covariance matrix and how it helps to yield the best lower dimensional subspace.
Again, eigenvectors don't better define our original variables. The axis determined by applying PCA to a dataset are linear combinations of our original variables that tend to exhibit maximum variance and produce the closest possible approximation to our original data (as measured by l2 norm).

Resources