Modelling known noise with gaussian process model - scikit-learn

I need some help understanding how to model known noise using a Gaussian process model.
I have some noisy data; for the purpose of this discussion lets say the noise is Gaussian.
I can model this noise using a GP model with the following kernel.
k1 = Matern(length_scale=[3, 0.2, 0.2])
k2 = WhiteKernel(noise_level=1)
kernel = k1 + k2
I fit to the data, and get a nice result (image here - sorry I am apparently not allowed to post inline images?)
In this case, I am fitting the kernel hyperparameters to the data, and everything seems to be working well. However, this is a rather artificial situation - normally I won't have the luxury of specifically feeding the GP multiple data points at the same parameter values, but I will normally have an estimate of noise in advance, so I need to figure out how to model this noise without explicitly fitting the model.
What I am confused about is how to interpret and set the noise_level. From what I have read in the docs, I should interpret noise_level parameter as 'corresponding to the variance of Gaussian noise'. In the case of the above data, following a model fit it predict the standard deviation is 0.5, corresponding to a variance of 0.25. This is correct - however, the noise_level in gp.kernel_.k2.noise_level is 1.09902. I would think there is some correspondence between this number and the predicted std, but I can't figure out what it is. Furthermore, when I perform the same experiment with the different data noise levels, the prediction of the standard deviation remains good butgp.kernel_.k2.noise_level actually doesn't change at all...
What I would like to do is be able to say "In this data, I know that the noise is roughly x" and then set up the kernel accordingly like this:
k1 = Matern()
k2 = WhiteKernel(noise_level=something_derived_from_x, noise_level_bounds='fixed')
kernel = k1 + k2
However, I cannot figure out how I should do this. I feel like I've fundamentally misunderstood something, can anyone help me out?

Related

Does SciKit Have A InHouse Function That Tallies The Accuracy For Each Y Solution?

I have LinearSVC algorithm that predicts some data for stock. It has a 90% acc rating, but I think this might be due to the fact that some y's are far more likely than others. I want to see if there is a way to see if for each y I've defined, how accurately that y was predicted.
I haven't seen anything like this in the docs, but it just makes sense to have it.
If what your really want is a measure of confidence rather than actual probabilities, you can use the method LinearSVC.decision_function(). See the documentation or the probability calibration CalibratedClassifierCV using this documentation.
You can use a confusion matrix representation implemented in SciKit to generate an accuracy matrix between the predicted and real values of your classification problem for each individual attribute. The diagonal represents the raw accuracy, which can easily be converted to a percentage accuracy.

Improving linear regression model by taking absolute value of predicted output?

I have a particular classification problem that I was able to improve using Python's abs() function. I am still somewhat new when it comes to machine learning, and I wanted to know if what I am doing is actually "allowed," so to speak, for improving a regression problem. The following line describes my method:
lr = linear_model.LinearRegression()
predicted = abs(cross_val_predict(lr, features, labels_postop_IS, cv=10))
I attempted this solution because linear regression can sometimes produce negative predictions values, even though my particular case, these predictions should never be negative, as they are a physical quantity.
Using the abs() function, my predictions produce a better fit for the data.
Is this allowed?
Why would it not be "allowed". I mean if you want to make certain statistical statements (like a 95% CI e.g.) you need to be careful. However, most ML practitioners do not care too much about underlying statistical assumptions and just want a blackbox model that can be evaluated based on accuracy or some other performance metric. So basically everything is allowed in ML, you just have to be careful not to overfit. Maybe a more sensible solution to your problem would be to use a function that truncates at 0 like f(x) = x if x > 0 else 0. This way larger negative values don't suddenly become large positive ones.
On a side note, you should probably try some other models as well with more parameters like a SVR with a non-linear kernel. The thing is obviously that a LR fits a line, and if this line is not parallel to your x-axis (thinking in the single variable case) it will inevitably lead to negative values at some point on the line. That's one reason for why it is often advised not to use LRs for predictions outside the "fitted" data.
A straight line y=a+bx will predict negative y for some x unless a>0 and b=0. Using logarithmic scale seems natural solution to fix this.
In the case of linear regression, there is no restriction on your outputs.
If your data is non-negative (as in your case the values are physical quantities and cannot be negative), you could model using a generalized linear model (GLM) with a log link function. This is known as Poisson regression and is helpful for modeling discrete non-negative counts such as the problem you described. The Poisson distribution is parameterized by a single value λ, which describes both the expected value and the variance of the distribution.
I cannot say your approach is wrong but a better way is to go towards the above method.
This results in an approach that you are attempting to fit a linear model to the log of your observations.

How to get started with Tensorflow

I am pretty new to Tensorflow, and I am currently learning it through given website https://www.tensorflow.org/get_started/get_started
It is said in the manual that:
We've created a model, but we don't know how good it is yet. To evaluate the model on training data, we need a y placeholder to provide the desired values, and we need to write a loss function.
A loss function measures how far apart the current model is from the provided data. We'll use a standard loss model for linear regression, which sums the squares of the deltas between the current model and the provided data. linear_model - y creates a vector where each element is the corresponding example's error delta. We call tf.square to square that error. Then, we sum all the squared errors to create a single scalar that abstracts the error of all examples using tf.reduce_sum:"
q1."we don't know how good it is yet.", I didn't understand this
quote as the simple model created is a simple slope equation and on
what it should train for?, as the model is a simple slope. Is it
require an perfect slope or what? why am I training that model and
for what?
q2.what is a loss function? Is loss function is used to determine the
accuracy of the model? Why is it required?
q3. I didn't understand " 'sums the squares of the deltas' between
the current model and the provided data."
q4.I didn't understood this part of code,"squared_deltas =
tf.square(linear_model - y)
this is the code:
y = tf.placeholder(tf.float32)
squared_deltas = tf.square(linear_model - y)
loss = tf.reduce_sum(squared_deltas)
print(sess.run(loss, {x:[1,2,3,4], y:[0,-1,-2,-3]}))
this may be simple questions, but I am a beginner to Tensorflow and having a hard time understanding it.
1) So you're kind of right about "Why should we train for a simple problem" but this is just an introduction piece. With any machine learning task you need to evaluate your model to see how good it is. In this case you are just trying to train to find the coefficients for the line of best fit.
2) A loss function in any machine learning context represents your error with your model. This usually means a function of your "distance" of your calculated value to the ground truth value. Think of it as an internal evaluation score. You want to minimise your loss so the gradients and parameter changes are based on your loss.
3/4) Your question here is more to do with least square regression. It's a statistical method to create lines of best fit between points. The deltas represent the differences between your calculated values and the truth values. The aim is to minimise the area of the squares and hence minise the error and have a better line of best fit.
What you are doing in this Tensorflow example is creating a machine learning model that will learn the coefficients for the line of best fit automatically using a least squares based system.
Pretty much all of your question have to-do with the loss function.
The loss function is a function that determines how far apart your output are from the expected (correct) output.
It has two usages:
Help the algorithm determine if the tweaking of the weight is helping going in the good or bad direction
Determinate the accuracy (~the number of time your system guesses the correct answer)
The loss function is the sum of the deltas witch is: the addition of the diff (delta) between the expected output and the actual output.
I think It's squared to magnifies the error the algorithm makes.

Meaning of Weight Gradient in CNN

I developed a CNN using MatConvNet and am able to visualize the weights of the 1st layer. It looked very similar to what is shown here (also attached below incase I am not specific enough) http://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html
My question is, what are the weight gradients ? I'm not sure what those are and am unable to generate those...
Weights in a NN
In a neural network, a series of linear functions represented as matrices are applied to features (usually with a nonlinear joint between them). These functions are determined by the values in the marices, referred to as weights.
You can visualize the weights of a normal neural network, but it usually means something slightly different to visualize the convolutional layers of a cnn. These layers are designed to learn a feature computation over the space.
When you visualize the weights, you're looking for patterns. A nice smooth filter may mean that the weights are well learned and "looking for something in particular". A noisy weight visualization may mean that you've undertrained your network, overfit it, need more regularization, or something else nefarious (a decent source for these claims).
From this decent review of weight visualizations, we can see patterns start to emerge from treating the weights as images:
Weight Gradients
"Visualizing the gradient" means taking the gradient matrix and treating like an image [1], just like you took the weight matrix and treated it like an image before.
A gradient is just a derivative; for images, it's usually computed as a finite difference - grossly simplified, the X gradient subtracts pixels next to each other in a row, and the Y gradient subtracts pixels next to each other in a column.
For the common example of a filter that extracts edges, we may see a strong gradient in a particular direction. By visualizing the gradients (taking the matrix of finite differences and treating it like an image), you can get a more immediate idea of how your filter is operating on the input. There are a lot of cutting edge techniques (eg, eg) for interpreting these results, but making the image pop up is the easy part!
A similar technique involves visualizing the activations after a forward pass over the input. In this case, you're looking at how the input was changed by the weights; by visualizing the weights, you're looking at how you expect them to change the input.
Don't over-think it - the weights are interesting because they let us see how the function behaves, and the gradients of the weights are just another feature to help explain what's going on. There's nothing sacred about that feature: here are some cool clustering features (t-SNE) from the google paper that look at space separability.
[1] It can be more complicated if you introduce weight sharing, but not that much
My answer here covers this question https://stackoverflow.com/a/68988426/10661506
Long story short, weight gradient of layer l is the gradient of the loss with respect to the weights of layer l.
If you have a correct implementation of backpropagation, you should have access to these gradients as they are needed to compute the weights update at every layer.

SVM training to infer point position with respect to two other points in a video

I would like to train a SVM with opencv c++ so as to infer the position of a point in the image with respect to two other points to which the wanted point is related.
Basically I have the trajectories of the three points during a whole video and I would like to use these trajectories as training data of the SVM.
I'm new to machine learning techniques and after some readings I think I've understood that SVM will return a boolean result( true if some conditions are satisfied at the same time, false if not). In my case I need a position in the image as result.
I'm not sure how I should organize the training set, I was thinking to do something like that:
T1 T2 T3 label=1
where T1 T2 and T3 contain all the points belonging to the three trajectories that I know as correct;
T1 T2 T4 label=-1
where T1 and T2 are the same as before while T4 contains random points that don't lie on the trajectory T3.
Once I have trained the SVM with different trajectories from different videos I would like to pass three points: P1(x,y) and P2(x,y) corresponding to T1 and T2 at time t and a random point P(x,y), and the SVM should predict if the random point is in the wanted position or not.
anybody could explain me if this approach is wrong and why?
Thanks
This approach is wrong mostly because yout problem is not a binary classification problem. It is rather a regression problem. Your desired output is a value, not a binary number, so training SVM, or any other binary classifier is a bad idea. Classification problem is a search for a mapping from your input data into some finite (and small) set of possible labels (like "true" and "false", or "cat", "dog" or "face"). Regression on the other hand is a seek for the mapping from your input data into (possibly multi dimensional) real-valued space, so instead of labels - your are looking for actual values. In your case - you seek for coordinates, which are (as I suppose) two real numbers. If you model your problem as a binary classification then:
There is no sensible way of creating a training set (you have only "positive" examples, you can generate "negative" ones by taking points which are not correct, but most of them are, it would be better to train a one-class SVM, but as mentioned before - it is not a classification problem at all)
Actual testing would be of horrible complexity, as you have to ask for each point "is it a correct answer?"
Instead, you should train any regression model with data of form
(point_1, point_2) -> point_3
so model can find a function which maps your two input points onto one output point. There are many possible models for this task:
linear regression
neural network
SVR (support vector reggresion)
In short:
your output is a label, discrete value from the finite set -> classifier
your output is a continuous value -> reggresion model
If it is still not clear for you, I suggest a good video from the Stanford University:
http://www.youtube.com/watch?v=5RLRKkzYWuQ

Resources