SKlean's LassoCV with regularization parameter per target - scikit-learn

For linear regressions, alphas (regularization parameters) are usually obtained via cross-validation.
My goal is to perform regression for each target dimension. So that the alpha is defined for each target dim. Assuming that my input data size is 100 x 10 and target data size is 100 x 3 (n_sample x dim), I want to get three alphas for each target dim.
For this, sklearn.linear_model.RidgeCV has alpha_per_target. If alpha_per_target is set True, then the alphas are assigned for each target dim.
However, there is no such thing in sklearn.linear_model.LassoCV. I think I understand what Lasso and Ridge regression are. But I don't understand why the LassoCV in sklearn does not have something similar to alpha_per_target from RidgeCV. Perhaps I misunderstood the concepts so that applying alpha_per_target to LassoCV is totally nonsense.
Could anyone help me understand why something like alpha_per_target is not implemented in LassoCV? Is it because it does not make sense? or is it simply not implemented yet?
References
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html

Related

Returning std from sklearn gaussian process regression for mutliple targets?

I'm using scikit learn to fit a Gaussian process regressor to some data. Ideally I want to do this for data with multiple targets, however the prediction doesn't seem return the std for multiple targets. As an example here I train a Gaussian process on 3 target statistics and predict at 100 sampled positions
gpr = GaussianProcessRegressor(kernel=kernel)
gpr.fit(x.reshape(-1,1), y_obs)
y,y_err=gpr.predict(x_sample.reshape(-1,1),return_std=True)
Where the training data has shape x.shape=(20,) and y_obs.shape=(20,3). The predicted mean and errors (y,y_sample) then do not have the same shape.
print(y.shape)
print(y_err.shape)
returns
(100,3)
(100,)
The mean, y, is the shape I expect as I requested the 3 targets at 100 sampled positions. However y_err doesn't seem be predicted for each target statistic.
This doesn't seem to be working as the documentation describes as both the mean and std should have shape (n_samples,) or (n_samples, n_targets)
Is this a bug, or am I missing something?
As far as I know, this is a bug (see https://github.com/scikit-learn/scikit-learn/pull/22199 and other related issues), fixed not so long ago. Sklearn 1.1.0 and up should return the proper shape. (However, multitarget GPR seems to be problematic in general so you may potentially encounter other issues still.)

What might be the best loss function when target is a gaussian label?

I have a simple CNN with the inputs as
Cropped grayscale patches of size MxN centered on the object of interest. The intensity of each patch is rescaled to [0, 1].
Target Gaussian label of the same size MXN with values ranging
in [5.0155e-173, 1]. This label is kept fixed throughout the training.
The goal is to learn the target label and use the learned model to detect the object in a test image. I am using Adam optimizer with various loss functions such as categorical_crossentropy, mean_squared_error, and mean_absolute_error but training halts soon probably due to the low values returned by all these loss functions (vanishing gradients?). Increasing the batch size from 1 to 16~32 sometimes helps in completing the iteration but gives undesired outcomes at test time.
Is it because the loss function is too sensitive to the lower values in the target and even treats them as outliers hence steering the whole learning process in the wrong direction?
I'll be grateful for your help in fixing the loss function in such a scenario.
I think that the best choice here is to use some probability ditribution pseudo-distance, the first choice that came to my mind is to use Kullback-Leiber Divergence, it is already implemented in pytorch and keras( see [kldivloss](https://pytorch.org/docs/stable/nn.html#kldivloss and keras) Other famous ditances may include Jesnsen-Shanon divergence and Earth-Mover distance (This the same distance thatwas used in WGAN

Neural network regression evaluation based on target range

I am currently fitting a neural network to predict a continuous target from 1 to 10. However, the samples are not evenly distributed over the entire data set: samples with target ranging from 1-3 are quite underrepresented (only account for around 5% of the data). However, they are of big interest, since the low range of the target is kind of the critical range.
Is there any way to know how my model predicts these low range samples in particular? I know that when doing multiclass classification I can examine the recall to get a taste of how well the model performs on a certain class. For classification use cases I can also set the class weight parameter in Keras to account for class imbalances, but this is obviously not possible for regression.
Until now, I use typical metrics like MAE, MSE, RMSE and get satisfying results. I would however like to know how the model performs on the "critical" samples.
From my point of view, I would compare the test measurements (classification performance, MSE, RMSE) for the whole test step that corresponds to the whole range of values (1-10). Then, of course, I would do it separately to the specific range that you are considering critical (let's say between 1-3) and compare the divergence of the two populations. You can even perform some statistics about the significance of the difference between the two populations (Wilcoxon tests etc.).
Maybe this link could be useful for your comparisons. Since you can regression you can even compare for MSE and RMSE.
What you need to do is find identifiers for these critical samples. Often times row indices are used for this. Once you have predicted all of your samples, use those stored indices to find the critical samples in your predictions and run whatever automatic metric over those filtered samples. I hope this answers your question.

Is standardized scaling a pre-requisite for applying PCA using sklearn?

I have a set of 70 input variables on which I need to perform PCA. As per my understanding centering data such that for each input variable mean is 0 and variance is 1, is necessary for applying PCA.
I am having a hard time figuring it out that do I need to perform standard scaling preprocessing.StandardScaler()before passing my data set to PCA or PCA function in sklearn does it on its own.
If latter is the case then irrespective of if I do, or do not apply preprocessing.StandardScaler() the explained_variance_ratio_ should be the same.
But the results are different, hence I believe preprocessing.StandardScaler() is necessary before applying PCA. Is it true?
Yes, it' true, scikit-learn's PCA does not apply standardization to the input dataset, it only centers it by subtracting the mean.
See also this post.

Improving linear regression model by taking absolute value of predicted output?

I have a particular classification problem that I was able to improve using Python's abs() function. I am still somewhat new when it comes to machine learning, and I wanted to know if what I am doing is actually "allowed," so to speak, for improving a regression problem. The following line describes my method:
lr = linear_model.LinearRegression()
predicted = abs(cross_val_predict(lr, features, labels_postop_IS, cv=10))
I attempted this solution because linear regression can sometimes produce negative predictions values, even though my particular case, these predictions should never be negative, as they are a physical quantity.
Using the abs() function, my predictions produce a better fit for the data.
Is this allowed?
Why would it not be "allowed". I mean if you want to make certain statistical statements (like a 95% CI e.g.) you need to be careful. However, most ML practitioners do not care too much about underlying statistical assumptions and just want a blackbox model that can be evaluated based on accuracy or some other performance metric. So basically everything is allowed in ML, you just have to be careful not to overfit. Maybe a more sensible solution to your problem would be to use a function that truncates at 0 like f(x) = x if x > 0 else 0. This way larger negative values don't suddenly become large positive ones.
On a side note, you should probably try some other models as well with more parameters like a SVR with a non-linear kernel. The thing is obviously that a LR fits a line, and if this line is not parallel to your x-axis (thinking in the single variable case) it will inevitably lead to negative values at some point on the line. That's one reason for why it is often advised not to use LRs for predictions outside the "fitted" data.
A straight line y=a+bx will predict negative y for some x unless a>0 and b=0. Using logarithmic scale seems natural solution to fix this.
In the case of linear regression, there is no restriction on your outputs.
If your data is non-negative (as in your case the values are physical quantities and cannot be negative), you could model using a generalized linear model (GLM) with a log link function. This is known as Poisson regression and is helpful for modeling discrete non-negative counts such as the problem you described. The Poisson distribution is parameterized by a single value λ, which describes both the expected value and the variance of the distribution.
I cannot say your approach is wrong but a better way is to go towards the above method.
This results in an approach that you are attempting to fit a linear model to the log of your observations.

Resources