sklearn GLM for negative values - scikit-learn

I need to fit a dataset where the target variable has heavy tails, so I want to use a GLM with tails heavier than normal.
However, if I check the implemented models in sklearn:
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model
All the options (Gamma, Tweedie) are for positive distributions only.
Is there a reason for that? Maybe, an option to switch those to symmetric target variables that take negative values too?

Related

Using GAN to Model Posteriors with PyTorch

I have a 5-dimensional dataset and I'm interested in using a neural network to model the posterior distributions from which the data was drawn. I decided to implement a GAN to do this, and have been familiarizing myself with PyTorch.
I'm wondering how one should go about restricting what values the generator can produce for the parameters. For one of the parameters, the values must be nonnegative real values. For another case, the values must be nonnegative integer values. For the other three cases, the parameters can take on any real value.
My first idea was to control this through the transfer function applied to the nodes in the output layer of my neural network. But all of the PyTorch examples I've seen so far apply the same transfer function to all of the output nodes, which is not what I want to do. Is there a way to apply a different transfer function to each output node? Or is there maybe a better way to approach this problem?

How to implement KNN to impute categorical features in a sklearn pipeline

I want to use KNN for imputing categorical features in a sklearn pipeline (muliple Categorical features missing).
I have done quite a bit research on existing KNN solution (fancyimpute, sklearn KneighborRegressor). None of them seem to be working in terms
work in a sklearn pipeline
impute categorical features
Some of my questions are (any advice is highly appreciated):
is there any existing approach to allow using KNN (or any other regressor) to impute missing values (categorical in this case) to work with sklearn pipeline
fancyimpute KNN implementation seems not use hamming distance for imputing missing values (which is ideal for categorical features).
is there any fast KNN method implementation available considering KNN is time consuming when imputing missing values (i.e., run prediction on missing values against the whole datasets)
The default KNeighborRegressor is supposed to be able to work with regressing missing values, however, with numeric values only. Therefore for categorical value, I believe you most likely need to encode it first, then impute the missing values.
KNNImpute, most likely uses mean/mode etc
iterativeimputer from sklearn can run the imputation against the whole datasets
KNNImputer is new as of sklearn version 0.22.0
KNNImputer uses a euclidean distance metric by default, but you can pass in your own custom distance metric.
I can't speak to the speed of KNNImputer, but I'd imagine there have been some optimizations done on it if it's made it into sklearn.
KNeighborRegressor and KNNImpute do not behave the same as explained here: https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html#sphx-glr-auto-examples-impute-plot-iterative-imputer-variants-comparison-py
With KNeighborRegressor, you have to use sklearn IterativeImputer class. Missing values are initialized with the column mean. For each missing cell, you then perform iterative imputations using the K nearest neighbours. The algorithm stop after convergence. This is stochastic (i.e. will produce different imputation each time).
With KNNImputer, a distance measure that can handle missing value is computed (default is nan-euclidean-distance). Empty cells are filled with the mean of the K nearest neighbors that have a value for the corresponding variable. Doc: https://scikit-learn.org/stable/modules/impute.html#nearest-neighbors-imputation

when to use min-max-scalar and standard-scalar

When it is referred to use min-max-scaler and when Standard Scalar.
I think it depends on the data. Is there any features of data to look on to decide to go for which preprocessing method.
I looked at the docs but can someone give me more insight into it.
The scaling will indeed depend of the type of data that you will. For most cases, StandardScaler is the scaler of choice. If you know that you have some outliers, go for the RobustScaler.
Then, you deal with some features with a weird distribution like for instance the digits, it will not be the best to use these scalers. Indeed, on this dataset, there a lot of pixel at zero meaning that you have a pick at zero for this distribution involving that dividing by the std. dev. will not be beneficial. So basically when the distribution of a feature is far to be Normal then you need to take an alternative.
In the case of the digits, the MinMaxScaler is a much better choice. However, if you want to keep the zero at zeros (because you use sparse matrices), you will go for a MaxAbsScaler.
NB: also look at the QuantileTransformer and the PowerTransformer if you want a feature to follow a Normal/Uniform distribution whatever the original distribution was.
I hope this helps.
When to use MinMaxScaler, RobustScaler, StandardScaler, and Normalizer
https://towardsdatascience.com/scale-standardize-or-normalize-with-scikit-learn-6ccc7d176a02
StandardScaler
StandardScaler assumes that data usually has distributed features and will scale them to zero mean and 1 standard deviation. Use StandardScaler() if you know the data distribution is normal. For most cases, StandardScaler would do no harm. Especially when dealing with variance (PCA, clustering, logistic regression, SVMs, perceptrons, neural networks) in fact Standard Scaler would be very important. On the other hand, it will not make much of a difference if you are using tree-based classifiers or regressors.
MinMaxScaler
MinMaxScaler will transform each value in the column proportionally within the range [0,1]. This is quite acceptable in cases where we are not concerned about the standardisation along the variance axes. e.g. image processing or neural networks expecting values between 0 to 1.
Guide to Scaling and Standardizing
Compare the effect of different scalers on data with outliers

Improving linear regression model by taking absolute value of predicted output?

I have a particular classification problem that I was able to improve using Python's abs() function. I am still somewhat new when it comes to machine learning, and I wanted to know if what I am doing is actually "allowed," so to speak, for improving a regression problem. The following line describes my method:
lr = linear_model.LinearRegression()
predicted = abs(cross_val_predict(lr, features, labels_postop_IS, cv=10))
I attempted this solution because linear regression can sometimes produce negative predictions values, even though my particular case, these predictions should never be negative, as they are a physical quantity.
Using the abs() function, my predictions produce a better fit for the data.
Is this allowed?
Why would it not be "allowed". I mean if you want to make certain statistical statements (like a 95% CI e.g.) you need to be careful. However, most ML practitioners do not care too much about underlying statistical assumptions and just want a blackbox model that can be evaluated based on accuracy or some other performance metric. So basically everything is allowed in ML, you just have to be careful not to overfit. Maybe a more sensible solution to your problem would be to use a function that truncates at 0 like f(x) = x if x > 0 else 0. This way larger negative values don't suddenly become large positive ones.
On a side note, you should probably try some other models as well with more parameters like a SVR with a non-linear kernel. The thing is obviously that a LR fits a line, and if this line is not parallel to your x-axis (thinking in the single variable case) it will inevitably lead to negative values at some point on the line. That's one reason for why it is often advised not to use LRs for predictions outside the "fitted" data.
A straight line y=a+bx will predict negative y for some x unless a>0 and b=0. Using logarithmic scale seems natural solution to fix this.
In the case of linear regression, there is no restriction on your outputs.
If your data is non-negative (as in your case the values are physical quantities and cannot be negative), you could model using a generalized linear model (GLM) with a log link function. This is known as Poisson regression and is helpful for modeling discrete non-negative counts such as the problem you described. The Poisson distribution is parameterized by a single value λ, which describes both the expected value and the variance of the distribution.
I cannot say your approach is wrong but a better way is to go towards the above method.
This results in an approach that you are attempting to fit a linear model to the log of your observations.

Setting feature weights for KNN

I am working with sklearn's implementation of KNN. While my input data has about 20 features, I believe some of the features are more important than others. Is there a way to:
set the feature weights for each feature when "training" the KNN learner.
learn what the optimal weight values are with or without pre-processing the data.
On a related note, I understand generally KNN does not require training but since sklearn implements it using KDTrees, the tree must be generated from the training data. However, this sounds like its turning KNN into a binary tree problem. Is that the case?
Thanks.
kNN is simply based on a distance function. When you say "feature two is more important than others" it usually means difference in feature two is worth, say, 10x difference in other coords. Simple way to achive this is by multiplying coord #2 by its weight. So you put into the tree not the original coords but coords multiplied by their respective weights.
In case your features are combinations of the coords, you might need to apply appropriate matrix transform on your coords before applying weights, see PCA (principal component analysis). PCA is likely to help you with question 2.
The answer to question to is called "metric learning" and currently not implemented in Scikit-learn. Using the popular Mahalanobis distance amounts to rescaling the data using StandardScaler. Ideally you would want your metric to take into account the labels.

Resources