Gradient Descent with Linear regression in Sklearn - scikit-learn

The Linear regression model from sklearn uses a closed or normal equation to find the parameters. However with large datasets Gradient Descent is said to be more efficient. Is there any way to use the LinearRegression from sklearn using gradient descent.

The function you are looking for is: sklearn.linear_model.SGDRegressor
You can modify the loss hyperparameter which will define the loss function to be used.
Be aware that the SGD of SGDRegressor stands for Stochastic Gradient Descent. Which means that the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate).

Related

Lasso with Coordinate Descent in Scikit-Learn

I've tried to implement the lasso regression with coordinate descent. In the later process the objective function will include the first derivative of the function as well. All derivatives are computed by a automatic differentiation tool. In the first step I've tried to implement the lasso with simple cyclic coordinate descent without including the derivative.
In an small example with 4 features and ~100 samples the algorithm is converging to the right solution. But the solutions of my real dataset and the solution of the lasso regression from scikit-learn are diffrent. Furthermore, scikit-learns algorithm converges a lot faster. I've used default settings on the scikit-learn setup.
My question is: What is the diffrence between the defaulth scikit-learn algorithm of the lasso regression and the simple coordinate descent? Is there a paper which describes the implemented algorithm?
BR

Is there any place in scikit-learn Lasso/Quantile Regression source code that L1 regularization is applied?

I could not find where the Manhattan distance of weights is calculated and multiplied with alpha (L1 reg. coefficient) in the Lasso Regression and the Quantile Regression source code of scikit-learn.
I was trying to implement Lasso Regression and Quantile Regression w/ NumPy and compare results w/ scikit-learn models.
I don't believe the loss function (including the regularization penalty) is ever explicitly calculated, no.
Instead, the loss function is optimized by coordinate descent, and so we only ever need to actually calculate derivatives of the loss function. That happens in the enet_coordinate_descent function (or relatives), and I think the relevant bit is here.

What is the partial derivative of sklearn's SVM (Hinge) loss function with regards to the input?

Does sklearn have a method to get out the gradient of the loss function w.r.t the input for an SVM that you have trained? I am also using a Gaussian (rbf) kernel.

gensim Word2Vec - how to apply stochastic gradient descent?

To my understanding, batch (vanilla) gradient descent makes one parameter update for all training data. Stochastic gradient descent (SGD) allows you to update parameter for each training sample, helping the model to converge faster, at the cost of high fluctuation in function loss.
Batch (vanilla) gradient descent sets batch_size=corpus_size.
SGD sets batch_size=1.
And mini-batch gradient descent sets batch_size=k, in which k is usually 32, 64, 128...
How does gensim apply SGD or mini-batch gradient descent? It seems that batch_words is the equivalent of batch_size, but I want to be sure.
Is setting batch_words=1 in gensim model equivalent to applying SGD?
No, batch_words in gensim refers to the size of work-chunks sent to worker threads.
The gensim Word2Vec class updates model parameters after each training micro-example of (context)->(target-word) (where context might be a single word, as in skip-gram, or the mean of several words, as in CBOW).
For example, you can review this optimized w2v_fast_sentence_sg_neg() cython function for skip-gram with negative-sampling, deep in the Word2Vec training loop:
https://github.com/RaRe-Technologies/gensim/blob/460dc1cb9921817f71b40b412e11a6d413926472/gensim/models/word2vec_inner.pyx#L159
Observe that it is considering exactly one target-word (word_index parameter) and one context-word (word2_index), and updating both the word-vectors (aka 'projection layer' syn0) and the model's hidden-to-output weights (syn1neg) before it might be called again with a subsequent single (context)->(target-word) pair.

SkikitLearn learning curve strongly dependent on batch size of MLPClassifier ??? Or: how to diagnose bias/ variance for NN?

I am currently working on a classification problem with two classes in ScikitLearn with the solver adam and activation relu. To explore if my classifier suffers from high bias or high variance, I plotted the learning curve with Scikitlearns build-in function:
https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html
I am using a Group-K_Fold crossvalidation with 8 splits.
However, I found that my learning curve is strongly dependent on the batch size of my classifier:
https://imgur.com/a/FOaWKN1
Is it supposed to be like this? I thought learning curves are tackling the accuracy scores dependent on the portion of training data independent from any batches/ epochs? Can I actually use this build-in function for batch methods? If yes, which batch size should I choose (full batch or batch size= number of training examples or something in between) and what diagnosis do I get from this? Or how do you usually diagnose bias/ variance problems of a neural network classifier?
Help would be really appreciated!
Yes, the learning curve depends on the batch size.
The optimal batch size depends on the type of data and the total volume of the data.
In ideal case batch size of 1 will be best, but in practice, with big volumes of data, this approach is not feasible.
I think you have to do that through experimentation because you can’t easily calculate the optimal value.
Moreover, when you change the batch size you might want to change the learning rate as well so you want to keep the control over the process.
But indeed having a tool to find the optimal (memory and time-wise) batch size is quite interesting.
What is Stochastic Gradient Descent?
Stochastic gradient descent, often abbreviated SGD, is a variation of the gradient descent algorithm that calculates the error and updates the model for each example in the training dataset.
The update of the model for each training example means that stochastic gradient descent is often called an online machine learning algorithm.
What is Batch Gradient Descent?
Batch gradient descent is a variation of the gradient descent algorithm that calculates the error for each example in the training dataset, but only updates the model after all training examples have been evaluated.
One cycle through the entire training dataset is called a training epoch. Therefore, it is often said that batch gradient descent performs model updates at the end of each training epoch.
What is Mini-Batch Gradient Descent?
Mini-batch gradient descent is a variation of the gradient descent algorithm that splits the training dataset into small batches that are used to calculate model error and update model coefficients.
Implementations may choose to sum the gradient over the mini-batch or take the average of the gradient which further reduces the variance of the gradient.
Mini-batch gradient descent seeks to find a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent. It is the most common implementation of gradient descent used in the field of deep learning.
Source: https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/

Resources