I am working on Weka using Linear Regression Model. I realized that by multiplying two relevant attributes from my dataset and add this as an extra attribute i improve the performance of the Linear Regression.
However, i cannot understand why! Why my multiplying two relevant attributes have better results.
This is a sign that the function you're approximating isn't linear in the original inputs, but it is in their product. In effect, you've reinvented multivariate polynomial regression.
E.g., suppose the function you're approximating has the form y = a × x² + b × x + c. A linear regression model fitted on x only won't give good results, but when you feed it both x² and x, it can learn the correct a and b.
The same is true in the multivariate setting: a function might not be linear in x1 and x2 separately, but it might be in x1 × x2, which you call an "interaction attribute". (I know these as cross-product features or feature conjunctions; they're what the polynomial kernel in an SVM computes, and that's why SVMs are stronger learners than linear models.)
Related
I'm exploring the Scikit-learn logistic regression algorithm. I understand that as part of the training, the algorithm builds a regression curve where the y-variable ranges from 0 to 1 (sigmoid S-curve). The y-variable is a continuous variable here (although in reality it is a discrete variable). .
How is the algorithm able to learn the S-curve, when the training dataset reflects reality and includes the y-variable as a discrete variable? There is no probability estimate in the training, so I'm wondering how is the algorithm able to learn the S-curve.
There is no probability estimate in the training
Sure, but we pretend there is for modeling purposes. We want to maximize the probability of, as you call it, “reality”—if the observed response (the discrete value you refer to) is a 0, we want to predict that with probability 1; similarly, if the response is a 1, we want to predict that with probability 1.
Fitting the model to one data point, getting the right answer with probability 1, would be easy. Of course, we have more than one data point. We have to balance concerns between these. We want the predicted value sigmoid(weights * features) to be close to the true response (0 or 1) for all of the data points, but there may not be a way to set the parameters of the model to achieve this. (That is, the data may not be linearly separable.)
Good question! The fitting process in logistic regression is a search procedure that seeks the beta coefficients that minimize the error in the probabilities predicted by the model (continuous values) and the data (discrete values).
In logistic regression, you model probabilities using a logistic function (also known as a sigmoid function):
XB = B0 + B1 * X1 + B2 * X2 + ... + BN * XN
p(X) = e^(XB) / (1 + e^(XB))
The algorithm tries to find the beta coefficients that minimize the error using Maximum Likelihood estimation. The function to be minimized is called the cost function, and it can be any number of things. The most common ones are:
sum (P(X_i) - y_i)^2
sum |P(X_i) - y_i|
A random set of betas is picked at random, the cost is calculated and the algorithm will pick a new set of betas that will result in a lower cost. The algorithm stops searching for new betas when the decrease in cost is smaller than a given threshold (set by the tol parameter in sklearn).
The way the model converges to the final set of coefficients depends on the solver parameter. Each solver has a different way of converging to the final set of betas, but they usually converge to the same results.
I am exploring Amazon Sagemaker and need to know whether it has built-in polynomial regression algorithm.
Polynomial regression can be implemented using Linear Regression. It can be implemented by creating x^2, x^3, x^4…and so on in the training data.
Check out the Sagemaker documentation. You might be especially interested in linear learner:
For input, you give the model labeled examples (x, y). x is a high-dimensional vector and y is a numeric label.
...
Continuous objectives, such as mean square error, cross entropy loss, absolute error.
We have following methods to develop a linear regression model.
1. Ordinary Least square
2. Linear Algebra
3. Gradient Descent
How to choose between those models. Can anyone pls clarify the pros and cons of those?
My understanding is that linear algebra is used to implement Ordinary Least Squares (OLS), such that in your question (1) and (2) are effectively the same thing. OLS can only be used for curve fitting equations that are linear in the coefficients and cannot directly be used for non-linear equations. Gradient descent is one of the ways to curve fit non-linear equations, but requires good starting parameters from which to begin the descent in error space.
I invite the more experienced statisticians on this list to comment on my small summary.
I was planning to use sklearn linear_model to plot a graph of linear regression result, and statsmodels.api to get a detail summary of the learning result. However, the two packages produce very different results on the same input.
For example, the constant term from sklearn is 7.8e-14, but the constant term from statsmodels is 48.6. (I added a column of 1's in x for constant term when using both methods) My code for both methods are succint:
# Use statsmodels linear regression to get a result (summary) for the model.
def reg_statsmodels(y, x):
results = sm.OLS(y, x).fit()
return results
# Use sklearn linear regression to compute the coefficients for the prediction.
def reg_sklearn(y, x):
lr = linear_model.LinearRegression()
lr.fit(x, y)
return lr.coef_
The input is too complicated to post here. Is it possible that a singular input x caused this problem?
By making a 3-d plot using PCA, it seems that the sklearn result is not a good approximation. What are some explanations? I still want to make a visualization, so it will be very helpful to fix the issues in the sklearn linear regression implementation.
You say that
I added a column of 1's in x for constant term when using both methods
But the documentation of LinearRegression says that
LinearRegression(fit_intercept=True, [...])
it fits an intercept by default. This could explain why you have the differences in the constant term.
Now for the other coefficients, differences can occur when two of the variables are highly correlated. Let's consider the most extreme case where two of your columns are identical. Then reducing the coefficient in front of any of the two can be compensated by increasing the other. This is the first thing I'd check.
Using scikit-learn to fit a one dimensional model, without an intercept:
lm = sklearn.linear_models.LinearRegression(fit_intercept=False).
lm.fit(x, y)
When evaluating the score using the training data I get a negative .score().
lm.score(x, y)
-0.00256
Why? Does the R2 score compare the variance of my intercept-less model with a model with an intercept?
(Note that it is the same data that I used to fit the model.)
From Wikipedia article on R^2:
Important cases where the computational definition of R2 can yield
negative values, depending on the definition used, arise [...] where
linear regression is conducted without including an intercept.
(emphasis mine).