XGboost model variable transformation - statistics

I am working on an XGBoost model with a few input variables. There is one variable X that I am testing different ways of transformation.
option 1. I apply a group-by average of X, and use the deviation X - group_by_mean(X) as input
option 2. I apply a simple line regression y = aX + b on group-by X, and use y as input
I run two models, with otherwise identical input,
Result, I get better model prediction from option 1 then option 2 on XGboost model.
My question is, can anyone direct me to the potential theorical reason why opition 1 gives me better result as an input to XGboost model, than option 2?
I suspect it is due to option 2 a simple linear regression creates unobserved error, while option 1 a simple average has unobserved error 0, since I apply all known information in option 1. But I would appreciate more theorical reasoning and backing if possible.

Related

What should be output size of image classifier model?

I'm performing an image classification task . Images are labeled as 0 1 2. Should be the size of the last linear layer in the model output be 3 or 1 ? In general, for a 3-class operation, the output is set to 3, and as a result of these three, the maximum probability is returned. But I saw that the last layer is set as 1 in some codes. I think it is actually logical. What do you think about ? ( Also I dont use softmax or sigmoid function in last layer.)
To perform classification into c classes (c = 3 in your example) you need to predict the probability of each class, therefore you need to output a c-dim output.
Usually you do not explicitly apply softmax to the "raw predictions" (aka "logits") - the loss function usually does that for you in a more numerically-robust way (see, e.g., nn.CrossEntropyLoss).
After you trained the model, at inference time you can take argmax over the predicted c logits and output a single scalar - the index of the predicted class. This can only be done during inference since argmax is not a differentiable operation.

Interpreting coefficientMatrix, interceptVector and Confusion matrix on multinomial logistic regression

Can anyone explain how to interpret coefficientMatrix, interceptVector , Confusion matrix
of a multinomial logistic regression.
According to Spark documentation:
Multiclass classification is supported via multinomial logistic (softmax) regression. In multinomial logistic regression, the algorithm produces K sets of coefficients, or a matrix of dimension K×J where K is the number of outcome classes and J is the number of features. If the algorithm is fit with an intercept term then a length K vector of intercepts is available.
I turned an example using spark ml 2.3.0 and I got this result.
.
If I analyse what I get :
The coefficientMatrix has dimension of 5 * 11
The interceptVector has dimension of 5
If so,why the Confusion matrix has a dimension of 4 * 4 ?
Also, can anyone give an interpretation of coefficientMatrix, interceptVector ?
Why I get negative coefficients ?
If 5 is the number of classes after classification, why I get 4 rows in the confusion matrix ?
EDIT
I forgot to mention that I am still beginner in machine learning and that my search in google didn't help, so maybe I get an Up Vote :)
Regarding the 4x4 confusion matrix: I imagine that when you split your data into test and train, there were 5 classes present in your training set and only 4 classes present in your test set. This can easily happen if the distribution of your response variable is imbalanced.
You'll want to try to perform some stratified split between test and train prior to modeling. If you are working with pyspark, you may find this library helpful: https://github.com/databricks/spark-sklearn
Now regarding negative coefficients for a multi-class Logistic Regression: As you mentioned, your returned coefficientMatrix shape is 5x11.
Spark generated five models via one-vs-all approach. The 1st model corresponds to the model where the positive class is the 1st label and the negative class is composed of all other labels. Lets say the 1st coefficient for this model is -2.23. In order to interpret this coefficient we take the exponential of -2.23 which is (approx) 0.10. Interpretation here: 'With one unit increase of 1st feature we expect a reduced odds of the positive label by 90%'

Sklearn error : predict(x,y) takes 2 positional arguments but 3 were given

I am working on building a multivariate regression analysis on sklearn , I did a thorough look at the documentation. When I run the predict() function I get the error : predict() takes 2 positional arguments but 3 were given
X is a data frame , y is column; I have tried to convert the data frame to array / matrix but still get the error.
Have added a snippet showing the x and y arrays.
reg.coef_
reg.predict(x,y)
x_train=train.drop('y-variable',axis =1)
y_train=train['y-variable']
x_test=test.drop('y-variable',axis =1)
y_test=test['y-variable']
x=x_test.as_matrix()
y=y_test.as_matrix()
reg = linear_model.LinearRegression()
reg.fit(x_train,y_train)
reg.predict(x,y)
Use reg.predict(x). You don't need to provide the y values to predict. In fact, the purpose of training the machine learning model is to let it infer the values of y given the input parameters in x.
Also, the documentation of predict here explains that predict expects only x as a parameter.
The reason why you get the error:
predict() takes 2 positional arguments but 3 were given
is because, when you call reg.predic(x), python will implicitly translate this to reg.predict(self,x), that's why the error is telling you that predict() takes 2 positional arguments. The way you call predict, reg.predict(x,y), will be translated to reg.predict(self,x,y) thus 3 positional arguments will be used instead of 2 and that explains the whole error message.
When you are testing over the test set, it is assumed you don't have the labels for it. You are testing to see how well your model can generalize, and hence you compare the predictions with the real labels. When you want to predict, you use only your X variable(s).
I think you are getting confused between reg.predict() and reg.score(), the former is a method which is used for making predictions on the data using the model which is trained using the data. It only takes your features/independent variables X and the object itself self (which is taken care internally) as inputs and gives you the corresponding predicted target/dependent variable Y, which can be later compared with the actual values of the target variable and evaluate the performance of the model. However, if you wish to do the model evaluation it in a single step you can use reg.score() method which takes both your X and Y as inputs and computes the corresponding evaluation measure (R^2 or accuracy depending on the problem at hand). Please refer to sklearn.linear_model.LinearRegression for more information.
Also, these methods are common for most of the supervised learning models in sklearn.

How to get started with Tensorflow

I am pretty new to Tensorflow, and I am currently learning it through given website https://www.tensorflow.org/get_started/get_started
It is said in the manual that:
We've created a model, but we don't know how good it is yet. To evaluate the model on training data, we need a y placeholder to provide the desired values, and we need to write a loss function.
A loss function measures how far apart the current model is from the provided data. We'll use a standard loss model for linear regression, which sums the squares of the deltas between the current model and the provided data. linear_model - y creates a vector where each element is the corresponding example's error delta. We call tf.square to square that error. Then, we sum all the squared errors to create a single scalar that abstracts the error of all examples using tf.reduce_sum:"
q1."we don't know how good it is yet.", I didn't understand this
quote as the simple model created is a simple slope equation and on
what it should train for?, as the model is a simple slope. Is it
require an perfect slope or what? why am I training that model and
for what?
q2.what is a loss function? Is loss function is used to determine the
accuracy of the model? Why is it required?
q3. I didn't understand " 'sums the squares of the deltas' between
the current model and the provided data."
q4.I didn't understood this part of code,"squared_deltas =
tf.square(linear_model - y)
this is the code:
y = tf.placeholder(tf.float32)
squared_deltas = tf.square(linear_model - y)
loss = tf.reduce_sum(squared_deltas)
print(sess.run(loss, {x:[1,2,3,4], y:[0,-1,-2,-3]}))
this may be simple questions, but I am a beginner to Tensorflow and having a hard time understanding it.
1) So you're kind of right about "Why should we train for a simple problem" but this is just an introduction piece. With any machine learning task you need to evaluate your model to see how good it is. In this case you are just trying to train to find the coefficients for the line of best fit.
2) A loss function in any machine learning context represents your error with your model. This usually means a function of your "distance" of your calculated value to the ground truth value. Think of it as an internal evaluation score. You want to minimise your loss so the gradients and parameter changes are based on your loss.
3/4) Your question here is more to do with least square regression. It's a statistical method to create lines of best fit between points. The deltas represent the differences between your calculated values and the truth values. The aim is to minimise the area of the squares and hence minise the error and have a better line of best fit.
What you are doing in this Tensorflow example is creating a machine learning model that will learn the coefficients for the line of best fit automatically using a least squares based system.
Pretty much all of your question have to-do with the loss function.
The loss function is a function that determines how far apart your output are from the expected (correct) output.
It has two usages:
Help the algorithm determine if the tweaking of the weight is helping going in the good or bad direction
Determinate the accuracy (~the number of time your system guesses the correct answer)
The loss function is the sum of the deltas witch is: the addition of the diff (delta) between the expected output and the actual output.
I think It's squared to magnifies the error the algorithm makes.

Different Linear Regression Coefficients with statsmodels and sklearn

I was planning to use sklearn linear_model to plot a graph of linear regression result, and statsmodels.api to get a detail summary of the learning result. However, the two packages produce very different results on the same input.
For example, the constant term from sklearn is 7.8e-14, but the constant term from statsmodels is 48.6. (I added a column of 1's in x for constant term when using both methods) My code for both methods are succint:
# Use statsmodels linear regression to get a result (summary) for the model.
def reg_statsmodels(y, x):
results = sm.OLS(y, x).fit()
return results
# Use sklearn linear regression to compute the coefficients for the prediction.
def reg_sklearn(y, x):
lr = linear_model.LinearRegression()
lr.fit(x, y)
return lr.coef_
The input is too complicated to post here. Is it possible that a singular input x caused this problem?
By making a 3-d plot using PCA, it seems that the sklearn result is not a good approximation. What are some explanations? I still want to make a visualization, so it will be very helpful to fix the issues in the sklearn linear regression implementation.
You say that
I added a column of 1's in x for constant term when using both methods
But the documentation of LinearRegression says that
LinearRegression(fit_intercept=True, [...])
it fits an intercept by default. This could explain why you have the differences in the constant term.
Now for the other coefficients, differences can occur when two of the variables are highly correlated. Let's consider the most extreme case where two of your columns are identical. Then reducing the coefficient in front of any of the two can be compensated by increasing the other. This is the first thing I'd check.

Resources