In linear regression, is the coefficient of un-transformed (0,1) categorical data is the same scale as other coefficient of transformed features? - statistics

I want to perform a linear regression on house price dataset to rank the features that most impacting the price. I handled categorical data by one hot encoding. Then, I transformed (log) all the features except the categorical data (0,1). After I run the regression, is the coefficient of the categorical data is the same scale as other feature's coefficient?
I have a problem to understand why 0,1 data cannot be transformed and is the coefficient of untransformed one hot encoded data is the same scale with other transformed coefficient.

Related

Scaling new data LSTM [duplicate]

While applying min max scaling to normalize your features, do you apply min max scaling on the entire dataset before splitting it into training, validation and test data?
Or do you split first and then apply min max on each set, using the min and max values from that specific set?
Lastly , when making a prediction on a new input, should the features of that input be normalized using the min, max values from the training data before being fed into the network?
Split it, then scale. Imagine it this way: you have no idea what real-world data looks like, so you couldn't scale the training data to it. Your test data is the surrogate for real-world data, so you should treat it the same way.
To reiterate: Split, scale your training data, then use the scaling from your training data on the testing data.

Combine Coefficients of Regression from multiple training sets in scikit Linear Regression

I have a million rows in my training data set, of i am performing the fit in batches. Is there any way to combine the coefficients generated from different batches.

Is the loss in keras in percentage?

I am trying to implement VGGNet-16 for depth map prediction from single image. In the training the RMSE loss comes out to be 0.1599.
That loss value, is it in percentage or not?
No, if you want a percentage of a correctly classified data you can look at a value of accuracy.
Definition of RMSE from Wikipedia:
The root-mean-square deviation (RMSD) or root-mean-square error (RMSE) is a frequently used measure of the differences between values (sample and population values) predicted by a model or an estimator and the values actually observed.
It's always non-negative, and values closer to zero are better.

Why does k=1 in KNN give the best accuracy?

I am using Weka IBk for text classificaiton. Each document basically is a short sentence. The training dataset contains 15,000 documents. While testing, I can see that k=1 gives the best accuracy? How can this be explained?
If you are querying your learner with the same dataset you have trained on with k=1, the output values should be perfect barring you have data with the same parameters that have different outcome values. Do some reading on overfitting as it applies to KNN learners.
In the case where you are querying with the same dataset as you trained with, the query will come in for each learner with some given parameter values. Because that point exists in the learner from the dataset you trained with, the learner will match that training point as closest to the parameter values and therefore output whatever Y value existed for that training point, which in this case is the same as the point you queried with.
The possibilities are:
The data training with data tests are the same data
Data tests have high similarity with the training data
The boundaries between classes are very clear
The optimal value for K is depends on the data. In general, the value of k may reduce the effect of noise on the classification, but makes the boundaries between each classification becomes more blurred.
If your result variable contains values of 0 or 1 - then make sure you are using as.factor, otherwise it might be interpreting the data as continuous.
Accuracy is generally calculated for the points that are not in training dataset that is unseen data points because if you calculate the accuracy for unseen values (values not in training dataset), you can claim that my model's accuracy is the accuracy that is been calculated for the unseen values.
If you calculate accuracy for training dataset, KNN with k=1, you get 100% as the values are already seen by the model and a rough decision boundary is formed for k=1. When you calculate the accuracy for the unseen data it performs really bad that is the training error would be very low but the actual error would be very high. So it would be better if you choose an optimal k. To choose an optimal k you should be plotting a graph between error and k value for the unseen data that is the test data, now you should choose the value of the where the error is lowest.
To answer your question now,
1) you might have taken the entire dataset as train data set and would have chosen a subpart of the dataset as the test dataset.
(or)
2) you might have taken accuracy for the training dataset.
If these two are not the cases than please check the accuracy values for higher k, you will get even better accuracy for k>1 for the unseen data or the test data.

How do I run the Spark logistic regression with categorical features using python?

I have a data with some categorical variables and I want to run a logistic regression using Mllib , it seems like the model support only continous variables.
Does anyone know how to deal with this please ?
Logistic regression, like the other linear models, takes as input an RDD whereas a LabeledPoint is a Double (the label) and the associated Vector (a double Array).
Categorical values (Strings) are not supported, however you could convert those to binary columns.
For example if you have a column RAG taking values Red, Amber and Green, you would add three binary column isRed, isAmber and isGreen of which only one of them is 1 (true) and the others are 0 (zero) for each sample.
See as further explanation: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html

Resources