Can some coefficients be held constant during regression training in PySpark? - apache-spark

Is it possible to specify that certain coefficients should be held constant (at a pre-determined value) during the training of a regression model in PySpark?
For example, if I have the simple, single-feature data shown below, I can fit a straight line to it with linear regression and allow both coefficients to be fit. Then I get the green line.
However, if I somehow know the slope is 2.3, I can fix that coefficient to 2.3 and fit the intercept, which is the blue line.
This is a trivial example, but is there a means to do this in Spark (PySpark especially)?
Or is there a hook to add a custom cost function? (Then I could make the cost extremely large if certain coefficients are far from the expected value.)

Related

Using discretization before or after splitting data?

I am new to data mining concepts and have a question regarding implementation of a technique.
I am using the a dataset with large continuous values.
Now, I am trying to code an algorithm where I need to discretize data (not scale as it makes no impact on data along with the fact that algorithm is not a distance based one, hence no scaling needed).
Now for discretization, I have a similar question with regards to scaling and train test split.
For scaling, I know we should split data and then fit transform the train and transform the test based on what we fit from train.
But what do we do for discretization? I am using scikit learns KBinsDiscretizer and trying to make sense of whether I should split first and discretize the same way we normally scale or discretize first then scale.
The issue came up because I used the 17 bins, uniform strategy (0-16 value range)
With split then discretize, I get (0-16) range throughout in train but not in test.
With discretize and split, I get (0-16) range in both.
With former strategy, my accuracy is around 85% but with the latter, its a whopping 97% which leads me to believe I have definitely overfit the data.
Please advise on what I should be doing for discretization and whether the data interpretation was correct.

GridSearchCV: based on mean_test_score results, predict should perform much worse, but it does not

I am trying to evaluate the performance of a regressor by means of GridSearchCV. In my implementation cv is an int, so I'm applying the K-fold validation method. Looking at cv_results_['mean_test_score'],
the best mean score on the k-fold unseen data is around 0.7, while the train scores are much higher, like 0.999. This is very normal, and I'm ok with that.
Well, following the reasoning behind this concept, when I apply the best_estimator_ on the whole data set, I expect to see at least some part of the data predicted not perfectly, right? Instead, the numerical deviations between the predicted quantities and the real values are near zero for all datapoints. And this smells of overfitting.
I don't understand that, because if I remove a small part of the data and apply GridSearchCV to the remaining part, I find almost identical results as above, but the best regressor applied to the totally unseen data predicts with much higher errors, like 10%, 30% or 50%. Which is what I expected, at least for some points, fitting GridSearchCV on the whole set, based on the results of k-fold test sets.
Now, I understand that this forces the predictor to see all datapoints, but the best estimator is the result of k fits, each of them never saw 1/k fraction of data. Being the mean_test_score the average between these k scores, I expect to see a bunch of predictions (depending on cv value) which show errors distributed around a mean error that justifies a 0.7 score.
The refit=True parameter of GridSearchCV makes the estimator with the found best set of hyperparameters be refit on the full data. So if your training error is almost zero in the CV folds, you would expect it to be near zero in the best_estimator_ as well.

Improving linear regression model by taking absolute value of predicted output?

I have a particular classification problem that I was able to improve using Python's abs() function. I am still somewhat new when it comes to machine learning, and I wanted to know if what I am doing is actually "allowed," so to speak, for improving a regression problem. The following line describes my method:
lr = linear_model.LinearRegression()
predicted = abs(cross_val_predict(lr, features, labels_postop_IS, cv=10))
I attempted this solution because linear regression can sometimes produce negative predictions values, even though my particular case, these predictions should never be negative, as they are a physical quantity.
Using the abs() function, my predictions produce a better fit for the data.
Is this allowed?
Why would it not be "allowed". I mean if you want to make certain statistical statements (like a 95% CI e.g.) you need to be careful. However, most ML practitioners do not care too much about underlying statistical assumptions and just want a blackbox model that can be evaluated based on accuracy or some other performance metric. So basically everything is allowed in ML, you just have to be careful not to overfit. Maybe a more sensible solution to your problem would be to use a function that truncates at 0 like f(x) = x if x > 0 else 0. This way larger negative values don't suddenly become large positive ones.
On a side note, you should probably try some other models as well with more parameters like a SVR with a non-linear kernel. The thing is obviously that a LR fits a line, and if this line is not parallel to your x-axis (thinking in the single variable case) it will inevitably lead to negative values at some point on the line. That's one reason for why it is often advised not to use LRs for predictions outside the "fitted" data.
A straight line y=a+bx will predict negative y for some x unless a>0 and b=0. Using logarithmic scale seems natural solution to fix this.
In the case of linear regression, there is no restriction on your outputs.
If your data is non-negative (as in your case the values are physical quantities and cannot be negative), you could model using a generalized linear model (GLM) with a log link function. This is known as Poisson regression and is helpful for modeling discrete non-negative counts such as the problem you described. The Poisson distribution is parameterized by a single value λ, which describes both the expected value and the variance of the distribution.
I cannot say your approach is wrong but a better way is to go towards the above method.
This results in an approach that you are attempting to fit a linear model to the log of your observations.

How to get started with Tensorflow

I am pretty new to Tensorflow, and I am currently learning it through given website https://www.tensorflow.org/get_started/get_started
It is said in the manual that:
We've created a model, but we don't know how good it is yet. To evaluate the model on training data, we need a y placeholder to provide the desired values, and we need to write a loss function.
A loss function measures how far apart the current model is from the provided data. We'll use a standard loss model for linear regression, which sums the squares of the deltas between the current model and the provided data. linear_model - y creates a vector where each element is the corresponding example's error delta. We call tf.square to square that error. Then, we sum all the squared errors to create a single scalar that abstracts the error of all examples using tf.reduce_sum:"
q1."we don't know how good it is yet.", I didn't understand this
quote as the simple model created is a simple slope equation and on
what it should train for?, as the model is a simple slope. Is it
require an perfect slope or what? why am I training that model and
for what?
q2.what is a loss function? Is loss function is used to determine the
accuracy of the model? Why is it required?
q3. I didn't understand " 'sums the squares of the deltas' between
the current model and the provided data."
q4.I didn't understood this part of code,"squared_deltas =
tf.square(linear_model - y)
this is the code:
y = tf.placeholder(tf.float32)
squared_deltas = tf.square(linear_model - y)
loss = tf.reduce_sum(squared_deltas)
print(sess.run(loss, {x:[1,2,3,4], y:[0,-1,-2,-3]}))
this may be simple questions, but I am a beginner to Tensorflow and having a hard time understanding it.
1) So you're kind of right about "Why should we train for a simple problem" but this is just an introduction piece. With any machine learning task you need to evaluate your model to see how good it is. In this case you are just trying to train to find the coefficients for the line of best fit.
2) A loss function in any machine learning context represents your error with your model. This usually means a function of your "distance" of your calculated value to the ground truth value. Think of it as an internal evaluation score. You want to minimise your loss so the gradients and parameter changes are based on your loss.
3/4) Your question here is more to do with least square regression. It's a statistical method to create lines of best fit between points. The deltas represent the differences between your calculated values and the truth values. The aim is to minimise the area of the squares and hence minise the error and have a better line of best fit.
What you are doing in this Tensorflow example is creating a machine learning model that will learn the coefficients for the line of best fit automatically using a least squares based system.
Pretty much all of your question have to-do with the loss function.
The loss function is a function that determines how far apart your output are from the expected (correct) output.
It has two usages:
Help the algorithm determine if the tweaking of the weight is helping going in the good or bad direction
Determinate the accuracy (~the number of time your system guesses the correct answer)
The loss function is the sum of the deltas witch is: the addition of the diff (delta) between the expected output and the actual output.
I think It's squared to magnifies the error the algorithm makes.

Modelling probabilities in a regularized (logistic?) regression model in python

I would like to fit a regression model to probabilities. I am aware that linear regression is often used for this purpose, but I have several probabilities at or near 0.0 and 1.0 and would like to fit a regression model where the output is constrained to lie between 0.0 and 1.0. I want to be able to specify a regularization norm and strength for the model and ideally do this in python (but an R implementation would be helpful as well). All the logistic regression packages I've found seem to be only suited for classification whereas this is a regression problem (albeit one where I want to use the logit link function). I use scikits-learn for my classification and regression needs so if this regression model can be implemented in scikits-learn, that would be fantastic (it seemed to me that this is not possible), but I'd be happy about any solution in python and/or R.
The question has two issues, penalized estimation and fractional or proportions data as dependent variable. I worked on each separately but never tried the combination.
Penalization
Statsmodels has had L1 regularized Logit and other discrete models like Poisson for some time. In recent months there has been a lot of effort to support more penalization but it is not in statsmodels yet. Elastic net for linear and Generalized Linear Model (GLM) is in a pull request and will be merged soon. More penalized GLM like L2 penalization for GAM and splines or SCAD penalization will follow over the next months based on pull requests that still need work.
Two examples for the current L1 fit_regularized for Logit are here
Difference in SGD classifier results and statsmodels results for logistic with l1 and https://github.com/statsmodels/statsmodels/blob/master/statsmodels/examples/l1_demo/short_demo.py
Note, the penalization weight alpha can be a vector with zeros for coefficients like the constant if they should not be penalized.
http://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.Logit.fit_regularized.html
Fractional models
Binary and binomial models in statsmodels do not impose that the dependent variable is binary and work as long as the dependent variable is in the [0,1] interval.
Fractions or proportions can be estimated with Logit as Quasi-maximum likelihood estimator. The estimates are consistent if the mean function, logistic, cumulative normal or similar link function, is correctly specified but we should use robust sandwich covariance for proper inference. Robust standard errors can be obtained in statsmodels through a fit keyword cov_type='HC0'.
Best documentation is for Stata http://www.stata.com/manuals14/rfracreg.pdf and the references therein. I went through those references before Stata had fracreg, and it works correctly with at least Logit and Probit which were my test cases. (I don't find my scripts or test cases right now.)
The bad news for inference is that robust covariance matrices have not been added to fit_regularized, so the correct sandwich covariance is not directly available. The standard covariance matrix and standard errors of the parameter estimates are derived under the assumption that the model, i.e. the likelihood function, is correctly specified, which will not be the case if the data are fractions and not binary.
Besides using Quasi-Maximum Likelihood with binary models, it is also possible to use a likelihood that is defined for fractional data in (0, 1). A popular model is Beta regression, which is also waiting in a pull request for statsmodels and is expected to be merged within the next months.

Resources