Hardcode a spark logistic regression model - apache-spark

I've trained a model using PySpark and would like to compare its performance to that of an existing heuristic.
I just want to hardcode an LR model with the coefficients 0.1, 0.5, and 0.7, call .transform on the test data to get the predictions, and compute the accuracies.
How do I hardcode a model?

Unfortunately it's not possible to just set the coefficients of a pyspark LR model. The pyspark LR model is actually a wrapper around a java ml model (see class JavaEstimator).
So when the LR model is fit, it transfers the params from the paramMap to a new java estimator, which is fit to the data. All the LogisticRegressionModel methods/attributes are just calls to the java model using the _call_java method.
Since the coefficients aren't params (you can see a comprehensive list using explainParams on a LR instance), you can't pass them to the java LR model that's created, and there is not a setter method.
For example, for a logistic regression model lrm, you can see that the only setters are for the params you can set when you instantiate a pyspark LR instance: lowerBoundsOnCoefficients and upperBoundsOnCoefficients.
print([c for c in lmr._java_obj.__dir__() if "coefficient" in c.lower()])
# >>> ['coefficientMatrix', 'lowerBoundsOnCoefficients',
# 'org$apache$spark$ml$classification$LogisticRegressionParams$_setter_$lowerBoundsOnCoefficients_$eq',
# 'getLowerBoundsOnCoefficients',
# 'org$apache$spark$ml$classification$LogisticRegressionParams$_setter_$upperBoundsOnCoefficients_$eq',
# 'getUpperBoundsOnCoefficients', 'upperBoundsOnCoefficients', 'coefficients',
# 'org$apache$spark$ml$classification$LogisticRegressionModel$$_coefficients']
Trying to set the "coefficients" attribute yields this:
print(lmr.coefficients)
# >>> DenseVector([18.9303, -18.9303])
lmr.coefficients = [10, -10]
# >>> AttributeError: can't set attribute
So you'd have to roll your own pyspark transformer if you want to be able to provide coefficients. It would probably be easier just to calculate results using the standard logistic function as per #pault's comment.

You can set lower and upper bounds on coefficients of a LR model.
In your case when you exactly know what you want - you can set the lower and upper bound coefficients to the same numbers and thats what you will get the same exact coefficients.
You can set the coeffcients as dense matrix like this -
from pyspark.ml.linalg import Vectors,Matrices
a=Matrices.dense(1, 3,[ 0.1,0.5,0.7])
b=Matrices.dense(1, 3,[ 0.1,0.5,0.7])
and incroporate them into the model as hyperparamaters
lr = LogisticRegression(featuresCol='features', labelCol='label', maxIter=10,
lowerBoundsOnCoefficients=a,\
upperBoundsOnCoefficients=b, \
threshold = 0.5)
and voila! you have your model.
You can then call fit & tranform on your model -
best_mod=lr.fit(train)
predict_train=best_mod.transform(train) # train data
predict_test=best_mod.transform(test) # test data

Related

Spark/Pyspark: SVM - How to get Area-under-curve?

I have been dealing with random forest and naive bayes lately. Now i want to use a Support vector machine.
After fitting the model i wanted to use the output columns "probability" and "label" to compute the AUC value. But now I have seen that there is no column "probability" for SVM?!
Here you can see how I have done so far:
from pyspark.ml.classification import LinearSVC
svm = LinearSVC(maxIter=5, regParam=0.01)
model = svm.fit(train)
scores = model.transform(train)
results = scores.select('probability', 'label')
# Create Score-Label Set for 'BinaryClassificationMetrics'
results_collect = results.collect()
results_list = [(float(i[0][0]), 1.0-float(i[1])) for i in results_collect]
scoreAndLabels = sc.parallelize(results_list)
metrics = BinaryClassificationMetrics(scoreAndLabels)
print("AUC-value: " + str(round(metrics.areaUnderROC,4)))
That was my approach how I have done this in the past for random forest and naive bayes. I thought I could do it with svm too... But that does not work because there is no output column "probability".
Does anyone know why the column "probability" does not exist? And how i can compute the AUC-value now?
Using the most recent spark/pyspark to the time of this answer:
If you use the pyspark.ml module (unlike mllib), you can work with Dataframe as the interface:
svm = LinearSVC(maxIter=5, regParam=0.01)
model = svm.fit(train)
test_prediction = model.transform(test)
Create the evaluator (see it's source code for settings):
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()
Apply evaluator to data (again, source code shows more options):
evaluation = evaluator.evaluate(test_prediction)
The result of evaluate is, by default, the "Area Under Curve":
print("evaluation (area under ROC): %f" % evaluation)
SVM algorithm doesn't provide probability estimates, but only some scores.
There is an algorithm proposed by Platt to compute probabilities given SVM scores, but it's criticized but some and apparently not implemented in Spark.
Btw, there was a similar question What does the score of the Spark MLLib SVM output mean?

LDA model prediction nonconsistance

I trained a LDA model and load it into the environment to transform the new data:
from pyspark.ml.clustering import LocalLDAModel
lda = LocalLDAModel.load(path)
df = lda.transform(text)
The model will add a new column called topicDistribution. In my opinion, this distribution should be same for the same input, otherwise this model is not consistent. However, it is not in practice.
May I ask the reason why and how to fix it?
LDA uses randomness when training and, depending on the implementation, when infering new data. The implementation in Spark is based on EM MAP inference so I believe it only uses randomness when training the model. This means that the results will be different each time the algorithm is trained and run.
To get the same results when running on the same input and same parameters, you can set the random seed when training the model. For example, to set the random seed to 1:
model = LDA.train(data, k=2, seed=1)
To set the seed when transforming new data, create a parameter map to overwrite the default value (None for seed).
lda = LocalLDAModel.load(path)
paramMap[lda.seed] = 1L
df = lda.transform(text, paramMap)
For more information about overwriting model parameters, see here.

How can I correctly use Pipleline with MinMaxScaler + NMF to predict data?

This is a very small sklearn snipplet:
logistic = linear_model.LogisticRegression()
pipe = Pipeline(steps=[
('scaler_2', MinMaxScaler()),
('pca', decomposition.NMF(6)),
('logistic', logistic),
])
from sklearn.cross_validation import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2)
pipe.fit(Xtrain, ytrain)
ypred = pipe.predict(Xtest)
I will get this error:
raise ValueError("Negative values in data passed to %s" % whom)
ValueError: Negative values in data passed to NMF (input X)
According to this question:
Scaling test data to 0 and 1 using MinMaxScaler
I know this is because
This is due to the fact that the lowest value in my test data was
lower than the train data, of which the min max scaler was fit
But I am wondering, is this a bug?
MinMaxScaler (all scalers) seems should be applied before I do the prediction, it should not depends on previous fitted training data, am I right?
Or how could I correctly use preprocessing scalers with Pipeline?
Thanks.
This is not a bug. The main reason that you add the scaler to the pipeline is to prevent leaking the information from your test set to your model. When you fit the pipeline to your training data, the MinMaxScaler keeps the min and max of your training data. It will use these values to scale any other data that it may see for prediction. As you also highlighted, this min and max are not necessarily the min and max of your test data set! Therefore you may end up having some negative values in your training set when the min of your test set is smaller than the min value in the training set. You need a scaler that does not give you negative values. For instance, you may usesklearn.preprocessing.StandardScaler. Make sure that you set the parameter with_mean = False. This way, it will not center the data before scaling but scales your data to unit variance.
If your data is stationary and sampling is done properly, you can assume that your test set resembles your train set to some big extent.
Therefore, you can expect that min/max over test set is close to min/max over train set, with exceptions to few "outliers".
To decrease chances of producing negative values with MinMaxScaler on test set, simply scale your data not to (0,1) range, but ensure that you have allowed some "safety space" for your transformer like this:
MinMaxScaler(feature_range=(1,2))

Spark : setNumClasses() for a subset of labels for Multiclass LogisticRegressionModel

I have a database with ids (labels) that range from 1 to 1040. I am using the Multiclass Logistic Regression for predciting the id. Now if I want to train only a subset of labels, let's say from 800 to 810. I get an error when I set setNumClasses(11) - for 11 classes. I must always set this method to the Max value of classes, which is 1040. That way the training model will train for all labels from 0 to 1040, and that is very expensive and uses a lot of resources.
Am I understaning this right? How can I train my model only for a subset of labels with giving the setNumClasses(count_of_classes).
final LogisticRegressionModel model = new LogisticRegressionWithLBFGS()
.setNumClasses(811).run(train.rdd());
Based on the comments of previews answer I found the 2nd last comment is the main query. If you set setNumClasses(23) means: in the train set all the classes should be in the range of (0 to 22). Check the (docs). It is written as:
:: Experimental :: Set the number of possible outcomes for k classes classification problem in Multinomial Logistic Regression. By default, it is binary logistic regression so k will be set to 2.
That means, for binary logistic regression, binary values/classes are (0 and 1), so setNumClasses(2), is the default.
In the train set if you have other classes like 2,3,4, for binary classification it will not work.
Proposed Solution: if you have train set or subset contains 790 - 801 and 900 - 910 classes, then normalise or transform your data to (0 to 22) and put 23 as setNumClasses(23).
You cannot do it like this, you are supplying a set of training data and it probably fails somewhere in the gradient descent method in Spark (not sure since you haven't provided the error message).
Also how is Spark supposed to figure out for which 800 labels should it train the model?
What you should do is to filter out only the rows in the RDD with the labels for which you want to train the model. For instance lets say your labels are values from 0 to 1040 and you only want to train for labels 0 to 800 you can do:
val actualTrainingRDD = train.filter( _.label < 801 )
final LogisticRegressionModel model = new LogisticRegressionWithLBFGS()
.setNumClasses(801).run(train.rdd());
#Edit: yes it's of course possible to choose a different set of labels, that was just an example, simply change the filter method to:
train.filter( row => (row.label >= 790 && row.label < 801) )
This is Scala, Java closures use ->, right?

Dealing with no data values

During learning, none of my features has '0' values; so I have successfully made my SVM model.
However, when I use that model for prediction with my features, have '0' values in some location of samples. The '0' are no data values. How can I deal with no data values during prediction. I could impute during learning. But if I remove no data value during prediction I will have missing prediction results in those sample locations.
In those sample points, not all features are void but some are.
any suggestions are appreciated.
If some data values are NaN, then you need an imputer to impute those missing values. General strategy is to use 'mean' or 'median' strategy for replacement.
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy='mean')
X_data = imputer.fit_transform(X_data_with_missing_values)
You can then to fit SVM using this imputed X_data.

Resources