Adjust Intercept of Spark DataFrame API Logistic Regression Model - apache-spark

I'm training a logistic regression in Spark. However, due to specifics in my training data, I need to manually adjust the model afterwards, namely change the intercept.
That was easy to do with the RDD api - just instantiate a new LogisticRegressionModel:
val intercept = model.intercept() + adjustment
val model = new LogisticRegressionModel(model.weights(), intercept)
However, the LogisticRegressionModel constructor in the DataFrame API was made private. How can I make manual adjustments to the model?

I had the same problem this afternoon and I was in test mode, trying to make it happen no matter what, so I don't care how dirty it is: get the coefficients from your model, get the intercept, adjust it, then do your predictions by hand with the code they use in Spark (look for BLAS.dot, margin and score). At some point they use BLAS.dot, well BLAS is private to spark. Again do the same, retrieve the code for dot, deal with SparseVector/DenseVector and you can make it. Dirty but it works.

Related

Spark- The purpose of saving ALS model

I'm trying to understand what would be a purpose of storing ALS model and what would be a use case for use of stored model.
I have a dataset which has over 300M rows and I'm using Hadoop Cluster and Spark to calculate recommendations based on ALS algorithm.
Whole computation takes around 5h and I'm wondering what would be the case of storing my model and use it- for example- the next day and... I don't see any. So, either I'm doing something wrong (which is possible, taking into account fact that I'm beginner in ML world) or ALS algorithm in Spark and possibility of saving on disk is not very helpful.
Right now, I use it as following:
df_input = spark.read.format("avro").load(PATH, schema=SCHEMA)
als = ALS(maxIter=12, regParam=0.05, rank=15, userCol="user", itemCol="item", ratingCol="rating", coldStartStrategy="drop")
model = als.fit(df_input)
df_recommendations = model.recommendForAllUsers(10)
And as I mentioned. df_input is a DataFrame which contains over 300M rows. Total calculation time is around 5h and after that I receive 10 recommended items for each user in the dataset.
In many tutorials or books. There is an example of training the model and validate it with test data. Something like:
df_input = spark.read.format("avro").load(PATH, schema=SCHEMA)
(training, test) = df_input.randomSplit(weights = [0.7, 0.3])
als = ALS(maxIter=12, regParam=0.05, rank=15, userCol="user", itemCol="item", ratingCol="rating", coldStartStrategy="drop")
model = als.train(training)
model.write().save("saved_model")
...
model = ALSModel.load('saved_model')
predictions = model.transform(test) // or df_input to get predictions for each user
I don't see any pros of using it in a such way. However I see a one big cons- You don't use 30% of data to train a model
As far as I know there isn't a way to use ALS model online (in real time). At least without using any external package/library.
You can't incrementally update this model.
You can't use it for newly registered users because there they don't exist in stored Matrix Factorization, so there won't be any recommendations for them.
All you can do is to check what would be a prediction for given user-item pair. Which is basically the same thing which would be return in the first example of code (with used fit() method)
What would be a reason to store this model on disk and load it when needed? or when (what conditions should be met) should I consider to store model and reuse it? Could you provide a use case?

Aggregate training results to predits

When training the model the results depend on the sampling. In order to obtain something better you could repeat the training (in another randomly create training sample, using Ffolds, StratifiedKFold ... ), somehow aggregate the results and have this way a result that will be more robust that one create in a particular case alone. Question: is it already implemented in sklearn or similar?. Apologies is this is a straighforward question, I haven't see a simple solution.
I see that there is a function called cross_val_predict however my first impresion having a quick look to the source code is that it predecits as many times as trains and I would like to predicts only ones, so I can piclke the, somehow aggregate results, and predict later, instead of repeat the whole training thing again.
So far I think the best option are the ensemblers in sklearn.
I left here the solution I was using before. I am pretty sure could be improved (as mentioned before the Ensemblers in sklearn) are better. I have placed here https://github.com/rafaelvalero/aggreating_predictions_sklearn, where I have left a notebook with and example (using iris database), in case anyone can play around and see in details how could be done.
That solution will train models (in parallel, using joblib), pickle the trained model (a model from SKlearn), store the results (using joblib dump) and later would recover them to create predictions (in parallel, using joblib) that later are aggregated.

How to extract average metrics with Cross-Validation in PySpark

I'm trying to perform a Cross-Validation over Random Forest in Spark 1.6.0 and I'm finding hard to obtain the evaluation metrics (precision, recall, f1...). I want the average of the metrics of all folds. Is this possible to obtain them with CrossValidator and MulticlassClassificationEvaluator?
I only found examples where the evaluation is performed later over an independent test dataset and using the best model from the Cross-Validation. I'm not planning to use a train and test set, but to use all the dataframe (df) for the cross validation, let it make the splits, and then take the average metrics.
paramGrid = ParamGridBuilder().build()
evaluator = MulticlassClassificationEvaluator()
crossval = CrossValidator(
estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=5)
model = crossval.fit(df)
evaluator.evaluate(model.transform(df))
For now, I obtain the best model metric with the last line of the above code evaluator.evaluate(model.transform(df)) and I'm not totally sure that I'm doing it correctly.
In Spark 2.x, it is possible to get the average metrics using model.avgMetrics. This returns an array of double containing the metrics used to train your cross validation model.
For MulticlassClassificationEvaluator, this gives an array of: f1, weightedPrecision, weightedRecall, accuracy (as documented here). These metrics can be overridden as needed using setter in the evaluator class.
If you also need to get the best model parameters chosen by the cross validator, please see my answer in here.

Can I extract significane values for Logistic Regression coefficients in pyspark

Is there a way to get the significance level of each coefficient we receive after we fit a logistic regression model on training data?
I was trying to find out a way and could not figure out myself.
I think I may get the significance level of each feature if I run chi sq test but first of all not sure if I can run the test on all features together and secondly I have numeric data value so if it will give me right result or not that remains a question as well.
Right now I am running the modeling part using statsmodel and scikit learn but certainly, want to know, how can I get these results from PySpark ML or MLLib itself
If anyone can shed some light, it will be helpful
I use only mllib, I think that when you train a model you can use toPMML method to export your model un PMML format (xml file), then you can parse the xml file to get features weights, here an example
https://spark.apache.org/docs/2.0.2/mllib-pmml-model-export.html
Hope that will help

How am I supposed to use RandomizedLogisticRegression in Scikit-learn?

I simply have failed to understand the documentation for this class.
I can fit data using it, and get the scores for features, but it this all this class is supposed to do?
I can't see how I can use it to actually perform regression using the model that was fit. The example in the documentation above is simply creating an instance of the class, so I can't see how that is supposed to help.
There are methods that perform 'transform' operation, but no mention of what kind of transform that is.
so is it possible to use this class to get actual predictions on new test data, and is it possible to use it in cross fold validation to compare performance with other methods I'm using?
I've used the highest ranking features in other classifiers, but I'm not sure if more than that is possible with this classifier.
Update: I've found the use for fit_transform under feature selection part of the documentation:
When the goal is to reduce the dimensionality of the data to use with another classifier, they expose a transform method to select the non-zero coefficient
Unless I get an answer that says I'm wrong, I'll assume that this classifier indeed does not do prediction. I'll wait before I answer my own question.
Randomized LR is supposed to be a feature selection method, not a classifier in and of itself. Its API matches that of a standard scikit-learn transformer:
randomlr = RandomizedLogisticRegression()
X_train = randomlr.fit_transform(X_train)
X_test = randomlr.transform(X_test)
Then fit a model to X_train and do classification on X_test as usual.

Resources