GLM with Apache Spark 2.2.0 - Tweedie family default Link value - apache-spark

I am using spark 2.2.0 with python. I tried to figure out what is the default param of Link function Spark accepts in the GeneralizedLineraModel in case of Tweedie family.
When I look to documentation https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.regression.GeneralizedLinearRegression
class pyspark.ml.regression.GeneralizedLinearRegression(self, labelCol="label", featuresCol="features", predictionCol="prediction", family="gaussian", link=None, fitIntercept=True, maxIter=25, tol=1e-6, regParam=0.0, weightCol=None, solver="irls", linkPredictionCol=None
It seems that default value when family='tweedie' should be None but when I tried this (by using similar test as unit test : https://github.com/apache/spark/pull/17146/files/fe1d3ae36314e385990f024bca94ab1e416476f2) :
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([(1.0, Vectors.dense(0.0, 0.0)),\
(1.0, Vectors.dense(1.0, 2.0)),\
(2.0, Vectors.dense(0.0, 0.0)),\
(2.0, Vectors.dense(1.0, 1.0)),], ["label", "features"])
glr = GeneralizedLinearRegression(family="tweedie",variancePower=1.42,link=None)
model = glr.fit(df)
transformed = model.transform(df)
it raised a Null pointer Java exception...
Py4JJavaError: An error occurred while calling o6739.w. :
java.lang.NullPointerException ...
It works well when I remove explicite link=None in the initilization of the model.
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([(1.0, Vectors.dense(0.0, 0.0)),\
(1.0, Vectors.dense(1.0, 2.0)),\
(2.0, Vectors.dense(0.0, 0.0)),\
(2.0, Vectors.dense(1.0, 1.0)),], ["label", "features"])
glr = GeneralizedLinearRegression(family="tweedie",variancePower=1.42)
model = glr.fit(df)
transformed = model.transform(df)
I would like to be able to pass a standard set of params like
params={"family":"Onefamily","link":"OnelinkAccordingToFamily",..}
and then initialize GLM as:
glr = GeneralizedLinearRegression(family=params["family"],link=params['link]' ....)
So it could be more standard and works in any case of family and link.
Just seems that the link value is not ignored in the case when family=Tweedie any idea of what default value I should use? I tried link='' or link='None' but it raises 'invalid link function'.

To deal with GLR tweedie family you'll need to define the power link function specified through the "linkPower" parameter, and you shouldn't set link to None which was leading to that exception you got.
Here is an example on how to use it :
df = spark.createDataFrame(
[(1.0, Vectors.dense(0.0, 0.0)),
(1.0, Vectors.dense(1.0, 2.0)),
(2.0, Vectors.dense(0.0, 0.0)),
(2.0, Vectors.dense(1.0, 1.0)), ], ["label", "features"])
# in this case the default link power applies
glr = GeneralizedLinearRegression(family="tweedie", variancePower=1.6)
model = glr.fit(df) # in this case the default link power applies
model2 = glr.setLinkPower(-1.0).fit(df)
PS : The default link power in the tweedie family is 1 - variancePower.

Related

Spark Pipeline - How to extract attributes from trained features transformer

I need extract attributes from trained transformers, so I can use them for serving later, such as bin boundaries from QuantileDiscretizer, name to index map from StringIndexer. For example, how to extract bin boundaries from "discretizer_trained" in code below.
I was not able to find introduction by googling as well as from official documentation https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.QuantileDiscretizer
//https://spark.apache.org/docs/latest/ml-features.html#quantilediscretizer
import org.apache.spark.ml.feature.QuantileDiscretizer
val data = Array((0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2))
val df = spark.createDataFrame(data).toDF("id", "hour")
val discretizer = new QuantileDiscretizer()
.setInputCol("hour")
.setOutputCol("result")
.setNumBuckets(3)
val discretizer_trained = discretizer.fit(df)
In Scala Spark running:
discretizer_trained.getSplits
in your example will produce:
res1: Array[Double] = Array(-Infinity, 5.0, 18.0, Infinity)

How to use specific GPU's in keras for multi-GPU training?

I have a server with 4 GPU's. I want to use exactly 2 of them for multi-GPU training.
Keras documentation provided here gives some insight about how to use multiple GPU's but I want to select the specific GPU's. Is there a way to achieve this?
from keras import backend as K
import tensorflow as tf
c = []
for d in ['/device:GPU:2', '/device:GPU:3']:
with K.tf.device(d):
config = tf.ConfigProto(intra_op_parallelism_threads=4,\
inter_op_parallelism_threads=4, allow_soft_placement=True,\
device_count = {'CPU' : 1, 'GPU' : 2})
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3])
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2])
c.append(tf.matmul(a, b))
with tf.device('/cpu:0'):
sum = tf.add_n(c)
session = tf.Session(config=config)
K.set_session(session)
I think this should work . You should be having the number(index) of GPU devices you want to use. In this case its 2 and 3. Relevant links 1)https://github.com/carla-simulator/carla/issues/116
2) https://www.tensorflow.org/guide/using_gpu#using_multiple_gpus
The best way is to compile the Keras model with a tf.distribute Strategy by creating and compiling your model in the strategy's scope. For example:
import contextlib
def model_scope(devices):
if 1 < len(devices):
strategy = tf.distribute.MirroredStrategy(devices)
scope = strategy.scope()
else:
scope = contextlib.supress() # Python 3.4 up
return scope
devices = ['/device:GPU:2', '/device:GPU:3']
with model_scope(devices):
# create and compile your model
model = get_model()
model.compile(optimizer=optimizer, loss=loss)

How to load a spark model

I do not succeed in loading the model and just saved. I have got a strange error.
from transforms.api import Output, transform,transform_df
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import LogisticRegressionModel
import logging
logger = logging.getLogger(__name__)
def save_model(spark_session, output, model, model_name='model4'):
foundry_file_system = output.filesystem()._foundry_fs
logger.info("The path 1 is : "+ str(foundry_file_system))
path = foundry_file_system._root_path + "/" + model_name
logger.info("The path 2 is : "+ str(path))
model.write().overwrite().session(spark_session).save(path)
model=LogisticRegressionModel.read().session(spark_session).load(path)
df_to_predict = spark_session.createDataFrame([(
Vectors.dense([0.0, 1.1, 0.1]),
Vectors.dense([2.0, 1.0, -1.0]),
Vectors.dense([2.0, 1.3, 1.0]),
Vectors.dense([0.0, 1.2, -0.5]),)], ["features"])
df_predicted = model.transform(df_to_predict)
logger.info(df_predicted.show())
logger.info(df_predicted.count())
def my_compute_function(ctx, output_model):
training = ctx.spark_session.createDataFrame([
(1.0, Vectors.dense([0.0, 1.1, 0.1])),
(0.0, Vectors.dense([2.0, 1.0, -1.0])),
(0.0, Vectors.dense([2.0, 1.3, 1.0])),
(1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])
lr = LogisticRegression(maxIter=10, regParam=0.01)
model1 = lr.fit(training)
save_model(ctx.spark_session, output_model, model1, 'model4')
Here is the error I get:
NonRetryableError: Py4JJavaError: An error occurred while calling
o266.load. : scala.MatchError:
[2,3,[1,null,null,WrappedArray(0.06817659473873602)],[1,1,3,null,null,WrappedArray(-3.1009356010205322,
2.6082147383214482, -0.38017912254303043),true],false] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema) at
org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelReader.load(LogisticRegression.scala:1273)
....
That error is indicative of using a different method to load the model than what was used to write the model.
You should be using LogisticRegressionModel.load not LogisticRegression.read()
This can also be caused if the parquet metadata doesn't match. I recommend that you set the summary metadata level to NONE
spark.conf.set("parquet.summary.metadata.level", "NONE")

Get survival function with Spark ML

I am training an Accelerated failure time model with PySpark (from pyspark.ml.regression import AFTSurvivalRegression)
Now I want to apply the model to new data and get the probability that the event will happen before time t (survival function), which method should I use ? The documentation is not clear to me : https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.regression.AFTSurvivalRegression
For example, if I do the following:
from pyspark.ml.regression import AFTSurvivalRegression
from pyspark.ml.linalg import Vectors
training = spark.createDataFrame([
(1.218, 1.0, Vectors.dense(1.560, -0.605)),
(2.949, 0.0, Vectors.dense(0.346, 2.158)),
(3.627, 0.0, Vectors.dense(1.380, 0.231)),
(0.273, 1.0, Vectors.dense(0.520, 1.151)),
(4.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", "features"])
quantileProbabilities = [0.25, 0.75]
aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities,
quantilesCol="quantiles")
model = aft.fit(training)
model.transform(training).show(truncate=False)
I get as an output :
Does it mean that for the first line, P(event happening between 0.832 and 9.48) = 50% ?
Thanks

How to evaluate implicit trained model in spark MLlib [duplicate]

I'm trying to tune the parameters of an ALS matrix factorization model that uses implicit data. For this, I'm trying to use pyspark.ml.tuning.CrossValidator to run through a parameter grid and select the best model. I believe my problem is in the evaluator, but I can't figure it out.
I can get this to work for an explicit data model with a regression RMSE evaluator, as follows:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql.functions import rand
conf = SparkConf() \
.setAppName("MovieLensALS") \
.set("spark.executor.memory", "2g")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
dfRatings = sqlContext.createDataFrame([(0, 0, 4.0), (0, 1, 2.0), (1, 1, 3.0), (1, 2, 4.0), (2, 1, 1.0), (2, 2, 5.0)],
["user", "item", "rating"])
dfRatingsTest = sqlContext.createDataFrame([(0, 0), (0, 1), (1, 1), (1, 2), (2, 1), (2, 2)], ["user", "item"])
alsExplicit = ALS()
defaultModel = alsExplicit.fit(dfRatings)
paramMapExplicit = ParamGridBuilder() \
.addGrid(alsExplicit.rank, [8, 12]) \
.addGrid(alsExplicit.maxIter, [10, 15]) \
.addGrid(alsExplicit.regParam, [1.0, 10.0]) \
.build()
evaluatorR = RegressionEvaluator(metricName="rmse", labelCol="rating")
cvExplicit = CrossValidator(estimator=alsExplicit, estimatorParamMaps=paramMapExplicit, evaluator=evaluatorR)
cvModelExplicit = cvExplicit.fit(dfRatings)
predsExplicit = cvModelExplicit.bestModel.transform(dfRatingsTest)
predsExplicit.show()
When I try to do this for implicit data (let's say counts of views rather than ratings), I get an error that I can't quite figure out. Here's the code (very similar to the above):
dfCounts = sqlContext.createDataFrame([(0,0,0), (0,1,12), (0,2,3), (1,0,5), (1,1,9), (1,2,0), (2,0,0), (2,1,11), (2,2,25)],
["user", "item", "rating"])
dfCountsTest = sqlContext.createDataFrame([(0, 0), (0, 1), (1, 1), (1, 2), (2, 1), (2, 2)], ["user", "item"])
alsImplicit = ALS(implicitPrefs=True)
defaultModelImplicit = alsImplicit.fit(dfCounts)
paramMapImplicit = ParamGridBuilder() \
.addGrid(alsImplicit.rank, [8, 12]) \
.addGrid(alsImplicit.maxIter, [10, 15]) \
.addGrid(alsImplicit.regParam, [1.0, 10.0]) \
.addGrid(alsImplicit.alpha, [2.0,3.0]) \
.build()
evaluatorB = BinaryClassificationEvaluator(metricName="areaUnderROC", labelCol="rating")
evaluatorR = RegressionEvaluator(metricName="rmse", labelCol="rating")
cv = CrossValidator(estimator=alsImplicit, estimatorParamMaps=paramMapImplicit, evaluator=evaluatorR)
cvModel = cv.fit(dfCounts)
predsImplicit = cvModel.bestModel.transform(dfCountsTest)
predsImplicit.show()
I tried doing this with an RMSE evaluator and I get an error. As I understand, I should also be able to use the AUC metric for the binary classification evaluator, because the predictions of the implicit matrix factorization are a confidence matrix c_ui for predictions of a binary matrix p_ui per this paper, which the documentation for pyspark ALS cites.
Using either evaluator gives me an error and I can't find any fruitful discussion about cross-validating implicit ALS models online. I'm looking through the CrossValidator source code to try to figure out what's wrong, but am having trouble. One of my thoughts is that after the process converts the implicit data matrix r_ui to the binary matrix p_ui and confidence matrix c_ui, I'm not sure what it's comparing the predicted c_ui matrix against during the evaluation stage.
Here is the error:
Traceback (most recent call last):
File "<ipython-input-16-6c43b997005e>", line 1, in <module>
cvModel = cv.fit(dfCounts)
File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\ml\pipeline.py", line 69, in fit
return self._fit(dataset)
File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\ml\tuning.py", line 239, in _fit
model = est.fit(train, epm[j])
File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\ml\pipeline.py", line 67, in fit
return self.copy(params)._fit(dataset)
File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\ml\wrapper.py", line 133, in _fit
java_model = self._fit_java(dataset)
File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\ml\wrapper.py", line 130, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "C:\spark-1.6.1-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py", line 813, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\sql\utils.py", line 45, in deco
return f(*a, **kw)
File "C:\spark-1.6.1-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\protocol.py", line 308, in get_return_value
format(target_id, ".", name), value)
etc.......
UPDATE
I tried scaling the input so it's in the range of 0 to 1 and using a RMSE evaluator. It seems to work well until I try to insert it into the CrossValidator.
The following code works. I get predictions and i get an RMSE value from my evaluator.
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import FloatType
import pyspark.sql.functions as F
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator
conf = SparkConf() \
.setAppName("ALSPractice") \
.set("spark.executor.memory", "2g")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
# Users 0, 1, 2, 3 - Items 0, 1, 2, 3, 4, 5 - Ratings 0.0-5.0
dfCounts2 = sqlContext.createDataFrame([(0,0,5.0), (0,1,5.0), (0,3,0.0), (0,4,0.0),
(1,0,5.0), (1,2,4.0), (1,3,0.0), (1,4,0.0),
(2,0,0.0), (2,2,0.0), (2,3,5.0), (2,4,5.0),
(3,0,0.0), (3,1,0.0), (3,3,4.0) ],
["user", "item", "rating"])
dfCountsTest2 = sqlContext.createDataFrame([(0,0), (0,1), (0,2), (0,3), (0,4),
(1,0), (1,1), (1,2), (1,3), (1,4),
(2,0), (2,1), (2,2), (2,3), (2,4),
(3,0), (3,1), (3,2), (3,3), (3,4)], ["user", "item"])
# Normalize rating data to [0,1] range based on max rating
colmax = dfCounts2.select(F.max('rating')).collect()[0].asDict().values()[0]
normalize = udf(lambda x: x/colmax, FloatType())
dfCountsNorm = dfCounts2.withColumn('ratingNorm', normalize(col('rating')))
alsImplicit = ALS(implicitPrefs=True)
defaultModelImplicit = alsImplicit.fit(dfCountsNorm)
preds = defaultModelImplicit.transform(dfCountsTest2)
evaluatorR2 = RegressionEvaluator(metricName="rmse", labelCol="ratingNorm")
evaluatorR2.evaluate(defaultModelImplicit.transform(dfCountsNorm))
preds = defaultModelImplicit.transform(dfCountsTest2)
What I don't understand is why the following doesn't work. I'm using the same estimator, the same evaluator and fitting the same data. Why would these work above but not within the CrossValidator:
paramMapImplicit = ParamGridBuilder() \
.addGrid(alsImplicit.rank, [8, 12]) \
.addGrid(alsImplicit.maxIter, [10, 15]) \
.addGrid(alsImplicit.regParam, [1.0, 10.0]) \
.addGrid(alsImplicit.alpha, [2.0,3.0]) \
.build()
cv = CrossValidator(estimator=alsImplicit, estimatorParamMaps=paramMapImplicit, evaluator=evaluatorR2)
cvModel = cv.fit(dfCountsNorm)
Ignoring technical issues, strictly speaking neither method is correct given the input generated by ALS with implicit feedback.
you cannot use RegressionEvaluator because, as you already know, prediction can be interpreted as a confidence value and is represented as a floating point number in range [0, 1] and label column is just an unbound integer. These values are clearly not comparable.
you cannot use BinaryClassificationEvaluator because even if the prediction can be interpreted as probability label doesn't represent binary decision. Moreover prediction column has invalid type and couldn't be used directly with BinaryClassificationEvaluator
You can try to convert one of the columns so input fit the requirements but this is is not really a justified approach from a theoretical perspective and introduces additional parameters which are hard to tune.
map label column to [0, 1] range and use RMSE.
convert label column to binary indicator with fixed threshold and extend ALS / ALSModel to return expected column type. Assuming threshold value is 1 it could be something like this
from pyspark.ml.recommendation import *
from pyspark.sql.functions import udf, col
from pyspark.mllib.linalg import DenseVector, VectorUDT
class BinaryALS(ALS):
def fit(self, df):
assert self.getImplicitPrefs()
model = super(BinaryALS, self).fit(df)
return ALSBinaryModel(model._java_obj)
class ALSBinaryModel(ALSModel):
def transform(self, df):
transformed = super(ALSBinaryModel, self).transform(df)
as_vector = udf(lambda x: DenseVector([1 - x, x]), VectorUDT())
return transformed.withColumn(
"rawPrediction", as_vector(col("prediction")))
# Add binary label column
with_binary = dfCounts.withColumn(
"label_binary", (col("rating") > 0).cast("double"))
als_binary_model = BinaryALS(implicitPrefs=True).fit(with_binary)
evaluatorB = BinaryClassificationEvaluator(
metricName="areaUnderROC", labelCol="label_binary")
evaluatorB.evaluate(als_binary_model.transform(with_binary))
## 1.0
Generally speaking, material about evaluating recommender systems with implicit feedbacks is kind of missing in textbooks, I suggest you take a read on eliasah's answer about evaluating these kind of recommenders.
With implicit feedbacks we don't have user reactions to our recommendations. Thus, we cannot use precision based metrics.
In the already cited paper, the expected percentile ranking metric is used instead.
You can try to implement an Evaluator based on a similar metric in the Spark ML lib, and use it in your Cross Validation pipeline.
Very late to the party here, but I'll post in case anyone stumbles upon this question like I did.
I was getting a similar error when trying to use CrossValidator with an ALS model. I resolved it by setting the coldStartStrategy parameter in ALS to "drop". That is:
alsImplicit = ALS(implicitPrefs=True, coldStartStrategy="drop")
and keep the rest of the code the same.
I expect what was happening in my example is that the cross-validation splits created scenarios where I had items in the validation set that did not appear in the training set, which results in NaN prediction values. The best solution is to drop the NaN values when evaluating, as described in the documentation.
I don't know if we were getting the same error so can't guarantee this would solve OP's problem, but it's good practice to set coldStartStrategy="drop" for cross validation anyway.
Note: my error message was "Params must be either a param map or a list/tuple of param maps", which didn't seem to imply an issue with the coldStartStrategy parameter or NaN values but this solution resolved the error.
In order to cross validate my ALS model with implicitPrefs=True, I needed to adapt #zero323's answer slightly for pyspark==2.3.0 where I was getting the following exception:
xspy4j.Py4JException: Target Object ID does not exist for this gateway :o2733\\n\tat py4j.Gateway.invoke(Gateway.java...java:79)\\n\tat py4j.GatewayConnection.run(GatewayConnection.java:214)\\n\tat java.lang.Thread.run(Thread.java:748)\\n
ALS extends JavaEstimator which provides the hooks necessary for fitting Estimators that wrap Java/Scala implementations. We need to override _create_model in BinaryALS so PySpark can keep all the Java object references straight:
import pyspark.sql.functions as F
from pyspark.ml.linalg import DenseVector, VectorUDT
from pyspark.ml.recommendation import ALS, ALSModel
from pyspark.sql.dataframe import DataFrame
class ALSBinaryModel(ALSModel):
def transform(self, df: DataFrame) -> DataFrame:
transformed = super().transform(df)
as_vector = F.udf(lambda x: DenseVector([1 - x, x]), VectorUDT())
return transformed.withColumn("rawPrediction", as_vector(F.col("prediction")))
class BinaryALS(ALS):
def fit(self, df: DataFrame) -> ALSBinaryModel:
assert self.getImplicitPrefs()
return super().fit(df)
def _create_model(self, java_model) -> ALSBinaryModel:
return ALSBinaryModel(java_model=java_model)

Resources