Get survival function with Spark ML - apache-spark

I am training an Accelerated failure time model with PySpark (from pyspark.ml.regression import AFTSurvivalRegression)
Now I want to apply the model to new data and get the probability that the event will happen before time t (survival function), which method should I use ? The documentation is not clear to me : https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.regression.AFTSurvivalRegression
For example, if I do the following:
from pyspark.ml.regression import AFTSurvivalRegression
from pyspark.ml.linalg import Vectors
training = spark.createDataFrame([
(1.218, 1.0, Vectors.dense(1.560, -0.605)),
(2.949, 0.0, Vectors.dense(0.346, 2.158)),
(3.627, 0.0, Vectors.dense(1.380, 0.231)),
(0.273, 1.0, Vectors.dense(0.520, 1.151)),
(4.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", "features"])
quantileProbabilities = [0.25, 0.75]
aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities,
quantilesCol="quantiles")
model = aft.fit(training)
model.transform(training).show(truncate=False)
I get as an output :
Does it mean that for the first line, P(event happening between 0.832 and 9.48) = 50% ?
Thanks

Related

The Keras MultiHeadAttention() class does not return expected values

I would like to match the results of the self_attention() function on page 339 of the Chollet's book, Deep learning with Python, second edition, with those of the MultiHeadAttention() example just below on the same page.
I wrote an example with the same input and I have different results. Can somebody explain why? I inserted the self_attention() function for clarity.
import numpy as np
from scipy.special import softmax
from tensorflow.keras.layers import MultiHeadAttention
def self_attention(input_sequence):
output = np.zeros(shape=input_sequence.shape)
# The output will consist of contextual embeddinsgs of the same shape
for i, pivot_vector in enumerate(input_sequence):
scores = np.zeros(shape=(len(input_sequence),))
for j, vector in enumerate(input_sequence):
scores[j] = np.dot(pivot_vector, vector.T) # Q K^T
scores /= np.sqrt(input_sequence.shape[1]) # sqrt(d_k)
scores = softmax(scores) # softmax(Q K^T / sqrt(d_k))
print(i, scores)
new_pivot_representation = np.zeros(shape=pivot_vector.shape)
for j, vector in enumerate(input_sequence):
new_pivot_representation += vector * scores[j]
output[i] = new_pivot_representation
return output
test_input_sequence = np.array([[[1.0, 0.0, 0.0, 1.0],
[0.0, 1.0, 0.0, 0.0],
[0.0, 1.0, 1.0, 1.0]]])
test_input_sequence.shape
# (1, 3, 4)
self_attention(test_input_sequence[0])
"""
returns
[[0.50648039 0.49351961 0.30719589 0.81367628]
[0.23269654 0.76730346 0.38365173 0.61634827]
[0.21194156 0.78805844 0.57611688 0.78805844]]
the attention scores being:
[0.50648039 0.18632372 0.30719589]
[0.23269654 0.38365173 0.38365173]
[0.21194156 0.21194156 0.57611688]
"""
att_layer = MultiHeadAttention(num_heads=1,
key_dim=4,
use_bias=False,
attention_axes=(1,))
att_layer(test_input_sequence,
test_input_sequence,
test_input_sequence,
return_attention_scores=True)
"""
returns
array([[[-0.46123487, 0.36683324, -0.47130704, -0.00722525],
[-0.49571565, 0.37488416, -0.52883905, -0.02713571],
[-0.4566634 , 0.38055322, -0.45884743, -0.00156384]]],
dtype=float32)
and the attention scores
array([[[[0.31446996, 0.36904442, 0.3164856 ],
[0.34567958, 0.2852166 , 0.36910382],
[0.2934979 , 0.3996053 , 0.30689687]]]], dtype=float32)>)
"""
I found the answer. This is due to the three dense layers before query, key, and value, and the one after the attention module (this last dense layer is missing from Fig. 11.8 in the book).
To reproduce the results of self_attention(), we just need to have pass-through dense layers:
i_4 = np.identity(4)
w_pt_4 = [i_4.reshape(4, 1, 4) for _ in range(3)] + [i_4.reshape(1, 4, 4)]
att_layer.set_weights(w_pt_4)

How to load a spark model

I do not succeed in loading the model and just saved. I have got a strange error.
from transforms.api import Output, transform,transform_df
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import LogisticRegressionModel
import logging
logger = logging.getLogger(__name__)
def save_model(spark_session, output, model, model_name='model4'):
foundry_file_system = output.filesystem()._foundry_fs
logger.info("The path 1 is : "+ str(foundry_file_system))
path = foundry_file_system._root_path + "/" + model_name
logger.info("The path 2 is : "+ str(path))
model.write().overwrite().session(spark_session).save(path)
model=LogisticRegressionModel.read().session(spark_session).load(path)
df_to_predict = spark_session.createDataFrame([(
Vectors.dense([0.0, 1.1, 0.1]),
Vectors.dense([2.0, 1.0, -1.0]),
Vectors.dense([2.0, 1.3, 1.0]),
Vectors.dense([0.0, 1.2, -0.5]),)], ["features"])
df_predicted = model.transform(df_to_predict)
logger.info(df_predicted.show())
logger.info(df_predicted.count())
def my_compute_function(ctx, output_model):
training = ctx.spark_session.createDataFrame([
(1.0, Vectors.dense([0.0, 1.1, 0.1])),
(0.0, Vectors.dense([2.0, 1.0, -1.0])),
(0.0, Vectors.dense([2.0, 1.3, 1.0])),
(1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])
lr = LogisticRegression(maxIter=10, regParam=0.01)
model1 = lr.fit(training)
save_model(ctx.spark_session, output_model, model1, 'model4')
Here is the error I get:
NonRetryableError: Py4JJavaError: An error occurred while calling
o266.load. : scala.MatchError:
[2,3,[1,null,null,WrappedArray(0.06817659473873602)],[1,1,3,null,null,WrappedArray(-3.1009356010205322,
2.6082147383214482, -0.38017912254303043),true],false] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema) at
org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelReader.load(LogisticRegression.scala:1273)
....
That error is indicative of using a different method to load the model than what was used to write the model.
You should be using LogisticRegressionModel.load not LogisticRegression.read()
This can also be caused if the parquet metadata doesn't match. I recommend that you set the summary metadata level to NONE
spark.conf.set("parquet.summary.metadata.level", "NONE")

Reconstruct Matrix from svd components with Pyspark

I'm working on SVD using pyspark. But in the documentation as well as any other place I didn't find how to reconstruct the matrix back using the segemented vectors.For example, using the svd of pyspark, I got U, s and V matrix as below.
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([
Vectors.sparse(5, {1: 1.0, 3: 7.0}),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
])
mat = RowMatrix(rows)
# Compute the top 5 singular values and corresponding singular vectors.
svd = mat.computeSVD(5, computeU=True)
U = svd.U # The U factor is a RowMatrix.
s = svd.s # The singular values are stored in a local dense vector.
V = svd.V # The V factor is a local dense matrix.
Now, I want to reconstruct back the original matrix by multiplying it back. The equation is:
mat_cal = U.diag(s).V.T
In python, we can easily do it. But in pyspark I'm not getting the result.
I found this link. But it's in scala and I don't know the how to convert it in pyspark. If someone can guide me, it will be very helpful.
Thanks!
Convert u to diagonal matrix Σ:
import numpy as np
from pyspark.mllib.linalg import DenseMatrix
Σ = DenseMatrix(len(s), len(s), np.diag(s).ravel("F"))
Transpose V, convert to column major and then convert back to DenseMatrix
V_ = DenseMatrix(V.numCols, V.numRows, V.toArray().transpose().ravel("F"))
Multiply:
mat_ = U.multiply(Σ).multiply(V_)
Inspect the results:
for row in mat_.rows.take(3):
print(row.round(12))
[0. 1. 0. 7. 0.]
[2. 0. 3. 4. 5.]
[4. 0. 0. 6. 7.]
Check the norm
np.linalg.norm(np.array(rows.collect()) - np.array(mat_.rows.collect())
1.2222842061189339e-14
Of course the last two steps are used only for testing, and won't be feasible on real life data.

Exception during xgboost prediction: can not initialize DMatrix from DMatrix

I trained a xgboost model in Python using the Scikit-Learn Python API, and serialized it using pickle library. I uploaded the model to ML Engine, but when I try to do online predictions, i get the following exception:
Prediction failed: Exception during xgboost prediction: can not initialize DMatrix from DMatrix
An example of the json I'm using for prediction is the following:
{
"instances":[
[
24.90625,
21.6435643564356,
20.3762376237624,
24.3679245283019,
30.2075471698113,
28.0947368421053,
16.7797359774725,
14.9262079299572,
17.9888028979966,
15.3333284503293,
19.6535308744024,
17.1501961307627,
0.0,
0.0,
0.0,
0.0,
0.0,
509.0,
497.0,
439.0,
427.0,
407.0,
1.0,
1.0,
1.0,
1.0,
1.0,
2.0,
23.0,
10.0,
58.0,
11.0,
20.0,
23.3617021276596,
23.3617021276596,
23.3617021276596,
23.3617021276596,
23.3617021276596,
23.9423076923077,
26.3082269243683,
23.6212606363851,
22.6752334301282,
27.4343583104833,
34.0090408101173,
11.1991944104063,
7.33420726455092,
8.15160392948917,
11.4119236389594,
17.9429092915607,
18.0573102225845,
32.8902876598084,
-0.00286123032904149,
-0.00286123032904149,
-0.00286123032904149,
-0.00286123032904149,
-0.00286123032904149,
-0.0028328611898017,
0.0534138904223018,
0.0534138904223018,
0.0534138904223018,
0.0534138904223018,
0.0534138904223018,
0.0531491870801522
]
]
}
I use the following code to train my model:
def _train_model(X, y):
clf = xgb.XGBClassifier(max_depth=6,
learning_rate=0.01,
n_estimators=100,
n_jobs=-1)
clf.fit(X, y)
return clf
Where X and y are both numpy.ndarray:
Type of X: <class 'numpy.ndarray'> Type of y: <class 'numpy.ndarray'>
Also I'm using xgboost 0.72.1, Python 3.5 and ML runtime 1.9.
Any one knows what can be the source of the problem?
Thanks!
Seems like the issue is due to the pickling. I was able to reproduce it and working on a fix, but meanwhile could you try exporting your classifier like below instead?
clf._Booster.save_model('./model.bst')
That should unblock you for now. If it didn't, feel free to reach out to cloudml-feedback#google.com.
I also faced similar problem or feature mismatch when I tried score the test data using the the trained XGBoost model that was dumped in .pkl format.
However after saving the model in .bst format, I was able to score the same training data without any issues. Looks like there is a difference in the two implementations of .pkl and .bst format when it comes to XGBoost.
Going a little further, and answering kuza's question above on loading the saved model:
save model:
clf._Booster.save_model('./model.bst')
loading the saved model:
model = xgboost.Booster({'nthread': 4}) # initialize before loading model
model.load_model('./model.bst') # load model
This cleared up 2 issues that I had with using pickle on the model. Issue 1 was a weird exeption: ValueError: feature_names mismatch:
Also check if you are using predict_proba on the loaded model, and getting a weird exception. The fix for that was just to use the straight predict function vice predict_proba.

GLM with Apache Spark 2.2.0 - Tweedie family default Link value

I am using spark 2.2.0 with python. I tried to figure out what is the default param of Link function Spark accepts in the GeneralizedLineraModel in case of Tweedie family.
When I look to documentation https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.regression.GeneralizedLinearRegression
class pyspark.ml.regression.GeneralizedLinearRegression(self, labelCol="label", featuresCol="features", predictionCol="prediction", family="gaussian", link=None, fitIntercept=True, maxIter=25, tol=1e-6, regParam=0.0, weightCol=None, solver="irls", linkPredictionCol=None
It seems that default value when family='tweedie' should be None but when I tried this (by using similar test as unit test : https://github.com/apache/spark/pull/17146/files/fe1d3ae36314e385990f024bca94ab1e416476f2) :
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([(1.0, Vectors.dense(0.0, 0.0)),\
(1.0, Vectors.dense(1.0, 2.0)),\
(2.0, Vectors.dense(0.0, 0.0)),\
(2.0, Vectors.dense(1.0, 1.0)),], ["label", "features"])
glr = GeneralizedLinearRegression(family="tweedie",variancePower=1.42,link=None)
model = glr.fit(df)
transformed = model.transform(df)
it raised a Null pointer Java exception...
Py4JJavaError: An error occurred while calling o6739.w. :
java.lang.NullPointerException ...
It works well when I remove explicite link=None in the initilization of the model.
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([(1.0, Vectors.dense(0.0, 0.0)),\
(1.0, Vectors.dense(1.0, 2.0)),\
(2.0, Vectors.dense(0.0, 0.0)),\
(2.0, Vectors.dense(1.0, 1.0)),], ["label", "features"])
glr = GeneralizedLinearRegression(family="tweedie",variancePower=1.42)
model = glr.fit(df)
transformed = model.transform(df)
I would like to be able to pass a standard set of params like
params={"family":"Onefamily","link":"OnelinkAccordingToFamily",..}
and then initialize GLM as:
glr = GeneralizedLinearRegression(family=params["family"],link=params['link]' ....)
So it could be more standard and works in any case of family and link.
Just seems that the link value is not ignored in the case when family=Tweedie any idea of what default value I should use? I tried link='' or link='None' but it raises 'invalid link function'.
To deal with GLR tweedie family you'll need to define the power link function specified through the "linkPower" parameter, and you shouldn't set link to None which was leading to that exception you got.
Here is an example on how to use it :
df = spark.createDataFrame(
[(1.0, Vectors.dense(0.0, 0.0)),
(1.0, Vectors.dense(1.0, 2.0)),
(2.0, Vectors.dense(0.0, 0.0)),
(2.0, Vectors.dense(1.0, 1.0)), ], ["label", "features"])
# in this case the default link power applies
glr = GeneralizedLinearRegression(family="tweedie", variancePower=1.6)
model = glr.fit(df) # in this case the default link power applies
model2 = glr.setLinkPower(-1.0).fit(df)
PS : The default link power in the tweedie family is 1 - variancePower.

Resources