I'm getting predictions through spark.ml.classification.LogisticRegressionModel.predict. A number of the rows have the prediction column as 1.0 and probability column as .04. The model.getThreshold is 0.5 so I'd assume the model is classifying everything over a 0.5 probability threshold as 1.0.
How am I supposed to interpret a result with a 1.0 prediction and a probability of 0.04?
The probability column from performing a LogisticRegression should contain a list with the same length as the number of classes, where each index gives the corresponding probability for that class. I made a small example with two classes for illustration:
case class Person(label: Double, age: Double, height: Double, weight: Double)
val df = List(Person(0.0, 15, 175, 67),
Person(0.0, 30, 190, 100),
Person(1.0, 40, 155, 57),
Person(1.0, 50, 160, 56),
Person(0.0, 15, 170, 56),
Person(1.0, 80, 180, 88)).toDF()
val assembler = new VectorAssembler().setInputCols(Array("age", "height", "weight"))
.setOutputCol("features")
.select("label", "features")
val df2 = assembler.transform(df)
df2.show
+-----+------------------+
|label| features|
+-----+------------------+
| 0.0| [15.0,175.0,67.0]|
| 0.0|[30.0,190.0,100.0]|
| 1.0| [40.0,155.0,57.0]|
| 1.0| [50.0,160.0,56.0]|
| 0.0| [15.0,170.0,56.0]|
| 1.0| [80.0,180.0,88.0]|
+-----+------------------+
val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
val Array(testing, training) = df2.randomSplit(Array(0.7, 0.3))
val model = lr.fit(training)
val predictions = model.transform(testing)
predictions.select("probability", "prediction").show(false)
+----------------------------------------+----------+
|probability |prediction|
+----------------------------------------+----------+
|[0.7487950501224138,0.2512049498775863] |0.0 |
|[0.6458452667523259,0.35415473324767416]|0.0 |
|[0.3888393314864866,0.6111606685135134] |1.0 |
+----------------------------------------+----------+
Here are the probabilities as well as the final prediction made by the algorithm. The class that have the highest probability in the end is the one predicted.
Related
How will method in spark threat a vector assembler column? For example, if I have longitude and latitude column, is it better to assemble them using vector assembler then put it into my model or it does not make any difference if I just put them directly(separately)?
Example1:
loc_assembler = VectorAssembler(inputCols=['long', 'lat'], outputCol='loc')
vector_assembler = VectorAssembler(inputCols=['loc', 'feature1', 'feature2'], outputCol='features')
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
pipeline = Pipeline(stages=[loc_assembler, vector_assembler, lr])
Example2:
vector_assembler = VectorAssembler(inputCols=['long', 'lat', 'feature1', 'feature2'], outputCol='features')
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
pipeline = Pipeline(stages=[vector_assembler, lr])
What is the difference? Which one is better?
There will not be any difference simply because, in both your examples, the final form of the features column will be the same, i.e. in your 1st example, the loc vector will be broken back into its individual components.
Here is short demonstration with dummy data (leaving the linear regression part aside, as it is unnecessary for this discussion):
spark.version
# u'2.3.1'
# dummy data:
df = spark.createDataFrame([[0, 33.3, -17.5, 10., 0.2],
[1, 40.4, -20.5, 12., 2.2],
[2, 28., -23.9, -2., -1.7],
[3, 29.5, -19.0, -0.5, -0.2],
[4, 32.8, -18.84, 1.5, 1.8]
],
["id","lat", "long", "other", "label"])
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.pipeline import Pipeline
loc_assembler = VectorAssembler(inputCols=['long', 'lat'], outputCol='loc')
vector_assembler = VectorAssembler(inputCols=['loc', 'other'], outputCol='features')
pipeline = Pipeline(stages=[loc_assembler, vector_assembler])
model = pipeline.fit(df)
model.transform(df).show()
The result is:
+---+----+------+-----+-----+-------------+-----------------+
| id| lat| long|other|label| loc| features|
+---+----+------+-----+-----+-------------+-----------------+
| 0|33.3| -17.5| 10.0| 0.2| [-17.5,33.3]|[-17.5,33.3,10.0]|
| 1|40.4| -20.5| 12.0| 2.2| [-20.5,40.4]|[-20.5,40.4,12.0]|
| 2|28.0| -23.9| -2.0| -1.7| [-23.9,28.0]|[-23.9,28.0,-2.0]|
| 3|29.5| -19.0| -0.5| -0.2| [-19.0,29.5]|[-19.0,29.5,-0.5]|
| 4|32.8|-18.84| 1.5| 1.8|[-18.84,32.8]|[-18.84,32.8,1.5]|
+---+----+------+-----+-----+-------------+-----------------+
i.e. the features column is arguably identical with your 2nd example (not shown here), where you do not use the intermediate assembled feature loc...
When train a model, say linear regression, we may make a normalization, like MinMaxScaler, on the train an test dataset.
After we got a trained model and use it to make predictions, and scale back the predictions to the original representation.
In python, there is "inverse_transform" method. For example:
from sklearn.preprocessing import MinMaxScaler
scalerModel.inverse_transform
from sklearn.preprocessing import MinMaxScaler
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler()
MinMaxScaler(copy=True, feature_range=(0, 1))
print(data)
dataScaled = scaler.fit(data).transform(data)
print(dataScaled)
scaler.inverse_transform(dataScaled)
Is there similar method in spark?
I have googled a lot, but found no answer. Can anyone give me some suggestions?
Thank you very much!
In our company, in order to solve the same problem on the StandardScaler, we extended spark.ml with this (among other things):
package org.apache.spark.ml
import org.apache.spark.ml.linalg.DenseVector
import org.apache.spark.ml.util.Identifiable
package object feature {
implicit class RichStandardScalerModel(model: StandardScalerModel) {
private def invertedStdDev(sigma: Double): Double = 1 / sigma
private def invertedMean(mu: Double, sigma: Double): Double = -mu / sigma
def inverse(newOutputCol: String): StandardScalerModel = {
val sigma: linalg.Vector = model.std
val mu: linalg.Vector = model.mean
val newSigma: linalg.Vector = new DenseVector(sigma.toArray.map(invertedStdDev))
val newMu: linalg.Vector = new DenseVector(mu.toArray.zip(sigma.toArray).map { case (m, s) => invertedMean(m, s) })
val inverted: StandardScalerModel = new StandardScalerModel(Identifiable.randomUID("stdScal"), newSigma, newMu)
.setInputCol(model.getOutputCol)
.setOutputCol(newOutputCol)
inverted
.set(inverted.withMean, model.getWithMean)
.set(inverted.withStd, model.getWithStd)
}
}
}
It should be fairly easy to modify it or do something similar for your specific case.
Keep in mind that due to JVM's double implementation, you normally lose precision in these operations, so you will not recover the exact original values you had before the transformation (e.g.: you will probably get something like 1.9999999999999998 instead of 2.0).
No direct solution here.
Since passing an array to a UDFs can only be done when the array is a column (lit(array) won't do the trick) I am using the following workaround.
In a nutshell it turns an inverted scales array into a string, pass it to the UDFs, and solve the math.
You can use that scaled array (string) in an inverse function (also attached here), the get the inverted values.
Code:
from pyspark.ml.feature import VectorAssembler, QuantileDiscretizer
from pyspark.ml.linalg import SparseVector, DenseVector, Vectors, VectorUDT
df = spark.createDataFrame([
(0, 1, 0.5, -1),
(1, 2, 1.0, 1),
(2, 4, 10.0, 2)
], ["id", 'x1', 'x2', 'x3'])
df.show()
def Normalize(df):
scales = df.describe()
scales = scales.filter("summary = 'mean' or summary = 'stddev'")
scales = scales.select(["summary"] + [col(c).cast("double") for c in scales.columns[1:]])
assembler = VectorAssembler(
inputCols=scales.columns[1:],
outputCol="X_scales")
df_scales = assembler.transform(scales)
x_mean = df_scales.filter("summary = 'mean'").select('X_scales')
x_std = df_scales.filter("summary = 'stddev'").select('X_scales')
ks_std_lit = lit('|'.join([str(s) for s in list(x_std.collect()[0].X_scales)]))
ks_mean_lit = lit('|'.join([str(s) for s in list(x_mean.collect()[0].X_scales)]))
assembler = VectorAssembler(
inputCols=df.columns[0:4],
outputCol="features")
df_features = assembler.transform(df)
df_features = df_features.withColumn('Scaled', exec_norm_udf(df_features.features, ks_mean_lit, ks_std_lit))
return df_features, ks_mean_lit, ks_std_lit
def exec_norm(vector, x_mean, x_std):
x_mean = [float(s) for s in x_mean.split('|')]
x_std = [float(s) for s in x_std.split('|')]
res = (np.array(vector) - np.array(x_mean)) / np.array(x_std)
res = list(res)
return Vectors.dense(res)
exec_norm_udf = udf(exec_norm, VectorUDT())
def scaler_invert(vector, x_mean, x_std):
x_mean = [float(s) for s in x_mean.split('|')]
x_std = [float(s) for s in x_std.split('|')]
res = (np.array(vector) * np.array(x_std)) + np.array(x_mean)
res = list(res)
return Vectors.dense(res)
scaler_invert_udf = udf(scaler_invert, VectorUDT())
df, scaler_mean, scaler_std = Normalize(df)
df.withColumn('inverted', scaler_invert_udf(df.Scaled, scaler_mean, scaler_std)).show(truncate=False)
Maybe I'm too late to the party, however, recently faced exactly the same problem and couldn't find any viable solution.
Presuming that the author of this question doesn't have to inverse MinMax Values of vectors, instead, there is a need to inverse only one column.
Min Max values of a column, as well as min-max parameters of the scaler, are also known.
Maths behind MinMaxScaler as per scikit learn website:
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
"Reverse-engineered" MinMaxScaler formula
X_scaled = (X - Xmin) / (Xmax) - Xmin) * (max - min) + min
X = (max * Xmin - min * Xmax - Xmin * X_scaled + Xmax * X_scaled)/(max - min)
Implementation
from sklearn.preprocessing import MinMaxScaler
import pandas
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler(copy=True, feature_range=(0, 1))
print(data)
dataScaled = scaler.fit(data).transform(data)
data_sp = spark.createDataFrame(pandas.DataFrame(data, columns=["x", "y"]).join(pandas.DataFrame(dataScaled, columns=["x_scaled", "y_scaled"])))
data_sp.show()
print("Inversing column: y_scaled")
Xmax = data_sp.select("y").rdd.max()[0]
Xmin = data_sp.select("y").rdd.min()[0]
_max = scaler.feature_range[1]
_min = scaler.feature_range[0]
print("Xmax =", Xmax, "Xmin =", Xmin, "max =", _max, "min =", _min)
data_sp.withColumn(colName="y_scaled_inversed", col=(_max * Xmin - _min * Xmax - Xmin * data_sp.y_scaled + Xmax * data_sp.y_scaled)/(_max - _min)).show()
Outputs
[[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
+----+---+--------+--------+
| x| y|x_scaled|y_scaled|
+----+---+--------+--------+
|-1.0| 2| 0.0| 0.0|
|-0.5| 6| 0.25| 0.25|
| 0.0| 10| 0.5| 0.5|
| 1.0| 18| 1.0| 1.0|
+----+---+--------+--------+
Inversing column: y_scaled
Xmax = 18 Xmin = 2 max = 1 min = 0
+----+---+--------+--------+-----------------+
| x| y|x_scaled|y_scaled|y_scaled_inversed|
+----+---+--------+--------+-----------------+
|-1.0| 2| 0.0| 0.0| 2.0|
|-0.5| 6| 0.25| 0.25| 6.0|
| 0.0| 10| 0.5| 0.5| 10.0|
| 1.0| 18| 1.0| 1.0| 18.0|
+----+---+--------+--------+-----------------+
I have a spark dataframe containing geo-information.
my_df.show(2)
## +----+----+-----------+----------+
## | x0 | x1 | longitude | latitude |
## +----+----+-----------+----------+
## | ...| ...| 51.043 | 13.6847 |
## | ...| ...| 42.6753 | 23.3218 |
I took the longitude and the latitude out of my dataframe and caluculated some centerpoints with the kmeans library from pyspark.
#Trains a k-means model
k = 120
model = KMeans.train(dataset, k)
print ("Final centers: " + str(model.clusterCenters))
the output
Final centers: [array([ 51.04307692, 13.68474126]), array([-33.434 , -70.58366667]), array([ 42.67533333, 23.32185981]), array([ 45.876, -61.492]), array([ 53.07465714, 8.4655 ]), array([ 4.594, 114.262]), array([ 48.15665306, 11.54269728]), array([ 51.51729851, 7.49838806]), array([ 48.76316125, 9.15357859]), ....
Anyone an idea how to add the matching centers to my dataframe?
## +----+----+-----------+----------+-----------+----------+
## | x0 | x1 | longitude | latitude | mean_long | mean_lat |
## +----+----+-----------+----------+-----------+----------+
## | ...| ...| 51.043 | 13.6847 | 50.000 | 15.000 |
## | ...| ...| 42.6753 | 23.3218 | 50.000 | 15.000 |
If you decided to use DataFrames you should use new pyspark.ml API, not the legacy pyspark.mllib. It provides a number of clustering methods, including K-Means, and its predict method will attach prediction column to the DataFrame.
Please check ML documentation for details (API and required input types):
https://spark.apache.org/docs/latest/ml-clustering.html#k-means
Hope this helps!
(note - I have taken sample data from Spark documentation page)
from pyspark.ml.linalg import Vectors
from pyspark.ml.clustering import KMeans
import pandas as pd
#generate data
data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),
(Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)]
df = sqlContext.createDataFrame(data, ["features"])
df.show()
#run kmeans clustering model
kmeans = KMeans(k=2, seed=1)
model = kmeans.fit(df)
predictions=model.transform(df).withColumnRenamed("prediction","cluster_id")
centers = model.clusterCenters()
#preprocessing centers so that it can be joined with predictions dataframe
centers_p_df = pd.DataFrame(centers)
centers_p_df.insert(0, 'new_col', range(0, len(centers_p_df)))
centers_df = sqlContext.createDataFrame(centers_p_df, schema=['cluster_id','centers_col1','centers_col2'])
final_df = predictions.join(centers_df, on="cluster_id").drop("cluster_id")
final_df.show()
I have the following Python test code (the arguments to ALS.train are defined elsewhere):
r1 = (2, 1)
r2 = (3, 1)
test = sc.parallelize([r1, r2])
model = ALS.train(ratings, rank, numIter, lmbda)
predictions = model.predictAll(test)
print test.take(1)
print predictions.count()
print predictions
Which works, because it has a count of 1 against the predictions variable and outputs:
[(2, 1)]
1
ParallelCollectionRDD[2691] at parallelize at PythonRDD.scala:423
However, when I try and use an RDD I created myself using the following code, it doesn't appear to work anymore:
model = ALS.train(ratings, rank, numIter, lmbda)
validation_data = validation.map(lambda xs: tuple(int(x) for x in xs))
predictions = model.predictAll(validation_data)
print validation_data.take(1)
print predictions.count()
print validation_data
Which outputs:
[(61, 3864)]
0
PythonRDD[4018] at RDD at PythonRDD.scala:43
As you can see, predictAllcomes back empty when passed the mapped RDD. The values going in are both of the same format. The only noticeable difference that I can see is that the first example uses parallelize and produces a ParallelCollectionRDDwhereas the second example just uses a map which produces a PythonRDD. Does predictAll only work if passed a certain type of RDD? If so, is it possible to convert between RDD types? I'm not sure how to get this working.
There are two basic conditions under which MatrixFactorizationMode.predictAll may return a RDD with lower number of items than the input:
user is missing in the training set.
product is missing in the training set.
You can easily reproduce this behavior and check that it is is not dependent on the way how RDD has been created. First lets use example data to build a model:
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
def parse(s):
x, y, z = s.split(",")
return Rating(int(x), int(y), float(z))
ratings = (sc.textFile("data/mllib/als/test.data")
.map(parse)
.union(sc.parallelize([Rating(1, 5, 4.0)])))
model = ALS.train(ratings, 10, 10)
Next lets see which products and users are present in the training data:
set(ratings.map(lambda r: r.product).collect())
## {1, 2, 3, 4, 5}
set(ratings.map(lambda r: r.user).collect())
## {1, 2, 3, 4}
Now lets create test data and check predictions:
valid_test = sc.parallelize([(2, 5), (1, 4), (3, 5)])
valid_test
## ParallelCollectionRDD[434] at parallelize at PythonRDD.scala:423
model.predictAll(valid_test).count()
## 3
So far so good. Next lets map it using the same logic as in your code:
valid_test_ = valid_test.map(lambda xs: tuple(int(x) for x in xs))
valid_test_
## PythonRDD[497] at RDD at PythonRDD.scala:43
model.predictAll(valid_test_).count()
## 3
Still fine. Next lets create invalid data and repeat experiment:
invalid_test = sc.parallelize([
(2, 6), # No product in the training data
(6, 1) # No user in the training data
])
invalid_test
## ParallelCollectionRDD[500] at parallelize at PythonRDD.scala:423
model.predictAll(invalid_test).count()
## 0
invalid_test_ = invalid_test.map(lambda xs: tuple(int(x) for x in xs))
model.predictAll(invalid_test_).count()
## 0
As expected there are no predictions for invalid input.
Finally you can confirm this is really the case by using ML model which is completely independent in training / prediction from Python code:
from pyspark.ml.recommendation import ALS as MLALS
model_ml = MLALS(rank=10, maxIter=10).fit(
ratings.toDF(["user", "item", "rating"])
)
model_ml.transform((valid_test + invalid_test).toDF(["user", "item"])).show()
## +----+----+----------+
## |user|item|prediction|
## +----+----+----------+
## | 6| 1| NaN|
## | 1| 4| 1.0184212|
## | 2| 5| 4.0041084|
## | 3| 5|0.40498763|
## | 2| 6| NaN|
## +----+----+----------+
As you can see no corresponding user / item in the training data means no prediction.
I'm trying to apply a score to a Spark DataFrame using PySpark. Let's assuming that I built a simple regression model outside of Spark and want to map the coefficient values created in the model to the individual columns in the DataFrame to create a new column that is the sum of each of the different source columns multiplied by the individual coefficients. I understand that there are many utilities in Spark mllib for modeling, but I want to understand how this 'brute force' method could be accomplished. I also know that DataFrames/RDDs are immutable, so a new DataFrame would have to be created.
Here's some pseudo-code for reference:
#load example data
df = sqlContext.createDataFrame(data)
df.show(5)
dfmappd.select("age", "parch", "pclass").show(5)
+----+-----+------+
| age|parch|pclass|
+----+-----+------+
|22.0| 0| 3|
|38.0| 0| 1|
|26.0| 0| 3|
|35.0| 0| 1|
|35.0| 0| 3|
+----+-----+------+
only showing top 5 rows
The model created outside of Spark is a logistic regression model based on a binary response. So essentially I want to map the logit function to these three columns to produce a fourth scored column. Here are the coefficients from the model:
intercept: 3.435222
age: -0.039841
parch: 0.176439
pclass: -1.239452
Here is a description of the logit function for reference:
https://en.wikipedia.org/wiki/Logistic_regression
For comparison, here is how I would do the same thing in R using tidyr and dplyr
library(dplyr)
library(tidyr)
#Example data
Age <- c(22, 38, 26, 35, 35)
Parch <- c(0,0,0,0,0)
Pclass <- c(3, 1, 3, 1, 3)
#Wrapped in a dataframe
mydf <- data.frame(Age, Parch, Pclass)
#Using dplyr to create a new dataframe with mutated column
scoredf = mydf %>%
mutate(score = round(1/(1 + exp(-(3.435 + -0.040 * Age + 0.176 * Parch + -1.239 * Pclass))),2))
scoredf
If I interpret your question correctly, you want to compute the class conditional probability of each sample given the coefficients you computed offline and do it "manually".
Does something like this work:
def myLogisticFunc(age, parch, pclass):
intercept = 3.435222
betaAge = -0.039841
betaParch = 0.176439
betaPclass = -1.239452
z = intercept + betaAge * age + betaParch * parch + betaPclass * pclass
return 1.0 / (1.0 + math.exp(-z))
myLogisticFuncUDF = udf(myLogisticFunc)
df.withColumn("score", myLogisticFuncUDF(col("age"), col("parch"), col("pclass"))).show()