ALS in mllib vs ALS in ml ---- spark [duplicate] - apache-spark

I have the following Python test code (the arguments to ALS.train are defined elsewhere):
r1 = (2, 1)
r2 = (3, 1)
test = sc.parallelize([r1, r2])
model = ALS.train(ratings, rank, numIter, lmbda)
predictions = model.predictAll(test)
print test.take(1)
print predictions.count()
print predictions
Which works, because it has a count of 1 against the predictions variable and outputs:
[(2, 1)]
1
ParallelCollectionRDD[2691] at parallelize at PythonRDD.scala:423
However, when I try and use an RDD I created myself using the following code, it doesn't appear to work anymore:
model = ALS.train(ratings, rank, numIter, lmbda)
validation_data = validation.map(lambda xs: tuple(int(x) for x in xs))
predictions = model.predictAll(validation_data)
print validation_data.take(1)
print predictions.count()
print validation_data
Which outputs:
[(61, 3864)]
0
PythonRDD[4018] at RDD at PythonRDD.scala:43
As you can see, predictAllcomes back empty when passed the mapped RDD. The values going in are both of the same format. The only noticeable difference that I can see is that the first example uses parallelize and produces a ParallelCollectionRDDwhereas the second example just uses a map which produces a PythonRDD. Does predictAll only work if passed a certain type of RDD? If so, is it possible to convert between RDD types? I'm not sure how to get this working.

There are two basic conditions under which MatrixFactorizationMode.predictAll may return a RDD with lower number of items than the input:
user is missing in the training set.
product is missing in the training set.
You can easily reproduce this behavior and check that it is is not dependent on the way how RDD has been created. First lets use example data to build a model:
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
def parse(s):
x, y, z = s.split(",")
return Rating(int(x), int(y), float(z))
ratings = (sc.textFile("data/mllib/als/test.data")
.map(parse)
.union(sc.parallelize([Rating(1, 5, 4.0)])))
model = ALS.train(ratings, 10, 10)
Next lets see which products and users are present in the training data:
set(ratings.map(lambda r: r.product).collect())
## {1, 2, 3, 4, 5}
set(ratings.map(lambda r: r.user).collect())
## {1, 2, 3, 4}
Now lets create test data and check predictions:
valid_test = sc.parallelize([(2, 5), (1, 4), (3, 5)])
valid_test
## ParallelCollectionRDD[434] at parallelize at PythonRDD.scala:423
model.predictAll(valid_test).count()
## 3
So far so good. Next lets map it using the same logic as in your code:
valid_test_ = valid_test.map(lambda xs: tuple(int(x) for x in xs))
valid_test_
## PythonRDD[497] at RDD at PythonRDD.scala:43
model.predictAll(valid_test_).count()
## 3
Still fine. Next lets create invalid data and repeat experiment:
invalid_test = sc.parallelize([
(2, 6), # No product in the training data
(6, 1) # No user in the training data
])
invalid_test
## ParallelCollectionRDD[500] at parallelize at PythonRDD.scala:423
model.predictAll(invalid_test).count()
## 0
invalid_test_ = invalid_test.map(lambda xs: tuple(int(x) for x in xs))
model.predictAll(invalid_test_).count()
## 0
As expected there are no predictions for invalid input.
Finally you can confirm this is really the case by using ML model which is completely independent in training / prediction from Python code:
from pyspark.ml.recommendation import ALS as MLALS
model_ml = MLALS(rank=10, maxIter=10).fit(
ratings.toDF(["user", "item", "rating"])
)
model_ml.transform((valid_test + invalid_test).toDF(["user", "item"])).show()
## +----+----+----------+
## |user|item|prediction|
## +----+----+----------+
## | 6| 1| NaN|
## | 1| 4| 1.0184212|
## | 2| 5| 4.0041084|
## | 3| 5|0.40498763|
## | 2| 6| NaN|
## +----+----+----------+
As you can see no corresponding user / item in the training data means no prediction.

Related

How does Spark model treat vector column?

How will method in spark threat a vector assembler column? For example, if I have longitude and latitude column, is it better to assemble them using vector assembler then put it into my model or it does not make any difference if I just put them directly(separately)?
Example1:
loc_assembler = VectorAssembler(inputCols=['long', 'lat'], outputCol='loc')
vector_assembler = VectorAssembler(inputCols=['loc', 'feature1', 'feature2'], outputCol='features')
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
pipeline = Pipeline(stages=[loc_assembler, vector_assembler, lr])
Example2:
vector_assembler = VectorAssembler(inputCols=['long', 'lat', 'feature1', 'feature2'], outputCol='features')
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
pipeline = Pipeline(stages=[vector_assembler, lr])
What is the difference? Which one is better?
There will not be any difference simply because, in both your examples, the final form of the features column will be the same, i.e. in your 1st example, the loc vector will be broken back into its individual components.
Here is short demonstration with dummy data (leaving the linear regression part aside, as it is unnecessary for this discussion):
spark.version
# u'2.3.1'
# dummy data:
df = spark.createDataFrame([[0, 33.3, -17.5, 10., 0.2],
[1, 40.4, -20.5, 12., 2.2],
[2, 28., -23.9, -2., -1.7],
[3, 29.5, -19.0, -0.5, -0.2],
[4, 32.8, -18.84, 1.5, 1.8]
],
["id","lat", "long", "other", "label"])
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.pipeline import Pipeline
loc_assembler = VectorAssembler(inputCols=['long', 'lat'], outputCol='loc')
vector_assembler = VectorAssembler(inputCols=['loc', 'other'], outputCol='features')
pipeline = Pipeline(stages=[loc_assembler, vector_assembler])
model = pipeline.fit(df)
model.transform(df).show()
The result is:
+---+----+------+-----+-----+-------------+-----------------+
| id| lat| long|other|label| loc| features|
+---+----+------+-----+-----+-------------+-----------------+
| 0|33.3| -17.5| 10.0| 0.2| [-17.5,33.3]|[-17.5,33.3,10.0]|
| 1|40.4| -20.5| 12.0| 2.2| [-20.5,40.4]|[-20.5,40.4,12.0]|
| 2|28.0| -23.9| -2.0| -1.7| [-23.9,28.0]|[-23.9,28.0,-2.0]|
| 3|29.5| -19.0| -0.5| -0.2| [-19.0,29.5]|[-19.0,29.5,-0.5]|
| 4|32.8|-18.84| 1.5| 1.8|[-18.84,32.8]|[-18.84,32.8,1.5]|
+---+----+------+-----+-----+-------------+-----------------+
i.e. the features column is arguably identical with your 2nd example (not shown here), where you do not use the intermediate assembled feature loc...

python spark: narrowing down most relevant features using PCA

I am using spark 2.2 with python. I am using PCA from ml.feature module. I am using VectorAssembler to feed my features to PCA. To clarify, let's say I have a table with three columns col1, col2 and col3 then I am doing:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=table.columns, outputCol="features")
df = assembler.transform(table).select("features")
from pyspark.ml.feature import PCA
pca = PCA(k=2, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)
At this time I have run PCA with 2 components and I can look at its values as:
m = model.pc.values.reshape(3, 2)
which corresponds to 3 (= number of columns in my original table) rows and 2 (= number of components in my PCA) columns. My question is are the three rows here in the same order in which I had specified my input columns to the vector assembler above? To clarify it further does the above matrix correspond to:
| PC1 | PC2 |
---------|-----|-----|
col1 | | |
---------|-----|-----|
col2 | | |
---------|-----|-----|
col3 | | |
---------+-----+-----+
Note that the example here is only for clarity. In my real problem I am dealing with ~1600 columns and bunch of selections. I could not find any definitive answer to this in spark documentation. I want to do this to pick best columns / features from my original table to train my model based on the top principal components. Or is there anything else / better in spark ML PCA that I should be looking at to deduce such result?
Or I cannot use PCA for this and have to use other techniques like spearman ranking etc.?
are the (...) rows here in the same order in which I had specified my input columns
Yes, they are. Let's trace what is going on:
from pyspark.ml.feature import PCA, VectorAssembler
data = [
(0.0, 1.0, 0.0, 7.0, 0.0), (2.0, 0.0, 3.0, 4.0, 5.0),
(4.0, 0.0, 0.0, 6.0, 7.0)
]
df = spark.createDataFrame(data, ["u", "v", "x", "y", "z"])
VectorAseembler follows the order of columns:
assembler = VectorAssembler(inputCols=df.columns, outputCol="features")
vectors = assembler.transform(df).select("features")
vectors.schema[0].metadata
# {'ml_attr': {'attrs': {'numeric': [{'idx': 0, 'name': 'u'},
# {'idx': 1, 'name': 'v'},
# {'idx': 2, 'name': 'x'},
# {'idx': 3, 'name': 'y'},
# {'idx': 4, 'name': 'z'}]},
# 'num_attrs': 5}}
So are principal components
model = PCA(inputCol="features", outputCol="pc_features", k=3).fit(vectors)
?model.pc
# Type: property
# String form: <property object at 0x7feb5bdc1d68>
# Docstring:
# Returns a principal components Matrix.
# Each column is one principal component.
#
# .. versionadded:: 2.0.0
Finally sanity check:
import numpy as np
x = np.array(data)
y = model.pc.values.reshape(3, 5).transpose()
z = np.array(model.transform(vectors).rdd.map(lambda x: x.pc_features).collect())
np.linalg.norm(x.dot(y) - z)
# 8.881784197001252e-16
You can see the actual order of the columns here
df.schema["features"].metadata["ml_attr"]["attrs"]
there will be two classes usually, ["binary] & ["numeric"]
pd.DataFrame(df.schema["features"].metadata["ml_attr"]["attrs"]["binary"]+df.schema["features"].metadata["ml_attr"]["attrs"]["numeric"]).sort_values("idx")
Should give the exact order of all the columns.
You can verify, the order of input & output remains same.

Is there no "inverse_transform" method for a scaler like MinMaxScaler in spark?

When train a model, say linear regression, we may make a normalization, like MinMaxScaler, on the train an test dataset.
After we got a trained model and use it to make predictions, and scale back the predictions to the original representation.
In python, there is "inverse_transform" method. For example:
from sklearn.preprocessing import MinMaxScaler
scalerModel.inverse_transform
from sklearn.preprocessing import MinMaxScaler
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler()
MinMaxScaler(copy=True, feature_range=(0, 1))
print(data)
dataScaled = scaler.fit(data).transform(data)
print(dataScaled)
scaler.inverse_transform(dataScaled)
Is there similar method in spark?
I have googled a lot, but found no answer. Can anyone give me some suggestions?
Thank you very much!
In our company, in order to solve the same problem on the StandardScaler, we extended spark.ml with this (among other things):
package org.apache.spark.ml
import org.apache.spark.ml.linalg.DenseVector
import org.apache.spark.ml.util.Identifiable
package object feature {
implicit class RichStandardScalerModel(model: StandardScalerModel) {
private def invertedStdDev(sigma: Double): Double = 1 / sigma
private def invertedMean(mu: Double, sigma: Double): Double = -mu / sigma
def inverse(newOutputCol: String): StandardScalerModel = {
val sigma: linalg.Vector = model.std
val mu: linalg.Vector = model.mean
val newSigma: linalg.Vector = new DenseVector(sigma.toArray.map(invertedStdDev))
val newMu: linalg.Vector = new DenseVector(mu.toArray.zip(sigma.toArray).map { case (m, s) => invertedMean(m, s) })
val inverted: StandardScalerModel = new StandardScalerModel(Identifiable.randomUID("stdScal"), newSigma, newMu)
.setInputCol(model.getOutputCol)
.setOutputCol(newOutputCol)
inverted
.set(inverted.withMean, model.getWithMean)
.set(inverted.withStd, model.getWithStd)
}
}
}
It should be fairly easy to modify it or do something similar for your specific case.
Keep in mind that due to JVM's double implementation, you normally lose precision in these operations, so you will not recover the exact original values you had before the transformation (e.g.: you will probably get something like 1.9999999999999998 instead of 2.0).
No direct solution here.
Since passing an array to a UDFs can only be done when the array is a column (lit(array) won't do the trick) I am using the following workaround.
In a nutshell it turns an inverted scales array into a string, pass it to the UDFs, and solve the math.
You can use that scaled array (string) in an inverse function (also attached here), the get the inverted values.
Code:
from pyspark.ml.feature import VectorAssembler, QuantileDiscretizer
from pyspark.ml.linalg import SparseVector, DenseVector, Vectors, VectorUDT
df = spark.createDataFrame([
(0, 1, 0.5, -1),
(1, 2, 1.0, 1),
(2, 4, 10.0, 2)
], ["id", 'x1', 'x2', 'x3'])
df.show()
def Normalize(df):
scales = df.describe()
scales = scales.filter("summary = 'mean' or summary = 'stddev'")
scales = scales.select(["summary"] + [col(c).cast("double") for c in scales.columns[1:]])
assembler = VectorAssembler(
inputCols=scales.columns[1:],
outputCol="X_scales")
df_scales = assembler.transform(scales)
x_mean = df_scales.filter("summary = 'mean'").select('X_scales')
x_std = df_scales.filter("summary = 'stddev'").select('X_scales')
ks_std_lit = lit('|'.join([str(s) for s in list(x_std.collect()[0].X_scales)]))
ks_mean_lit = lit('|'.join([str(s) for s in list(x_mean.collect()[0].X_scales)]))
assembler = VectorAssembler(
inputCols=df.columns[0:4],
outputCol="features")
df_features = assembler.transform(df)
df_features = df_features.withColumn('Scaled', exec_norm_udf(df_features.features, ks_mean_lit, ks_std_lit))
return df_features, ks_mean_lit, ks_std_lit
def exec_norm(vector, x_mean, x_std):
x_mean = [float(s) for s in x_mean.split('|')]
x_std = [float(s) for s in x_std.split('|')]
res = (np.array(vector) - np.array(x_mean)) / np.array(x_std)
res = list(res)
return Vectors.dense(res)
exec_norm_udf = udf(exec_norm, VectorUDT())
def scaler_invert(vector, x_mean, x_std):
x_mean = [float(s) for s in x_mean.split('|')]
x_std = [float(s) for s in x_std.split('|')]
res = (np.array(vector) * np.array(x_std)) + np.array(x_mean)
res = list(res)
return Vectors.dense(res)
scaler_invert_udf = udf(scaler_invert, VectorUDT())
df, scaler_mean, scaler_std = Normalize(df)
df.withColumn('inverted', scaler_invert_udf(df.Scaled, scaler_mean, scaler_std)).show(truncate=False)
Maybe I'm too late to the party, however, recently faced exactly the same problem and couldn't find any viable solution.
Presuming that the author of this question doesn't have to inverse MinMax Values of vectors, instead, there is a need to inverse only one column.
Min Max values of a column, as well as min-max parameters of the scaler, are also known.
Maths behind MinMaxScaler as per scikit learn website:
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
"Reverse-engineered" MinMaxScaler formula
X_scaled = (X - Xmin) / (Xmax) - Xmin) * (max - min) + min
X = (max * Xmin - min * Xmax - Xmin * X_scaled + Xmax * X_scaled)/(max - min)
Implementation
from sklearn.preprocessing import MinMaxScaler
import pandas
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler(copy=True, feature_range=(0, 1))
print(data)
dataScaled = scaler.fit(data).transform(data)
data_sp = spark.createDataFrame(pandas.DataFrame(data, columns=["x", "y"]).join(pandas.DataFrame(dataScaled, columns=["x_scaled", "y_scaled"])))
data_sp.show()
print("Inversing column: y_scaled")
Xmax = data_sp.select("y").rdd.max()[0]
Xmin = data_sp.select("y").rdd.min()[0]
_max = scaler.feature_range[1]
_min = scaler.feature_range[0]
print("Xmax =", Xmax, "Xmin =", Xmin, "max =", _max, "min =", _min)
data_sp.withColumn(colName="y_scaled_inversed", col=(_max * Xmin - _min * Xmax - Xmin * data_sp.y_scaled + Xmax * data_sp.y_scaled)/(_max - _min)).show()
Outputs
[[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
+----+---+--------+--------+
| x| y|x_scaled|y_scaled|
+----+---+--------+--------+
|-1.0| 2| 0.0| 0.0|
|-0.5| 6| 0.25| 0.25|
| 0.0| 10| 0.5| 0.5|
| 1.0| 18| 1.0| 1.0|
+----+---+--------+--------+
Inversing column: y_scaled
Xmax = 18 Xmin = 2 max = 1 min = 0
+----+---+--------+--------+-----------------+
| x| y|x_scaled|y_scaled|y_scaled_inversed|
+----+---+--------+--------+-----------------+
|-1.0| 2| 0.0| 0.0| 2.0|
|-0.5| 6| 0.25| 0.25| 6.0|
| 0.0| 10| 0.5| 0.5| 10.0|
| 1.0| 18| 1.0| 1.0| 18.0|
+----+---+--------+--------+-----------------+

Python Spark combineByKey Average

I'm trying to learn Spark in Python, and am stuck with combineByKey for averaging the values in key-value pairs. In fact, my confusion is not with the combineByKey syntax, but what comes afterward. The typical example (from the O'Rielly 2015 Learning Spark Book) can be seen on the web in many places; here's one.
The problem is with the sumCount.map(lambda (key, (totalSum, count)): (key, totalSum / count)).collectAsMap() statement. Using spark 2.0.1 and iPython 3.5.2, this throws a syntax error exception. Simplifying it to something that should work (and is what's in the O'Reilly book): sumCount.map(lambda key,vals: (key, vals[0]/vals[1])).collectAsMap() causes Spark to go bats**t crazy with java exceptions, but I do note a TypeError: <lambda>() missing 1 required positional argument: 'v' error.
Can anyone point me to an example of this functionality that actually works with a recent version of Spark & Python? For completeness, I've included my own minimum working (or rather, non-working) example:
In: pRDD = sc.parallelize([("s",5),("g",3),("g",10),("c",2),("s",10),("s",3),("g",-1),("c",20),("c",2)])
In: cbk = pRDD.combineByKey(lambda x:(x,1), lambda x,y:(x[0]+y,x[1]+1),lambda x,y:(x[0]+y[0],x[1]+y[1]))
In: cbk.collect()
Out: [('s', (18, 3)), ('g', (12, 3)), ('c', (24, 3))]
In: cbk.map(lambda key,val:(k,val[0]/val[1])).collectAsMap() <-- errors
It's easy enough to compute [(e[0],e[1][0]/e[1][1]) for e in cbk.collect()], but I'd rather get the "Sparkic" way working.
Step by step:
lambda (key, (totalSum, count)): ... is so-called Tuple Parameter Unpacking which has been removed in Python.
RDD.map takes a function which expect as single argument. Function you try to use:
lambda key, vals: ...
Is a function which expects two arguments, not a one. A valid translation from 2.x syntax would be
lambda key_vals: (key_vals[0], key_vals[1][0] / key_vals[1][1])
or:
def get_mean(key_vals):
key, (total, cnt) = key_vals
return key, total / cnt
cbk.map(get_mean)
You can also make this much simpler with mapValues:
cbk.mapValues(lambda x: x[0] / x[1])
Finally a numerically stable solution would be:
from pyspark.statcounter import StatCounter
(pRDD
.combineByKey(
lambda x: StatCounter([x]),
StatCounter.merge,
StatCounter.mergeStats)
.mapValues(StatCounter.mean))
Averaging over a specific column value can be done by using the Window concept. Consider the following code:
import pyspark.sql.functions as F
from pyspark.sql import Window
df = spark.createDataFrame([('a', 2), ('b', 3), ('a', 6), ('b', 5)],
['a', 'i'])
win = Window.partitionBy('a')
df.withColumn('avg', F.avg('i').over(win)).show()
Would yield:
+---+---+---+
| a| i|avg|
+---+---+---+
| b| 3|4.0|
| b| 5|4.0|
| a| 2|4.0|
| a| 6|4.0|
+---+---+---+
The average aggregation is done on each worker separately, requires no round trip to the host, and therefore efficient.

Apply Model Scores to Spark DataFrame - Python

I'm trying to apply a score to a Spark DataFrame using PySpark. Let's assuming that I built a simple regression model outside of Spark and want to map the coefficient values created in the model to the individual columns in the DataFrame to create a new column that is the sum of each of the different source columns multiplied by the individual coefficients. I understand that there are many utilities in Spark mllib for modeling, but I want to understand how this 'brute force' method could be accomplished. I also know that DataFrames/RDDs are immutable, so a new DataFrame would have to be created.
Here's some pseudo-code for reference:
#load example data
df = sqlContext.createDataFrame(data)
df.show(5)
dfmappd.select("age", "parch", "pclass").show(5)
+----+-----+------+
| age|parch|pclass|
+----+-----+------+
|22.0| 0| 3|
|38.0| 0| 1|
|26.0| 0| 3|
|35.0| 0| 1|
|35.0| 0| 3|
+----+-----+------+
only showing top 5 rows
The model created outside of Spark is a logistic regression model based on a binary response. So essentially I want to map the logit function to these three columns to produce a fourth scored column. Here are the coefficients from the model:
intercept: 3.435222
age: -0.039841
parch: 0.176439
pclass: -1.239452
Here is a description of the logit function for reference:
https://en.wikipedia.org/wiki/Logistic_regression
For comparison, here is how I would do the same thing in R using tidyr and dplyr
library(dplyr)
library(tidyr)
#Example data
Age <- c(22, 38, 26, 35, 35)
Parch <- c(0,0,0,0,0)
Pclass <- c(3, 1, 3, 1, 3)
#Wrapped in a dataframe
mydf <- data.frame(Age, Parch, Pclass)
#Using dplyr to create a new dataframe with mutated column
scoredf = mydf %>%
mutate(score = round(1/(1 + exp(-(3.435 + -0.040 * Age + 0.176 * Parch + -1.239 * Pclass))),2))
scoredf
If I interpret your question correctly, you want to compute the class conditional probability of each sample given the coefficients you computed offline and do it "manually".
Does something like this work:
def myLogisticFunc(age, parch, pclass):
intercept = 3.435222
betaAge = -0.039841
betaParch = 0.176439
betaPclass = -1.239452
z = intercept + betaAge * age + betaParch * parch + betaPclass * pclass
return 1.0 / (1.0 + math.exp(-z))
myLogisticFuncUDF = udf(myLogisticFunc)
df.withColumn("score", myLogisticFuncUDF(col("age"), col("parch"), col("pclass"))).show()

Resources