How to compute percentiles in Apache Spark - apache-spark

I have an rdd of integers (i.e. RDD[Int]) and what I would like to do is to compute the following ten percentiles: [0th, 10th, 20th, ..., 90th, 100th]. What is the most efficient way to do that?

You can :
Sort the dataset via rdd.sortBy()
Compute the size of the dataset via rdd.count()
Zip with index to facilitate percentile retrieval
Retrieve the desired percentile via rdd.lookup() e.g. for 10th percentile rdd.lookup(0.1 * size)
To compute the median and the 99th percentile:
getPercentiles(rdd, new double[]{0.5, 0.99}, size, numPartitions);
In Java 8:
public static double[] getPercentiles(JavaRDD<Double> rdd, double[] percentiles, long rddSize, int numPartitions) {
double[] values = new double[percentiles.length];
JavaRDD<Double> sorted = rdd.sortBy((Double d) -> d, true, numPartitions);
JavaPairRDD<Long, Double> indexed = sorted.zipWithIndex().mapToPair((Tuple2<Double, Long> t) -> t.swap());
for (int i = 0; i < percentiles.length; i++) {
double percentile = percentiles[i];
long id = (long) (rddSize * percentile);
values[i] = indexed.lookup(id).get(0);
}
return values;
}
Note that this requires sorting the dataset, O(n.log(n)) and can be expensive on large datasets.
The other answer suggesting simply computing a histogram would not compute correctly the percentile: here is a counter example: a dataset composed of 100 numbers, 99 numbers being 0, and one number being 1. You end up with all the 99 0's in the first bin, and the 1 in the last bin, with 8 empty bins in the middle.

How about t-digest?
https://github.com/tdunning/t-digest
A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means. The t-digest algorithm is also very parallel friendly making it useful in map-reduce and parallel streaming applications.
The t-digest construction algorithm uses a variant of 1-dimensional k-means clustering to product a data structure that is related to the Q-digest. This t-digest data structure can be used to estimate quantiles or compute other rank statistics. The advantage of the t-digest over the Q-digest is that the t-digest can handle floating point values while the Q-digest is limited to integers. With small changes, the t-digest can handle any values from any ordered set that has something akin to a mean. The accuracy of quantile estimates produced by t-digests can be orders of magnitude more accurate than those produced by Q-digests in spite of the fact that t-digests are more compact when stored on disk.
In summary, the particularly interesting characteristics of the t-digest are that it
has smaller summaries than Q-digest
works on doubles as well as integers.
provides part per million accuracy for extreme quantiles and typically <1000 ppm accuracy for middle quantiles
is fast
is very simple
has a reference implementation that has > 90% test coverage
can be used with map-reduce very easily because digests can be merged
It should be fairly easy to use the reference Java implementation from Spark.

I discovered this gist
https://gist.github.com/felixcheung/92ae74bc349ea83a9e29
that contains the following function:
/**
* compute percentile from an unsorted Spark RDD
* #param data: input data set of Long integers
* #param tile: percentile to compute (eg. 85 percentile)
* #return value of input data at the specified percentile
*/
def computePercentile(data: RDD[Long], tile: Double): Double = {
// NIST method; data to be sorted in ascending order
val r = data.sortBy(x => x)
val c = r.count()
if (c == 1) r.first()
else {
val n = (tile / 100d) * (c + 1d)
val k = math.floor(n).toLong
val d = n - k
if (k <= 0) r.first()
else {
val index = r.zipWithIndex().map(_.swap)
val last = c
if (k >= c) {
index.lookup(last - 1).head
} else {
index.lookup(k - 1).head + d * (index.lookup(k).head - index.lookup(k - 1).head)
}
}
}
}

If you don't mind converting your RDD to a DataFrame, and using a Hive UDAF, you can use percentile. Assuming you've loaded HiveContext hiveContext into scope:
hiveContext.sql("SELECT percentile(x, array(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9)) FROM yourDataFrame")
I found out about this Hive UDAF in this answer.

Here is my Python implementation on Spark for calculating the percentile for a RDD containing values of interest.
def percentile_threshold(ardd, percentile):
assert percentile > 0 and percentile <= 100, "percentile should be larger then 0 and smaller or equal to 100"
return ardd.sortBy(lambda x: x).zipWithIndex().map(lambda x: (x[1], x[0])) \
.lookup(np.ceil(ardd.count() / 100 * percentile - 1))[0]
# Now test it out
import numpy as np
randlist = range(1,10001)
np.random.shuffle(randlist)
ardd = sc.parallelize(randlist)
print percentile_threshold(ardd,0.001)
print percentile_threshold(ardd,1)
print percentile_threshold(ardd,60.11)
print percentile_threshold(ardd,99)
print percentile_threshold(ardd,99.999)
print percentile_threshold(ardd,100)
# output:
# 1
# 100
# 6011
# 9900
# 10000
# 10000
Separately, I defined the following function to get the 10th to 100th percentile.
def get_percentiles(rdd, stepsize=10):
percentiles = []
rddcount100 = rdd.count() / 100
sortedrdd = ardd.sortBy(lambda x: x).zipWithIndex().map(lambda x: (x[1], x[0]))
for p in range(0, 101, stepsize):
if p == 0:
pass
# I am not aware of a formal definition of 0 percentile,
# you can put a place holder like this if you want
# percentiles.append(sortedrdd.lookup(0)[0] - 1)
elif p == 100:
percentiles.append(sortedrdd.lookup(np.ceil(rddcount100 * 100 - 1))[0])
else:
pv = sortedrdd.lookup(np.ceil(rddcount100 * p) - 1)[0]
percentiles.append(pv)
return percentiles
randlist = range(1,10001)
np.random.shuffle(randlist)
ardd = sc.parallelize(randlist)
get_percentiles(ardd, 10)
# [1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000]

Convert you RDD into a RDD of Double, and then use the .histogram(10) action. See DoubleRDD ScalaDoc

If N percent is small like 10, 20% then I will do the following:
Compute the size of dataset, rdd.count(), skip it maybe you know it already and take as argument.
Rather then sorting the whole dataset, I will find out top(N) from each partition. For that I would have to find out N = what is N% of rdd.count, then sort the partitions and take top(N) from each partition. Now you have a much smaller dataset to sort.
3.rdd.sortBy
4.zipWithIndex
5.filter (index < topN)

Based on the answer given here Median UDAF in Spark/Scala, I used an UDAF to compute percentiles over spark windows (spark 2.1) :
First an abstract generic UDAF used for other aggregations
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
import scala.collection.mutable
import scala.collection.mutable.ArrayBuffer
abstract class GenericUDAF extends UserDefinedAggregateFunction {
def inputSchema: StructType =
StructType(StructField("value", DoubleType) :: Nil)
def bufferSchema: StructType = StructType(
StructField("window_list", ArrayType(DoubleType, false)) :: Nil
)
def deterministic: Boolean = true
def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = new ArrayBuffer[Double]()
}
def update(buffer: MutableAggregationBuffer,input: org.apache.spark.sql.Row): Unit = {
var bufferVal = buffer.getAs[mutable.WrappedArray[Double]](0).toBuffer
bufferVal+=input.getAs[Double](0)
buffer(0) = bufferVal
}
def merge(buffer1: MutableAggregationBuffer, buffer2: org.apache.spark.sql.Row): Unit = {
buffer1(0) = buffer1.getAs[ArrayBuffer[Double]](0) ++ buffer2.getAs[ArrayBuffer[Double]](0)
}
def dataType: DataType
def evaluate(buffer: Row): Any
}
Then the Percentile UDAF customized for deciles :
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
import scala.collection.mutable
import scala.collection.mutable.ArrayBuffer
class DecilesUDAF extends GenericUDAF {
override def dataType: DataType = ArrayType(DoubleType, false)
override def evaluate(buffer: Row): Any = {
val sortedWindow = buffer.getAs[mutable.WrappedArray[Double]](0).sorted.toBuffer
val windowSize = sortedWindow.size
if (windowSize == 0) return null
if (windowSize == 1) return (0 to 10).map(_ => sortedWindow.head).toArray
(0 to 10).map(i => sortedWindow(Math.min(windowSize-1, i*windowSize/10))).toArray
}
}
The UDAF is then instanciated and called over a partitionned and ordered window :
val deciles = new DecilesUDAF()
df.withColumn("mt_deciles", deciles(col("mt")).over(myWindow))
You can then split the resulting array into multiple columns with getItem :
def splitToColumns(size: Int, splitCol:String)(df: DataFrame) = {
(0 to size).foldLeft(df) {
case (df_arg, i) => df_arg.withColumn("mt_decile_"+i, col(splitCol).getItem(i))
}
}
df.transform(splitToColumns(10, "mt_deciles" ))
The UDAF is slower than native spark functions but as long as each grouped bag or each window is relatively small and fits into a single executor, it should be fine. The main advantage is using spark parallelism.
With little effort, this code could be extend to n-quantiles.
I tested the code using this function :
def testDecilesUDAF = {
val window = W.partitionBy("user")
val deciles = new DecilesUDAF()
val schema = StructType(StructField("mt", DoubleType) :: StructField("user", StringType) :: Nil)
val rows1 = (1 to 20).map(i => Row(i.toDouble, "a"))
val rows2 = (21 to 40).map(i => Row(i.toDouble, "b"))
val df = spark.createDataFrame(spark.sparkContext.makeRDD[Row](rows1++rows2), schema)
df.withColumn("deciles", deciles(col("mt")).over(window))
.transform(splitToColumns(10, "deciles" ))
.drop("deciles")
.show(100, truncate=false)
}
First 3 lines of output :
+----+----+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+------------+
|mt |user|mt_decile_0|mt_decile_1|mt_decile_2|mt_decile_3|mt_decile_4|mt_decile_5|mt_decile_6|mt_decile_7|mt_decile_8|mt_decile_9|mt_decile_10|
+----+----+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+------------+
|21.0|b |21.0 |23.0 |25.0 |27.0 |29.0 |31.0 |33.0 |35.0 |37.0 |39.0 |40.0 |
|22.0|b |21.0 |23.0 |25.0 |27.0 |29.0 |31.0 |33.0 |35.0 |37.0 |39.0 |40.0 |
|23.0|b |21.0 |23.0 |25.0 |27.0 |29.0 |31.0 |33.0 |35.0 |37.0 |39.0 |40.0 |

Another alternative way can be to use top and last on RDD of double. For example, val percentile_99th_value=scores.top((count/100).toInt).last
This method is more suited for individual percentiles.

Here is my easy approach:
val percentiles = Array(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1)
val accuracy = 1000000
df.stat.approxQuantile("score", percentiles, 1.0/accuracy)
output:
scala> df.stat.approxQuantile("score", percentiles, 1.0/accuracy)
res88: Array[Double] = Array(0.011044141836464405, 0.02022990956902504, 0.0317261666059494, 0.04638145491480827, 0.06498630344867706, 0.0892181545495987, 0.12161539494991302, 0.16825592517852783, 0.24740923941135406, 0.9188197255134583)
accuracy: The accuracy parameter (default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of accuracy yields better accuracy, 1.0/accuracy is the relative error of the approximation.

Related

Optimizing Pyspark UDF on large data

I am trying to optimize this code that creates a dummy when the column's value (of a pyspark dataframe) is in [categories].
When the run is on 100K rows, it takes about 30seconds to run. In my case I have around 20M rows which will take a lot of time.
def create_dummy(dframe,col_name,top_name,categories,**options):
lst_tmp_col = []
if 'lst_tmp_col' in options:
lst_tmp_col = options["lst_tmp_col"]
udf = UserDefinedFunction(lambda x: 1 if x in categories else 0, IntegerType())
dframe = dframe.withColumn(str(top_name), udf(col(col_name))).cache()
dframe = dframe.select(lst_tmp_col+ [str(top_name)])
return dframe
In other words, how do I optimize this function and cut the total time down regarding the volume of my data? And how to make sure that this function does not iterates over my data?
Appreciate your suggestions. Thanks
You don't need a UDF for encoding the categories. You can use isin:
import pyspark.sql.functions as F
def create_dummy(dframe, col_name, top_name, categories, **options):
lst_tmp_col = []
if 'lst_tmp_col' in options:
lst_tmp_col = options["lst_tmp_col"]
dframe = dframe.withColumn(str(top_name), F.col(col_name).isin(categories).cast("int")).cache()
dframe = dframe.select(lst_tmp_col + [str(top_name)])
return dframe

How to generate large word count file in Spark?

I want to generate 10 million lines’ wordcount file for performance test(each line has the same sentence). But I have no idea about how to code it.
You can give me an example code, and save file in HDFS directly.
You can try something like this.
Generate 1 column with values from 1 to 100k and one with values from 1 to 100 explode both of them with explode(column).
You can't generate one column with 10 Mil values because kryo buffer is gonna throw an error.
I don't know if this is the best performance way to do it, but it is the fastest way I can think right now.
val generateList = udf((s: Int) => {
val buf = scala.collection.mutable.ArrayBuffer.empty[Int]
for(i <- 1 to s) {
buf += i
}
buf
})
val someDF = Seq(
("Lorem ipsum dolor sit amet, consectetur adipiscing elit.")
).toDF("sentence")
val someDfWithMilColumn = someDF.withColumn("genColumn1", generateList(lit(100000)))
.withColumn("genColumn2", generateList(lit(100)))
val someDfWithMilColumn100k = someDfWithMilColumn
.withColumn("expl_val", explode($"mil")).drop("expl_val", "genColumn1")
val someDfWithMilColumn10mil = someDfWithMilColumn100k
.withColumn("expl_val2", explode($"10")).drop("genColumn2", "expl_val2")
someDfWithMilColumn10mil.write.parquet(path)
You can do it by joining the 2 DFs as below,
Also find the code explanation inline.
import org.apache.spark.sql.SaveMode
object GenerateTenMils {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
spark.conf.set("spark.sql.crossJoin.enabled","true") // Enable cross join
import spark.implicits._
//Create a DF with your sentence
val df = List("each line has the same sentence").toDF
//Create another Dataset with 10000000 records
spark.range(10000000)
.join(df) // Cross Join the dataframes
.coalesce(1) // Output to a single file
.drop("id") // Drop the extra column
.write
.mode(SaveMode.Overwrite)
.text("src/main/resources/tenMils") // Write as text file
}
}
You could follow this approach.
Tail recursive to generate the objects list and Dataframes, and Union to generate the big Dataframe
val spark = SparkSession
.builder()
.appName("TenMillionsRows")
.master("local[*]")
.config("spark.sql.shuffle.partitions","4") //Change to a more reasonable default number of partitions for our data
.config("spark.app.id","TenMillionsRows") // To silence Metrics warning
.getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
/**
* Returns a List of nums sentences
* #param sentence
* #param num
* #return
*/
def getList(sentence: String, num: Int) : List[String] = {
#tailrec
def loop(st: String,n: Int, acc: List[String]): List[String] = {
n match {
case num if num == 0 => acc
case _ => loop(st, n - 1, st :: acc)
}
}
loop(sentence,num,List())
}
/**
* Returns a Dataframe that is the union of nums dataframes
* #param lst
* #param num
* #return
*/
def getDataFrame(lst: List[String], num: Int): DataFrame = {
#tailrec
def loop (ls: List[String],n: Int, acc: DataFrame): DataFrame = {
n match {
case n if n == 0 => acc
case _ => loop(lst,n - 1, acc.union(sc.parallelize(ls).toDF("sentence")))
}
}
loop(lst, num, sc.parallelize(List(sentence)).toDF("sentence"))
}
val sentence = "hope for the best but prepare for the worst"
val lSentence = getList(sentence, 100000)
val dfs = getDataFrame(lSentence,100)
println(dfs.count())
// output: 10000001
dfs.write.orc("path_to_hdfs") // write dataframe to a orc file
// you can save the file as parquet, txt, json .......
// with dataframe.write
Hope this helps.

Compounding in Spark

I have a dataframe of this format
Date | Return
01/01/2015 0.0
02/02/2015 -0.02
03/02/2015 0.05
04/02/2015 0.07
I would like to do compounding and add a column which will return Compounded return. Compounded return is calculated as:
1 for 1st row.
(1+Return(i))* Compounded(i-1))
So my df finally will be
Date | Return | Compounded
01/01/2015 0.0 1.0
02/02/2015 -0.02 1.0*(1-0.2)=0.8
03/02/2015 0.05 0.8*(1+0.05)=0.84
04/02/2015 0.07 0.84*(1+0.07)=0.8988
Answers in Java will be highly appreciated.
You can also create a custom aggregate function and use it in a window function.
Something like this (writing freeform so there probably would be some mistakes):
package com.myuadfs
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
class MyUDAF() extends UserDefinedAggregateFunction {
def inputSchema: StructType = StructType(Array(StructField("Return", DoubleType)))
def bufferSchema = StructType(StructField("compounded", DoubleType))
def dataType: DataType = DoubleType
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer(0) = 1.0 // set compounded to 1
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
buffer(0) = buffer.getDouble(0) * ( input.getDouble(0) + 1)
}
// this generally merges two aggregated buffers. This means this
// would not have worked properly had you been working with a regular
// aggregate but since you are planning to use this inside a window
// only this should not be called at all.
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
buffer1(0) = buffer1.getDouble(0) + buffer2.getDouble(0)
}
def evaluate(buffer: Row) = {
buffer.getDouble(0)
}
}
Now you can use this inside a window function. Something like this:
import org.apache.spark.sql.Window
val windowSpec = Window.orderBy("date")
val newDF = df.withColumn("compounded", df("Return").over(windowSpec)
Note that this has the limitation that the entire calculation should fit in a single partition so if you have too large a data you would have a problem. That said, nominally this kind of operations are performed after some partitioning by key (e.g. add a partitionBy to the window) and then a single element should be part of a key.
First, we define a function f(line) (suggest a better name, please!!) to process the lines.
def f(line):
global firstLine
global last_compounded
if line[0] == 'Date':
firstLine = True
return (line[0], line[1], 'Compounded')
else:
firstLine = False
if firstLine:
last_compounded = 1
firstLine = False
else:
last_compounded = (1+float(line[1]))*last_compounded
return (line[0], line[1], last_compounded)
Using two global variables (could be improved?), we keep the Compounded(i-1) value and if we are processing the first line.
With your data in some_file, a solution could be:
rdd = sc.textFile('some_file').map(lambda l: l.split())
r1 = rdd.map(lambda l: f(l))
rdd.collect()
[[u'Date', u'Return'], [u'01/01/2015', u'0.0'], [u'02/02/2015', u'-0.02'], [u'03/02/2015', u'0.05'], [u'04/02/2015', u'0.07']]
r1.collect()
[(u'Date', u'Return', 'Compounded'), (u'01/01/2015', u'0.0', 1.0), (u'02/02/2015', u'-0.02', 0.98), (u'03/02/2015', u'0.05', 1.05), (u'04/02/2015', u'0.07', 1.1235000000000002)]

Probability of predictions using Spark LogisticRegressionWithLBFGS for multiclass classification

I am using LogisticRegressionWithLBFGS() to train a model with multiple classes.
From the documentation in mllib it is written that clearThreshold() can be used only if the classification is binary. Is there a way to use something similar for multiclass classification in order to output the probabilities of each class in a given input in the model?
There are two ways to accomplish this. One is to create a method that assumes the responsibility of predictPoint in LogisticRegression.scala
object ClassificationUtility {
def predictPoint(dataMatrix: Vector, model: LogisticRegressionModel):
(Double, Array[Double]) = {
require(dataMatrix.size == model.numFeatures)
val dataWithBiasSize: Int = model.weights.size / (model.numClasses - 1)
val weightsArray: Array[Double] = model.weights match {
case dv: DenseVector => dv.values
case _ =>
throw new IllegalArgumentException(s"weights only supports dense vector but got type ${model.weights.getClass}.")
}
var bestClass = 0
var maxMargin = 0.0
val withBias = dataMatrix.size + 1 == dataWithBiasSize
val classProbabilities: Array[Double] = new Array[Double (model.numClasses)
(0 until model.numClasses - 1).foreach { i =>
var margin = 0.0
dataMatrix.foreachActive { (index, value) =>
if (value != 0.0) margin += value * weightsArray((i * dataWithBiasSize) + index)
}
// Intercept is required to be added into margin.
if (withBias) {
margin += weightsArray((i * dataWithBiasSize) + dataMatrix.size)
}
if (margin > maxMargin) {
maxMargin = margin
bestClass = i + 1
}
classProbabilities(i+1) = 1.0 / (1.0 + Math.exp(-margin))
}
return (bestClass.toDouble, classProbabilities)
}
}
Note it is only slightly different from the original method, it just calculates the logistic as a function of the input features. It also defines some vals and vars that are originally private and included outside of this method. Ultimately, it indexes the scores in an Array and returns it along with the best answer. I call my method like so:
// Compute raw scores on the test set.
val predictionAndLabelsAndProbabilities = test
.map { case LabeledPoint(label, features) =>
val (prediction, probabilities) = ClassificationUtility
.predictPoint(features, model)
(prediction, label, probabilities)}
However:
It seems the Spark contributors are discouraging the use of MLlib in favor of ML. The ML logistic regression API currently does not support multi-class classification. I am now using OneVsRest which acts as a wrapper for one vs all classification. You can obtain the raw scores by iterating through the models:
val lr = new LogisticRegression().setFitIntercept(true)
val ovr = new OneVsRest()
ovr.setClassifier(lr)
val ovrModel = ovr.fit(training)
ovrModel.models.zipWithIndex.foreach {
case (model: LogisticRegressionModel, i: Int) =>
model.save(s"model-${model.uid}-$i")
}
val model0 = LogisticRegressionModel.load("model-logreg_457c82141c06-0")
val model1 = LogisticRegressionModel.load("model-logreg_457c82141c06-1")
val model2 = LogisticRegressionModel.load("model-logreg_457c82141c06-2")
Now that you have the individual models, you can obtain the probabilities by calculating the sigmoid of the rawPrediction
def sigmoid(x: Double): Double = {
1.0 / (1.0 + Math.exp(-x))
}
val newPredictionAndLabels0 = model0.transform(newRescaledData)
.select("prediction", "rawPrediction")
.map(row => (row.getDouble(0),
sigmoid(row.getAs[org.apache.spark.mllib.linalg.DenseVector](1).values(1)) ))
newPredictionAndLabels0.foreach(println)
val newPredictionAndLabels1 = model1.transform(newRescaledData)
.select("prediction", "rawPrediction")
.map(row => (row.getDouble(0),
sigmoid(row.getAs[org.apache.spark.mllib.linalg.DenseVector](1).values(1)) ))
newPredictionAndLabels1.foreach(println)
val newPredictionAndLabels2 = model2.transform(newRescaledData)
.select("prediction", "rawPrediction")
.map(row => (row.getDouble(0),
sigmoid(row.getAs[org.apache.spark.mllib.linalg.DenseVector](1).values(1)) ))
newPredictionAndLabels2.foreach(println)

model.predictProbabilities() for LogisticRegression in Spark?

I'm running a multi-class Logistic Regression (withLBFGS) with Spark 1.6.
given x and possible labels {1.0,2.0,3.0}
the final model will only output what is the best prediction, say 2.0.
If I'm interested to know what was the second best prediction, say 3.0, how could I retrieve that information?
In NaiveBayes I would use the model.predictProbabilities() function which for each sample would output a vector with all the probabilities for each possible outcome.
There are two ways to do logistic regression in Spark: spark.ml and spark.mllib.
With DataFrames you can use spark.ml:
import org.apache.spark
import sqlContext.implicits._
def p(label: Double, a: Double, b: Double) =
new spark.mllib.regression.LabeledPoint(
label, new spark.mllib.linalg.DenseVector(Array(a, b)))
val data = sc.parallelize(Seq(p(1.0, 0.0, 0.5), p(0.0, 0.5, 1.0)))
val df = data.toDF
val model = new spark.ml.classification.LogisticRegression().fit(df)
model.transform(df).show
You get the raw predictions and probabilities:
+-----+---------+--------------------+--------------------+----------+
|label| features| rawPrediction| probability|prediction|
+-----+---------+--------------------+--------------------+----------+
| 1.0|[0.0,0.5]|[-19.037302860930...|[5.39764620520461...| 1.0|
| 0.0|[0.5,1.0]|[18.9861466274786...|[0.99999999431904...| 0.0|
+-----+---------+--------------------+--------------------+----------+
With RDDs you can use spark.mllib:
val model = new spark.mllib.classification.LogisticRegressionWithLBFGS().run(data)
This model does not expose the raw predictions and probabilities. You can take a look at predictPoint. It multiplies the vectors and picks the class with the highest prediction. The weights are publicly accessible, so you could copy that algorithm and save the predictions instead of just returning the highest one.
Following the suggestions from #Daniel Darabos:
I tried to use the LogisticRegression function from ml instead of mllib
Unfortunately it doesn't support the multi-class logistic regression but only the binary one.
I took a look at PredictedPoint
and modified it so that it prints all the probabilities for each class. Here it is what it looks like:
def predictPointForMulticlass(featurizedVector:Vector,weightsArray:Vector,intercept:Double,numClasses:Int,numFeatures:Int) : Seq[(String, Double)] = {
val weightsArraySize = weightsArray.size
val dataWithBiasSize = weightsArraySize / (numClasses - 1)
val withBias = false
var bestClass = 0
var maxMargin = 0.0
var margins = new Array[Double](numClasses - 1)
var temp_marginMap = new HashMap[Int, Double]()
var res = new HashMap[Int, Double]()
(0 until numClasses - 1).foreach { i =>
var margin = 0.0
var index = 0
featurizedVector.toArray.foreach(value => {
if (value != 0.0) {
margin += value * weightsArray((i * dataWithBiasSize) + index)
}
index += 1
}
)
// Intercept is required to be added into margin.
if (withBias) {
margin += weightsArray((i * dataWithBiasSize) + featurizedVector.size)
}
val prob = 1.0 / (1.0 + Math.exp(-margin))
margins(i) = margin
temp_marginMap += (i -> margin)
if(margin > maxMargin) {
maxMargin = margin
bestClass = i + 1
}
}
for ((k,v) <- temp_marginMap){
val calc =probCalc(maxMargin,v)
res += (k -> calc)
}
return res
}
where probCalc() is simply defined as:
def probCalc(maxMargin:Double,margin:Double) :Double ={
val res = 1.0 / (1.0 + Math.exp(-(margin - maxMargin)))
res
}
I'm returning a Hashmap[Int, Double] but that can be changed as you wish.
Hopefully this helps!

Resources