How to define mergeExpressions for a custom DeclarativeAggregate (in catalyst package) - apache-spark

I don't understand the general approach one takes to determine the mergeExpressions function for non-trivial aggregators.
The mergeExpresssions method for something like org.apache.spark.sql.catalyst.expressions.aggregate.Average is straightforward:
override lazy val mergeExpressions = Seq(
/* sum = */ sum.left + sum.right,
/* count = */ count.left + count.right
)
The mergeExpressions for CentralMomentAgg aggregators is a bit more involved.
What I would like to do is create a WeightedStddevSamp aggregator modeled after sparks CentralMomentAgg.
I almost have it working, but the weighted standard deviations that it produces are still a little off from what I compute by hand.
I'm having trouble debugging it because I do not understand how I can compute the exact logic for the mergeExpressions method.
Below is my code. The updateExpressions method is based on this weighted incremental algorithm, so I'm pretty sure that method is correct. I believe my problem is in the mergeExpressions method. Any hints would be appreciated.
abstract class WeightedCentralMomentAgg(child: Expression, weight: Expression) extends DeclarativeAggregate {
override def children: Seq[Expression] = Seq(child, weight)
override def nullable: Boolean = true
override def dataType: DataType = DoubleType
override def inputTypes: Seq[AbstractDataType] = Seq(DoubleType, DoubleType)
protected val wSum = AttributeReference("wSum", DoubleType, nullable = false)()
protected val mean = AttributeReference("mean", DoubleType, nullable = false)()
protected val s = AttributeReference("s", DoubleType, nullable = false)()
override val aggBufferAttributes = Seq(wSum, mean, s)
override val initialValues: Seq[Expression] = Array.fill(3)(Literal(0.0))
// See https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Weighted_incremental_algorithm
override val updateExpressions: Seq[Expression] = {
val newWSum = wSum + weight
val newMean = mean + (weight / newWSum) * (child - mean)
val newS = s + weight * (child - mean) * (child - newMean)
Seq(
If(IsNull(child), wSum, newWSum),
If(IsNull(child), mean, newMean),
If(IsNull(child), s, newS)
)
}
override val mergeExpressions: Seq[Expression] = {
val wSum1 = wSum.left
val wSum2 = wSum.right
val newWSum = wSum1 + wSum2
val delta = mean.right - mean.left
val deltaN = If(newWSum === Literal(0.0), Literal(0.0), delta / newWSum)
val newMean = mean.left + wSum1 / newWSum * delta // ???
val newS = s.left + s.right + wSum1 * wSum2 * delta * deltaN // ???
Seq(newWSum, newMean, newS)
}
}
// Compute the weighted sample standard deviation of a column
case class WeightedStddevSamp(child: Expression, weight: Expression)
extends WeightedCentralMomentAgg(child, weight) {
override val evaluateExpression: Expression = {
If(wSum === Literal(0.0), Literal.create(null, DoubleType),
If(wSum === Literal(1.0), Literal(Double.NaN),
Sqrt(s / wSum) ) )
}
override def prettyName: String = "wtd_stddev_samp"
}

For any hash aggregation, it's divided into four steps:
1) initialize the buffer (wSum, mean, s)
2) Within a partition, update the buffer of the key given all the input (call updateExpression for each of input)
3) After shuffling, merge all the buffer for same key using mergeExpression. wSum.left means wSum in left buffer, wSum.right means wSum in the other buffer
4) get the final result from buffer using valueExpression

I discovered how to write the mergeExpressions function for weighted standard deviation. I actually had it right, but then was using a population variance rather than a sample variance calculation in evaluateExpression. The implementation shown below gives the same result as above, but it easier to understand.
override val mergeExpressions: Seq[Expression] = {
val newN = n.left + n.right
val wSum1 = wSum.left
val wSum2 = wSum.right
val newWSum = wSum1 + wSum2
val delta = mean.right - mean.left
val deltaN = If(newWSum === Literal(0.0), Literal(0.0), delta / newWSum)
val newMean = mean.left + deltaN * wSum2
val newS = (((wSum1 * s.left) + (wSum2 * s.right)) / newWSum) + (wSum1 * wSum2 * deltaN * deltaN)
Seq(newN, newWSum, newMean, newS)
}
Here are some references
https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
http://www.itl.nist.gov/div898/software/dataplot/refman2/ch2/weightsd.pdf
http://people.ds.cam.ac.uk/fanf2/hermes/doc/antiforgery/stats.pdf
https://blog.cordiner.net/2010/06/16/calculating-variance-and-mean-with-mapreduce-python/ (This last one gave me the clue I needed for the mergeExpressions function)
Davies' post gives an outline of the approach, but for many non-trivial aggregators, I think the mergeExpressions function can be quite complex and involve advanced math to determine a correct and efficient solution. Fortunately, in this case, I found someone who had worked it out.
This solution matches what I work out by hand. Its important to note that the evaluateExpression needs to be modified slightly (to be s / ((n-1)*wSum/n)) if you want sample variance instead of population variance.

Related

Smart cast to 'Bitmap!' is impossible, because 'textBitmap' is a local variable that is captured by a changing closure

when ever I build my project I got this error
here is the kotlin class code
var textBitmap: Bitmap? = null
dynamicItem.dynamicText[imageKey]?.let { drawingText ->
dynamicItem.dynamicTextPaint[imageKey]?.let { drawingTextPaint ->
drawTextCache[imageKey]?.let {
textBitmap = it
} ?: kotlin.run {
textBitmap = Bitmap.createBitmap(drawingBitmap.width, drawingBitmap.height, Bitmap.Config.ARGB_8888)
val drawRect = Rect(0, 0, drawingBitmap.width, drawingBitmap.height)
val textCanvas = Canvas(textBitmap)
drawingTextPaint.isAntiAlias = true
val fontMetrics = drawingTextPaint.getFontMetrics();
val top = fontMetrics.top
val bottom = fontMetrics.bottom
val baseLineY = drawRect.centerY() - top / 2 - bottom / 2
textCanvas.drawText(drawingText, drawRect.centerX().toFloat(), baseLineY, drawingTextPaint);
drawTextCache.put(imageKey, textBitmap as Bitmap)
}
I couldn't figure out how to fix it
Instead of doing nested let like that, i would prefer to do some guard clause
val drawingText = dynamicItem.dynamicText[imageKey] ?: return // or you could assign an empty string `?: "" `
val drawingTextPaint = dynamicItem.dynamicTextPaint[imageKey] ?: return
val textBitmap: Bitmap = drawTextCache[imageKey] ?: Bitmap.createBitmap(drawingBitmap.width, drawingBitmap.height, Bitmap.Config.ARGB_8888).applyCanvas {
val drawRect = Rect(0, 0, drawingBitmap.width, drawingBitmap.height)
val fontMetrics = drawingTextPaint.getFontMetrics()
val top = fontMetrics.top
val bottom = fontMetrics.bottom
val baseLineY = drawRect.centerY() - top / 2 - bottom / 2
drawingTextPaint.isAntiAlias = true
drawText(drawingText, drawRect.centerX().toFloat(), baseLineY, drawingTextPaint);
}
drawTextCache.put(imageKey, textBitmap)
Basically Kotlin can't smart cast textBitmap to a non-null Bitmap inside that lambda. You're probably getting the error on the Canvas(textBitmap) call, which can't take a null parameter, and the compiler can't guarantee textBitmap isn't null at that moment.
It's a limitation of lambdas referencing external vars which can be changed - I think it's because a lambda could potentially be run at some other time, so no guarantees can be made about what's happening to that external variable and whether something else could have modified it. I don't know the details, there's some chat here if you like.
The fix is pretty easy though, if all you're doing is creating a textBitmap variable and assigning something to it:
// Assign it as a result of the expression - no need to create a var first and keep
// changing the value, no need for a temporary null value, it can just be a val
val textBitmap: Bitmap? =
dynamicItem.dynamicText[imageKey]?.let { drawingText ->
dynamicItem.dynamicTextPaint[imageKey]?.let { drawingTextPaint ->
drawTextCache[imageKey]
?: Bitmap.createBitmap(drawingBitmap.width, drawingBitmap.height, Bitmap.Config.ARGB_8888).apply {
val drawRect = Rect(0, 0, drawingBitmap.width, drawingBitmap.height)
val textCanvas = Canvas(this)
drawingTextPaint.isAntiAlias = true
val fontMetrics = drawingTextPaint.getFontMetrics();
val top = fontMetrics.top
val bottom = fontMetrics.bottom
val baseLineY = drawRect.centerY() - top / 2 - bottom / 2
textCanvas.drawText(drawingText, drawRect.centerX().toFloat(), baseLineY, drawingTextPaint);
drawTextCache.put(imageKey, this)
}
}
}
I'd recommend breaking the bitmap creation part out into its own function for readability, and personally I'd avoid the nested lets (because it's not immediately obvious what you get in what situation) but that's a style choice

Implementing a multithreading function for running "foreach, map and reduce" parallel

I am quite new to Scala but I am learning about Threads and Multithreading.
As the title says, I am trying to implement a way to divide the problem onto different threads of variable count.
We are given this code:
/** Executes the provided function for each entry in the input sequence in parallel.
*
* #param input the input sequence
* #param parallelism the number of threads to use
* #param f the function to run
*/
def parallelForeach[A](input: IndexedSeq[A], parallelism: Int, f: A => Unit): Unit = ???
I tried implementing it like this:
def parallelForeach[A](input: IndexedSeq[A], parallelism: Int, f: A => Unit): Unit = {
if (parallelism < 1) {
throw new IllegalArgumentException("a degree of parallelism < 1 s not allowed for parallel foreach")
}
val threads = (0 until parallelism).map { threadId =>
val startIndex = threadId * input.size / parallelism
val endIndex = (threadId + 1) * input.size / parallelism
val task: Runnable = () => {
(startIndex until endIndex).foreach { A =>
val key = input.grouped(input.size / parallelism)
val x: Unit = input.foreach(A => f(A))
x
}
}
new Thread(task)
}
threads.foreach(_.start())
threads.foreach(_.join())
}
for this test:
test("parallel foreach should perform the given function once for each element in the sequence") {
val counter = AtomicLong(0L)
parallelForeach((1 to 100), 16, counter.addAndGet(_))
assert(counter.get() == 5050)
But, as you can guess, it doesn't work this way as my result isn't 5050 but 505000.
Now here is my question. How do I implement a way to use multithreading efficiently, so there are for example 16 different threads working at the same time?
Check your test: "1 to 100".
With your Code you go with each thread through 100, this is why your result is 100 times to large.

Why am I getting a race condition in multi-threading scala?

I am trying to parallelise a p-norm calculation over an array.
To achieve that I try the following, I understand I can solve this differently but I am interested in understanding where the race condition is occurring,
val toSum = Array(0,1,2,3,4,5,6)
// Calculate the sum over a segment of an array
def sumSegment(a: Array[Int], p:Double, s: Int, t: Int): Int = {
val res = {for (i <- s until t) yield scala.math.pow(a(i), p)}.reduceLeft(_ + _)
res.toInt
}
// Calculate the p-norm over an Array a
def parallelpNorm(a: Array[Int], p: Double): Double = {
var acc = 0L
// The worker who should calculate the sum over a slice of an array
class sumSegmenter(s: Int, t: Int) extends Thread {
override def run() {
// Calculate the sum over the slice
val subsum = sumSegment(a, p, s, t)
// Add the sum of the slice to the accumulator in a synchronized fashion
val x = new AnyRef{}
x.synchronized {
acc = acc + subsum
}
}
}
val split = a.size / 2
val seg_one = new sumSegmenter(0, split)
val seg_two = new sumSegmenter(split, a.size)
seg_one.start
seg_two.start
seg_one.join
seg_two.join
scala.math.pow(acc, 1.0 / p)
}
println(parallelpNorm(toSum, 2))
Expected output is 9.5393920142 but instead some runs give me 9.273618495495704 or even 2.23606797749979.
Any recommendations where the race condition could happen?
The problem has been explained in the previous answer, but a better way to avoid this race condition and improve performance is to use an AtomicInteger
// Calculate the p-norm over an Array a
def parallelpNorm(a: Array[Int], p: Double): Double = {
val acc = new AtomicInteger(0)
// The worker who should calculate the sum over a slice of an array
class sumSegmenter(s: Int, t: Int) extends Thread {
override def run() {
// Calculate the sum over the slice
val subsum = sumSegment(a, p, s, t)
// Add the sum of the slice to the accumulator in a synchronized fashion
acc.getAndAdd(subsum)
}
}
val split = a.length / 2
val seg_one = new sumSegmenter(0, split)
val seg_two = new sumSegmenter(split, a.length)
seg_one.start()
seg_two.start()
seg_one.join()
seg_two.join()
scala.math.pow(acc.get, 1.0 / p)
}
Modern processors can do atomic operations without blocking which can be much faster than explicit synchronisation. In my tests this runs twice as fast as the original code (with correct placement of x)
Move val x = new AnyRef{} outside sumSegmenter (that is, into parallelpNorm) -- the problem is that each thread is using its own mutex rather than sharing one.

Accessing rows outside of window while aggregating in Spark dataframe

In short, in the example below I want to pin 'b to be the value in the row that the result will appear in.
Given:
a,b
1,2
4,6
3,7 ==> 'special would be: (1-7 + 4-7 + 3-7) == -13 in this row
val baseWin = Window.partitionBy("something_I_forgot").orderBy("whatever")
val sumWin = baseWin.rowsBetween(-2, 0)
frame.withColumn("special",sum( 'a - 'b ).over(win) )
Or another way to think of it is I want to close over the row when I calculate the sum so that I can pass in the value of 'b (in this case 7)
* Update *
Here is what I want to accomplish as an UDF. In short, I used a foldLeft.
def mad(field : Column, numPeriods : Integer) : Column = {
val baseWin = Window.partitionBy("exchange","symbol").orderBy("datetime")
val win = baseWin.rowsBetween(numPeriods + 1, 0)
val subFunc: (Seq[Double],Int) => Double = { (input: Seq[Double], numPeriods : Int) => {
val agg = grizzled.math.stats.mean(input: _*)
val fooBar = (1.0 / -numPeriods)*input.foldLeft(0.0)( (a,b) => a + Math.abs((b-agg)) )
fooBar
} }
val myUdf = udf( subFunc )
myUdf(collect_list(field.cast(DoubleType)).over(win),lit(numPeriods))
}
If I understood correctly what you're trying to do, I think you can refactor your logic a bit to achieve it. The way you have it right now, you're probably getting "-7" instead of -13.
For the "special" column, (1-7 + 4-7 + 3-7), you can calculate it like (sum(a) - count(*) * b):
dfA.withColumn("special",sum('a).over(win) - count("*").over(win) * 'b)

How to compute the distance matrix in spark?

I have tried pairing the samples but it costs huge amount of memory as 100 samples leads to 9900 samples which is more costly. What could be the more effective way of computing distance matrix in distributed environment in spark
Here is a snippet of pseudo code what i'm trying
val input = (sc.textFile("AirPassengers.csv",(numPartitions/2)))
val i = input.map(s => (Vectors.dense(s.split(',').map(_.toDouble))))
val indexed = i.zipWithIndex() //Including the index of each sample
val indexedData = indexed.map{case (k,v) => (v,k)}
val pairedSamples = indexedData.cartesian(indexedData)
val filteredSamples = pairedSamples.filter{ case (x,y) =>
(x._1.toInt > y._1.toInt) //to consider only the upper or lower trainagle
}
filteredSamples.cache
filteredSamples.count
Above code creates the pairs but even if my dataset contains 100 samples, by pairing filteredSamples (above) results in 4950 sample which could be very costly for big data
I recently answered a similar question.
Basically, it will arrive to computing n(n-1)/2 pairs, which would be 4950 computations in your example. However, what makes this approach different is that I use joins instead of cartesian. With your code, the solution would look like this:
val input = (sc.textFile("AirPassengers.csv",(numPartitions/2)))
val i = input.map(s => (Vectors.dense(s.split(',').map(_.toDouble))))
val indexed = i.zipWithIndex()
// including the index of each sample
val indexedData = indexed.map { case (k,v) => (v,k) }
// prepare indices
val count = i.count
val indices = sc.parallelize(for(i <- 0L until count; j <- 0L until count; if i > j) yield (i, j))
val joined1 = indices.join(indexedData).map { case (i, (j, v)) => (j, (i,v)) }
val joined2 = joined1.join(indexedData).map { case (j, ((i,v1),v2)) => ((i,j),(v1,v2)) }
// after that, you can then compute the distance using your distFunc
val distRDD = joined2.mapValues{ case (v1, v2) => distFunc(v1, v2) }
Try this method and compare it with the one you already posted. Hopefully, this can speedup your code a bit.
As far as I can see from checking various sources and the Spark mllib clustering site, Spark doesn't currently support the distance or pdist matrices.
In my opinion, 100 samples will always output at least 4950 values; so manually creating a distributed matrix solver using a transformation (like .map) would be the best solution.
This can serve as the java version of jtitusj's answer..
public JavaPairRDD<Tuple2<Long, Long>, Double> getDistanceMatrix(Dataset<Row> ds, String vectorCol) {
JavaRDD<Vector> rdd = ds.toJavaRDD().map(new Function<Row, Vector>() {
private static final long serialVersionUID = 1L;
public Vector call(Row row) throws Exception {
return row.getAs(vectorCol);
}
});
List<Vector> vectors = rdd.collect();
long count = ds.count();
List<Tuple2<Tuple2<Long, Long>, Double>> distanceList = new ArrayList<Tuple2<Tuple2<Long, Long>, Double>>();
for(long i=0; i < count; i++) {
for(long j=0; j < count && i > j; j++) {
Tuple2<Long, Long> indexPair = new Tuple2<Long, Long>(i, j);
double d = DistanceMeasure.getDistance(vectors.get((int)i), vectors.get((int)j));
distanceList.add(new Tuple2<Tuple2<Long, Long>, Double>(indexPair, d));
}
}
return distanceList;
}

Resources