Accessing rows outside of window while aggregating in Spark dataframe - apache-spark

In short, in the example below I want to pin 'b to be the value in the row that the result will appear in.
Given:
a,b
1,2
4,6
3,7 ==> 'special would be: (1-7 + 4-7 + 3-7) == -13 in this row
val baseWin = Window.partitionBy("something_I_forgot").orderBy("whatever")
val sumWin = baseWin.rowsBetween(-2, 0)
frame.withColumn("special",sum( 'a - 'b ).over(win) )
Or another way to think of it is I want to close over the row when I calculate the sum so that I can pass in the value of 'b (in this case 7)
* Update *
Here is what I want to accomplish as an UDF. In short, I used a foldLeft.
def mad(field : Column, numPeriods : Integer) : Column = {
val baseWin = Window.partitionBy("exchange","symbol").orderBy("datetime")
val win = baseWin.rowsBetween(numPeriods + 1, 0)
val subFunc: (Seq[Double],Int) => Double = { (input: Seq[Double], numPeriods : Int) => {
val agg = grizzled.math.stats.mean(input: _*)
val fooBar = (1.0 / -numPeriods)*input.foldLeft(0.0)( (a,b) => a + Math.abs((b-agg)) )
fooBar
} }
val myUdf = udf( subFunc )
myUdf(collect_list(field.cast(DoubleType)).over(win),lit(numPeriods))
}

If I understood correctly what you're trying to do, I think you can refactor your logic a bit to achieve it. The way you have it right now, you're probably getting "-7" instead of -13.
For the "special" column, (1-7 + 4-7 + 3-7), you can calculate it like (sum(a) - count(*) * b):
dfA.withColumn("special",sum('a).over(win) - count("*").over(win) * 'b)

Related

treeAggregate use case explanation

I am trying to understand treeAggregate but there isn't enough examples online.
So does the following code merges the elements of partition then calls makeSummary and in parallel do the same for each partition (sums the result and summarizes it again) then with depth set to (lets say) 5, is this repeated 5 times?
The result I want to get from these is to summarize the arrays until I get one of them.
val summary = input.transform(rdd=>{
rdd.treeAggregate(initialSet)(addToSet,mergePartitionSets,5)
// this returns Array[Double] not rdd but still
})
val initialSet = Array.empty[Double]
def addToSet = (s: Array[Double], v: (Int,Array[Double])) => {
val p=s ++ v._2
val ret = makeSummary(p,10000)
ret
}
val mergePartitionSets = (p1: Array[Double], p2: Array[Double]) => {
val p = p1 ++ p2
val ret = makeSummary(p,10000)
ret
}
//makeSummary selects half of the points of p randomly

How to define mergeExpressions for a custom DeclarativeAggregate (in catalyst package)

I don't understand the general approach one takes to determine the mergeExpressions function for non-trivial aggregators.
The mergeExpresssions method for something like org.apache.spark.sql.catalyst.expressions.aggregate.Average is straightforward:
override lazy val mergeExpressions = Seq(
/* sum = */ sum.left + sum.right,
/* count = */ count.left + count.right
)
The mergeExpressions for CentralMomentAgg aggregators is a bit more involved.
What I would like to do is create a WeightedStddevSamp aggregator modeled after sparks CentralMomentAgg.
I almost have it working, but the weighted standard deviations that it produces are still a little off from what I compute by hand.
I'm having trouble debugging it because I do not understand how I can compute the exact logic for the mergeExpressions method.
Below is my code. The updateExpressions method is based on this weighted incremental algorithm, so I'm pretty sure that method is correct. I believe my problem is in the mergeExpressions method. Any hints would be appreciated.
abstract class WeightedCentralMomentAgg(child: Expression, weight: Expression) extends DeclarativeAggregate {
override def children: Seq[Expression] = Seq(child, weight)
override def nullable: Boolean = true
override def dataType: DataType = DoubleType
override def inputTypes: Seq[AbstractDataType] = Seq(DoubleType, DoubleType)
protected val wSum = AttributeReference("wSum", DoubleType, nullable = false)()
protected val mean = AttributeReference("mean", DoubleType, nullable = false)()
protected val s = AttributeReference("s", DoubleType, nullable = false)()
override val aggBufferAttributes = Seq(wSum, mean, s)
override val initialValues: Seq[Expression] = Array.fill(3)(Literal(0.0))
// See https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Weighted_incremental_algorithm
override val updateExpressions: Seq[Expression] = {
val newWSum = wSum + weight
val newMean = mean + (weight / newWSum) * (child - mean)
val newS = s + weight * (child - mean) * (child - newMean)
Seq(
If(IsNull(child), wSum, newWSum),
If(IsNull(child), mean, newMean),
If(IsNull(child), s, newS)
)
}
override val mergeExpressions: Seq[Expression] = {
val wSum1 = wSum.left
val wSum2 = wSum.right
val newWSum = wSum1 + wSum2
val delta = mean.right - mean.left
val deltaN = If(newWSum === Literal(0.0), Literal(0.0), delta / newWSum)
val newMean = mean.left + wSum1 / newWSum * delta // ???
val newS = s.left + s.right + wSum1 * wSum2 * delta * deltaN // ???
Seq(newWSum, newMean, newS)
}
}
// Compute the weighted sample standard deviation of a column
case class WeightedStddevSamp(child: Expression, weight: Expression)
extends WeightedCentralMomentAgg(child, weight) {
override val evaluateExpression: Expression = {
If(wSum === Literal(0.0), Literal.create(null, DoubleType),
If(wSum === Literal(1.0), Literal(Double.NaN),
Sqrt(s / wSum) ) )
}
override def prettyName: String = "wtd_stddev_samp"
}
For any hash aggregation, it's divided into four steps:
1) initialize the buffer (wSum, mean, s)
2) Within a partition, update the buffer of the key given all the input (call updateExpression for each of input)
3) After shuffling, merge all the buffer for same key using mergeExpression. wSum.left means wSum in left buffer, wSum.right means wSum in the other buffer
4) get the final result from buffer using valueExpression
I discovered how to write the mergeExpressions function for weighted standard deviation. I actually had it right, but then was using a population variance rather than a sample variance calculation in evaluateExpression. The implementation shown below gives the same result as above, but it easier to understand.
override val mergeExpressions: Seq[Expression] = {
val newN = n.left + n.right
val wSum1 = wSum.left
val wSum2 = wSum.right
val newWSum = wSum1 + wSum2
val delta = mean.right - mean.left
val deltaN = If(newWSum === Literal(0.0), Literal(0.0), delta / newWSum)
val newMean = mean.left + deltaN * wSum2
val newS = (((wSum1 * s.left) + (wSum2 * s.right)) / newWSum) + (wSum1 * wSum2 * deltaN * deltaN)
Seq(newN, newWSum, newMean, newS)
}
Here are some references
https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
http://www.itl.nist.gov/div898/software/dataplot/refman2/ch2/weightsd.pdf
http://people.ds.cam.ac.uk/fanf2/hermes/doc/antiforgery/stats.pdf
https://blog.cordiner.net/2010/06/16/calculating-variance-and-mean-with-mapreduce-python/ (This last one gave me the clue I needed for the mergeExpressions function)
Davies' post gives an outline of the approach, but for many non-trivial aggregators, I think the mergeExpressions function can be quite complex and involve advanced math to determine a correct and efficient solution. Fortunately, in this case, I found someone who had worked it out.
This solution matches what I work out by hand. Its important to note that the evaluateExpression needs to be modified slightly (to be s / ((n-1)*wSum/n)) if you want sample variance instead of population variance.

I want to collect the data frame column values in an array list to conduct some computations, is it possible?

I am loading data from phoenix through this:
val tableDF = sqlContext.phoenixTableAsDataFrame("Hbtable", Array("ID", "distance"), conf = configuration)
and want to carry out the following computation on the column values distance:
val list=Array(10,20,30,40,10,20,0,10,20,30,40,50,60)//list of values from the column distance
val first=list(0)
val last=list(list.length-1)
var m = 0;
for (a <- 0 to list.length-2) {
if (list(a + 1) < list(a) && list(a+1)>=0)
{
m = m + list(a)
}
}
val totalDist=(m+last-first)
You can do something like this. It returns Array[Any]
`val array = df.select("distance").rdd.map(r => r(0)).collect()
If you want to get the data type properly, then you can use. It returns the Array[Int]
val array = df.select("distance").rdd.map(r => r(0).asInstanceOf[Int]).collect()

Using PartitionBy to split and efficiently compute RDD groups by Key

I've implemented a solution to group RDD[K, V] by key and to compute data according to each group (K, RDD[V]), using partitionBy and Partitioner. Nevertheless, I'm not sure if it is really efficient and I'd like to have your point of view.
Here is a sample case : according to a list of [K: Int, V: Int], compute the Vs mean for each group of K, knowing that it should be distributed and that V values may be very large. That should give :
List[K, V] => (K, mean(V))
The simple Partitioner class:
class MyPartitioner(maxKey: Int) extends Partitioner {
def numPartitions = maxKey
def getPartition(key: Any): Int = key match {
case i: Int if i < maxKey => i
}
}
The partition code :
val l = List((1, 1), (1, 8), (1, 30), (2, 4), (2, 5), (3, 7))
val rdd = sc.parallelize(l)
val p = rdd.partitionBy(new MyPartitioner(4)).cache()
p.foreachPartition(x => {
try {
val r = sc.parallelize(x.toList)
val id = r.first() //get the K partition id
val v = r.map(x => x._2)
println(id._1 + "->" + mean(v))
} catch {
case e: UnsupportedOperationException => 0
}
})
The output is :
1->13, 2->4, 3->7
My questions are :
what does it really happen when calling partitionBy ? (sorry, I didn't find enough specs on it)
Is it really efficient to map by partition, knowing that in my production case it would not be too much keys (as 50 for sample) by very much values (as 1 million for sample)
What is the cost of paralellize(x.toList) ? Is it consistent to do it ? (I need a RDD in input of mean())
How would you do it by yourself ?
Regards
Your code should not work. You cannot pass the SparkContext object to the executors. (It's not Serializable.) Also I don't see why you would need to.
To calculate the mean, you need to calculate the sum and the count and take their ratio. The default partitioner will do fine.
def meanByKey(rdd: RDD[(Int, Int)]): RDD[(Int, Double)] = {
case class SumCount(sum: Double, count: Double)
val sumCounts = rdd.aggregateByKey(SumCount(0.0, 0.0))(
(sc, v) => SumCount(sc.sum + v, sc.count + 1.0),
(sc1, sc2) => SumCount(sc1.sum + sc2.sum, sc1.count + sc2.count))
sumCounts.map(sc => sc.sum / sc.count)
}
This is an efficient single-pass calculation that generalizes well.

Groovier way of manipulating the list

I have two list like this :
def a = [100,200,300]
def b = [30,60,90]
I want the Groovier way of manipulating the a like this :
1) First element of a should be changed to a[0]-2*b[0]
2)Second element of a should be changed to a[1]-4*b[1]
3)Third element of a should be changed to a[2]-8*b[2]
(provided that both a and b will be of same length of 3)
If the list changed to map like this, lets say:
def a1 = [100:30, 200:60, 300:90]
how one could do the same above operation in this case.
Thanks in advance.
For List, I'd go with:
def result = []
a.eachWithIndex{ item, index ->
result << item - ((2**index) * b[index])
}
For Map it's a bit easier, but still requires an external state:
int i = 1
def result = a.collect { k, v -> k - ((2**i++) * v) }
A pity, Groovy doesn't have an analog for zip, in this case - something like zipWithIndex or collectWithIndex.
Using collect
In response to Victor in the comments, you can do this using a collect
def a = [100,200,300]
def b = [30,60,90]
// Introduce a list `c` of the multiplier
def c = (1..a.size()).collect { 2**it }
// Transpose these lists together, and calculate
[a,b,c].transpose().collect { x, y, z ->
x - y * z
}
Using inject
You can also use inject, passing in a map of multiplier and result, then fetching the result out at the end:
def result = [a,b].transpose().inject( [ mult:2, result:[] ] ) { acc, vals ->
acc.result << vals.with { av, bv -> av - ( acc.mult * bv ) }
acc.mult *= 2
acc
}.result
And similarly, you can use inject for the map:
def result = a1.inject( [ mult:2, result:[] ] ) { acc, key, val ->
acc.result << key - ( acc.mult * val )
acc.mult *= 2
acc
}.result
Using inject has the advantage that you don't need external variables declared, but has the disadvantage of being harder to read the code (and as Victor points out in the comments, this makes static analysis of the code hard to impossible for IDEs and groovypp)
def a1 = [100:30, 200:60, 300:90]
a1.eachWithIndex{item,index ->
println item.key-((2**(index+1))*item.value)
i++
}

Resources