Working with arraybuffers in Spark UDAF's

Working with arraybuffers in Spark UDAF's - apache-spark

I'm writing a UDAF in spark which calculates a range representation of integers.
My intermediate results are ArrayBuffers and the final result is also a ArrayBuffer. But I'm getting this error when I run the code -
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.collection.mutable.ArrayBuffer
at $iwC$$iwC$Concat.update(<console>:33)
at org.apache.spark.sql.execution.aggregate.ScalaUDAF.update(udaf.scala:445)
at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$11.apply(AggregationIterator.scala:178)
at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$11.apply(AggregationIterator.scala:171)
at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.processCurrentSortedGroup(SortBasedAggregationIterator.scala:100)
at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:139)
at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:30)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:119)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
This is my aggregation function -
import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.DataType
import org.apache.spark.sql.types.LongType
import org.apache.spark.sql.types.ArrayType
import scala.collection.mutable.ArrayBuffer
class Concat extends UserDefinedAggregateFunction {
def inputSchema: org.apache.spark.sql.types.StructType =
StructType(StructField("value", LongType) :: Nil)
def bufferSchema: StructType = StructType(
StructField("concatenation",ArrayType(LongType,false) ) :: Nil
)
def dataType: DataType = ArrayType(LongType,false)
def deterministic: Boolean = true
def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer.update(0, new ArrayBuffer[Long]() )
}
def update(buffer: MutableAggregationBuffer,input: Row): Unit = {
val l=buffer.getSeq(0).asInstanceOf[ ArrayBuffer[Long] ]
val v=input.getAs[ Long ](0)
val n=l.size
if(n >= 2){
val x1=l(n-2)
val x2=l(n-1)
if( x1-1 == v)
l(n-2)=v
else if(x1+x2+1 == v)
l(n-1)=x2+1
else
l += v
l += 0L
}
else{
l += v
l += 0L
}
buffer.update(0,l)
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
val a=buffer1.getSeq(0).asInstanceOf[ ArrayBuffer[Long] ]
val b=buffer2.getSeq(0).asInstanceOf[ ArrayBuffer[Long] ]
a ++ b
}
def evaluate(buffer: Row): Any = {
buffer(0)
}
}
I looked into udaf.scala as well but I'm not able to figure how to make it work & I'm not very proficient in scala. How can I make it work?

Related

Spark custom aggregation : collect_list+UDF vs UDAF

I often have the need to perform custom aggregations on dataframes in spark 2.1, and used these two approaches :
Using groupby/collect_list to get all the values in a single row, then apply an UDF to aggregate the values
Writing a custom UDAF (User defined aggregate function)
I generally prefer the first option as its easier to implement and more readable than the UDAF implementation. But I would assume that the first option is generally slower, because more data is sent around the network (no partial aggregation), but my experience shows that UDAF are generally slow. Why is that?
Concrete example: Calculating histograms:
Data is in a hive table (1E6 random double values)
val df = spark.table("testtable")
def roundToMultiple(d:Double,multiple:Double) = Math.round(d/multiple)*multiple
UDF approach:
val udf_histo = udf((xs:Seq[Double]) => xs.groupBy(x => roundToMultiple(x,0.25)).mapValues(_.size))
df.groupBy().agg(collect_list($"x").as("xs")).select(udf_histo($"xs")).show(false)
+--------------------------------------------------------------------------------+
|UDF(xs) |
+--------------------------------------------------------------------------------+
|Map(0.0 -> 125122, 1.0 -> 124772, 0.75 -> 250819, 0.5 -> 248696, 0.25 -> 250591)|
+--------------------------------------------------------------------------------+
UDAF-Approach
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
import scala.collection.mutable
class HistoUDAF(binWidth:Double) extends UserDefinedAggregateFunction {
override def inputSchema: StructType =
StructType(
StructField("value", DoubleType) :: Nil
)
override def bufferSchema: StructType =
new StructType()
.add("histo", MapType(DoubleType, IntegerType))
override def deterministic: Boolean = true
override def dataType: DataType = MapType(DoubleType, IntegerType)
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = Map[Double, Int]()
}
private def mergeMaps(a: Map[Double, Int], b: Map[Double, Int]) = {
a ++ b.map { case (k,v) => k -> (v + a.getOrElse(k, 0)) }
}
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val oldBuffer = buffer.getAs[Map[Double, Int]](0)
val newInput = Map(roundToMultiple(input.getDouble(0),binWidth) -> 1)
buffer(0) = mergeMaps(oldBuffer, newInput)
}
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
val a = buffer1.getAs[Map[Double, Int]](0)
val b = buffer2.getAs[Map[Double, Int]](0)
buffer1(0) = mergeMaps(a, b)
}
override def evaluate(buffer: Row): Any = {
buffer.getAs[Map[Double, Int]](0)
}
}
val histo = new HistoUDAF(0.25)
df.groupBy().agg(histo($"x")).show(false)
+--------------------------------------------------------------------------------+
|histoudaf(x) |
+--------------------------------------------------------------------------------+
|Map(0.0 -> 125122, 1.0 -> 124772, 0.75 -> 250819, 0.5 -> 248696, 0.25 -> 250591)|
+--------------------------------------------------------------------------------+
My tests show that the collect_list/UDF approach is about 2 times faster than the UDAF approach. Is this a general rule, or are there cases where UDAF is really much faster and the rather awkward implemetation is justified?

UDAF is slower because it deserializes/serializes aggregator from/to internal buffer on each update -> on each row which is quite expensive (some more details). Instead you should use Aggregator (in fact, UDAF have been deprecated since Spark 3.0).

Compounding in Spark

I have a dataframe of this format
Date | Return
01/01/2015 0.0
02/02/2015 -0.02
03/02/2015 0.05
04/02/2015 0.07
I would like to do compounding and add a column which will return Compounded return. Compounded return is calculated as:
1 for 1st row.
(1+Return(i))* Compounded(i-1))
So my df finally will be
Date | Return | Compounded
01/01/2015 0.0 1.0
02/02/2015 -0.02 1.0*(1-0.2)=0.8
03/02/2015 0.05 0.8*(1+0.05)=0.84
04/02/2015 0.07 0.84*(1+0.07)=0.8988
Answers in Java will be highly appreciated.

You can also create a custom aggregate function and use it in a window function.
Something like this (writing freeform so there probably would be some mistakes):
package com.myuadfs
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
class MyUDAF() extends UserDefinedAggregateFunction {
def inputSchema: StructType = StructType(Array(StructField("Return", DoubleType)))
def bufferSchema = StructType(StructField("compounded", DoubleType))
def dataType: DataType = DoubleType
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer(0) = 1.0 // set compounded to 1
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
buffer(0) = buffer.getDouble(0) * ( input.getDouble(0) + 1)
}
// this generally merges two aggregated buffers. This means this
// would not have worked properly had you been working with a regular
// aggregate but since you are planning to use this inside a window
// only this should not be called at all.
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
buffer1(0) = buffer1.getDouble(0) + buffer2.getDouble(0)
}
def evaluate(buffer: Row) = {
buffer.getDouble(0)
}
}
Now you can use this inside a window function. Something like this:
import org.apache.spark.sql.Window
val windowSpec = Window.orderBy("date")
val newDF = df.withColumn("compounded", df("Return").over(windowSpec)
Note that this has the limitation that the entire calculation should fit in a single partition so if you have too large a data you would have a problem. That said, nominally this kind of operations are performed after some partitioning by key (e.g. add a partitionBy to the window) and then a single element should be part of a key.

First, we define a function f(line) (suggest a better name, please!!) to process the lines.
def f(line):
global firstLine
global last_compounded
if line[0] == 'Date':
firstLine = True
return (line[0], line[1], 'Compounded')
else:
firstLine = False
if firstLine:
last_compounded = 1
firstLine = False
else:
last_compounded = (1+float(line[1]))*last_compounded
return (line[0], line[1], last_compounded)
Using two global variables (could be improved?), we keep the Compounded(i-1) value and if we are processing the first line.
With your data in some_file, a solution could be:
rdd = sc.textFile('some_file').map(lambda l: l.split())
r1 = rdd.map(lambda l: f(l))
rdd.collect()
[[u'Date', u'Return'], [u'01/01/2015', u'0.0'], [u'02/02/2015', u'-0.02'], [u'03/02/2015', u'0.05'], [u'04/02/2015', u'0.07']]
r1.collect()
[(u'Date', u'Return', 'Compounded'), (u'01/01/2015', u'0.0', 1.0), (u'02/02/2015', u'-0.02', 0.98), (u'03/02/2015', u'0.05', 1.05), (u'04/02/2015', u'0.07', 1.1235000000000002)]

Writing an efficient aggregation function for Spark SQL

I wrote an aggregation function which returns range encoded representation of a column of Long data. I ran it on a 1 GB parquet file which has 50 columns. My cluster has 55 executors with 4 cores for each node. The run time is around 5 minutes even after caching the dataframe. Is there any way to run this query in a more efficient manner ?
Here is the UDAF -
class Concat extends UserDefinedAggregateFunction {
def inputSchema: org.apache.spark.sql.types.StructType =
StructType(StructField("value", LongType) :: Nil)
def bufferSchema: StructType = StructType(
StructField("concatenation",ArrayType(LongType,false) ) :: Nil
)
def dataType: DataType = ArrayType(LongType,false)
def deterministic: Boolean = true
def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer.update(0, new ArrayBuffer[Long]() )
}
def update(buffer: MutableAggregationBuffer,input: Row): Unit = {
val l=buffer.getAs[ ArrayBuffer[Long] ](0).toBuffer.asInstanceOf[ ArrayBuffer[Long] ]
val v=input.getAs[ Long ](0)
val n=l.size
if(n<2){
l += v
l += 0L
}
else{
val x1 = l(n-2)
val x2 = l(n-1)
if( x1-1 == v){
l(n-2)= v
l(n-1)= x2+1
}
else if(x1+x2+1 == v)
l(n-1)= x2+1
else{
l += v
l += 0L
}
}
buffer.update(0,l)
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
val a=buffer1.getAs[ WrappedArray[Long] ](0)
val b=buffer2.getAs[ WrappedArray[Long] ](0)
buffer1.update(0,a ++ b)
}
def evaluate(buffer: Row): Any = {
buffer.getSeq(0)
}
}
Here is how I am running the query -
val concat = new Concat
sqlContext.udf.register("lcon", concat)
val df=sqlContext.read.parquet("file_url")
df.cache
df.registerTempTable("agg11")
val results=sqlContext.sql("SELECT lcon(Id) from agg11 WHERE Status IN (1) AND Device IN (1,4) AND Medium IN (1)").collect

Problems running Spark GraphX algorithms on generated graphs

I have created a graph in Spark GraphX using the following codes. (See my question and solution)
import scala.math.random
import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
import scala.util.Random
import org.apache.spark.HashPartitioner
object SparkER {
val nPartitions: Integer = 4
val n: Long = 100
val p: Double = 0.1
def genNodeIds(nPartitions: Int, n: Long)(i: Int) = {
(0L until n).filter(_ % nPartitions == i).toIterator
}
def genEdgesForId(p: Double, n: Long, random: Random)(i: Long) = {
(i + 1 until n).filter(_ => random.nextDouble < p).map(j => Edge(i, j, ()))
}
def genEdgesForPartition(iter: Iterator[Long]) = {
val random = new Random(new java.security.SecureRandom())
iter.flatMap(genEdgesForId(p, n, random))
}
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Spark ER").setMaster("local[4]")
val sc = new SparkContext(conf)
val empty = sc.parallelize(Seq.empty[Int], nPartitions)
val ids = empty.mapPartitionsWithIndex((i, _) => genNodeIds(nPartitions, n)(i))
val edges = ids.mapPartitions(genEdgesForPartition)
val vertices: VertexRDD[Unit] = VertexRDD(ids.map((_, ())))
val graph = Graph(vertices, edges)
val cc = graph.connectedComponents().vertices //Throwing Exceptions
println("Stopping Spark Context")
sc.stop()
}
}
Now, I can access the graph and see the degrees of the nodes. But when I try to get some measures, such as Connected components, I am getting the following exceptions.
15/12/22 12:12:57 ERROR Executor: Exception in task 3.0 in stage 6.0 (TID 19)
java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap$mcJI$sp.apply$mcJI$sp(GraphXPrimitiveKeyOpenHashMap.scala:64)
at org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:91)
at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75)
at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:99)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
15/12/22 12:12:57 ERROR Executor: Exception in task 1.0 in stage 6.0 (TID 17)
java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap$mcJI$sp.apply$mcJI$sp(GraphXPrimitiveKeyOpenHashMap.scala:64)
at org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:91)
at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75)
at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:99)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Why am I nable to perform these operations on the generated graph using GraphX?

I found that, if I do the following the exception does not occur.
val graph = Graph(vertices, edges).partitionBy(PartitionStrategy.RandomVertexCut)
Apparently, some GraphX algorithms require the repartitioning. But the purpose is not entirely clear to me.

Spark: Work around nested RDD

There are two tables. First table has records with two fields book1 and book2. These are id's of books that usualy are read together, in pairs.
Second table has columns books and readers of these books, where books and readers are book and reader IDs, respectively. For every reader in the second table I need to find corresponding books in the pairs table. For example if reader read books 1,2,3 and we have pairs (1,7), (6,2), (4,10) the resulting list for this reader should have books 7,6.
I first group books by readers and then iterate pairs. Every book in pair I try to match with all books in a user list:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.log4j.Logger
import org.apache.log4j.Level
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
object Simple {
case class Pair(book1: Int, book2: Int)
case class Book(book: Int, reader: Int, name:String)
val pairs = Array(
Pair(1, 2),
Pair(1, 3),
Pair(5, 7)
)
val testRecs = Array(
Book(book = 1, reader = 710, name = "book1"),
Book(book = 2, reader = 710, name = "book2"),
Book(book = 3, reader = 710, name = "book3"),
Book(book = 8, reader = 710, name = "book8"),
Book(book = 1, reader = 720, name = "book1"),
Book(book = 2, reader = 720, name = "book2"),
Book(book = 8, reader = 720, name = "book8"),
Book(book = 3, reader = 730, name = "book3"),
Book(book = 8, reader = 740, name = "book8")
)
def main(args: Array[String]) {
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)
// set up environment
val conf = new SparkConf()
.setMaster("local[5]")
.setAppName("Simple")
.set("spark.executor.memory", "2g")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val pairsDf = sc.parallelize(pairs).toDF()
val testData = sc.parallelize(testRecs)
// *** Group test data by reader
val testByReader = testData.map(r => (r.reader, r.book))
val testGroups = testByReader.groupByKey()
val x = testGroups.map(tuple => tuple match {
case(user, bookIter) => matchList(user,pairsDf, bookIter.toList)
})
x.foreach(println)
}
def matchList(user:Int, df: DataFrame, toMatch: List[Int]) = {
//val x = df.map(r => (r(0), r(1))) --- This also fails!!
//x
val relatedBooks = df.map(r => {
val book1 = r(0)
val book2 = r(1)
val z = toMatch.map(book =>
if (book == book1)
List(book2)
else {
if (book == book2) List(book1)
else List()
} //if
)
z.flatMap(identity)
})
(user,relatedBooks)
}
}
This results in java.lang.NullPointerException (below). As I understand, Spark does not support nested RDDs. Please advise on another way to solve this task.
...
15/06/09 18:59:25 INFO Server: jetty-8.y.z-SNAPSHOT
15/06/09 18:59:25 INFO AbstractConnector: Started SocketConnector#0.0.0.0:44837
15/06/09 18:59:26 INFO Server: jetty-8.y.z-SNAPSHOT
15/06/09 18:59:26 INFO AbstractConnector: Started SelectChannelConnector#0.0.0.0:4040
[Stage 0:> (0 + 0) / 5]15/06/09 18:59:30 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 5)
java.lang.NullPointerException
at org.apache.spark.sql.DataFrame.schema(DataFrame.scala:253)
at org.apache.spark.sql.DataFrame.rdd(DataFrame.scala:961)
at org.apache.spark.sql.DataFrame.map(DataFrame.scala:848)
at Simple$.matchList(Simple.scala:60)
at Simple$$anonfun$2.apply(Simple.scala:52)
at Simple$$anonfun$2.apply(Simple.scala:51)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

You can create two rdds . One for bookpair and one for readerbook and then join the two rdds by bookid.
val bookpair = Array((1,2),(2,4),(3,4),(5,6),(4,6),(7,3))
val bookpairRdd = sc.parallelize(bookpair)
val readerbook = Array(("foo",1),("bar",2),("user1",3),("user3",4))
val readerRdd = sc.parallelize(readerbook).map(x => x.swap)
val joinedRdd = readerRdd.join(bookpairRdd)
joinedRdd.foreach(println)
(4,(user3,6))
(3,(user1,4))
(2,(bar,4))
(1,(foo,2))

As you've noticed, we can't nest RDDs. One option would be to emit book-user pairs, then join that with the book info, and then group the results by user id (grouping by key is a bit sketchy, but assuming no user has read so many books that the book info for that user doesn't fit in memory it should be ok).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Working with arraybuffers in Spark UDAF's - apache-spark

Related

Spark custom aggregation : collect_list+UDF vs UDAF

Compounding in Spark

Writing an efficient aggregation function for Spark SQL

Problems running Spark GraphX algorithms on generated graphs

Spark: Work around nested RDD

Categories

Resources