Writing an efficient aggregation function for Spark SQL - apache-spark

I wrote an aggregation function which returns range encoded representation of a column of Long data. I ran it on a 1 GB parquet file which has 50 columns. My cluster has 55 executors with 4 cores for each node. The run time is around 5 minutes even after caching the dataframe. Is there any way to run this query in a more efficient manner ?
Here is the UDAF -
class Concat extends UserDefinedAggregateFunction {
def inputSchema: org.apache.spark.sql.types.StructType =
StructType(StructField("value", LongType) :: Nil)
def bufferSchema: StructType = StructType(
StructField("concatenation",ArrayType(LongType,false) ) :: Nil
def dataType: DataType = ArrayType(LongType,false)
def deterministic: Boolean = true
def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer.update(0, new ArrayBuffer[Long]() )
def update(buffer: MutableAggregationBuffer,input: Row): Unit = {
val l=buffer.getAs[ ArrayBuffer[Long] ](0).toBuffer.asInstanceOf[ ArrayBuffer[Long] ]
val v=input.getAs[ Long ](0)
val n=l.size
l += v
l += 0L
val x1 = l(n-2)
val x2 = l(n-1)
if( x1-1 == v){
l(n-2)= v
l(n-1)= x2+1
else if(x1+x2+1 == v)
l(n-1)= x2+1
l += v
l += 0L
def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
val a=buffer1.getAs[ WrappedArray[Long] ](0)
val b=buffer2.getAs[ WrappedArray[Long] ](0)
buffer1.update(0,a ++ b)
def evaluate(buffer: Row): Any = {
Here is how I am running the query -
val concat = new Concat
sqlContext.udf.register("lcon", concat)
val df=sqlContext.read.parquet("file_url")
val results=sqlContext.sql("SELECT lcon(Id) from agg11 WHERE Status IN (1) AND Device IN (1,4) AND Medium IN (1)").collect


How to generate large word count file in Spark?

I want to generate 10 million lines’ wordcount file for performance test(each line has the same sentence). But I have no idea about how to code it.
You can give me an example code, and save file in HDFS directly.
You can try something like this.
Generate 1 column with values from 1 to 100k and one with values from 1 to 100 explode both of them with explode(column).
You can't generate one column with 10 Mil values because kryo buffer is gonna throw an error.
I don't know if this is the best performance way to do it, but it is the fastest way I can think right now.
val generateList = udf((s: Int) => {
val buf = scala.collection.mutable.ArrayBuffer.empty[Int]
for(i <- 1 to s) {
buf += i
val someDF = Seq(
("Lorem ipsum dolor sit amet, consectetur adipiscing elit.")
val someDfWithMilColumn = someDF.withColumn("genColumn1", generateList(lit(100000)))
.withColumn("genColumn2", generateList(lit(100)))
val someDfWithMilColumn100k = someDfWithMilColumn
.withColumn("expl_val", explode($"mil")).drop("expl_val", "genColumn1")
val someDfWithMilColumn10mil = someDfWithMilColumn100k
.withColumn("expl_val2", explode($"10")).drop("genColumn2", "expl_val2")
You can do it by joining the 2 DFs as below,
Also find the code explanation inline.
import org.apache.spark.sql.SaveMode
object GenerateTenMils {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
spark.conf.set("spark.sql.crossJoin.enabled","true") // Enable cross join
import spark.implicits._
//Create a DF with your sentence
val df = List("each line has the same sentence").toDF
//Create another Dataset with 10000000 records
.join(df) // Cross Join the dataframes
.coalesce(1) // Output to a single file
.drop("id") // Drop the extra column
.text("src/main/resources/tenMils") // Write as text file
You could follow this approach.
Tail recursive to generate the objects list and Dataframes, and Union to generate the big Dataframe
val spark = SparkSession
.config("spark.sql.shuffle.partitions","4") //Change to a more reasonable default number of partitions for our data
.config("spark.app.id","TenMillionsRows") // To silence Metrics warning
val sc = spark.sparkContext
import spark.implicits._
* Returns a List of nums sentences
* #param sentence
* #param num
* #return
def getList(sentence: String, num: Int) : List[String] = {
def loop(st: String,n: Int, acc: List[String]): List[String] = {
n match {
case num if num == 0 => acc
case _ => loop(st, n - 1, st :: acc)
* Returns a Dataframe that is the union of nums dataframes
* #param lst
* #param num
* #return
def getDataFrame(lst: List[String], num: Int): DataFrame = {
def loop (ls: List[String],n: Int, acc: DataFrame): DataFrame = {
n match {
case n if n == 0 => acc
case _ => loop(lst,n - 1, acc.union(sc.parallelize(ls).toDF("sentence")))
loop(lst, num, sc.parallelize(List(sentence)).toDF("sentence"))
val sentence = "hope for the best but prepare for the worst"
val lSentence = getList(sentence, 100000)
val dfs = getDataFrame(lSentence,100)
// output: 10000001
dfs.write.orc("path_to_hdfs") // write dataframe to a orc file
// you can save the file as parquet, txt, json .......
// with dataframe.write
Hope this helps.

Spark custom aggregation : collect_list+UDF vs UDAF

I often have the need to perform custom aggregations on dataframes in spark 2.1, and used these two approaches :
Using groupby/collect_list to get all the values in a single row, then apply an UDF to aggregate the values
Writing a custom UDAF (User defined aggregate function)
I generally prefer the first option as its easier to implement and more readable than the UDAF implementation. But I would assume that the first option is generally slower, because more data is sent around the network (no partial aggregation), but my experience shows that UDAF are generally slow. Why is that?
Concrete example: Calculating histograms:
Data is in a hive table (1E6 random double values)
val df = spark.table("testtable")
def roundToMultiple(d:Double,multiple:Double) = Math.round(d/multiple)*multiple
UDF approach:
val udf_histo = udf((xs:Seq[Double]) => xs.groupBy(x => roundToMultiple(x,0.25)).mapValues(_.size))
|UDF(xs) |
|Map(0.0 -> 125122, 1.0 -> 124772, 0.75 -> 250819, 0.5 -> 248696, 0.25 -> 250591)|
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
import scala.collection.mutable
class HistoUDAF(binWidth:Double) extends UserDefinedAggregateFunction {
override def inputSchema: StructType =
StructField("value", DoubleType) :: Nil
override def bufferSchema: StructType =
new StructType()
.add("histo", MapType(DoubleType, IntegerType))
override def deterministic: Boolean = true
override def dataType: DataType = MapType(DoubleType, IntegerType)
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = Map[Double, Int]()
private def mergeMaps(a: Map[Double, Int], b: Map[Double, Int]) = {
a ++ b.map { case (k,v) => k -> (v + a.getOrElse(k, 0)) }
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val oldBuffer = buffer.getAs[Map[Double, Int]](0)
val newInput = Map(roundToMultiple(input.getDouble(0),binWidth) -> 1)
buffer(0) = mergeMaps(oldBuffer, newInput)
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
val a = buffer1.getAs[Map[Double, Int]](0)
val b = buffer2.getAs[Map[Double, Int]](0)
buffer1(0) = mergeMaps(a, b)
override def evaluate(buffer: Row): Any = {
buffer.getAs[Map[Double, Int]](0)
val histo = new HistoUDAF(0.25)
|histoudaf(x) |
|Map(0.0 -> 125122, 1.0 -> 124772, 0.75 -> 250819, 0.5 -> 248696, 0.25 -> 250591)|
My tests show that the collect_list/UDF approach is about 2 times faster than the UDAF approach. Is this a general rule, or are there cases where UDAF is really much faster and the rather awkward implemetation is justified?
UDAF is slower because it deserializes/serializes aggregator from/to internal buffer on each update -> on each row which is quite expensive (some more details). Instead you should use Aggregator (in fact, UDAF have been deprecated since Spark 3.0).

Spark can not serialize the BufferedImage class

I have a Not Serializable Class exception in Spark 2.2.0.
The following procedure is what I am trying to do in Scala:
To read from HDFS a set of JPEG images.
To build an array of java.awt.image.BufferedImageS.
To extract the java.awt.image.BufferedImage buffer and store it in a 2D array for each image, by building an array of two-dimensional arrays containing the image buffer information Array[Array[Int]].
Transform the Array[Array[Int]] into an org.apache.spark.rdd.RDD[Array[Array[Int]]] by using sc.parallelize method.
Perform image processing operations distributelly by transforming the initial org.apache.spark.rdd.RDD[Array[Array[Int]]].
This is the code:
import org.apache.spark.sql.SparkSession
import javax.imageio.ImageIO
import java.io.ByteArrayInputStream
def binarize(image: Array[Array[Int]], threshold: Int) : Array[Array[Int]] = {
val height = image.size
val width = image(0).size
val result = Array.ofDim[Int](height, width)
for (i <- 0 until height) {
for (j <- 0 until width){
result(i)(j) = if (image(i)(j) <= threshold) 0 else 255
object imageTestObj {
def main(args: Array[String]) {
val spark = SparkSession.builder().appName("imageTest2").getOrCreate()
val sc = spark.sparkContext
val saveToHDFS = false
val threshold: Int = 128
val partitions = 32
val inPathStr = "hdfs://"
val outPathStr = if (saveToHDFS) "hdfs://" else "/home/vitrion/IdeaProjects/imageTest2/output/"
val files = sc.binaryFiles(inPathStr).collect
val AWTImageArray = files.map { binFile =>
val input = binFile._2.open()
val name = binFile._1
var buffer: Array[Byte] = Array.fill(input.available)(0)
ImageIO.read(new ByteArrayInputStream(buffer))
val ImgBuffers = AWTImageArray.map { image =>
val height = image.getHeight
val width = image.getWidth
val buffer = Array.ofDim[Int](height, width)
for (i <- 0 until height) {
for (j <- 0 until width){
buffer(i)(j) = image.getRaster.getDataBuffer.getElem(0, i * width + j)
val inputImages = sc.parallelize(ImgBuffers, partitions).cache()
val op1 = inputImages.map(image => binarize(image, threshold))
This algorithm gets a very well-known exception:
org.apache.spark.SparkException: Task not serializable
Caused by: java.io.NotSerializableException: java.awt.image.BufferedImage
Serialization stack:
- object not serializable (class: java.awt.image.BufferedImage, ...
I do not understand why Spark attempts to serialize the BufferedImage class when it is used before creating the first RDD in the application. Isn't it supposed that the BufferedImage class should be serialized if I try to create an RDD[BufferedImage]?
Can somebody explain me what is going on?
Thank you in advance...
Actually you are serializing a function in Spark. This function cannot contain references to non serializable classes. You can instantiate in the function non-serializable classes (OK), but NOT refer to instances of non serializable classes in the function.
Most probably you are referencing in one of the functions you use to an instance of a BufferedImage.
Check your code and see if you are not referencing from a function a BufferedImage object.
By inlining some code and not serializing BufferedImage objects, I guess you can overcome the exception. Can you try out this code (did not execute it myself)?:
object imageTestObj {
def main(args: Array[String]) {
val spark = SparkSession.builder().appName("imageTest2").getOrCreate()
val sc = spark.sparkContext
val saveToHDFS = false
val threshold: Int = 128
val partitions = 32
val inPathStr = "hdfs://"
val outPathStr = if (saveToHDFS) "hdfs://" else "/home/vitrion/IdeaProjects/imageTest2/output/"
val ImgBuffers = sc.binaryFiles(inPathStr).collect.map { binFile =>
val input = binFile._2.open()
val name = binFile._1
var buffer: Array[Byte] = Array.fill(input.available)(0)
val image = ImageIO.read(new ByteArrayInputStream(buffer))
// Inlining must be here, so that BufferedImage is not serialized.
val height = image.getHeight
val width = image.getWidth
val buffer = Array.ofDim[Int](height, width)
for (i <- 0 until height) {
for (j <- 0 until width){
buffer(i)(j) = image.getRaster.getDataBuffer.getElem(0, i * width + j)
val inputImages = sc.parallelize(ImgBuffers, partitions).cache()
val op1 = inputImages.map(image => binarize(image, threshold))

spark dataframe null value count

I am new to spark and i want to calculate the null rate of each columns,(i have 200 columns), my function is as follows:
def nullCount(dataFrame: DataFrame): Unit = {
val args = dataFrame.columns.length
val cols = dataFrame.columns
val d=dataFrame.count()
println("Follows are the null value rate of each columns")
for (i <- Range(0,args)) {
var nullrate = dataFrame.rdd.filter(r => r(i) == (-900)).count.toDouble / d
println(cols(i), nullrate)
But I find it's too slow , is there any more effective way to do this ?
Adapted from this answer by zero323:
import org.apache.spark.sql.functions.{col, count, when}
df.select(df.columns.map(c => (count(c) / count("*")).alias(c)): _*)
with -900:
c => (count(when(col(c) === -900, col(c))) / count("*")).alias(c)): _*)

spark: How to merge shuffled rdd efficiently?

I have 5 shuffled key-value rdds, one big one(1,000,000 records), and 4 relative small ones(100,000 records).All rdds were shullfed with the same number of partitions, I have two strategies to merge the 5 one,
Merge the 5 rdds together
merge the 4 small rdds together and then join the bigone
I think the strategy 2 would be more efficiently, as it would not re-shuffle the big one. But the experiment result shows the strategy 1 more efficient. The code and output are following:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkContext, SparkConf}
object MergeStrategy extends App {
val conf = new SparkConf().setMaster("local[4]").setAppName("test")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val bigRddSize = 1e6.toInt
val smallRddSize = 1e5.toInt
val bigRdd = sc.parallelize((0 until bigRddSize)
.map(x => (scala.util.Random.nextInt, 0))).repartition(100).cache
val smallRddList = (0 until 4).map(i => {
val rst = sc.parallelize((0 until smallRddSize)
.map(x => (scala.util.Random.nextInt, 0))).repartition(100).cache
// strategy 1
val begin = System.currentTimeMillis
val s1Rst = sc.union(Array(bigRdd) ++ smallRddList).distinct(100)
val end = System.currentTimeMillis
val timeCost = (end - begin) / 1000d
println("S1 time count: %.1f s".format(timeCost))
// strategy 2
val begin = System.currentTimeMillis
val smallMerged = sc.union(smallRddList).distinct(100).cache
val s2Rst = bigRdd.fullOuterJoin(smallMerged).flatMap({ case (key, (left, right)) => {
if (left.isDefined && right.isDefined) Array((key, left.get), (key, right.get)).distinct
else if (left.isDefined) Array((key, left.get))
else if (right.isDefined) Array((key, right.get))
else throw new Exception("Cannot happen")
val end = System.currentTimeMillis
val timeCost = (end - begin) / 1000d
println("S2 time count: %.1f s".format(timeCost))
S1 time count: 39.7 s
S2 time count: 49.8 s
My understanding for shuffled rdd was wrong? Can anybody give some advices?
I found a method to merge rdd more efficiently, see the following 2 merging strategies:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SQLContext
import org.apache.spark.{HashPartitioner, SparkContext, SparkConf}
import scala.collection.mutable.ArrayBuffer
object MergeStrategy extends App {
val conf = new SparkConf().setMaster("local[4]").setAppName("test")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val rddCount = 20
val mergeCount = 5
val dataSize = 20000
val parts = 50
// generate data
val testData = for (i <- 0 until rddCount)
yield sc.parallelize(scala.util.Random.shuffle((0 until dataSize).toList).map(x => (x, 0)))
.partitionBy(new HashPartitioner(parts))
testData.foreach(x => println(x.count))
// strategy 1: merge directly
val buff = ArrayBuffer[RDD[(Int, Int)]]()
val begin = System.currentTimeMillis
for (i <- 0 until rddCount) {
buff += testData(i)
if ((buff.size >= mergeCount || i == rddCount - 1) && buff.size > 1) {
val merged = sc.union(buff).distinct
.partitionBy(new HashPartitioner(parts)).cache
buff += merged
val end = System.currentTimeMillis
val timeCost = (end - begin) / 1000d
println("Strategy 1 Time Cost: %.1f".format(timeCost))
assert(buff.size == 1)
println("Strategy 1 Complete, with merged Count %s".format(buff(0).count))
// strategy 2: merge directly without repartition
val buff = ArrayBuffer[RDD[(Int, Int)]]()
val begin = System.currentTimeMillis
for (i <- 0 until rddCount) {
buff += testData(i)
if ((buff.size >= mergeCount || i == rddCount - 1) && buff.size > 1) {
val merged = sc.union(buff).distinct(parts).cache
buff += merged
val end = System.currentTimeMillis
val timeCost = (end - begin) / 1000d
println("Strategy 2 Time Cost: %.1f".format(timeCost))
assert(buff.size == 1)
println("Strategy 2 Complete, with merged Count %s".format(buff(0).count))
The result shows that strategy 1 (time cost 20.8 seconds) is more efficient than strategy 2 (time cost 34.3 seconds). my pc is window 8, CPU 4 cores 2.0GHz, 8GB memory.
The only difference is that strategy partitioned by HashPartitioner, but strategy 2 not. As a result, the strategy 1 produce ShuffledRDD, but strategy 1 MapPartitionsRDD. I think RDD.distinct function processes ShuflledRDD more efficiently than MapPartitionsRDD.
