Aggregate for mode(most common element) in spark dataframes - apache-spark

In Spark I am using a library for which I am supposed to provide the aggregates and the library then does a series of joins/groupby's and calls the aggregate at the end. I am trying to avoid violating encapsulation (although I can if necessary), and just call this method with an aggregate (traditionally sum or min etc.)
In this case I am trying to run mode, however, which I am not sure of how to run in an aggregate.

Here's a Spark (2.1.0) UDAF to calculate the Statistical Mode for a given column:
package org.anish.spark.mostcommonvalue
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
import scalaz.Scalaz._
/**
* Spark User Defined Aggregate Function to calculate the most frequent value in a column. This is similar to
* Statistical Mode. When there are two random values, this function selects any one. When calculating mode, both
* these values together is considered as mode.
*
* Usage:
*
* DataFrame / DataSet DSL
* val mostCommonValue = new MostCommonValue
* df.groupBy("group_id").agg(mostCommonValue(col("mode_column")), mostCommonValue(col("city")))
*
* Spark SQL:
* sqlContext.udf.register("mode", new MostCommonValue)
* %sql
* -- Use a group_by statement and call the UDAF.
* select group_id, mode(id) from table group by group_id
*
* Reference: https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html
*
* Created by anish on 26/05/17.
*/
class MostCommonValue extends UserDefinedAggregateFunction {
// This is the input fields for your aggregate function.
// We use StringType, because Mode can also be meaningfully applied on nominal data
override def inputSchema: StructType =
StructType(StructField("value", StringType) :: Nil)
// This is the internal fields you keep for computing your aggregate.
// We store the frequency of all the distinct element we encounter for the given attribute in this HashMap
override def bufferSchema: StructType = StructType(
StructField("frequencyMap", DataTypes.createMapType(StringType, LongType)) :: Nil
)
// This is the output type of your aggregation function.
override def dataType: DataType = StringType
override def deterministic: Boolean = true
// This is the initial value for the buffer schema.
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = Map[String, Long]()
}
// This is how to update your buffer schema given an input.
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
buffer(0) = buffer.getAs[Map[String, Long]](0) |+| Map(input.getAs[String](0) -> 1L)
}
// This is how you merge two objects with the bufferSchema type.
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0) = buffer1.getAs[Map[String, Long]](0) |+| buffer2.getAs[Map[String, Long]](0)
}
// This is where you output the final value, given the final value of your bufferSchema.
override def evaluate(buffer: Row): String = {
buffer.getAs[Map[String, Long]](0).maxBy(_._2)._1
}
}
credit/source:
https://gist.github.com/anish749/6a815ed281f538068a0d3a20ca9044fa

Related

Cumulative product UDF for Spark SQL

I have seen in other posts of this being done for dataframes: https://stackoverflow.com/a/52992212/4080521
But I am trying to figure out how I can write an udf for a cumulative product.
Assuming I have a very basic table
Input data:
+----+
| val|
+----+
| 1 |
| 2 |
| 3 |
+----+
If i want to take the sum of this I can simply do something like
sparkSession.createOrReplaceTempView("table")
spark.sql("""Select SUM(table.val) from table""").show(100, false)
and this simply works because SUM is a pre defined function.
How would I define something similar for multiplication (or even how can I implement sum in an UDF myself)?
Trying the following
sparkSession.createOrReplaceTempView("_Period0")
val prod = udf((vals:Seq[Decimal]) => vals.reduce(_ * _))
spark.udf.register("prod",prod)
spark.sql("""Select prod(table.vals) from table""").show(100, false)
I get the following error:
Message: cannot resolve 'UDF(vals)' due to data type mismatch: argument 1 requires array<decimal(38,18)> type, however, 'table.vals' is of decimal(28,14)
Obviously each specific cell is not an array, but it seems the udf needs to take in an array to perform the aggregation. Is it even possible with spark sql?
You can implement it through UserDefinedAggregateFunction
You need to define several functions to work with the input and the buffer values.
Quick example for the product function using just doubles as type:
import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
class myUDAF extends UserDefinedAggregateFunction {
// inputSchema for the function
override def inputSchema: StructType = {
new StructType().add("val", DoubleType, nullable = true)
}
//Schema for the inner UDAF buffer, in the product case, you just need an accumulator
override def bufferSchema: StructType = StructType(StructField("accumulated", DoubleType) :: Nil)
//OutputDataType
override def dataType: DataType = DoubleType
override def deterministic: Boolean = true
//Initicla buffer value 1 for product
override def initialize(buffer: MutableAggregationBuffer) = buffer(0) = 1.0
//How to update the buffer, for product you just need to perform a product between the two elements (buffer & input)
override def update(buffer: MutableAggregationBuffer, input: Row) = {
buffer(0) = buffer.getAs[Double](0) * input.getAs[Double](0)
}
//Merge results with the previous buffered value (product as well here)
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0) = buffer1.getAs[Double](0) * buffer2.getAs[Double](0)
}
//Function on how to return the value
override def evaluate(buffer: Row) = buffer.getAs[Double](0)
}
Then you can register the function as you would do with any other UDF:
spark.udf.register("prod", new myUDAF)
RESULT
scala> spark.sql("Select prod(val) from table").show
+-----------+
|myudaf(val)|
+-----------+
| 6.0|
+-----------+
You can find further documentation here

How to know data type of result of evaluation based on a schema

I have a an sql expression and input schema type. Based on these two information it seems possible to deduct the result type when the expression is applied to a data frame with the schema. How can I know result type without explicitly creating a data frame with the given schema.
I basically want to complete following method:
def resultType(sqlExpression: String, schema: StructType): DataType = {
}
One way to achieve it is:
def createEmptyDataFrame(schema: StructType): DataFrame = {
val spark = SparkSession.builder().getOrCreate()
val dummyRDD = spark.sparkContext.parallelize(Seq(Row.empty))
spark.createDataFrame(dummyRDD, schema)
}
def resultType(sqlExpression: String, schema: StructType): DataType = {
val df = createEmptyDataFrame(schema)
val result = df.withColumn("temp", expr(sqlExpression))
result.schema("temp").dataType
}
How can achieve the same without creating an empty dataframe?

Cannot evaluate ML model on Structured Streaming, because RDD transformations and actions are invoked inside other transformations

This is a well-known limitation[1] of Structured Streaming that I'm trying to get around using a custom sink.
In what follows, modelsMap is a map of string keys to org.apache.spark.mllib.stat.KernelDensity models
and
streamingData is a streaming dataframe org.apache.spark.sql.DataFrame = [id1: string, id2: string ... 6 more fields]
I'm trying to evaluate each row of streamingData against its corresponding model from modelsMap, enhance each row with prediction, and write to Kakfa.
An obvious way would be .withColumn, using a UDF to predict, and write using kafka sink.
But this is illegal because:
org.apache.spark.SparkException: This RDD lacks a SparkContext. It
could happen in the following cases: (1) RDD transformations and
actions are NOT invoked by the driver, but inside of other
transformations; for example, rdd1.map(x => rdd2.values.count() * x) is
invalid because the values transformation and count action cannot be
performed inside of the rdd1.map transformation. For more information,
see SPARK-5063.
I get the same error with a custom sink that implements forEachWriter which was a bit unexpected:
import org.apache.spark.sql.ForeachWriter
import java.util.Properties
import kafkashaded.org.apache.kafka.clients.producer._
class customSink(topic:String, servers:String) extends ForeachWriter[(org.apache.spark.sql.Row)] {
val kafkaProperties = new Properties()
kafkaProperties.put("bootstrap.servers", servers)
kafkaProperties.put("key.serializer", "kafkashaded.org.apache.kafka.common.serialization.StringSerializer")
kafkaProperties.put("value.serializer", "kafkashaded.org.apache.kafka.common.serialization.StringSerializer")
val results = new scala.collection.mutable.HashMap[String, String]
var producer: KafkaProducer[String, String] = _
def open(partitionId: Long,version: Long): Boolean = {
producer = new KafkaProducer(kafkaProperties)
true
}
def process(value: (org.apache.spark.sql.Row)): Unit = {
var prediction = Double.NaN
try {
val id1 = value(0)
val id2 = value(3)
val id3 = value(5)
val time_0 = value(6).asInstanceOf[Double]
val key = f"$id1/$id2/$id3"
var model = modelsMap(key)
println("Looking up key: ",key)
var prediction = Double.NaN
prediction = model.estimate(Array[Double](time_0))(0)
println(prediction)
} catch {
case e: NoSuchElementException =>
val prediction = Double.NaN
println(prediction)
}
producer.send(new ProducerRecord(topic, value.mkString(",")+","+prediction.toString))
}
def close(errorOrNull: Throwable): Unit = {
producer.close()
}
}
val writer = new customSink("<broker>", "<topic>")
val query = streamingData
.writeStream
.foreach(writer)
.outputMode("update")
.trigger(Trigger.ProcessingTime(10.seconds))
.start()
model.estimate is implemented under the hood using aggregate in mllib.stat, and there's no way to get around it.
What changes do I make? (I could collect each batch and execute a for loop using driver, but then I'm not using spark the way it's intended)
References:
https://www.slideshare.net/databricks/realtime-machine-learning-analytics-using-structured-streaming-and-kinesis-firehose slide#11 mentions limitations
https://www.oreilly.com/learning/extend-structured-streaming-for-spark-ml
https://github.com/holdenk/spark-structured-streaming-ml (proposed solution)
https://issues.apache.org/jira/browse/SPARK-16454
https://issues.apache.org/jira/browse/SPARK-16407

Does spark.sql.Dataset.groupByKey support window operations like groupBy?

In Spark Structured Streaming, we can do window operations on event time with groupBy like:
import spark.implicits._
val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }
// Group the data by window and word and compute the count of each group
val windowedCounts = words.groupBy(
window($"timestamp", "10 minutes", "5 minutes"),
$"word"
).count()
Does groupByKey also supports window operations?
Thanks.
It is possible to write a helper function that makes it easier to generate a time-windowing function to give to groupByKey.
object windowing {
import java.sql.Timestamp
import java.time.Instant
/** given:
* a row type R
* a function from R to the Timestamp
* a windowing width in seconds
* return: a function that allows groupByKey to do windowing
*/
def windowBy[R](f:R=>Timestamp, width: Int) = {
val w = width.toLong * 1000L
(row: R) => {
val tsCur = f(row)
val msCur = tsCur.getTime()
val msLB = (msCur / w) * w
val instLB = Instant.ofEpochMilli(msLB)
val instUB = Instant.ofEpochMilli(msLB+w)
(Timestamp.from(instLB), Timestamp.from(instUB))
}
}
}
And in your example, it might be used like this:
case class MyRow(timestamp: Timestamp, word: String)
val windowBy60 = windowing.windowBy[MyRow](_.timestamp, 60)
// count words by time window
words.as[MyRow]
.groupByKey(windowBy60)
.count()
Or counting by (window, word) pairs:
words.as[MyRow]
.groupByKey(row => (windowBy60(row), row.word))
.count()
Yes and no. It cannot be used directly, as it is applicable only to SQL / DataFrame API, but you can always extend the record with window field:
val dfWithWindow = df.withColumn("window", window(...)))
case class Window(start: java.sql.Timestamp. end: java.sql.Timestamp)
case class MyRecordWithWindow(..., window: Window)
and use it for grouping:
dfWithWindow.as[MyRecordWithWindow].groupByKey(_.window).mapGroups(...)

Demultiplexing RDD onto multiple ORC tables

I'm trying to convert data stored in S3 as JSON-per-line textfiles to a structured, columnar format like ORC or Parquet on S3.
The source files contain data of multiple schemes (eg. HTTP request, HTTP response, ...), which need to be parsed into different Spark Dataframes of the correct type.
Example schemas:
val Request = StructType(Seq(
StructField("timestamp", TimestampType, nullable=false),
StructField("requestId", LongType),
StructField("requestMethod", StringType),
StructField("scheme", StringType),
StructField("host", StringType),
StructField("headers", MapType(StringType, StringType, valueContainsNull=false)),
StructField("path", StringType),
StructField("sessionId", StringType),
StructField("userAgent", StringType)
))
val Response = StructType(Seq(
StructField("timestamp", TimestampType, nullable=false),
StructField("requestId", LongType),
StructField("contentType", StringType),
StructField("contentLength", IntegerType),
StructField("statusCode", StringType),
StructField("headers", MapType(keyType=StringType, valueType=StringType, valueContainsNull=false)),
StructField("responseDuration", DoubleType),
StructField("sessionId", StringType)
))
I got that part working fine, however trying to write out the data back to S3 as efficiently as possible seems to be an issue atm.
I tried 3 approaches:
muxPartitions from the silex project
caching the parsed S3 input and looping over it multiple times
making each scheme type a separate partition of the RDD
In the first case, the JVM ran out of memory and in the second one the machine ran out of disk space.
The third I haven't thoroughly tested yet, but this does not seem an efficient use of processing power (as only one node of the cluster (the one on which this particular partition is) would actually be writing the data back out to S3).
Relevant code:
val allSchemes = Schemes.all().keys.toArray
if (false) {
import com.realo.warehouse.multiplex.implicits._
val input = readRawFromS3(inputPrefix) // returns RDD[Row]
.flatMuxPartitions(allSchemes.length, data => {
val buffers = Vector.tabulate(allSchemes.length) { j => ArrayBuffer.empty[Row] }
data.foreach {
logItem => {
val schemeIndex = allSchemes.indexOf(logItem.logType)
if (schemeIndex > -1) {
buffers(schemeIndex).append(logItem.row)
}
}
}
buffers
})
allSchemes.zipWithIndex.foreach {
case (schemeName, index) =>
val rdd = input(index)
writeColumnarToS3(rdd, schemeName)
}
} else if (false) {
// Naive approach
val input = readRawFromS3(inputPrefix) // returns RDD[Row]
.persist(StorageLevel.MEMORY_AND_DISK)
allSchemes.foreach {
schemeName =>
val rdd = input
.filter(x => x.logType == schemeName)
.map(x => x.row)
writeColumnarToS3(rdd, schemeName)
}
input.unpersist()
} else {
class CustomPartitioner extends Partitioner {
override def numPartitions: Int = allSchemes.length
override def getPartition(key: Any): Int = allSchemes.indexOf(key.asInstanceOf[String])
}
val input = readRawFromS3(inputPrefix)
.map(x => (x.logType, x.row))
.partitionBy(new CustomPartitioner())
.map { case (logType, row) => row }
.persist(StorageLevel.MEMORY_AND_DISK)
allSchemes.zipWithIndex.foreach {
case (schemeName, index) =>
val rdd = input
.mapPartitionsWithIndex(
(i, iter) => if (i == index) iter else Iterator.empty,
preservesPartitioning = true
)
writeColumnarToS3(rdd, schemeName)
}
input.unpersist()
}
Conceptually, I think the code should have 1 output DStream per scheme type and the input RDD should pick 'n place each processed item onto the correct DStream (with batching for better throughput).
Does anyone have any pointers as to how to implement this? And/or is there a better way of tackling this problem?
Given that the input is a json, you can read it into a dataframe of strings (each line is a single string). Then you can extract the type from each json (either by using a UDF or by using a function such as get_json_object or json_tuple).
Now you have two columns: The type and the original json. You can now use partitionBy dataframe option when writing the dataframe. This would result in a directory for each type and the content of the directory would include the original jsons.
Now you can read each type with its own schema.
You can also do a similar thing with RDD using a map which turns the input rdd into a pair rdd with the key being the type and the value being the json converted to the target schema. Then you can use partitionBy and map partition to save each partition to a file or you can use reduce by key to write to different files (e.g. by using the key to set the filename).
You could also take a look at Write to multiple outputs by key Spark - one Spark job
Note that I assumed here that the goal is to split to file. Depending on your specific use case, other options might be viable. For example, if your different schemas are close enough, you can create a super schema which encompasses all of them and create the dataframe directly from that. Then you can either work on the dataframe directly or use the dataframe partitionBy to write the different subtypes to different directories (but this time already saved to parquet).
This is what I came up with eventually:
I use a custom partitioner to partition the data based on their scheme plus the hashcode of the row.
The reasoning here is that we want to be able to only process certain partitions, yet still allow all nodes to participate (for performance reasons). So we don't spread the data over just 1 partition, but over X partitions (with X being the number of nodes times 2, in this example).
Then for each scheme, we prune the partitions we don't need and thus we will only process the ones we do.
Code example:
def process(date : ReadableInstant, schemesToProcess : Array[String]) = {
// Tweak this based on your use case
val DefaultNumberOfStoragePartitions = spark.sparkContext.defaultParallelism * 2
class CustomPartitioner extends Partitioner {
override def numPartitions: Int = schemesToProcess.length * DefaultNumberOfStoragePartitions
override def getPartition(key: Any): Int = {
// This is tightly coupled with how `input` gets transformed below
val (logType, rowHashCode) = key.asInstanceOf[(String, Int)]
(schemesToProcess.indexOf(logType) * DefaultNumberOfStoragePartitions) + Utils.nonNegativeMod(rowHashCode, DefaultNumberOfStoragePartitions)
}
/**
* Internal helper function to retrieve all partition indices for the given key
* #param key input key
* #return
*/
private def getPartitions(key: String): Seq[Int] = {
val index = schemesToProcess.indexOf(key) * DefaultNumberOfStoragePartitions
index until (index + DefaultNumberOfStoragePartitions)
}
/**
* Returns an RDD which only traverses the partitions for the given key
* #param rdd base RDD
* #param key input key
* #return
*/
def filterRDDForKey[T](rdd: RDD[T], key: String): RDD[T] = {
val partitions = getPartitions(key).toSet
PartitionPruningRDD.create(rdd, x => partitions.contains(x))
}
}
val partitioner = new CustomPartitioner()
val input = readRawFromS3(date)
.map(x => ((x.logType, x.row.hashCode), x.row))
.partitionBy(partitioner)
.persist(StorageLevel.MEMORY_AND_DISK_SER)
// Initial stage: caches the processed data + gets an enumeration of all schemes in this RDD
val schemesInRdd = input
.map(_._1._1)
.distinct()
.collect()
// Remaining stages: for each scheme, write it out to S3 as ORC
schemesInRdd.zipWithIndex.foreach {
case (schemeName, index) =>
val rdd = partitioner.filterRDDForKey(input, schemeName)
.map(_._2)
.coalesce(DefaultNumberOfStoragePartitions)
writeColumnarToS3(rdd, schemeName)
}
input.unpersist()
}

Resources