Why a new batch is triggered without getting any new offsets in streaming source? - apache-spark

I have multiple spark structured streaming jobs and the usual behaviour that I see is that a new batch is triggered only when there are any new offsets in Kafka which is used as source to create streaming query.
But when I run this example which demonstrates arbitrary stateful operations using mapGroupsWithState , then I see that a new batch is triggered even if there is no new data in Streaming source. Why is it so and can it be avoided?
Update-1
I modified the above example code and remove state related operation like updating/removing it. Function simply outputs zero. But still a batch is triggered every 10 seconds without any new data on netcat server.
import java.sql.Timestamp
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming._
object Stateful {
def main(args: Array[String]): Unit = {
val host = "localhost"
val port = "9999"
val spark = SparkSession
.builder
.appName("StructuredSessionization")
.master("local[2]")
.getOrCreate()
import spark.implicits._
// Create DataFrame representing the stream of input lines from connection to host:port
val lines = spark.readStream
.format("socket")
.option("host", host)
.option("port", port)
.option("includeTimestamp", true)
.load()
// Split the lines into words, treat words as sessionId of events
val events = lines
.as[(String, Timestamp)]
.flatMap { case (line, timestamp) =>
line.split(" ").map(word => Event(sessionId = word, timestamp))
}
val sessionUpdates = events
.groupByKey(event => event.sessionId)
.mapGroupsWithState[SessionInfo, Int](GroupStateTimeout.ProcessingTimeTimeout) {
case (sessionId: String, events: Iterator[Event], state: GroupState[SessionInfo]) =>
0
}
val query = sessionUpdates
.writeStream
.outputMode("update")
.trigger(Trigger.ProcessingTime("10 seconds"))
.format("console")
.start()
query.awaitTermination()
}
}
case class Event(sessionId: String, timestamp: Timestamp)
case class SessionInfo(
numEvents: Int,
startTimestampMs: Long,
endTimestampMs: Long)

The reason for the empty batches showing up is the usage of Timeouts within the mapGroupsWithState call.
According to the book "Learning Spark 2.0" it says:
"The next micro-batch will call the function on this timed-out key even if there is not data for that key in that micro.batch. [...] Since the timeouts are processed during the micro-batches, the timing of their execution is imprecise and depends heavily on the trigger interval [...]."
As you have set the timeout to be GroupStateTimeout.ProcessingTimeTimeout it aligns with your trigger time of the query which is 10 seconds. The alternative would be to set the timeout based on event time (i.e. GroupStateTimeout.EventTimeTimeout).
The ScalaDocs on GroupState provide some more details:
When the timeout occurs for a group, the function is called for that group with no values, and GroupState.hasTimedOut() set to true.

Related

Spark Structured Streaming - testing one batch at a time

I'm trying to create a test for a custom MicroBatchReadSupport DataSource which I've implemented.
For that, I want to invoke one batch at a time, which will read the data using this DataSource(I've created appropriate mocks). I want to invoke a batch, verify that the correct data was read (currently by saving it to a memory sink and checking the output), and only then invoke the next batch and verify it's output.
I couldn't find a way to invoke each batch after the other.
If I use streamingQuery.processAllAvailable(), the batches are invoked one after the other, without allowing me to verify the output for each one separately. Using trigger(Trigger.Once()) doesn't help as well, because it executes one batch and I can't continue to the next one.
Is there any way to do what I want?
Currently this is my basic code:
val dataFrame = sparkSession.readStream.format("my-custom-data-source").load()
val dsw: DataStreamWriter[Row] = dataFrame.writeStream
.format("memory")
.queryName("test_output")
val streamingQuery = dsw
.start()
streamingQuery.processAllAvailable()
What I've ended up doing is setting up the test with a DataStreamWriter which runs once, but saves the current status to a checkpoint. So each time we invoke dsw.start(), the new batch is resumed from the latest offset, according to the checkpoint. I'm also saving the data into a globalTempView, so I will be able to query the data in a similar way to using the memory sink. For doing that, I'm using foreachBatch (which is only available since Spark 2.4).
This is in code:
val dataFrame = sparkSession.readStream.format("my-custom-data-source").load()
val dsw = getNewDataStreamWriter(dataFrame)
testFirstBatch(dsw)
testSecondBatch(dsw)
private def getNewDataStreamWriter(dataFrame: DataFrame) = {
val checkpointTempDir = Files.createTempDirectory("tests").toAbsolutePath.toString
val dsw: DataStreamWriter[Row] = dataFrame.writeStream
.trigger(Trigger.Once())
.option("checkpointLocation", checkpointTempDir)
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.createOrReplaceGlobalTempView("input_data")
}
dsw
}
And the actual test code for each batch (e.g. testFirstBatch) is:
val rows = processNextBatch(dsw)
assertResult(10)(rows.length)
private def processNextBatch(dsw: DataStreamWriter[Row]) = {
val streamingQuery = dsw
.start()
streamingQuery.processAllAvailable()
sparkSession.sql("select * from global_temp.input_data").collect()
}

Call a function with each element a stream in Databricks

I have a DataFrame stream in Databricks, and I want to perform an action on each element. On the net I found specific purpose methods, like writing it to the console or dumping into memory, but I want to add some business logic, and put some results into Redis.
To be more specific, this is how it would look like in non-stream case:
val someDataFrame = Seq(
("key1", "value1"),
("key2", "value2"),
("key3", "value3"),
("key4", "value4")
).toDF()
def someFunction(keyValuePair: (String, String)) = {
println(keyValuePair)
}
someDataFrame.collect.foreach(r => someFunction((r(0).toString, r(1).toString)))
But if the someDataFrame is not a simple data frame but a stream data frame (indeed coming from Kafka), the error message is this:
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
Could anyone please help me solving this problem?
Some important notes:
I've read the relevant documentation, like Spark Streaming or Databricks Streaming and a few other descriptions as well.
I know that there must be something like start() and awaitTermination, but I don't know the exact syntax. The descriptions did not help.
It would take pages to list all the possibilities I tried, so I rather not provide them.
I do not want to solve the specific problem of displaying the result. I.e. please do not provide a solution to this specific case. The someFunction would look like this:
val someData = readSomeExternalData()
if (condition containing keyValuePair and someData) {
doSomething(keyValuePair);
}
(Question What is the purpose of ForeachWriter in Spark Structured Streaming? does not provide a working example, therefore does not answer my question.)
Here is an example of reading using foreachBatch to save every item to redis using the streaming api.
Related to a previous question (DataFrame to RDD[(String, String)] conversion)
// import spark and spark-redis
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.streaming._
import org.apache.spark.sql.types._
import com.redislabs.provider.redis._
// schema of csv files
val userSchema = new StructType()
.add("name", "string")
.add("age", "string")
// create a data stream reader from a dir with csv files
val csvDF = spark
.readStream
.format("csv")
.option("sep", ";")
.schema(userSchema)
.load("./data") // directory where the CSV files are
// redis
val redisConfig = new RedisConfig(new RedisEndpoint("localhost", 6379))
implicit val readWriteConfig: ReadWriteConfig = ReadWriteConfig.Default
csvDF.map(r => (r.getString(0), r.getString(0))) // converts the dataset to a Dataset[(String, String)]
.writeStream // create a data stream writer
.foreachBatch((df, _) => sc.toRedisKV(df.rdd)(redisConfig)) // save each batch to redis after converting it to a RDD
.start // start processing
Call simple user defined function foreachbatch in spark streaming.
please try this,
it will print 'hello world' for every message from tcp socket
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession .builder .appName("StructuredNetworkWordCount") .getOrCreate()
# Create DataFrame representing the stream of input lines from connection tolocalhost:9999
lines = spark .readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load()
# Split the lines into words
words = lines.select(
explode(
split(lines.value, " ")
).alias("word")
)
# Generate running word count
wordCounts = words.groupBy("word").count()
# Start running the query that prints the running counts to the console
def process_row(df, epoch_id):
# # Write row to storage
print('hello world')
query = words.writeStream.foreachBatch(process_row).start()
#query = wordCounts .writeStream .outputMode("complete") .format("console") .start()
query.awaitTermination()

Spark Structured Streaming Kafka Microbatch count

I am using Spark structured streaming to read records from a Kafka topic; I intend to count the number of records received in each 'Micro batch' in Spark readstream
This is a snippet:
val kafka_df = sparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host:port")
.option("subscribe", "test-count")
.load()
I understand from the docs that kafka_df will be lazily evaluated when a streamingQuery is started (to come next), and as it is evaluated, it holds a micro-batch. So, I figured doing a groupBy on topic followed by a count should work.
Like this:
val counter = kafka_df
.groupBy("topic")
.count()
Now to evaluate all of this, we need a streaminQuery, lets say, a console sink query to print it on the console. And this is where i see the problem. A streamingQuery on aggregate DataFrames, such as kafka_df works only with outputMode complete/update and not on append.
This effectively means that, the count reported by the streamingQuery is cumulative.
Like this:
val counter_json = counter.toJSON //to jsonify
val count_query = counter_json
.writeStream.outputMode("update")
.format("console")
.start() // kicks of lazy evaluation
.awaitTermination()
In a controlled set up, where:
actual Published records: 1500
actual Received micro-batches : 3
aActual Received records: 1500
The count of each microbatch is supposed to be 500, so I hoped (wished) that the query prints to console:
topic: test-count
count: 500
topic: test-count
count: 500
topic: test-count
count: 500
But it doesn't. It actually prints:
topic: test-count
count: 500
topic: test-count
count:1000
topic: test-count
count: 1500
This I understand is because of 'outputMode' complete/update (cumulative)
My question: Is it possible to accurately get the count of each micro-batch is Spark-Kafka structured streaming?
From the docs, I found out about the watermark approach (to support append):
val windowedCounts = kafka_df
.withWatermark("timestamp", "10 seconds")
.groupBy(window($"timestamp", "10 seconds", "10 seconds"), $"topic")
.count()
val console_query = windowedCounts
.writeStream
.outputMode("append")
.format("console")
.start()
.awaitTermination()
But the results of this console_query are inaccurate and appears is way off mark.
TL;DR - Any thoughts on accurately counting the records in Spark-Kafka micro-batch would be appreciated.
If you want to only process a specific number of records with every trigger within a Structured Streaming application using Kafka, use the option maxOffsetsPerTrigger
val kafka_df = sparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host:port")
.option("subscribe", "test-count")
.option("maxOffsetsPerTrigger", 500)
.load()
"TL;DR - Any thoughts on accurately counting the records in Spark-Kafka micro-batch would be appreciated."
You can count the records fetched from Kafka by using a StreamingQueryListener (ScalaDocs).
This allows you to print out the exact number of rows that were received from the subscribed Kafka topic. The onQueryProgress API gets called during every micro-batch and contains lots of useful meta information on your query. If no data is flowing into the query the onQueryProgress is called every 10 seconds. Below is a simple example that prints out the number of input messages.
spark.streams.addListener(new StreamingQueryListener() {
override def onQueryStarted(queryStarted: QueryStartedEvent): Unit = {}
override def onQueryTerminated(queryTerminated: QueryTerminatedEvent): Unit = {}
override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {
println("NumInputRows: " + queryProgress.progress.numInputRows)
}
})
In case you are validating the performance of your Structured Streaming query, it is usually best to keep an eye on the following two metrics:
queryProgress.progress.inputRowsPerSecond
queryProgress.progress.processedRowsPerSecond
In case input is higher than processed you might increase resources for your job or reduce the maximum limit (by reducing the readStream option maxOffsetsPerTrigger). If processed is higher, you may want to increase this limit.

How does Structured Streaming execute separate streaming queries (in parallel or sequentially)?

I'm writing a test application that consumes messages from Kafka's topcis and then push data into S3 and into RDBMS tables (flow is similar to presented here: https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html). So I read data from Kafka and then:
each message want to save into S3
some messages save to table A in an external database (based on filter condition)
some other messages save to table B in an external database (other filter condition)
So I have sth like:
Dataset<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1,topic2,topic3")
.option("startingOffsets", "earliest")
.load()
.select(from_json(col("value").cast("string"), schema, jsonOptions).alias("parsed_value"))
(please notice that I'm reading from more than one Kafka topic).
Next I define required datasets:
Dataset<Row> allMessages = df.select(.....)
Dataset<Row> messagesOfType1 = df.select() //some unique conditions applied on JSON elements
Dataset<Row> messagesOfType2 = df.select() //some other unique conditions
and now for each Dataset I create query to start processing:
StreamingQuery s3Query = allMessages
.writeStream()
.format("parquet")
.option("startingOffsets", "latest")
.option("path", "s3_location")
.start()
StreamingQuery firstQuery = messagesOfType1
.writeStream()
.foreach(new CustomForEachWiriterType1()) // class that extends ForeachWriter[T] and save data into external RDBMS table
.start();
StreamingQuery secondQuery = messagesOfType2
.writeStream()
.foreach(new CustomForEachWiriterType2()) // class that extends ForeachWriter[T] and save data into external RDBMS table (may be even another database than before)
.start();
Now I'm wondering:
Will be those queries executed in parallel (or one after another in FIFO order and I should assign those queries to separate scheduler pools)?
Will be those queries executed in parallel
Yes. These queries are going to be executed in parallel (every trigger which you did not specify and hence is to run them as fast as possible).
Internally, when you execute start on a DataStreamWriter, you create a StreamExecution that in turn creates immediately so-called daemon microBatchThread (quoted from the Spark source code below):
val microBatchThread =
new StreamExecutionThread(s"stream execution thread for $prettyIdString") {
override def run(): Unit = {
// To fix call site like "run at <unknown>:0", we bridge the call site from the caller
// thread to this micro batch thread
sparkSession.sparkContext.setCallSite(callSite)
runBatches()
}
}
You can see every query in its own thread with name:
stream execution thread for [prettyIdString]
You can check the separate threads using jstack or jconsole.

How to load history data when starting Spark Streaming process, and calculate running aggregations

I have some sales-related JSON data in my ElasticSearch cluster, and I would like to use Spark Streaming (using Spark 1.4.1) to dynamically aggregate incoming sales events from my eCommerce website via Kafka, to have a current view to the user's total sales (in terms of revenue and products).
What's not really clear to me from the docs I read is how I can load the history data from ElasticSearch upon the start of the Spark application, and to calculate for example the overall revenue per user (based on the history, and the incoming sales from Kafka).
I have the following (working) code to connect to my Kafka instance and receive the JSON documents:
import kafka.serializer.StringDecoder
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.SQLContext
object ReadFromKafka {
def main(args: Array[String]) {
val checkpointDirectory = "/tmp"
val conf = new SparkConf().setAppName("Read Kafka JSONs").setMaster("local[2]")
val topicsSet = Array("tracking").toSet
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(10))
// Create direct kafka stream with brokers and topics
val kafkaParams = Map[String, String]("metadata.broker.list" -> "localhost:9092")
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet)
//Iterate
messages.foreachRDD { rdd =>
//If data is present, continue
if (rdd.count() > 0) {
//Create SQLContect and parse JSON
val sqlContext = new SQLContext(sc)
val trackingEvents = sqlContext.read.json(rdd.values)
//Sample aggregation of incoming data
trackingEvents.groupBy("type").count().show()
}
}
// Start the computation
ssc.start()
ssc.awaitTermination()
}
}
I know that there's a plugin for ElasticSearch (https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html#spark-read), but it's not really clear to me how to integrate the read upon startup, and the streaming calculation process to aggregate the history data with the streaming data.
Help is much appreaciated! Thanks in advance.
RDDs are immutable, so after they are created you cannot add data to them, for example updating the revenue with new events.
What you can do is union the existing data with the new events to create a new RDD, which you can then use as the current total. For example...
var currentTotal: RDD[(Key, Value)] = ... //read from ElasticSearch
messages.foreachRDD { rdd =>
currentTotal = currentTotal.union(rdd)
}
In this case we make currentTotal a var since it will be replaced by the reference to the new RDD when it gets unioned with the incoming data.
After the union you may want to perform some further operations such as reducing the values which belong to the same Key, but you get the picture.
If you use this technique note that the lineage of your RDDs will grow, as each newly created RDD will reference its parent. This can cause a stack overflow style lineage problem. To fix this you can call checkpoint() on the RDD periodically.

Resources