Spark Structured Streaming Kafka Microbatch count - apache-spark

I am using Spark structured streaming to read records from a Kafka topic; I intend to count the number of records received in each 'Micro batch' in Spark readstream
This is a snippet:
val kafka_df = sparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host:port")
.option("subscribe", "test-count")
.load()
I understand from the docs that kafka_df will be lazily evaluated when a streamingQuery is started (to come next), and as it is evaluated, it holds a micro-batch. So, I figured doing a groupBy on topic followed by a count should work.
Like this:
val counter = kafka_df
.groupBy("topic")
.count()
Now to evaluate all of this, we need a streaminQuery, lets say, a console sink query to print it on the console. And this is where i see the problem. A streamingQuery on aggregate DataFrames, such as kafka_df works only with outputMode complete/update and not on append.
This effectively means that, the count reported by the streamingQuery is cumulative.
Like this:
val counter_json = counter.toJSON //to jsonify
val count_query = counter_json
.writeStream.outputMode("update")
.format("console")
.start() // kicks of lazy evaluation
.awaitTermination()
In a controlled set up, where:
actual Published records: 1500
actual Received micro-batches : 3
aActual Received records: 1500
The count of each microbatch is supposed to be 500, so I hoped (wished) that the query prints to console:
topic: test-count
count: 500
topic: test-count
count: 500
topic: test-count
count: 500
But it doesn't. It actually prints:
topic: test-count
count: 500
topic: test-count
count:1000
topic: test-count
count: 1500
This I understand is because of 'outputMode' complete/update (cumulative)
My question: Is it possible to accurately get the count of each micro-batch is Spark-Kafka structured streaming?
From the docs, I found out about the watermark approach (to support append):
val windowedCounts = kafka_df
.withWatermark("timestamp", "10 seconds")
.groupBy(window($"timestamp", "10 seconds", "10 seconds"), $"topic")
.count()
val console_query = windowedCounts
.writeStream
.outputMode("append")
.format("console")
.start()
.awaitTermination()
But the results of this console_query are inaccurate and appears is way off mark.
TL;DR - Any thoughts on accurately counting the records in Spark-Kafka micro-batch would be appreciated.

If you want to only process a specific number of records with every trigger within a Structured Streaming application using Kafka, use the option maxOffsetsPerTrigger
val kafka_df = sparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host:port")
.option("subscribe", "test-count")
.option("maxOffsetsPerTrigger", 500)
.load()

"TL;DR - Any thoughts on accurately counting the records in Spark-Kafka micro-batch would be appreciated."
You can count the records fetched from Kafka by using a StreamingQueryListener (ScalaDocs).
This allows you to print out the exact number of rows that were received from the subscribed Kafka topic. The onQueryProgress API gets called during every micro-batch and contains lots of useful meta information on your query. If no data is flowing into the query the onQueryProgress is called every 10 seconds. Below is a simple example that prints out the number of input messages.
spark.streams.addListener(new StreamingQueryListener() {
override def onQueryStarted(queryStarted: QueryStartedEvent): Unit = {}
override def onQueryTerminated(queryTerminated: QueryTerminatedEvent): Unit = {}
override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {
println("NumInputRows: " + queryProgress.progress.numInputRows)
}
})
In case you are validating the performance of your Structured Streaming query, it is usually best to keep an eye on the following two metrics:
queryProgress.progress.inputRowsPerSecond
queryProgress.progress.processedRowsPerSecond
In case input is higher than processed you might increase resources for your job or reduce the maximum limit (by reducing the readStream option maxOffsetsPerTrigger). If processed is higher, you may want to increase this limit.

Related

Change filter/where condition when restarting a Structured Streaming query reading data from Delta Table

In Structured Streaming, will the checkpoints keep track of which data has already been processed from a Delta Table?
def fetch_data_streaming(source_table: str):
print("Fetching now")
streamingInputDF = (
spark
.readStream
.format("delta")
.option("maxBytesPerTrigger",1024)
.table(source_table)
.where("measurementId IN (1351,1350)")
.where("year >= '2021'")
)
query = (
streamingInputDF
.writeStream
.outputMode("append")
.option("checkpointLocation", "/streaming_checkpoints/5")
.foreachBatch(customWriter)
.start()
.awaitTermination()
)
return query
def customWriter(batchDF,batchId):
print(batchId)
print(batchDF.count())
batchDF.show(10)
length = batchDF.count()
print("batchId,batch size:",batchId,length)
If I change the where clause in the streamingInputDF to add more measurentId, the structured streaming job doesn't always acknowledge the change and fetch the new data values. It continues to run as if nothing has changed, whereas at times it starts fetching new values.
Isn't the checkpoint supposed to identify the change?
Edit: Schema of delta table:
col_name
data_type
measurementId
int
year
int
time
timestamp
q
smallint
v
string
"In structured streaming, will the checkpoints will keep track of which data has already been processed?"
Yes, the Structured Streaming job will store the read version of the Delta table in its checkpoint files to avoid producing duplicates.
Within the checkpoint directory in the folder "offsets", you will see that Spark stored the progress per batchId. For example it will look like below:
v1
{"batchWatermarkMs":0,"batchTimestampMs":1619695775288,"conf":[...]}
{"sourceVersion":1,"reservoirId":"d910a260-6aa2-4a7c-9f5c-1be3164127c0","reservoirVersion":2,"index":2,"isStartingVersion":true}
Here, the important part is the "reservoirVersion":2 which tells you that the streaming job has consumed all data from the Delta Table as of version 2.
Re-starting your Structured Streaming query with an additional filter condition will therefore not be applied to historic records but only to those that were added to the Delta Table after version 2.
In order to see this behavior in action you can use below code and analyse the content in the checkpoint files.
val deltaPath = "file:///tmp/delta/table"
val checkpointLocation = "file:///tmp/checkpoint/"
// run the following two lines once
val deltaDf = Seq(("1", "foo1"), ("2", "foo2"), ("3", "foo2")).toDF("id", "value")
deltaDf.write.format("delta").mode("append").save(deltaPath)
// run this code for the first time, then add filter condition, then run again
val query = spark.readStream
.format("delta")
.load(deltaPath)
.filter(col("id").isin("1")) // in the second run add "2"
.writeStream
.format("console")
.outputMode("append")
.option("checkpointLocation", checkpointLocation)
.start()
query.awaitTermination()
Now, if you append some more data to the Delta table while the streaming query is shut down and then restart is with the new filter condition it will be applied to the new data.

spark structured streaming operation duration

I am running a structured streaming job with kafka source.
spark: 2.4.7
python: 3.6.8
spark = SparkSession.builder.getOrCreate()
ds = (spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafka_servers)
.option("subscribe", topic_name)
.load())
# data preprocessing
ds = ...
model = GBTClassificationModel.load(model_path)
ds = model.transform(ds)
query = (ds.writeStream
.outputMode("update")
.format("kafka")
.option("kafka.bootstrap.servers", kafka_servers)
.option("checkpointLocation", checkpoint_dir)
.option("topic", output_topic)
.trigger(processingTime="0 seconds")
.start())
query.awaitTermination()
The spark web UI displays the following indicators:
AddBatch: 1955.0
GetBatch: 1.0
GetOffset: 0.0
QueryPlanning: 3555.0
TriggerExecution: 2.0
WalCommit: 5569.0
undefined: 20.0
The following is a description of the indicators on the spark official website:
Operation Duration. The amount of time taken to perform various operations in milliseconds. The tracked operations are listed as follows.
addBatch: Time taken to read the micro-batch’s input data from the sources, process it, and write the batch’s output to the sink. This should take the bulk of the micro-batch’s time.
getBatch: Time taken to prepare the logical query to read the input of the current micro-batch from the sources.
latestOffset & getOffset: Time taken to query the maximum available offset for this source.
queryPlanning: Time taken to generates the execution plan.
walCommit: Time taken to write the offsets to the metadata log.
Why are WalCommit and QueryPlanning much larger than AddBatch?
Thanks!

Read whole Kafka topic as spark dataframe in offsets batches

I am trying to read all data in a kafka topic in batches (reading between two offset values) and load them to spark dataframes, without using readStream in spark streaming.
My idea is:
I first get the total number of data lines in the topic finding the maximum offset value.
I define step, namely the total number of data per batch.
With a for loop I read the data batch from the kafka topic setting startingOffsets and endingOffsets parameters.
This is my code (for a topic with a single partition) to print the count in each batch:
val maxOffsetValue = {
Process(s"kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic topicname")
.!!
.split(":")
.last
.trim
.toInt
}
val step = 1000
for (i <- 0 until maxOffsetValue by step) {
val df: DataFrame = {
spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topicname")
.option("startingOffsets", s"""{"topicname":{"0":${i}}}""")
.option("endingOffsets", s"""{"topicname":{"0":${i+step}}}""")
.load()
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
.select(from_json(col("value"), dataSchema) as "data")
.select("data.*")
}
println(s"i: ${i}, i+step: ${i+step}, count: ${df.count()}")
}
However, it seems that the json format for startingOffsets and endingOffsets is not flexible, as apparently all offsets indices need to be specified for each partition, e.g something like {"0":${i}, "1": ${i}}} if there are two partitions.
My questions are:
Is there a better way to achieve the same results, possibly that can be extended directly to a multi partition topic?
Is there a way to read the maximum offset without using a shell command?

how to determine where the Kafka consumption is made on Spark

We tried two MlLib transformers for Kafka consuming: one using Structured Batch Query Stream like here, and one using normal Kafka consumers. The normal consumers read into a list that gets converted to a dataframe
We created two MlLib pipelines starting with an empty dataframe. The first transformer of these pipes did the reading from Kafka. We had a pipe for each kafka transformer type.
then ran the pipes: 1st with normal kafka consumers, 2nd with Spark consumer, 3rd with normal consumers again.
The spark config had 4 executors, with 1 core each:
spark.executor.cores": "1", "spark.executor.instances": 4,
Questions are:
A. where is the consumption made? On the executors or on the driver? according to driver UI, it looks like in both cases the executors did all the work - the driver did not pass any data and 4 executors got created.
B. Why do we have a different number of executors running? In the 1st run, with normal consumers, we see 4 executors working; In the 2nd run, with spark Kafka connector, 1 executor; In 3rd run with normal consumers, 1 executor but 2 cores?
you`ll see the driver's UI attached at the bottom.
this is the relevant code:
Normal Consumer:
var kafkaConsumer: KafkaConsumer[String, String] = null
val readMessages = () => {
for (record <- records) {
recordList.append(record.value())
}
}
kafkaConsumer.subscribe(util.Arrays.asList($(topic)))
readMessages()
var df = recordList.toDF
kafkaConsumer.close()
val json_schema =
df.sparkSession.read.json(df.select("value").as[String]).schema
df = df.select(from_json(col("value"), json_schema).as("json"))
df = df.select(col("json.*"))
Spark consumer:
val records = dataset
.sparkSession
.read
.format("kafka")
.option("kafka.bootstrap.servers", $(url))
.option("subscribe", $(this.topic))
.option("kafkaConsumer.pollTimeoutMs", s"${$(timeoutMs)}")
.option("startingOffsets", $(startingOffsets))
.option("endingOffsets", $(endingOffsets))
.load
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
OmnixLogger.warn(uid, "Executor ID AFTER polling: " + SparkEnv.get.executorId)
val json_schema = records.sparkSession.read.json(records.select("value").as[String]).schema
var df: DataFrame = records.select(from_json(col("value"), json_schema).as("json"))
df = df.select(col("json.*"))

Why a new batch is triggered without getting any new offsets in streaming source?

I have multiple spark structured streaming jobs and the usual behaviour that I see is that a new batch is triggered only when there are any new offsets in Kafka which is used as source to create streaming query.
But when I run this example which demonstrates arbitrary stateful operations using mapGroupsWithState , then I see that a new batch is triggered even if there is no new data in Streaming source. Why is it so and can it be avoided?
Update-1
I modified the above example code and remove state related operation like updating/removing it. Function simply outputs zero. But still a batch is triggered every 10 seconds without any new data on netcat server.
import java.sql.Timestamp
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming._
object Stateful {
def main(args: Array[String]): Unit = {
val host = "localhost"
val port = "9999"
val spark = SparkSession
.builder
.appName("StructuredSessionization")
.master("local[2]")
.getOrCreate()
import spark.implicits._
// Create DataFrame representing the stream of input lines from connection to host:port
val lines = spark.readStream
.format("socket")
.option("host", host)
.option("port", port)
.option("includeTimestamp", true)
.load()
// Split the lines into words, treat words as sessionId of events
val events = lines
.as[(String, Timestamp)]
.flatMap { case (line, timestamp) =>
line.split(" ").map(word => Event(sessionId = word, timestamp))
}
val sessionUpdates = events
.groupByKey(event => event.sessionId)
.mapGroupsWithState[SessionInfo, Int](GroupStateTimeout.ProcessingTimeTimeout) {
case (sessionId: String, events: Iterator[Event], state: GroupState[SessionInfo]) =>
0
}
val query = sessionUpdates
.writeStream
.outputMode("update")
.trigger(Trigger.ProcessingTime("10 seconds"))
.format("console")
.start()
query.awaitTermination()
}
}
case class Event(sessionId: String, timestamp: Timestamp)
case class SessionInfo(
numEvents: Int,
startTimestampMs: Long,
endTimestampMs: Long)
The reason for the empty batches showing up is the usage of Timeouts within the mapGroupsWithState call.
According to the book "Learning Spark 2.0" it says:
"The next micro-batch will call the function on this timed-out key even if there is not data for that key in that micro.batch. [...] Since the timeouts are processed during the micro-batches, the timing of their execution is imprecise and depends heavily on the trigger interval [...]."
As you have set the timeout to be GroupStateTimeout.ProcessingTimeTimeout it aligns with your trigger time of the query which is 10 seconds. The alternative would be to set the timeout based on event time (i.e. GroupStateTimeout.EventTimeTimeout).
The ScalaDocs on GroupState provide some more details:
When the timeout occurs for a group, the function is called for that group with no values, and GroupState.hasTimedOut() set to true.

Resources