spark structured streaming operation duration - apache-spark

I am running a structured streaming job with kafka source.
spark: 2.4.7
python: 3.6.8
spark = SparkSession.builder.getOrCreate()
ds = (spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafka_servers)
.option("subscribe", topic_name)
.load())
# data preprocessing
ds = ...
model = GBTClassificationModel.load(model_path)
ds = model.transform(ds)
query = (ds.writeStream
.outputMode("update")
.format("kafka")
.option("kafka.bootstrap.servers", kafka_servers)
.option("checkpointLocation", checkpoint_dir)
.option("topic", output_topic)
.trigger(processingTime="0 seconds")
.start())
query.awaitTermination()
The spark web UI displays the following indicators:
AddBatch: 1955.0
GetBatch: 1.0
GetOffset: 0.0
QueryPlanning: 3555.0
TriggerExecution: 2.0
WalCommit: 5569.0
undefined: 20.0
The following is a description of the indicators on the spark official website:
Operation Duration. The amount of time taken to perform various operations in milliseconds. The tracked operations are listed as follows.
addBatch: Time taken to read the micro-batch’s input data from the sources, process it, and write the batch’s output to the sink. This should take the bulk of the micro-batch’s time.
getBatch: Time taken to prepare the logical query to read the input of the current micro-batch from the sources.
latestOffset & getOffset: Time taken to query the maximum available offset for this source.
queryPlanning: Time taken to generates the execution plan.
walCommit: Time taken to write the offsets to the metadata log.
Why are WalCommit and QueryPlanning much larger than AddBatch?
Thanks!

Related

Sink from Delta Live Table to Kafka, initial sink works, but any subsequent updates fail

I have a DLT pipeline that ingests a topic from my kafka stream, transforms it into a DLT, then I wish to write that table back into Kafka under a new topic.
So far, I have this working, however it only works on first load of the table, then after any subsequent updates will crash my read and write stream.
My DLT tables updates correctly, so I see updates from my pipeline into the Gold table,
CREATE OR REFRESH LIVE TABLE deal_gold1
TBLPROPERTIES ("quality" = "gold")
COMMENT "Gold Deals"
AS SELECT
documentId,
eventTimestamp,
substring(fullDocument.owner_id, 11, 24) as owner_id,
fullDocument.owner_type as owner_type,
substring(fullDocument.account_id, 11, 24) as account_id,
substring(fullDocument.manager_account_id, 11, 24) as manager_account_id,
fullDocument.hubspot_deal_id as hubspot_deal_id,
fullDocument.stage as stage,
fullDocument.status as status,
fullDocument.title as title
FROM LIVE.deal_bronze_cleansed
but then when I try to read from it via a separate notebook, these updates cause it to crash
import pyspark.sql.functions as fn
from pyspark.sql.types import StringType
# this one is the problem not the write stream
df = spark.readStream.format("delta").table("deal_stream_test.deal_gold1")
display(df)
writeStream= (
df
.selectExpr("CAST(documentId AS STRING) AS key", "to_json(struct(*)) AS value")
.writeStream
.format("kafka")
.outputMode("append")
.option("ignoreChanges", "true")
.option("checkpointLocation", "/tmp/benperram21/checkpoint")
.option("kafka.bootstrap.servers", confluentBootstrapServers)
.option("ignoreChanges", "true")
.option("kafka.security.protocol", "SASL_SSL")
.option("kafka.sasl.jaas.config", "kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username='{}' password='{}';".format(confluentApiKey, confluentSecret))
.option("kafka.ssl.endpoint.identification.algorithm", "https")
.option("kafka.sasl.mechanism", "PLAIN")
.option("topic", confluentTopicName)
.start()
)
I was looking and can see this might be as a result of it not being read as "Append". But yeah any thoughts on this? Everything works upset updates.
Right now DLT doesn't support output to the arbitrary sinks. Also, all Spark operations should be done inside the nodes of the execution graph (functions labeled with dlt.table or dlt.view).
Right now the workaround would be to run that notebook outside of the DLT pipeline, as a separate task in the multitask job (workflow).

Change filter/where condition when restarting a Structured Streaming query reading data from Delta Table

In Structured Streaming, will the checkpoints keep track of which data has already been processed from a Delta Table?
def fetch_data_streaming(source_table: str):
print("Fetching now")
streamingInputDF = (
spark
.readStream
.format("delta")
.option("maxBytesPerTrigger",1024)
.table(source_table)
.where("measurementId IN (1351,1350)")
.where("year >= '2021'")
)
query = (
streamingInputDF
.writeStream
.outputMode("append")
.option("checkpointLocation", "/streaming_checkpoints/5")
.foreachBatch(customWriter)
.start()
.awaitTermination()
)
return query
def customWriter(batchDF,batchId):
print(batchId)
print(batchDF.count())
batchDF.show(10)
length = batchDF.count()
print("batchId,batch size:",batchId,length)
If I change the where clause in the streamingInputDF to add more measurentId, the structured streaming job doesn't always acknowledge the change and fetch the new data values. It continues to run as if nothing has changed, whereas at times it starts fetching new values.
Isn't the checkpoint supposed to identify the change?
Edit: Schema of delta table:
col_name
data_type
measurementId
int
year
int
time
timestamp
q
smallint
v
string
"In structured streaming, will the checkpoints will keep track of which data has already been processed?"
Yes, the Structured Streaming job will store the read version of the Delta table in its checkpoint files to avoid producing duplicates.
Within the checkpoint directory in the folder "offsets", you will see that Spark stored the progress per batchId. For example it will look like below:
v1
{"batchWatermarkMs":0,"batchTimestampMs":1619695775288,"conf":[...]}
{"sourceVersion":1,"reservoirId":"d910a260-6aa2-4a7c-9f5c-1be3164127c0","reservoirVersion":2,"index":2,"isStartingVersion":true}
Here, the important part is the "reservoirVersion":2 which tells you that the streaming job has consumed all data from the Delta Table as of version 2.
Re-starting your Structured Streaming query with an additional filter condition will therefore not be applied to historic records but only to those that were added to the Delta Table after version 2.
In order to see this behavior in action you can use below code and analyse the content in the checkpoint files.
val deltaPath = "file:///tmp/delta/table"
val checkpointLocation = "file:///tmp/checkpoint/"
// run the following two lines once
val deltaDf = Seq(("1", "foo1"), ("2", "foo2"), ("3", "foo2")).toDF("id", "value")
deltaDf.write.format("delta").mode("append").save(deltaPath)
// run this code for the first time, then add filter condition, then run again
val query = spark.readStream
.format("delta")
.load(deltaPath)
.filter(col("id").isin("1")) // in the second run add "2"
.writeStream
.format("console")
.outputMode("append")
.option("checkpointLocation", checkpointLocation)
.start()
query.awaitTermination()
Now, if you append some more data to the Delta table while the streaming query is shut down and then restart is with the new filter condition it will be applied to the new data.

how to determine where the Kafka consumption is made on Spark

We tried two MlLib transformers for Kafka consuming: one using Structured Batch Query Stream like here, and one using normal Kafka consumers. The normal consumers read into a list that gets converted to a dataframe
We created two MlLib pipelines starting with an empty dataframe. The first transformer of these pipes did the reading from Kafka. We had a pipe for each kafka transformer type.
then ran the pipes: 1st with normal kafka consumers, 2nd with Spark consumer, 3rd with normal consumers again.
The spark config had 4 executors, with 1 core each:
spark.executor.cores": "1", "spark.executor.instances": 4,
Questions are:
A. where is the consumption made? On the executors or on the driver? according to driver UI, it looks like in both cases the executors did all the work - the driver did not pass any data and 4 executors got created.
B. Why do we have a different number of executors running? In the 1st run, with normal consumers, we see 4 executors working; In the 2nd run, with spark Kafka connector, 1 executor; In 3rd run with normal consumers, 1 executor but 2 cores?
you`ll see the driver's UI attached at the bottom.
this is the relevant code:
Normal Consumer:
var kafkaConsumer: KafkaConsumer[String, String] = null
val readMessages = () => {
for (record <- records) {
recordList.append(record.value())
}
}
kafkaConsumer.subscribe(util.Arrays.asList($(topic)))
readMessages()
var df = recordList.toDF
kafkaConsumer.close()
val json_schema =
df.sparkSession.read.json(df.select("value").as[String]).schema
df = df.select(from_json(col("value"), json_schema).as("json"))
df = df.select(col("json.*"))
Spark consumer:
val records = dataset
.sparkSession
.read
.format("kafka")
.option("kafka.bootstrap.servers", $(url))
.option("subscribe", $(this.topic))
.option("kafkaConsumer.pollTimeoutMs", s"${$(timeoutMs)}")
.option("startingOffsets", $(startingOffsets))
.option("endingOffsets", $(endingOffsets))
.load
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
OmnixLogger.warn(uid, "Executor ID AFTER polling: " + SparkEnv.get.executorId)
val json_schema = records.sparkSession.read.json(records.select("value").as[String]).schema
var df: DataFrame = records.select(from_json(col("value"), json_schema).as("json"))
df = df.select(col("json.*"))

Spark Structured Streaming - AssertionError in Checkpoint due to increasing the number of input sources

I am trying to join two streams into one and write the result to a topic
code:
1- Reading two topics
val PERSONINFORMATION_df: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "xx:9092")
.option("subscribe", "PERSONINFORMATION")
.option("group.id", "info")
.option("maxOffsetsPerTrigger", 1000)
.option("startingOffsets", "earliest")
.load()
val CANDIDATEINFORMATION_df: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "xxx:9092")
.option("subscribe", "CANDIDATEINFORMATION")
.option("group.id", "candent")
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", 1000)
.option("failOnDataLoss", "false")
.load()
2- Parse data to join them:
val parsed_PERSONINFORMATION_df: DataFrame = PERSONINFORMATION_df
.select(from_json(expr("cast(value as string) as actualValue"), schemaPERSONINFORMATION).as("s")).select("s.*")
val parsed_CANDIDATEINFORMATION_df: DataFrame = CANDIDATEINFORMATION_df
.select(from_json(expr("cast(value as string) as actualValue"), schemaCANDIDATEINFORMATION).as("s")).select("s.*")
val df_person = parsed_PERSONINFORMATION_df.as("dfperson")
val df_candidate = parsed_CANDIDATEINFORMATION_df.as("dfcandidate")
3- Join two frames
val joined_df : DataFrame = df_candidate.join(df_person, col("dfcandidate.PERSONID") === col("dfperson.ID"),"inner")
val string2json: DataFrame = joined_df.select($"dfcandidate.ID".as("key"),to_json(struct($"dfcandidate.ID", $"FULLNAME", $"PERSONALID")).cast("String").as("value"))
4- Write them to a topic
string2json.writeStream.format("kafka")
.option("kafka.bootstrap.servers", xxxx:9092")
.option("topic", "toDelete")
.option("checkpointLocation", "checkpoints")
.option("failOnDataLoss", "false")
.start()
.awaitTermination()
Error message:
21/01/25 11:01:41 ERROR streaming.MicroBatchExecution: Query [id = 9ce8bcf2-0299-42d5-9b5e-534af8d689e3, runId = 0c0919c6-f49e-48ae-a635-2e95e31fdd50] terminated with error
java.lang.AssertionError: assertion failed: There are [1] sources in the checkpoint offsets and now there are [2] sources requested by the query. Cannot continue.
Your code looks fine to me, it is rather the checkpointing that is causing the issue.
Based on the error message you are getting you probably ran this job with only one stream source. Then, you added the code for the stream join and tried to re-start the application without remiving existing checkpoint files. Now, the application tries to recover from the checkpoint files but realises that you initially had only one source and now you have two sources.
The section Recovery Semantics after Changes in a Streaming Query explains which changes are allowed and not allowed when using checkpointing. Changing the number of input sources is not allowed:
"Changes in the number or type (i.e. different source) of input sources: This is not allowed."
To solve your problem: Delete the current checkpoint files and re-start the job.

Spark Structured Streaming Kafka Microbatch count

I am using Spark structured streaming to read records from a Kafka topic; I intend to count the number of records received in each 'Micro batch' in Spark readstream
This is a snippet:
val kafka_df = sparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host:port")
.option("subscribe", "test-count")
.load()
I understand from the docs that kafka_df will be lazily evaluated when a streamingQuery is started (to come next), and as it is evaluated, it holds a micro-batch. So, I figured doing a groupBy on topic followed by a count should work.
Like this:
val counter = kafka_df
.groupBy("topic")
.count()
Now to evaluate all of this, we need a streaminQuery, lets say, a console sink query to print it on the console. And this is where i see the problem. A streamingQuery on aggregate DataFrames, such as kafka_df works only with outputMode complete/update and not on append.
This effectively means that, the count reported by the streamingQuery is cumulative.
Like this:
val counter_json = counter.toJSON //to jsonify
val count_query = counter_json
.writeStream.outputMode("update")
.format("console")
.start() // kicks of lazy evaluation
.awaitTermination()
In a controlled set up, where:
actual Published records: 1500
actual Received micro-batches : 3
aActual Received records: 1500
The count of each microbatch is supposed to be 500, so I hoped (wished) that the query prints to console:
topic: test-count
count: 500
topic: test-count
count: 500
topic: test-count
count: 500
But it doesn't. It actually prints:
topic: test-count
count: 500
topic: test-count
count:1000
topic: test-count
count: 1500
This I understand is because of 'outputMode' complete/update (cumulative)
My question: Is it possible to accurately get the count of each micro-batch is Spark-Kafka structured streaming?
From the docs, I found out about the watermark approach (to support append):
val windowedCounts = kafka_df
.withWatermark("timestamp", "10 seconds")
.groupBy(window($"timestamp", "10 seconds", "10 seconds"), $"topic")
.count()
val console_query = windowedCounts
.writeStream
.outputMode("append")
.format("console")
.start()
.awaitTermination()
But the results of this console_query are inaccurate and appears is way off mark.
TL;DR - Any thoughts on accurately counting the records in Spark-Kafka micro-batch would be appreciated.
If you want to only process a specific number of records with every trigger within a Structured Streaming application using Kafka, use the option maxOffsetsPerTrigger
val kafka_df = sparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host:port")
.option("subscribe", "test-count")
.option("maxOffsetsPerTrigger", 500)
.load()
"TL;DR - Any thoughts on accurately counting the records in Spark-Kafka micro-batch would be appreciated."
You can count the records fetched from Kafka by using a StreamingQueryListener (ScalaDocs).
This allows you to print out the exact number of rows that were received from the subscribed Kafka topic. The onQueryProgress API gets called during every micro-batch and contains lots of useful meta information on your query. If no data is flowing into the query the onQueryProgress is called every 10 seconds. Below is a simple example that prints out the number of input messages.
spark.streams.addListener(new StreamingQueryListener() {
override def onQueryStarted(queryStarted: QueryStartedEvent): Unit = {}
override def onQueryTerminated(queryTerminated: QueryTerminatedEvent): Unit = {}
override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {
println("NumInputRows: " + queryProgress.progress.numInputRows)
}
})
In case you are validating the performance of your Structured Streaming query, it is usually best to keep an eye on the following two metrics:
queryProgress.progress.inputRowsPerSecond
queryProgress.progress.processedRowsPerSecond
In case input is higher than processed you might increase resources for your job or reduce the maximum limit (by reducing the readStream option maxOffsetsPerTrigger). If processed is higher, you may want to increase this limit.

Resources