Spark 3 structured streaming use maxOffsetsPerTrigger in Kafka source with Trigger.Once - apache-spark

We need to use maxOffsetsPerTrigger in the Kafka source with Trigger.Once() in structured streaming but based on this issue it seems reads allAvailable in spark 3. Is there a way for achieving rate limit in this situation?
Here is a sample code in spark 3:
def options: Map[String, String] = Map(
"kafka.bootstrap.servers" -> conf.getStringSeq("bootstrapServers").mkString(","),
"subscribe" -> conf.getString("topic")
) ++
Option(conf.getLong("maxOffsetsPerTrigger")).map("maxOffsetsPerTrigger" -> _.toString)
val streamingQuery = sparkSession.readStream.format("kafka").options(options)
.load
.writeStream
.trigger(Trigger.Once)
.start()

There is no other way around it to properly set a rate limit. If the maxOffsetsPerTrigger is not applicable for streaming jobs with the Once trigger you could do the following to achieve identical result:
Choose another trigger and use maxOffsetsPerTrigger to limit the rate and kill this job manually after it finished processing all data.
Use options startingOffsets and endingOffsets while making the job a batch job. Repeat until you have processed all data within the topic. However, there is a reason why "Streaming in RunOnce mode is better than Batch" as detailed here.
Last option would be to look into the linked pull request and compile Spark on your own.

Here is how we "solved" this. This is basically the approach mike wrote about in the accepted answer.
In our case, the size of the message varied very little and we therefore knew how much time the processing of a batch takes. So in a nutshell we:
changed the Trigger.Once() with Trigger.ProcessingTime(<ms>) since maxOffsetsPerTrigger works with this mode
killed this running query by calling awaitTermination(<ms>) to mimic Trigger.Once()
set the processing interval to be larger than the termination interval so exactly one "batch" would fit to be processed
val kafkaOptions = Map[String, String](
"kafka.bootstrap.servers" -> "localhost:9092",
"failOnDataLoss" -> "false",
"subscribePattern" -> "testTopic",
"startingOffsets" -> "earliest",
"maxOffsetsPerTrigger" -> "10", // "batch" size
)
val streamWriterOptions = Map[String, String](
"checkpointLocation" -> "path/to/checkpoints",
)
val processingInterval = 30000L
val terminationInterval = 15000L
sparkSession
.readStream
.format("kafka")
.options(kafkaOptions)
.load()
.writeStream
.options(streamWriterOptions)
.format("Console")
.trigger(Trigger.ProcessingTime(processingInterval))
.start()
.awaitTermination(terminationInterval)
This works because the first batch will be read and processed respecting the maxOffsetsPerTrigger limit. Say, in 10 seconds. The second batch is then started to be processed but it is terminated in the middle of the operation after ~5s and never reaches the set 30s mark. But it stores the offsets correctly. picks up and processes this "killed" batch in the next run.
A downside of this approach is that you have to approximately know how much time does it take to process one "batch" - if you set the terminationInterval too low the job's output will constantly be nothing.
Of course, if you don't care about the exact number of batches you process in one run, you can easily adjust the processingInterval to be times smaller than the terminationInterval. In that case, you may process a varying number of batches in one go but still respecting the value of maxOffsetsPerTrigger.

Related

How to speed up spark streaming unit-tests

I hope this survives :)
I have a spark application using structured streaming written in scala. For unit tests I'm using https://github.com/apache/spark/blob/v3.1.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/memory.scala#L43
To simulate the "real world" I want the tests to produce batches so I basically use the foreachBatch streaming sink, where in each batch I add the next set of rows of data. Currently just one row per batch.
def foreachBatch(dataset: Dataset[MyRow], batchId: Long) {
log.info(s"Processing batch id $batchId / ${batches.size}")
val batch = dataset.collect()
log.info(s"${batch.length} rows in batch")
if (batchIter.hasNext) {
memStream.addData(batchIter.next())
}
}
val query = df.writeStream
.trigger(Trigger.ProcessingTime(100.millis))
.queryName("testQuery")
.outputMode(OutputMode.Append())
.foreachBatch(foreachBatch _)
.start()
Unfortunately, even when I implement flatMapGroupsWithState as a no-op to only return an empty iterator this takes ~10-15 seconds for each batch on an otherwise idle, recent macbook pro 16".
So testing this with only 10 rows is quite painful.
I played with sparks driver/executor memory settings, the number of executors (from local[1] to 'local[*]`) - all to pretty much no avail.
Is there any way I might be missing to speed things up?
Writing one input file per batch also didn't seem faster.

Consume a Kafka Topic every hour using Spark

I want to consume a Kafka topic as a batch where I want to read Kafka topic hourly and read the latest hourly data.
val readStream = existingSparkSession
.read
.format("kafka")
.option("kafka.bootstrap.servers", hostAddress)
.option("subscribe", "kafka.raw")
.load()
But this always read first 20 data rows and these rows are starting from the very beginning so this never pick latest data rows.
How can I read the latest rows on a hourly basis using scala and spark?
If you read Kafka messages in Batch mode you need to take care of the bookkeeping which data is new and which is not yourself. Remember that Spark will not commit any messages back to Kafka, so every time you restart the batch job it will read from beginning (or based on the setting startingOffsets which defaults to earliest for batch queries.
For your scenario where you want to run the job once every hour and only process the new data that arrived to Kafka in the previous hour, you can make use of the writeStream trigger option Trigger.Once for streaming queries.
There is a nice blog from Databricks that nicely explains why a streaming query with Trigger.Once should be preferred over a batch query.
The main point being:
"When you’re running a batch job that performs incremental updates, you generally have to deal with figuring out what data is new, what you should process, and what you should not. Structured Streaming already does all this for you."
Make sure that you also set the option "checkpointLocation" in your writeStream. In the end, you can have a simple cron job that submits your streaming job once an hour.

How to slow down the write speed of Kafka Producer?

I use the spark to write data to kafka in this way.
df.write(). format("kafka"). save()
can I control the writing speed to kafka to avoid pressure on kafka?
Is there some options that helps to slow down the speed?
I think setting linger.ms to a non-zero value would help. As it controls the amount of time to wait for additional messages before sending the current batch. Code can look like the following
df.write.format("kafka").option("linger.ms", "100").save()
But this really depends on a lot of things. If your Kafka is 'big' enough and configured properly, I wouldn't worry too much about the speed. After all, kafka is designed to cope with this situation (traffic spike).
Generally, Structured Streaming will try to process data as fast as possible by default. There are options in each source to allow to control the processing rate, such as maxFilesPerTrigger in File source, and maxOffsetsPerTrigger in Kafka source.
val streamingETLQuery = cloudtrailEvents
.withColumn("date", $"timestamp".cast("date") // derive the date
.writeStream
.trigger(ProcessingTime("10 seconds")) // check for files every 10s
.format("parquet") // write as Parquet partitioned by date
.partitionBy("date")
.option("path", "/cloudtrail")
.option("checkpointLocation", "/cloudtrail.checkpoint/")
.start()
val df = spark.readStream
.format("text")
.option("maxFilesPerTrigger", 1)
.load("text-logs")
Read the following links for more details:
https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-KafkaSource.html
https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-FileStreamSource.html
https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources
http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html

Long and consistent wait between tasks in spark streaming job

I have a spark streaming job running on Mesos.
All its batches take the exact same time and this time is much longer than expected.
The jobs pull data from kafka, process the data and insert it into cassandra and again back to kafka into a different topic.
Each batch (below) has 3 jobs, 2 of them pull from kafka, process and insert into cassandra, and the other one pulls from kafka, processes and pushes back into kafka.
I inspected the batch in the spark UI and found that they all take the same time (4s) but drilling down more, they actually process for less than a second each but they all have a gap of the same time (around 4 seconds).
Adding more executors or more processing power doesn't look like it will make a difference.
Details of batch: Processing time = 12s & total delay = 1.2 s ??
So I drill down into each job of the batch (they all take the exact same time = 4s even if they are doing different processing):
They all take 4 seconds to run one of their stage (the one that reads from kafka).
Now I drill down into the stage of one of them (they are all very similar):
Why this wait? The whole thing actually only takes 0.5s to run, it is just waiting. Is it waiting for Kafka?
Has anyone experienced anything similar?
What could I have coded wrong or configured incorrectly?
EDIT:
Here is a minimum code that triggers this behaviour. This makes me think that it must be the setup somehow.
object Test {
def main(args: Array[String]) {
val sparkConf = new SparkConf(true)
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
val kafkaParams = Map[String, String](
"bootstrap.servers" -> "####,####,####",
"group.id" -> "test"
)
val stream = KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](
streamingContext, kafkaParams, Set("test_topic")
)
stream.map(t => "LEN=" + t._2.length).print()
streamingContext.start()
streamingContext.awaitTermination()
}
}
Even if all the executors are in the same node (spark.executor.cores=2 spark.cores.max=2), the problem is still there and it is exactly 4 seconds as before: One mesos executor
Even if the topic has no messages (batch of 0 records), spark streaming takes 4 seconds for every batch.
The only way that I have been able to fix this is by setting cores=1 and cores.max=1 so that it only creates one task to execute.
This task has locality NODE_LOCAL. So it seems that when NODE_LOCAL the execution is instantaneous but when Locality is ANY it takes 4 seconds to connect to kafka. All the machines are in the same 10Gb network. Any idea why this would be?
The problem was with spark.locality.wait, this link gave me the idea
Its default value is 3 seconds and it was taking this whole time for every batch processed in spark streaming.
I have set it to 0 seconds when submitting the job with Mesos (--conf spark.locality.wait=0) and everything now runs as expected.

How to rewind Kafka Offsets in spark structured streaming readstream

I have a Spark Structured Streaming job which is configured to read data from Kafka. Please go through the code to check the readStream() with parameters to read the latest data from Kafka.
I understand that readStream() reads from the first offset when a new query is started and not on resume.
But I don't know how to start a new query every time I restart my job in IntelliJ.
val kafkaStreamingDF = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", AppProperties.getProp(AppConstants.PROPS_SERVICES_KAFKA_SERVERS))
.option("subscribe", AppProperties.getProp(AppConstants.PROPS_SDV_KAFKA_TOPICS))
.option("failOnDataLoss", "false")
.option("startingOffsets","earliest")
.load()
.selectExpr("CAST(value as STRING)", "CAST(topic as STRING)")
I have also tried setting the offsets by """{"topicA":{"0":0,"1":0}}"""
Following is my writestream
val query = kafkaStreamingDF
.writeStream
.format("console")
.start()
Every time I restart my job in IntelliJ IDE, logs show that the offset has been set to latest instead of 0 or earliest.
Is there way I can clean my checkpoint, in that case I don't know where the checkpoint directory is because in the above case I don't specify any checkpointing.
Kafka relies on the property auto.offset.reset to take care of the Offset Management.
The default is “latest,” which means that lacking a valid offset, the consumer will start reading from the newest records (records that were written after the consumer started running). The alternative is “earliest,” which means that lacking a valid offset, the consumer will read all the data in the partition, starting from the very beginning.
As per your question you want to read the entire data from the topic. So setting the "startingOffsets" to "earliest" should work. But, also make sure that you are setting the enable.auto.commit to false.
By setting enable.auto.commit to true means that offsets are committed automatically with a frequency controlled by the config auto.commit.interval.ms.
Setting this to true commits the offsets to Kafka automatically when messages are read from Kafka which doesn’t necessarily mean that Spark has finished processing those messages. To enable precise control for committing offsets, set Kafka parameter enable.auto.commit to false.
Try to set up .option("kafka.client.id", "XX"), to use a different client.id.

Resources