I would like to know if the read operations from a Kafka queue is faster by using batch-Kafka RDD instead of the KafkaDirectStream when I want to read all the Kafka queue.
I've observed that reading from different partition with batch RDD is not resulting in Spark concurrent jobs. Is there some Spark proprierties to config in order to allow this behaviour?
Thanks.
Try running your spark consumers in different threads or as different processes. That's the approach I take. I've observed that I get the best concurrency by allocating one consumer thread (or process) per topic partition.
Regarding your questions about batch vs KafkaDirectStream, I think even KafkaDirectStream is batch oriented. The batch interval can be specified in the streaming context, like this:
private static final int INTERVAL = 5000; // 5 seconds
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);
JavaStreamingContext ssc = new JavaStreamingContext(sc, new Duration(INTERVAL));
There's a good image that described how spark streaming is batch oriented here:
http://spark.apache.org/docs/1.6.0/streaming-programming-guide.html#discretized-streams-dstreams
Spark is essentially a batch engine and Spark streaming takes batching closer to streaming by defining something called micro-batching. Micro-batching is nothing but specifying batch interval to be very low (can be as low as 50ms per the advice in the official documentation). So now all it matters is how much is your micro-batch interval going to be. If you keep it low, you would feel it is near real-time.
On the Kafka consumer front, Spark direct receiver runs as a separate task in each executor. So if you have enough executors as the partitions, then it fetches data from all partitions and creates an RDD out of it.
If you are talking about reading from multiple queues, then you would create multiple DStreams, which would again need more executors to match the total number of partitions.
Related
I have a spark application that has to process multiple queries in parallel using a single Kafka topic as the source.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
What would be the recommended way to improve performance in the scenario above ? Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Any thoughts are welcome,
Thank you.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
tl;dr Not possible in the current design.
A single streaming query "starts" from a sink. There can only be one in a streaming query (I'm repeating it myself to remember better as I seem to have been caught multiple times while with Spark Structured Streaming, Kafka Streams and recently with ksqlDB).
Once you have a sink (output), the streaming query can be started (on its own daemon thread).
For exactly the reasons you mentioned (not to share data for which Kafka Consumer API requires group.id to be different), every streaming query creates a unique group ID (cf. this code and the comment in 3.3.0) so the same records can be transformed by different streaming queries:
// Each running query should use its own group id. Otherwise, the query may be only assigned
// partial data since Kafka will assign partitions to multiple consumers having the same group
// id. Hence, we should generate a unique id for each query.
val uniqueGroupId = KafkaSourceProvider.batchUniqueGroupId(sourceOptions)
And that makes sense IMHO.
Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Guess so.
You can separate your source data frame into different stages, yes.
val df = spark.readStream.format("kafka") ...
val strDf = df.select(cast('value).as("string")) ...
val df1 = strDf.filter(...) # in "parallel"
val df2 = strDf.filter(...) # in "parallel"
Only the first line should be creating Kafka consumer instance(s), not the other stages, as they depend on the consumer records from the first stage.
I have a simple job (20 executors, 8G memory each) that reads from Kafka (with 50 partitions), checkpoints to HDFS, and posts data to a HTTP endpoint (1000 events per second). I recently started to see some straggling executors which would take far longer compared to other executors. As part of investigation I was trying to rule out data skew; is there a way to print partition:offsets for executors? Or is there any other way to track why an executor maybe straggling?
I know I can implement StreamingQueryListener but that'll only give me partition:offsets per batch, and won't tell me which executor is processing a specific partition.
You can have it printed if you have used a sink with foreach. forEach in structured spark streaming. The open method has those details and it gets executed for every executor. so u have those details
What is the functionality of the queueStream function in Spark StreamingContext. According to my understanding it is a queue which queues the incoming DStream. If that is the case then how it is handled in the cluster with many node. Does each node will have this queueStream and the DStream is partitioned among all the nodes in the cluster? How does this queueStream work in cluster setup?
I have read below explanation in the [Spark Streaming documentation][https://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources), but I didn't understand it completely. Please help me to understand it.
Queue of RDDs as a Stream: For testing a Spark Streaming application with test data, one can also create a DStream based on a queue of RDDs, using streamingContext.queueStream(queueOfRDDs). Each RDD pushed into the queue will be treated as a batch of data in the DStream, and processed like a stream.
val myQueueRDD= scala.collection.mutable.Queue[RDD[MyObject]]()
val myStream= ssc.queueStream(myQueueRDD)
for(count <- 1 to 100) {
val randomData= generateData() //Generated random data
val rdd= ssc.sparkContext.parallelize(randomData) //Creates the rdd of the random data.
myQueueRDD+= rdd //Addes data to queue.
}
myStream.foreachRDD(rdd => rdd.mapPartitions(data => evaluate(data)))
How the above part of the code will get executed in the spark streaming context with respect to partitions on different nodes.
QueueInputDStream is intended for testing. It uses standard scala.collection.mutable.Queue to store RDDs which imitate incoming batches.
Does each node will have this queueStream and the DStream is partitioned among all the nodes in the cluster
No. There is only one copy of the queue and all data distribution is handled by RDDs. compute logic is very simple with dequeue (oneAtATime set to true) or union of the current queue (oneAtATime set to false) at each tick. This applies to DStreams in general - each stream is just a sequence of RDDs, which provide data distribution mechanism.
While it still follows InputDStream API, conceptually it is just a local collection from which you take elements every batchDuration.
I am building a prototype with spark streaming 1.5.0. DirectKafkaInputDStream is used.
And a simple stage to read from kafka by DirectKafkaInputDStream can't handle massive amount of messages. The stage spends longer time then batch interval, once the message rate reach or exceed a certain value. And the rate is much lower than I expect. ( I have done another benchmark of my kafka cluster with multiple consumer instances in different servers)
JavaPairInputDStream<String, String> recipeDStream =
KafkaUtils.createDirectStream(jssc,
String.class,
String.class,
StringKeyDecoder.class,
StringDecoder.class,
kafkaParams, kafkaTopicsSet);
After reading this article, I realize that the DirectKafkaInputDStream is run on the same node as the driver program. is it ture? If so, then DirectKafkaInputDStream can easily be stressed as it read all message in one node then dispatch to all executors.
And it means JavaPairReceiverInputDStream has better performance in handling high volume data, since receivers runs on multiple executor instances.
Am I right? Can someone explain this? Thank you.
No, the direct stream is only communicating from the driver to kafka in order to find the latest available offsets. Actual messages are read only on the executors.
Switching .createStream to .createDirectStream should in general perform better, not worse. If you've got a small reproducible example to the contrary, please share it on the spark mailing list or jira.
I have two node standalone cluster for spark stream processing. below is my sample code which demonstrate process I am executing.
sparkConf.setMaster("spark://rsplws224:7077")
val ssc=new StreamingContext()
println(ssc.sparkContext.master)
val inDStream = ssc.receiverStream //batch of 500 ms as i would like to have 1 sec latency
val filteredDStream = inDStream.filter // filtering unwanted tuples
val keyDStream = filteredDStream.map // converting to pair dstream
val stateStream = keyDStream .updateStateByKey //updating state for history
stateStream.checkpoint(Milliseconds(2500)) // to remove long lineage and meterilizing state stream
stateStream.count()
val withHistory = keyDStream.join(stateStream) //joining state wit input stream for further processing
val alertStream = withHistory.filter // decision to be taken by comparing history state and current tuple data
alertStream.foreach // notification to other system
My Problem is spark is not distributing this state RDD to multiple nodes or not distributing task to other node and causing high latency in response, my input load is around 100,000 tuples per seconds.
I have tried below things but nothing is working
1) spark.locality.wait to 1 sec
2) reduce memory allocated to executer process to check weather spark distribute RDD or task but even if it goes beyond memory limit of first node (m1) where drive is also running.
3) increased spark.streaming.concurrentJobs from 1 (default) to 3
4) I have checked in streaming ui storage that there are around 20 partitions for state dstream RDD all located on local node m1.
If I run SparkPi 100000 then spark is able to utilize another node after few seconds (30-40) so I am sure that my cluster configuration is fine.
Edit
One thing I have noticed that even for my RDD if I set storage level MEMORY_AND_DISK_SER_2 then also in app ui storage it shows Memory Serialized 1x Replicated
Spark will not distribute stream data across the cluster automatically for it tends to make full use of data locality(to launch a task on where its data lies will be better, this is default configuration). But you can use repartition to distribute stream data and improve the parallelism. You can turn to http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#performance-tuning for more information.
If your not hitting the cluster and your jobs only run locally it most likely means your Spark Master in your SparkConf is set to the local URI not the master URI.
By default the value of spark.default.parallelism property is "Local mode" so all the tasks will be executed in the node is receiving the data.
Change this property in spark-defaults.conf file in order to increase the parallelism level.