Spark streamming take long time read from kafka - apache-spark

I build a cluster use CDH5.14.2, includes 5 nodes, each node has 130G momery and 40 cpu cores. I builded the spark streamming application to read from multiple kafka topic, about 10 kafka topics, and aggregate the kafka message separately. And save the kafka offset into zookeeper finally. Finally i found spark task take long time to process kafka message. The kafka message is not skew, and i found spark take long to read from kafka.
My code script:
// build input steeam from kafka topic
JavaInputDStream<ConsumerRecord<String, String>> stream1 = MyKafkaUtils.
buildInputStream(KafkaConfig.kafkaFlowGrouppId, topic1, ssc);
JavaInputDStream<ConsumerRecord<String, String>> stream2 = MyKafkaUtils.
buildInputStream(KafkaConfig.kafkaFlowGrouppId, topic2, ssc);
JavaInputDStream<ConsumerRecord<String, String>> stream3 = MyKafkaUtils.
buildInputStream(KafkaConfig.kafkaFlowGrouppId, topic3, ssc);
...
// aggregate kafka message use spark sql
result1 = process(stream1);
result2 = process(stream2);
result3 = process(stream3);
...
// write result to kafka kafka
writeToKafka(result1);
writeToKafka(result2);
writeToKafka(result3);
// save offset to zookeeper
saveOffset(stream1);
saveOffset(stream2);
saveOffset(stream3);
spark web ui information:
enter image description here

Related

Spark structured streaming process each row on different worker nodes as soon as it arraives

Using spark 2.3 structred streaming and kafka as the input stream.
My cluster is built from master and 3 workers. (master runs on one of the worker machines)
My kafka topic has 3 partitions as the number of the workers.
I am using the default trigger and foreach sink to process the data.
When the first message arrives to the driver it is immediately starting the processing of the data on one of the available worker nodes, while processing, a second message arrives, instead of immediatly start processing it on the available worker, the "execution" of the processing is delayed until the first worker ends the processing, now all of the "waiting executions" start processing parallel on all the available workers. (lets say I have 3 waiting messages)
How can I force the execution to start immetetaly on the waiting worker?
** A snippet of my code: **
val sparkSession = SparkSession.builder().config(conf).getOrCreate()
import sparkSession.implicits._
import org.apache.spark.sql.ForeachWriter
val writer = new ForeachWriter[String] {
override def open(partitionId: Long, version: Long) = true
override def process(filePath: String) = {
val filesSeq = fileHandler
.handleData(filePath) // long processing
}
override def close(errorOrNull: Throwable) = {}
}
val filesDf = kafkaStreamSubscriber
.buildtream(conf, kafkaInputTopic)
val ds = filesDf.map(x=>x.getAs("filePath").asInstanceOf[String])
val query =
ds.writeStream
.foreach(writer)
.start
ds.writeStream
.format("console")
.option("truncate", "false")
.start()
println("lets go....")
query.awaitTermination()
What have I doing wrong? I don't want to have idle workers when I have waiting data to process
Thanx
Refer to Spark Structured Streaming Triggers documentation section
As far as I understand, default trigger process one micro batch at a time. I would suggest consider Experimental Continuous mode if you need process data as soon as it arrives.
My understanding is that if you use trigger with let's say 5 seconds, the micro batch will read messages from all 3 partitions and you will have 3 tasks running in the same time. Until they all finished, there will be no micro batch started.
Hope it helps!

Data loss Spark 2.1 -kafka broker 0.8.2.1 streaming

1 streaming and Kafka broker version 0.8.2.1, I have separate servers for spark and kafka on AWS.
Using val directKafkaStream = KafkaUtils.createDirectStream direct approach. StreamingContext(conf, Seconds(300)), I am expecting to get 30 string from streaming but actual receiving only 15-25 in range . Cross check kafka consumer on same topic showing 30 string during 300 seconds. And stream.foreachRDD { rdd => giving 15to 20 strings.
What is wrong behind getting uneventual data. I am using sparksession creating sc and ssc.
Thank You.
add auto.offset.reset to smallest in kafka param
val kafkaParams = Map[String, String](
"auto.offset.reset" -> "smallest", ......)

Message getting lost in Kafka + Spark Streaming

I am facing an issue of data loss in spark streaming with Kafka, my use case is as follow:
Spark streaming(DirectStream) application reading messages from
Kafka topic and processing it.
On the basis of the processed message, an app will write the
processed message to different Kafka topics for e.g. if the message
is harmonized then write to the harmonized topic else unharmonized
topic.
Now, the problem is that during the streaming somehow I am losing some messaged i.e all the incoming messages are not written to harmonized or unharmonized topics.
for e.g., if app received 30 messages in one batch then sometimes it writes all the messages to output topics(this is the expected behaviour) but sometimes it writes only 27 (3 messages are lost, this number can change).
Following is the version I am using:
Spark 1.6.0
Kafka 0.9
Kafka topics configuration is as follow:
num of brokers: 3
num replication factor: 3
num of partitions: 3
Following are the properties I am using for kafka:
val props = new Properties()
props.put("metadata.broker.list", properties.getProperty("metadataBrokerList"))
props.put("auto.offset.reset", properties.getProperty("autoOffsetReset"))
props.put("group.id", properties.getProperty("group.id"))
props.put("serializer.class", "kafka.serializer.StringEncoder")
props.put("outTopicHarmonized", properties.getProperty("outletKafkaTopicHarmonized"))
props.put("outTopicUnharmonized", properties.getProperty("outletKafkaTopicUnharmonized"))
props.put("acks", "all");
props.put("retries", "5");
props.put("request.required.acks", "-1")
Following is the piece of code where I am writing processed messages to Kafka:
val schemaRdd2 = finalHarmonizedDF.toJSON
schemaRdd2.foreachPartition { partition =>
val producerConfig = new ProducerConfig(props)
val producer = new Producer[String, String](producerConfig)
partition.foreach { row =>
if (debug) println(row.mkString)
val keyedMessage = new KeyedMessage[String, String](props.getProperty("outTopicHarmonized"),
null, row.toString())
producer.send(keyedMessage)
}
//hack, should be done with the flush
Thread.sleep(1000)
producer.close()
}
I have explicitly added sleep(1000) for testing purpose.
But this is also not solving the problem :(
Any suggestion would be appreciated.
Try to tune the batchDuration parameter (when initializing StreamingContext ) to a number larger than the processing time of each rdd. This solved my problem.
Because you don't want to lose any messages, you might want to choose the 'exactly once' delivery semantics, which provides no data loss. In order to configure the exactly once delivery semantics you have to use acks='all', which you did.
According to this resource[1], acks='all' property must be used in conjunction with min.insync.replicas property.
[1] https://www.linkedin.com/pulse/kafka-producer-delivery-semantics-sylvester-daniel/

Several Spark Streams in the Same Process - How to match a thread per stream?

I implemented a spark-streaming job that contains several kafka streams. Each stream has its own topic.
for (Map.Entry<String, Map<TopicAndPartition, Long>> byTopic : perTopicMap.entrySet()) {
logger.warn("Creating stream for topic " + byTopic.getKey() + " with the following offsets" + byTopic.getValue());
JavaInputDStream<String> directStream = KafkaStreamFactory.createDirectStream(jssc, kafkaParams, byTopic.getValue(), AllMetadataMessageHandler.INSTANCE);
processJavaDStream(directStream);
}
Later in the code I save each stream into its table according to topic:
private void processJavaDStream(JavaDStream<String> eventStream) {
eventStream.foreachRDD((JavaRDD<String> rdd) -> {
Dataset<Row> myDataSet = someCalc(rdd);
myDataSet.write().format(PARQUET_FORMAT).mode(SaveMode.Append).partitionBy(PARTITION_BY_DAY, PARTITION_BY_HOUR).saveAsTable(topic);
}
}
Now everything works very well, but I would like to enjoy the parallelism that spark suggested by spark.streaming.concurrentJobs.
However, when I add this, Spark runs all streams in the same thread pool. After some time, 2 different spark executor threads write to the same table. The same stream gets 2 different threads. This causes collision and fails the write into table.
Is there a way to match between thread and stream in a way that every stream will get exactly one executor thread ??

Why does my Spark Streaming application not print the number of records from Kafka (using count operator)?

I am working on a spark application which needs to read data from Kafka. I created a Kafka topic where producer was posting messages. I verified from console consumer that messages were successfully posted .
I wrote a short spark application to read data from Kafka, but it is not getting any data.
Following is the code i used:
def main(args: Array[String]): Unit = {
val Array(zkQuorum, group, topics, numThreads) = args
val sparkConf = new SparkConf().setAppName("SparkConsumer").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
process(lines) // prints the number of records in Kafka topic
ssc.start()
ssc.awaitTermination()
}
private def process(lines: DStream[String]) {
val z = lines.count()
println("count of lines is "+z)
//edit
lines.foreachRDD(rdd => rdd.map(println)
// <-- Why does this **not** print?
)
Any suggestions on how to resolve this issue?
******EDIT****
I have used
lines.foreachRDD(rdd => rdd.map(println)
as well in actual code but that is also not working. I set the retention period as mentioned in post : Kafka spark directStream can not get data . But still the problem exist.
Your process is a continuation of a DStream pipeline with no output operator that gets the pipeline executed every batch interval.
You can "see" it by reading the signature of count operator:
count(): DStream[Long]
Quoting the count's scaladoc:
Returns a new DStream in which each RDD has a single element generated by counting each RDD of this DStream.
So, you have a dstream of Kafka records that you transform to a dstream of single values (being the result of count). Not much to have it outputed (to a console or any other sink).
You have to end the pipeline using an output operator as described in the official documentation Output Operations on DStreams:
Output operations allow DStream’s data to be pushed out to external systems like a database or a file systems. Since the output operations actually allow the transformed data to be consumed by external systems, they trigger the actual execution of all the DStream transformations (similar to actions for RDDs).
(Low-Level) Output operators register input dstreams as output dstreams so the execution can start. Spark Streaming's DStream by design has no notion of being an output dstream. It is DStreamGraph to know and be able to differentiate between input and output dstreams.

Resources