Message getting lost in Kafka + Spark Streaming - apache-spark

I am facing an issue of data loss in spark streaming with Kafka, my use case is as follow:
Spark streaming(DirectStream) application reading messages from
Kafka topic and processing it.
On the basis of the processed message, an app will write the
processed message to different Kafka topics for e.g. if the message
is harmonized then write to the harmonized topic else unharmonized
topic.
Now, the problem is that during the streaming somehow I am losing some messaged i.e all the incoming messages are not written to harmonized or unharmonized topics.
for e.g., if app received 30 messages in one batch then sometimes it writes all the messages to output topics(this is the expected behaviour) but sometimes it writes only 27 (3 messages are lost, this number can change).
Following is the version I am using:
Spark 1.6.0
Kafka 0.9
Kafka topics configuration is as follow:
num of brokers: 3
num replication factor: 3
num of partitions: 3
Following are the properties I am using for kafka:
val props = new Properties()
props.put("metadata.broker.list", properties.getProperty("metadataBrokerList"))
props.put("auto.offset.reset", properties.getProperty("autoOffsetReset"))
props.put("group.id", properties.getProperty("group.id"))
props.put("serializer.class", "kafka.serializer.StringEncoder")
props.put("outTopicHarmonized", properties.getProperty("outletKafkaTopicHarmonized"))
props.put("outTopicUnharmonized", properties.getProperty("outletKafkaTopicUnharmonized"))
props.put("acks", "all");
props.put("retries", "5");
props.put("request.required.acks", "-1")
Following is the piece of code where I am writing processed messages to Kafka:
val schemaRdd2 = finalHarmonizedDF.toJSON
schemaRdd2.foreachPartition { partition =>
val producerConfig = new ProducerConfig(props)
val producer = new Producer[String, String](producerConfig)
partition.foreach { row =>
if (debug) println(row.mkString)
val keyedMessage = new KeyedMessage[String, String](props.getProperty("outTopicHarmonized"),
null, row.toString())
producer.send(keyedMessage)
}
//hack, should be done with the flush
Thread.sleep(1000)
producer.close()
}
I have explicitly added sleep(1000) for testing purpose.
But this is also not solving the problem :(
Any suggestion would be appreciated.

Try to tune the batchDuration parameter (when initializing StreamingContext ) to a number larger than the processing time of each rdd. This solved my problem.

Because you don't want to lose any messages, you might want to choose the 'exactly once' delivery semantics, which provides no data loss. In order to configure the exactly once delivery semantics you have to use acks='all', which you did.
According to this resource[1], acks='all' property must be used in conjunction with min.insync.replicas property.
[1] https://www.linkedin.com/pulse/kafka-producer-delivery-semantics-sylvester-daniel/

Related

Spark mapPartitions Issue

I am using spark mapPartition on my DF and the use case i should submit one Job (either calling lambda or sending a SQS Message) for each Partition.
I am partitioning on a custom formatted date column and logging the no.of partitions before and after and it is working as expected.
How ever when i see the total no.of jobs it is more than the no.of partitions. For Some of the partitions there are two or three jobs !!
Here is the Code i am using
val yearMonthQueryRDD = yearMonthQueryDF.rdd.mapPartitions(
partition => {
val partitionObjectList = new java.util.ArrayList[String]()
logger.info("partitionIndex = {}",TaskContext.getPartitionId());
val partitionCounter:AtomicLong = new AtomicLong(0)
val partitionSize:AtomicLong = new AtomicLong(0)
val paritionColumnName:AtomicReference[String] = new AtomicReference[String]();
// Iterate the Objects in a given parittion
val updatedPartition = partition.map( record => {
import yearMonthQueryDF.sparkSession.implicits._
partitionCounter.set(partitionCounter.get()+1)
val recordSizeInt = Integer.parseInt(record.getAs("object_size"))
val recordSize:Long = recordSizeInt.toLong
partitionObjectList.add(record.getAs("object_key"))
paritionColumnName.set(record.getAs("partition_column_name"))
record
}
).toList
logger_ref.info("No.of Elements in Partition ["+paritionColumnName.get()+"] are =["+partitionCounter.get()+"] Total Size=["+partitionSize.get()+"]")
// Submit a Job for the parition
// jobUtil.submitJob(paritionColumnName.get(),partitionObjectList,partitionSize.get())
updatedPartition.toIterator
}
)
Another thing that is making the debugging harder is the logging statements inside the mapPartitions() method are not found in the container error logs (since they are executed on each worker node not on master node i expected them to find them in container logs rather than in master node logs. Need to figure why i am only seeing stderr logs but not stdout logs on the containers though).
Thanks
Sateesh

How to save data to process later after stopping DirectStream in SparkStreaming?

I am creating below KafkaDirectStream.
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams))
Then saving the values as :
val lines = messages.map(_.value)
Then stoping the streaming context when I have no further offset to consume as follows:
lines.foreachRDD(rdd => {
if(rdd.isEmpty()) {
messages.stop()
ssc.stop(false)
} else {
}
})
Then I am printing the lines as follows:
lines.print()
Then I am starting stream as:
ssc.start()
It is working fine. It reads rdds and prints top 10 and stops messages stream and stop streaming context. But then when I execute the same line lines.print() it throws an exception saying cannot do new inputs, transform, or outputs after stoping streamingContext.
How do I achieve my goal? I am running it in a spark-shell not as a binary (mandatory requirement).
Here is what I actually want to achieve:
1) Consume all json records from the kafka topic.
2) Stop getting further records (It is guarenteed that after consuming, there won't be no new records added to Kafka topic, so don't want to keep proessing no records.)
3) Do some preprocessing by extracting some fields from the JSON fields.
4) Do further operation on the preprocessed data.
5) Done.
when you are calling "lines.print()" again, its trying to call the transformation "messages.map(_.value)" again. As you stopped the context its failing.
Save the lines variable by performing an action before stopping the context.

When am writing the messages into a kafka topic using spark streaming it is just writing into one partition

How to make spark write messages into all the partitions in the kafka so that I can use a directstream and improve the performance of the streaming.
here is my code:-
object kafka {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("FlightawareSparkApp")
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val ssc = new StreamingContext(sparkConf, Seconds(3))
val lines = ssc.socketTextStream("localhost", 18436)
val topic = "test"
val props = new java.util.Properties()
props.put("metadata.broker.list", "list")
props.put("bootstrap.servers", "list")
// props.put("bootstrap.servers", "localhost:9092")
// props.put("bootstrap.servers", "localhost:9092")
props.put("client.id", "KafkaProducer")
props.put("producer.type", "async")
props.put("key.serializer", "org.apache.kafka.common.serialization.IntegerSerializer")
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
lines.foreachRDD(rdd => {
rdd.foreachPartition(part => {
val producer = new KafkaProducer[Integer, String](props)
part.foreach(msg =>{
val record = new ProducerRecord[Integer, String](topic, msg)
producer.send(record)
})
producer.close()
})
})
ssc.start()
ssc.awaitTermination()
}
}
this code is pushing messages into kafka topic but when I see the count using
/usr/hdp/current/kafka-broker/bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list $KAFKABROKERS --topic test --time -1
am getting output where I can see the messages only in one partition.
test:8:0
test:2:0
test:5:0
test:4:0
test:7:0
test:1:0
test:9:0
test:3:0
test:6:237629
test:0:0
Any suggestions on how to split the data into all the partitions.
How to implement partition key by default in the program in order to distribute the messages across the partitions.
Thanks,
Ankush Reddy.
It's because you don't set the key. You can find the following details in Kafka FAQ [1].
Why is data not evenly distributed among partitions when a partitioning key is not specified?
In Kafka producer, a partition key can be specified to indicate the destination partition of the message. By default, a hashing-based partitioner is used to determine the partition id given the key, and people can use customized partitioners also.
To reduce # of open sockets, in 0.8.0 (https://issues.apache.org/jira/browse/KAFKA-1017), when the partitioning key is not specified or null, a producer will pick a random partition and stick to it for some time (default is 10 mins) before switching to another one. So, if there are fewer producers than partitions, at a given point of time, some partitions may not receive any data. To alleviate this problem, one can either reduce the metadata refresh interval or specify a message key and a customized random partitioner. For more detail see this thread http://mail-archives.apache.org/mod_mbox/kafka-dev/201310.mbox/%3CCAFbh0Q0aVh%2Bvqxfy7H-%2BMnRFBt6BnyoZk1LWBoMspwSmTqUKMg%40mail.gmail.com%3E
[1] https://cwiki.apache.org/confluence/display/KAFKA/FAQ

Several Spark Streams in the Same Process - How to match a thread per stream?

I implemented a spark-streaming job that contains several kafka streams. Each stream has its own topic.
for (Map.Entry<String, Map<TopicAndPartition, Long>> byTopic : perTopicMap.entrySet()) {
logger.warn("Creating stream for topic " + byTopic.getKey() + " with the following offsets" + byTopic.getValue());
JavaInputDStream<String> directStream = KafkaStreamFactory.createDirectStream(jssc, kafkaParams, byTopic.getValue(), AllMetadataMessageHandler.INSTANCE);
processJavaDStream(directStream);
}
Later in the code I save each stream into its table according to topic:
private void processJavaDStream(JavaDStream<String> eventStream) {
eventStream.foreachRDD((JavaRDD<String> rdd) -> {
Dataset<Row> myDataSet = someCalc(rdd);
myDataSet.write().format(PARQUET_FORMAT).mode(SaveMode.Append).partitionBy(PARTITION_BY_DAY, PARTITION_BY_HOUR).saveAsTable(topic);
}
}
Now everything works very well, but I would like to enjoy the parallelism that spark suggested by spark.streaming.concurrentJobs.
However, when I add this, Spark runs all streams in the same thread pool. After some time, 2 different spark executor threads write to the same table. The same stream gets 2 different threads. This causes collision and fails the write into table.
Is there a way to match between thread and stream in a way that every stream will get exactly one executor thread ??

Why does my Spark Streaming application not print the number of records from Kafka (using count operator)?

I am working on a spark application which needs to read data from Kafka. I created a Kafka topic where producer was posting messages. I verified from console consumer that messages were successfully posted .
I wrote a short spark application to read data from Kafka, but it is not getting any data.
Following is the code i used:
def main(args: Array[String]): Unit = {
val Array(zkQuorum, group, topics, numThreads) = args
val sparkConf = new SparkConf().setAppName("SparkConsumer").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
process(lines) // prints the number of records in Kafka topic
ssc.start()
ssc.awaitTermination()
}
private def process(lines: DStream[String]) {
val z = lines.count()
println("count of lines is "+z)
//edit
lines.foreachRDD(rdd => rdd.map(println)
// <-- Why does this **not** print?
)
Any suggestions on how to resolve this issue?
******EDIT****
I have used
lines.foreachRDD(rdd => rdd.map(println)
as well in actual code but that is also not working. I set the retention period as mentioned in post : Kafka spark directStream can not get data . But still the problem exist.
Your process is a continuation of a DStream pipeline with no output operator that gets the pipeline executed every batch interval.
You can "see" it by reading the signature of count operator:
count(): DStream[Long]
Quoting the count's scaladoc:
Returns a new DStream in which each RDD has a single element generated by counting each RDD of this DStream.
So, you have a dstream of Kafka records that you transform to a dstream of single values (being the result of count). Not much to have it outputed (to a console or any other sink).
You have to end the pipeline using an output operator as described in the official documentation Output Operations on DStreams:
Output operations allow DStream’s data to be pushed out to external systems like a database or a file systems. Since the output operations actually allow the transformed data to be consumed by external systems, they trigger the actual execution of all the DStream transformations (similar to actions for RDDs).
(Low-Level) Output operators register input dstreams as output dstreams so the execution can start. Spark Streaming's DStream by design has no notion of being an output dstream. It is DStreamGraph to know and be able to differentiate between input and output dstreams.

Resources