Spark Streaming with Kafka - Not all Kafka messages are received - apache-spark

I'm working with Spark Streaming with Kafka for the first time have been facing the following issue:
I'm using the receiver-based approach for Kafka integration with Spark Streaming as:
val kafkaConf = Map("metadata.broker.list" -> "ip1:9042,ip2:9042", "group.id" -> "raw-email-event-streaming-consumer", "zookeeper.connect" -> "ip1:2181,ip2:2181")
`val kafkaStream = KafkaUtils.createStream[Array[Byte], String, DefaultDecoder, StringDecoder](ssc, kafkaConf, Map(RN_MAIL_TOPIC -> RN_MAIL_TOPIC_PARTITIONS), StorageLevel.MEMORY_ONLY_SER)`
I have found that a lot of messages from a Kafka topic are not received in the Spark Streaming job (~5k messages out of 7k messages produced aren't received). Please provide insights as to why this might be happening. I submit my streaming job to an Azure HDInsights cluster in a standalone mode currently, as follows:
spark-submit --class <class_name> --master local[*] --deploy-mode client <executable>.jar

Related

spark streaming + kafka always rebalance

spark streaming is consuming streaming from kafka, but when kafka add new topics ,spark applicaitons always go wrong (kafka's groupid rebalance), and spark applicaitons cant Auto-Recovery。
version:
kafka:kafka_2.11-1.0.0
spark:spark-streaming_2.11-2.4.0

Spark NiFi site to site connection

I am new with NiFi, I am trying to send data from NiFi to Spark or to establish a stream from NiFi output port to Spark according to this tutorial.
Nifi is running on Kubernetes and I am using Spark operator on the same cluster to submit my applications.
It seems like Spark is able to access the web NiFi and it starts a streaming receiver. However, data is not coming to the Spark app through output and I have empty rdds. I have not seen any warnings or errors in Spark logs
Any Idea or information which could help me to solve this issue is appreciated.
My code:
val conf = new SiteToSiteClient.Builder()
.keystoreFilename("..")
.keystorePass("...")
.keystoreType(...)
.truststoreFilename("..")
.truststorePass("..")
.truststoreType(...)
.url("https://...../nifi")
.portName("spark")
.buildConfig()
val lines = ssc.receiverStream(new NiFiReceiver(conf, StorageLevel.MEMORY_ONLY))

I cannot connect from my cloud kafka to databricks community edition's spark cluster

1- I have a spark cluster on databricks community edition and I have a Kafka instance on GCP.
2- I just want to data ingestion Kafka streaming from databricks community edition and I want to analyze the data on spark.
3-
This is my connection code.
val UsYoutubeDf =
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "XXX.XXX.115.52:9092")
.option("subscribe", "usyoutube")
.load`
As is mentioned my datas arriving to the kafka.
I'm entering firewall settings spark.driver.host otherwise ı cannot sending any ping to my kafka machine from databricks's cluster
import org.apache.spark.sql.streaming.Trigger.ProcessingTime
val sortedModelCountQuery = sortedyouTubeSchemaSumDf
.writeStream
.outputMode("complete")
.format("console")
.option("truncate","false")
.trigger(ProcessingTime("5 seconds"))
.start()
After this post the datas dont coming to my spark on cluster
import org.apache.spark.sql.streaming.Trigger.ProcessingTime
sortedModelCountQuery: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper#3bd8a775
It stays like this. Actually, the data is coming, but the code I wrote for analysis does not work here

How to subscribe to a new topic with subscribePattern?

I am using Spark Structured streaming with Kafka and topic been subscribed as pattern:
option("subscribePattern", "topic.*")
// Subscribe to a pattern
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.load()
Once I start the job and a new topic is listed say topic.new_topic, the job automatically doesn't start listening to the new topic and it requires a restart.
Is there a way to automatically subscribe to a new pattern without restarting the job?
Spark: 3.0.0
The default behavior of a KafkaConsumer is to check every 5 minutes if there are new partitions to be consumed. This configuration is set through the Consumer config
metadata.max.age.ms: The period of time in milliseconds after which we force a refresh of metadata even if we haven't seen any partition leadership changes to proactively discover any new brokers or partitions.
According to the Spark + Kafka Integration Guide on Kafka Specific Configuration you can set this configuration by using the prefix kafka. as shown below:
.option("kafka.metadata.max.age.ms", "1000")
Through this setting the newly created topic will be consumed 1 second after its creation.
(Tested with Spark 3.0.0 and Kafka Broker 2.5.0)

Spark Streaming + Kafka: SparkException: Couldn't find leader offsets for Set

I'm trying to setup Spark Streaming to get messages from Kafka queue. I'm getting the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o30.createDirectStream.
: org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
org.apache.spark.SparkException: Couldn't find leader offsets for Set([test-topic,0])
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
at scala.util.Either.fold(Either.scala:97)
Here is the code I'm executing (pyspark):
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
directKafkaStream = KafkaUtils.createDirectStream(ssc, ["test-topic"], {"metadata.broker.list": "host.domain:9092"})
ssc.start()
ssc.awaitTermination()
There were a couple of similar posts with the same error. In all cases the cause was the empty kafka topic. There are messages in my "test-topic". I can get them out with
kafka-console-consumer --zookeeper host.domain:2181 --topic test-topic --from-beginning --max-messages 100
Does anyone know what might be the problem?
I'm using:
Spark 1.5.2 (apache)
Kafka 0.8.2.0+kafka1.3.0 (CDH 5.4.7)
You need to check 2 things:
check if this topic and partition exists , in your case is topic is test-topic and partition is 0.
based on your code, you are trying consume message from offset 0 and it might be possible message is not available from offset 0, check what is you earliest offset and try consume from there.
Below is command to check earliest offset:
sh kafka/bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list "your broker list" --topic "topic name" --time -1
1) You have to make sure that you have already created topic test-topic
Run following command to check list of the topic
kafka-topics.sh --list --zookeeper [host or ip of zookeeper]:[port]
2) After checking your topic, you have to configure your Kafka configuration in Socket Server Settings section
listeners=PLAINTEXT://[host or ip of Kafka]:[port]
If you define short host names in /etc/hosts and use them in your kafka servers' configurations, you should change those name to ip. Or register the same short host name in your local PC or client's /etc/hosts.
Error occurred because Spark streaming lib can't resolve short hostname in the PC or client.
Another option to force creating topic if it doesn't exist. You can do this by setting property "auto.create.topics.enable" to "true" in kafkaParams map like this.
val kafkaParams = Map[String, String](
"bootstrap.servers" -> kafkaHost,
"group.id" -> kafkaGroup,
"auto.create.topics.enable" -> "true")
Using Scala 2.11 and Kafka 0.10 versions.
One of the reason for this type of error where leader cannot be found for specified topic is Problem with one's Kafka server configs.
Open your Kafka server configs :
vim ./kafka/kafka-<your-version>/config/server.properties
In the "Socket Server Settings" section , provide IP for your host if its missing :
listeners=PLAINTEXT://{host-ip}:{host-port}
I was using Kafka setup provided with MapR sandbox and was trying to access the kafka via spark code. I was getting the same error while accessing my kafka since my configuration was missing the IP.

Resources