Spark Streaming + Kafka: SparkException: Couldn't find leader offsets for Set - apache-spark

I'm trying to setup Spark Streaming to get messages from Kafka queue. I'm getting the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o30.createDirectStream.
: org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
org.apache.spark.SparkException: Couldn't find leader offsets for Set([test-topic,0])
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
at scala.util.Either.fold(Either.scala:97)
Here is the code I'm executing (pyspark):
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
directKafkaStream = KafkaUtils.createDirectStream(ssc, ["test-topic"], {"metadata.broker.list": "host.domain:9092"})
ssc.start()
ssc.awaitTermination()
There were a couple of similar posts with the same error. In all cases the cause was the empty kafka topic. There are messages in my "test-topic". I can get them out with
kafka-console-consumer --zookeeper host.domain:2181 --topic test-topic --from-beginning --max-messages 100
Does anyone know what might be the problem?
I'm using:
Spark 1.5.2 (apache)
Kafka 0.8.2.0+kafka1.3.0 (CDH 5.4.7)

You need to check 2 things:
check if this topic and partition exists , in your case is topic is test-topic and partition is 0.
based on your code, you are trying consume message from offset 0 and it might be possible message is not available from offset 0, check what is you earliest offset and try consume from there.
Below is command to check earliest offset:
sh kafka/bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list "your broker list" --topic "topic name" --time -1

1) You have to make sure that you have already created topic test-topic
Run following command to check list of the topic
kafka-topics.sh --list --zookeeper [host or ip of zookeeper]:[port]
2) After checking your topic, you have to configure your Kafka configuration in Socket Server Settings section
listeners=PLAINTEXT://[host or ip of Kafka]:[port]

If you define short host names in /etc/hosts and use them in your kafka servers' configurations, you should change those name to ip. Or register the same short host name in your local PC or client's /etc/hosts.
Error occurred because Spark streaming lib can't resolve short hostname in the PC or client.

Another option to force creating topic if it doesn't exist. You can do this by setting property "auto.create.topics.enable" to "true" in kafkaParams map like this.
val kafkaParams = Map[String, String](
"bootstrap.servers" -> kafkaHost,
"group.id" -> kafkaGroup,
"auto.create.topics.enable" -> "true")
Using Scala 2.11 and Kafka 0.10 versions.

One of the reason for this type of error where leader cannot be found for specified topic is Problem with one's Kafka server configs.
Open your Kafka server configs :
vim ./kafka/kafka-<your-version>/config/server.properties
In the "Socket Server Settings" section , provide IP for your host if its missing :
listeners=PLAINTEXT://{host-ip}:{host-port}
I was using Kafka setup provided with MapR sandbox and was trying to access the kafka via spark code. I was getting the same error while accessing my kafka since my configuration was missing the IP.

Related

How to modify on-the-fly kafka cluster logger?

I have a secure kafka cluster (SSL with certificate) in production and i want to modify some logger level on-the-fly without restarting the cluster (even with a rolling update)
In the official doc it state you can modify broker configuration dynamically.
So, i tried this command
/bin/kafka-configs --bootstrap-server localhost:9092 --describe --entity-type broker-loggers --entity-name 1
only to obtain this error
java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.ClusterAuthorizationException: Cluster authorization failed.
If i try with port 9093 i get a java.util.concurrent.TimeoutException
kafka-configs is the right command to use.
You need to tell the command "who you are" / "log in".
It's achieved with the --command-config option.
There is an official example here
kafka-configs --command-config /etc/kafka/client.properties --bootstrap-server [hostname]:9093 --describe --entity-type broker-loggers --entity-name 1
Once you can use describe, then you can alter like
kafka-configs --command-config /etc/kafka/client.properties --bootstrap-server [hostname]:9093 --alter --add-config "kafka.authorizer.logger=INFO" --entity-type broker-loggers --entity-name 1
Which result in
Completed updating config for broker-logger 1.

Spark NiFi site to site connection

I am new with NiFi, I am trying to send data from NiFi to Spark or to establish a stream from NiFi output port to Spark according to this tutorial.
Nifi is running on Kubernetes and I am using Spark operator on the same cluster to submit my applications.
It seems like Spark is able to access the web NiFi and it starts a streaming receiver. However, data is not coming to the Spark app through output and I have empty rdds. I have not seen any warnings or errors in Spark logs
Any Idea or information which could help me to solve this issue is appreciated.
My code:
val conf = new SiteToSiteClient.Builder()
.keystoreFilename("..")
.keystorePass("...")
.keystoreType(...)
.truststoreFilename("..")
.truststorePass("..")
.truststoreType(...)
.url("https://...../nifi")
.portName("spark")
.buildConfig()
val lines = ssc.receiverStream(new NiFiReceiver(conf, StorageLevel.MEMORY_ONLY))

How to subscribe to a new topic with subscribePattern?

I am using Spark Structured streaming with Kafka and topic been subscribed as pattern:
option("subscribePattern", "topic.*")
// Subscribe to a pattern
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.load()
Once I start the job and a new topic is listed say topic.new_topic, the job automatically doesn't start listening to the new topic and it requires a restart.
Is there a way to automatically subscribe to a new pattern without restarting the job?
Spark: 3.0.0
The default behavior of a KafkaConsumer is to check every 5 minutes if there are new partitions to be consumed. This configuration is set through the Consumer config
metadata.max.age.ms: The period of time in milliseconds after which we force a refresh of metadata even if we haven't seen any partition leadership changes to proactively discover any new brokers or partitions.
According to the Spark + Kafka Integration Guide on Kafka Specific Configuration you can set this configuration by using the prefix kafka. as shown below:
.option("kafka.metadata.max.age.ms", "1000")
Through this setting the newly created topic will be consumed 1 second after its creation.
(Tested with Spark 3.0.0 and Kafka Broker 2.5.0)

what is zookeeper.broker.path

I'm learning Spark and Kafka and came across this project kafka-spark-consumer that seems to consume messages from Kafka efficiently. This project requires to configure few kafka & zookeeper properties thats where I'm struggling. I mean what does this property mean zookeeper.broker.path? Sorry, if its a basic question.
I have configured kafka in single node and with the following properties,
broker.id=1
port=9093
log.dir=/tmp/kafka-logs-1
and zookeeper as,
zookeeper.connect=localhost:2181/brokers
zookeeper.connection.timeout.ms=6000
if i try to configure the zookeeper.broker.path with /brokers i get the following exception from the consumer,
Exception in thread "main" java.lang.RuntimeException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /brokers/topics/<name>/partitions
at consumer.kafka.ReceiverLauncher.getNumPartitions(ReceiverLauncher.java:217)
at consumer.kafka.ReceiverLauncher.createStream(ReceiverLauncher.java:79)
at consumer.kafka.ReceiverLauncher.launch(ReceiverLauncher.java:51)
at com.ibm.spark.streaming.KafkaConsumer.run(KafkaConsumer.java:78)
at com.ibm.spark.streaming.KafkaConsumer.start(KafkaConsumer.java:43)
at com.ibm.spark.streaming.KafkaConsumer.main(KafkaConsumer.java:103)
Can you help me to understand what is the zookeeper broker path here and how can i configure that?
EDIT
The above error is caused due to non-existent topic, the moment i created the topic, the error went away.
As answered by user007, the /brokers directory is created by zookeeper by default.
No need of '/brokers' for zookeeper.connect property. It should be
zookeeper.connect=localhost:2181
I am not familiar with the "kafka-spark-consumer" project which you mentioned. But usually /brokers is the default node kafka creates in zookeeper. I haven't seen any library asking the user to configure it.
/brokers is the znode path under which metadata like topics are stored.
Go to kafka bin directory. Then invoke zookeeper shell - ./zookeeper-shell.sh localhost
Then do ls. You should be able to see topics and other child nodes created there.

Apache Spark broadcast error: Error sending message as driverActor is null [message = UpdateBlockInfo(BlockManagerId

I'm using Apache Spark 1.1.0 and I'm currently having issue with broadcast method. So when I call broadcast function on a small dataset to a 5 nodes cluster, I experiencing the "Error sending message as driverActor is null" after broadcast the variables several times (apps running under jboss).
Any help would be appreciate.
This problem had been resolved through stopping restart SparkContext in the app(running in wildfly) because the actor system doesn't shutdown properly in this spark version.

Resources