How to subscribe to a new topic with subscribePattern? - apache-spark

I am using Spark Structured streaming with Kafka and topic been subscribed as pattern:
option("subscribePattern", "topic.*")
// Subscribe to a pattern
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.load()
Once I start the job and a new topic is listed say topic.new_topic, the job automatically doesn't start listening to the new topic and it requires a restart.
Is there a way to automatically subscribe to a new pattern without restarting the job?
Spark: 3.0.0

The default behavior of a KafkaConsumer is to check every 5 minutes if there are new partitions to be consumed. This configuration is set through the Consumer config
metadata.max.age.ms: The period of time in milliseconds after which we force a refresh of metadata even if we haven't seen any partition leadership changes to proactively discover any new brokers or partitions.
According to the Spark + Kafka Integration Guide on Kafka Specific Configuration you can set this configuration by using the prefix kafka. as shown below:
.option("kafka.metadata.max.age.ms", "1000")
Through this setting the newly created topic will be consumed 1 second after its creation.
(Tested with Spark 3.0.0 and Kafka Broker 2.5.0)

Related

Can I send messages to KAFKA cluster via Azure Databricks as a batch job (close my connection once the messages i sent are consummed)?

I want to send messages once a day to Kafka via Azure Databricks. I want the messages received as a batch job.
I need to send them to a kafka server, but we don't want to have a cluster on all day running for this job.
I saw the databricks writeStream method (i can't make it work yet, but that is not the purpose of my question). It looks like i need to be streaming day and night to make it run.
Is there a way to use it as a batch job? Can i send the messages to Kafka server, and close my cluster once they are received?
df = spark \
.readStream \
.format("delta") \
.option("numPartitions", 5) \
.option("rowsPerSecond", 5) \
.load('/mnt/sales/marketing/numbers/DELTA/')
(df.select("Sales", "value")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "rferferfeez.eu-west-1.aws.confluent.cloud:9092")
.option("topic", "bingofr")
.option("kafka.sasl.username", "jakich")
.option("kafka.sasl.password", 'ozifjoijfziaihufzihufazhufhzuhfzuoehza')
.option("checkpointLocation", "/mnt/sales/marketing/numbers/temp/")
.option("spark.kafka.clusters.cluster.sasl.token.mechanism", "cluster-buyit")
.option("request.timeout.ms",30) \
.option("includeHeaders", "true") \
.start()
)
kafkashaded.org.apache.kafka.common.errors.TimeoutException: Topic
bingofr not present in metadata after
60000 ms.
It is worth noting we also have event hub. Would i be better off sending messages to our event hub, and implement a triggered function that writes to kafka ?
Just want to elaborate on #Alex Ott comment as it seems to work.
By adding ".trigger(availableNow=True)",you can
"periodically spin up a cluster, process everything that is available
since the last period, and then shutdown the cluster. In some case,
this may lead to significant cost savings."
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers
**(
df.select("key", "value","partition")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", host)
.option("topic", topic)
.trigger(availableNow=True)
.option("kafka.sasl.jaas.config",
'kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username="{}" password="{}";'.format(userid, password))
.option("checkpointLocation", "/mnt/Sales/Markerting/Whiteboards/temp/")
.option("kafka.security.protocol", "SASL_SSL")
Normally KAFKA is a continuous service/capability. At least, where I have been.
I would consider a Cloud Service like AZURE where an Event Hub is used on a per message basis with KAFKA API used. Always on, pay per message.
Otherwise, you will need to have a batch job that starts KAFKA, do your execution, then stop KAFKA. You do not state if all on Databricks, though.

spark streaming + kafka always rebalance

spark streaming is consuming streaming from kafka, but when kafka add new topics ,spark applicaitons always go wrong (kafka's groupid rebalance), and spark applicaitons cant Auto-Recovery。
version:
kafka:kafka_2.11-1.0.0
spark:spark-streaming_2.11-2.4.0

I cannot connect from my cloud kafka to databricks community edition's spark cluster

1- I have a spark cluster on databricks community edition and I have a Kafka instance on GCP.
2- I just want to data ingestion Kafka streaming from databricks community edition and I want to analyze the data on spark.
3-
This is my connection code.
val UsYoutubeDf =
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "XXX.XXX.115.52:9092")
.option("subscribe", "usyoutube")
.load`
As is mentioned my datas arriving to the kafka.
I'm entering firewall settings spark.driver.host otherwise ı cannot sending any ping to my kafka machine from databricks's cluster
import org.apache.spark.sql.streaming.Trigger.ProcessingTime
val sortedModelCountQuery = sortedyouTubeSchemaSumDf
.writeStream
.outputMode("complete")
.format("console")
.option("truncate","false")
.trigger(ProcessingTime("5 seconds"))
.start()
After this post the datas dont coming to my spark on cluster
import org.apache.spark.sql.streaming.Trigger.ProcessingTime
sortedModelCountQuery: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper#3bd8a775
It stays like this. Actually, the data is coming, but the code I wrote for analysis does not work here

Spark Streaming with Kafka - Not all Kafka messages are received

I'm working with Spark Streaming with Kafka for the first time have been facing the following issue:
I'm using the receiver-based approach for Kafka integration with Spark Streaming as:
val kafkaConf = Map("metadata.broker.list" -> "ip1:9042,ip2:9042", "group.id" -> "raw-email-event-streaming-consumer", "zookeeper.connect" -> "ip1:2181,ip2:2181")
`val kafkaStream = KafkaUtils.createStream[Array[Byte], String, DefaultDecoder, StringDecoder](ssc, kafkaConf, Map(RN_MAIL_TOPIC -> RN_MAIL_TOPIC_PARTITIONS), StorageLevel.MEMORY_ONLY_SER)`
I have found that a lot of messages from a Kafka topic are not received in the Spark Streaming job (~5k messages out of 7k messages produced aren't received). Please provide insights as to why this might be happening. I submit my streaming job to an Azure HDInsights cluster in a standalone mode currently, as follows:
spark-submit --class <class_name> --master local[*] --deploy-mode client <executable>.jar

Spark Streaming + Kafka: SparkException: Couldn't find leader offsets for Set

I'm trying to setup Spark Streaming to get messages from Kafka queue. I'm getting the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o30.createDirectStream.
: org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
org.apache.spark.SparkException: Couldn't find leader offsets for Set([test-topic,0])
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
at scala.util.Either.fold(Either.scala:97)
Here is the code I'm executing (pyspark):
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
directKafkaStream = KafkaUtils.createDirectStream(ssc, ["test-topic"], {"metadata.broker.list": "host.domain:9092"})
ssc.start()
ssc.awaitTermination()
There were a couple of similar posts with the same error. In all cases the cause was the empty kafka topic. There are messages in my "test-topic". I can get them out with
kafka-console-consumer --zookeeper host.domain:2181 --topic test-topic --from-beginning --max-messages 100
Does anyone know what might be the problem?
I'm using:
Spark 1.5.2 (apache)
Kafka 0.8.2.0+kafka1.3.0 (CDH 5.4.7)
You need to check 2 things:
check if this topic and partition exists , in your case is topic is test-topic and partition is 0.
based on your code, you are trying consume message from offset 0 and it might be possible message is not available from offset 0, check what is you earliest offset and try consume from there.
Below is command to check earliest offset:
sh kafka/bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list "your broker list" --topic "topic name" --time -1
1) You have to make sure that you have already created topic test-topic
Run following command to check list of the topic
kafka-topics.sh --list --zookeeper [host or ip of zookeeper]:[port]
2) After checking your topic, you have to configure your Kafka configuration in Socket Server Settings section
listeners=PLAINTEXT://[host or ip of Kafka]:[port]
If you define short host names in /etc/hosts and use them in your kafka servers' configurations, you should change those name to ip. Or register the same short host name in your local PC or client's /etc/hosts.
Error occurred because Spark streaming lib can't resolve short hostname in the PC or client.
Another option to force creating topic if it doesn't exist. You can do this by setting property "auto.create.topics.enable" to "true" in kafkaParams map like this.
val kafkaParams = Map[String, String](
"bootstrap.servers" -> kafkaHost,
"group.id" -> kafkaGroup,
"auto.create.topics.enable" -> "true")
Using Scala 2.11 and Kafka 0.10 versions.
One of the reason for this type of error where leader cannot be found for specified topic is Problem with one's Kafka server configs.
Open your Kafka server configs :
vim ./kafka/kafka-<your-version>/config/server.properties
In the "Socket Server Settings" section , provide IP for your host if its missing :
listeners=PLAINTEXT://{host-ip}:{host-port}
I was using Kafka setup provided with MapR sandbox and was trying to access the kafka via spark code. I was getting the same error while accessing my kafka since my configuration was missing the IP.

Resources