Dynamically update topics list for spark kafka consumer - apache-spark

Is it possible to dynamically update topics list in spark-kafka consumer?
I have a Spark Streaming application which uses spark-kafka consumer.
Say initially I have spark-kakfa consumer listening for topics: ["test"] and after a while my topics list got updated to ["test","testNew"]. now is there a way to update spark-kafka consumer topics list and ask spark-kafka consumer to consume data for updated list of topics without stopping sparkStreaming application or sparkStreaming context

Is it possible to dynamically update topics list in spark-kafka consumer
No. Both the receiver and receiverless approaches are fixed once you initialize the kafka stream using KafkaUtils. There is no way for you to pass new topics as you go as the DAG is fixed.
If you want to read dynamically, perhaps consider a batch k
job which is scheduled iteratively and can read the topics dynamically and creating an RDD out of that.
An additional solution would be to use a technology that gives you kore flexibility over the consumption, such as Akka Streams.

As Yuval said, it isn't possible but there might be a work around if you know what the structure/format of data you are dealing with from Kafka.
For example,
If your streaming application is listening to topics ["test","testNew"]
Downl the line you want to add a new topic named [test4], as a work around, you can simply add a unique key to the that is contained in it and pass it to the existing topics.
Design your streaming application in such a way to recognize/filter the data based on the key you added to that test2 data

You can use Thread based approach
1. define the Cache using any data structure which contains list of topics
2. way to add element in this cache
3. You have to class A and B where B has all the spark related logic
4 Class A is long running job and from A you are calling B , whenever there is new topic you just spawning new thread with B

I'd suggest trying ConsumerStrategies.SubscribePattern from the latest Spark-Kafka integration (0.10) API version.
That would look like:
KafkaUtils.createDirectStream(
mySparkStreamingContext,
PreferConsistent,
SubscribePattern("test.*".r.pattern, myKafkaParamsMap))

Related

Spark Streaming in Java: Reading from two Kafka Topics using One Consumer using JavaInputDStream

I have a spark application which is required to read from two different topics using one consumer using Spark Java.
The kafka message key & value schema is same for both the topics.
Below is the workflow:
1. Read messages from both the topics, same groupID, using JavaInputDStream<ConsumerRecord<String, String>> and iterate using foreachRDD
2. Inside the loop, Read offsets, filter messages based on the message key and create JavaRDD<String>
3. Iterate on JavaRDD<String> using mapPartitions
4. Inside mapPartitions loop, iterate over them using forEachRemaining.
5. Perform data enrichment, transformation, etc on the rows inside forEachRemaining loop.
6. commit
I want to understand below questions. Please provide your answers or share any documentation which can help me find answers.
1. How the messages are received/consumed from two topics(one common group id, same schema both key/value) in one consumer.
Let say the consumer reads data every second. Producer1 produces 50 messages to Topic1 and Producer 2 produces 1000 messages to Topic2.
2. Is it going to read all msgs(1000+50) in one batch and process together in the workflow, OR is it going to read 50 msgs first, process them and then read 1000 msgs and process them.
3. What parameter should i use to control the number of messages being read in one batch per second.
4. Will same group id create any issue while consuming.
The official document in Spark Streaming already explains on how to consume multiple topics per group id.
https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
Collection<String> topics = Arrays.asList("topicA", "topicB");
JavaInputDStream<ConsumerRecord<String, String>> stream =
KafkaUtils.createDirectStream(
streamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams)
);
One group id and follows same schema for both the topics.
Not sure about this, however from my understanding it would consume all the messages depending on the batch size.
"spark.streaming.backpressure.enabled" set this as true and "spark.streaming.kafka.maxRatePerPartition" set this as a numeric value, based on this spark limits the number of messaged to consume from kafka per batch. Also set the batch duration accordingly. https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/streaming/api/java/JavaStreamingContext.html
This totally depends on your application usage.
1. How the messages are received/consumed from two topics(one common group id, same schema both key/value) in one consumer.
Let say the consumer reads data every second. Producer1 produces 50 messages to Topic1 and Producer 2 produces 1000 messages to Topic2.
Any Kafka consumer can mention a list of topics, so no constraints about this.
So if you have one consumer, it will be responsible for all the partitions of both Topic1 and Topic2.
2. Is it going to read all msgs(1000+50) in one batch and process together in the workflow, OR is it going to read 50 msgs first, process them and then read 1000 msgs and process them.
3. What parameter should I use to control the number of messages being read in one batch per second.
Answer for both 2,3 questions:
It will receive all the messages together (1050) or even more, depending on your configuration.
In order to allow the consumer to receive in batches of 1050 or greater, raise max.poll.records (default 500) to 1050 (or more); other configuration may be a bottleneck, but you should be ok with the rest for the default configurations.
4. Will same group id create any issue while consuming.
The same group-id will affect you if you create more than one consumer, making the consumers to split the partitions they responsible of between topics.
Moreover, if your consumer dies or stops for some reason you have to get it back up with the same group-id, this way the consumer "remembers" the last offset consumed and keeps from the points it stopped.
If you have any more problems regarding to your consumer, I suggest you to read more information in this article, it is chapter 4 from Kafka: The Definitive Guide, explaining deeply about consumers and should answer further questions.
If you want to explore the configuration options, the documentation is always helpful.

kafka connect multiple topics in sink connector properties

I am trying to read 2 kafka topics using Cassandra sink connector and insert into 2 Cassandra tables. How can I go about doing this?
This is my connector.properties file:
name=cassandra-sink-orders
connector.class=com.datamountaineer.streamreactor.connect.cassandra.sink.CassandraSinkConnector
tasks.max=1
topics=topic1,topic2
connect.cassandra.kcql=INSERT INTO ks.table1 SELECT * FROM topic1;INSERT INTO ks.table2 SELECT * FROM topic2
connect.cassandra.contact.points=localhost
connect.cassandra.port=9042
connect.cassandra.key.space=ks
connect.cassandra.contact.points=localhost
connect.cassandra.username=cassandra
connect.cassandra.password=cassandra
Am I doing everything right? Is this the best way of doing this or should I create two separate connectors?
There's one issue with your config. You need one task per topic-partition. So if your topics have one partition, you need tasks.max set to at least 2.
I don't see it documented in Connect's docs, which is a shame
If you want to consume those two topics in one consumer that's fine and it's correct setup. The best way of doing this depends whether those messages should be consumed by one or two consumers. So it depends on your business logic.
Anyway, if you want to consume two topics via one consumer that should work find since consumer can subscribe to multiple topics. Did you try running this consumer? Is it working?

What is the most simple way to write to kafka from spark stream

I would like to write to kafka from spark stream data.
I know that I can use KafkaUtils to read from kafka.
But, KafkaUtils doesn't provide API to write to kafka.
I checked past question and sample code.
Is Above sample code the most simple way to write to kafka?
If I adopt way like above sample, I must create many classes...
Do you know more simple way or library to help to write to kafka?
Have a look here:
Basically this blog post summarise your possibilities which are written in different variations in the link you provided.
If we will look at your task straight forward, we can make several assumptions:
Your output data is divided to several partitions, which may (and quite often will) reside on different machines
You want to send the messages to Kafka using standard Kafka Producer API
You don't want to pass data between machines before the actual sending to Kafka
Given those assumptions your set of solution is pretty limited: You whether have to create a new Kafka producer for each partition and use it to send all the records of that partition, or you can wrap this logic in some sort of Factory / Sink but the essential operation will remain the same : You'll still request a producer object for each partition and use it to send the partition records.
I'll suggest you continue with one of the examples in the provided link, the code is pretty short, and any library you'll find would most probably do the exact same thing behind the scenes.

Storm like structure in Apache Spark

You know how in Apache Storm you can have a Spout streaming data to multiple Bolts. Is there a way to do something similar in Apache Spark?
I basically want that there be one program to read data from a Kafka Queue and output it to 2 different programs, which can then process it in their own, different ways.
Specifically, there would be a reader program that would read data from the Kafka queue and output it to 2 programs x and y. x would process the data to calculate metrics of one kind (in my case it would calculate the user activities) whereas y would calculate metrics of another kind (in my case this would be checking activities based on different devices).
Can someone help me understand how this is possible in Spark?
Why don't you simply create two topologies?
Both topologies have a spout reading from the kafka topic (yes, you can have multiple topologies reading from the same topic; I have this running on production systems). Make sure that you use different spout configs, otherwise kafka-zookeper will see both topologies as being the same. Have a look at the documentation here.
Spoutconfig is an extension of KafkaConfig that supports additional fields with ZooKeeper connection info and for controlling behavior specific to KafkaSpout. The Zkroot will be used as root to store your consumer's offset. The id should uniquely identify your spout.
public SpoutConfig(BrokerHosts hosts, String topic, String zkRoot, String id);
Implement program x in topology x and program y in topology y.
Another option would have two graphs of bolts subscribing from the same spout, but IMHO this is not optimal, because failed tuples (which are likely to fail only in one single graph) would be replayed to both graphs, event if they fail in only one of the graphs; and therefore some kafka messages will end being processed twice, using tow separated topologies you avoid this.

Apache kafka message dispatching and balance loading

I'm just started with Apache Kafka and really try to figure out, how could I design my system to use it in proper manner.
I'm building system which process data and actually my chunk of data is a task (object), that need to be processed. And object knows how it could be processed, so that's not a problem.
My system is actually a splited into 3 main component: Publisher (code which spown tasks), transport - actually kafka, and set of Consumers - it's actually workers who just pull data from the queue, process it somehow. It's important to note, that Consumer could be a publisher itself, if it's task need 2 step computation (Consumer just create tasks and send it back to transport)
So we could start with idea that I have 3 server: 1 single root publisher (kafka server also running there) and 2 consumers servers which actually handle the tasks. Data workflow is like that: Publisher create task, put it to transposrt, than one of consumers take this task from the queue and handle it. And it will be nice if each consumer will be handle the same ammount of tasks as the others (so workload spread eqauly between consumers).
Which kafka configuration pattern I need to use for that case? Does kafka have some message balancing features or I need to create 2 partitions and each consumer will be only binded to single partitions and could consume data only from this partition?
In kafka number of partitions roughly translates to the parallelism of the system.
General tip is create more partitions per topic (eg. 10) and while creating the consumer specify the number of consumer threads corresponding to the number of partitions.
In the High-level consumer API while creating the consumer you can provide the number of streams(threads) to create per topic. Assume that you create 10 partitions and you run the consumer process from a single machine, you can give topicCount as 10. If you run the consumer process from 2 servers you could specify the topicCount as 5.
Please refer to this link
The createMessageStreams call registers the consumer for the topic, which results in rebalancing the consumer/broker assignment. The API encourages creating many topic streams in a single call in order to minimize this rebalancing.
Also you can dynamically increased the number of partitions using kafka-add-partitions.sh command under kafka/bin. After increasing the partitions you can restart the consumer process with increased topicCount
Also while producing you should use the KeyedMessage class based on some random key within your message object so that the messages are evenly distributed across the different partitions

Resources