So, this is the code that I have, do I need to add something to properties?
At first I thought this had to do with partitioning, but it turns out that there's a way to make kafka producer use more threads.
Can someone explain how I can do this?
val props = new Properties()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, url)
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.PARTITIONER_CLASS_CONFIG, "IdPartitioner")
var kafkaProducer = new KafkaProducer[String, String](props)
turns out that there's a way to make kafka producer use more threads
Not natively, no. You would have to make a new Thread yourself, or use a higher level producing library that might accomplish that for you. Spring Kafka or Akka's Kafka support might be options to look at for that. Or Spark/Flink/Beam (since you had hadoop)
Multiple messages will already be sent in a batch, however, and each produced record includes a topic name, and key, so therefore even a single thread is producing "in parallel" to multiple possible brokers
KafkaProducer is thread safe by using an internal thread to send messages to brokers.
It does keep an internal cache of messages to be sent out (controlled by linger.ms), and it is possible for a scenario like this:
thread 1 submits message M1 (target = topic1/partition1) to producer
thread 2 submits message M2 (target = topic2/partition1) to producer
assumption: t1/p1 & t2/p1 are both hosted on the same broker
producer's internal thread wakes up and sends both of them in the same request
In general it should suffice for your requirements - you can submit your produce requests in parallel, and they will be organized by KafkaProducer itself.
Related
I need to consume from a Kafka topic that will have millions of data. Once I read from the topic, i need to transform and write it to another topic. I am able to consume messages from the topic, process the data by multiple threads and write to another topic.
I followed the example from here https://projectreactor.io/docs/kafka/1.3.5-SNAPSHOT/reference/index.html#concurrent-ordered
Here is my code:
public Flux<?> flux() {
KafkaSender<Integer, Person> sender = sender(senderOptions());
return KafkaReceiver.create(receiverOptions(Collections.singleton(sourceTopic)))
.receive()
.map(m -> SenderRecord.create(transform(m.value()), m.receiverOffset()))
.as(sender::send)
.doOnNext(m -> m.correlationMetadata().acknowledge())
.doOnCancel(() -> close());
}
I have multiple consumers to read from and was looking into adding different reader threads to read from the topic due to the volume of data. However, the reactor-kafka documentation mentions KafkaReceiver is not thread-safe since the underlying KafkaConsumer cannot be accessed concurrently by multiple threads.
I am looking for suggestions on reading from a topic concurrently.
So basically what you are looking for called Consumer Group, the maximum parallel consumption you can run is limited by the number of partitions your topic has.
Kafka Consumer Group mechanism allows you to seperate the work of consumption a topic to diffrent "readers" which belongs to the same group, the work would be divided by that each consumer in the group would be solely responsible for a partition (1 or more, based on number of consumers in the group, and number of partitions to the topic)
I have a spark application which is required to read from two different topics using one consumer using Spark Java.
The kafka message key & value schema is same for both the topics.
Below is the workflow:
1. Read messages from both the topics, same groupID, using JavaInputDStream<ConsumerRecord<String, String>> and iterate using foreachRDD
2. Inside the loop, Read offsets, filter messages based on the message key and create JavaRDD<String>
3. Iterate on JavaRDD<String> using mapPartitions
4. Inside mapPartitions loop, iterate over them using forEachRemaining.
5. Perform data enrichment, transformation, etc on the rows inside forEachRemaining loop.
6. commit
I want to understand below questions. Please provide your answers or share any documentation which can help me find answers.
1. How the messages are received/consumed from two topics(one common group id, same schema both key/value) in one consumer.
Let say the consumer reads data every second. Producer1 produces 50 messages to Topic1 and Producer 2 produces 1000 messages to Topic2.
2. Is it going to read all msgs(1000+50) in one batch and process together in the workflow, OR is it going to read 50 msgs first, process them and then read 1000 msgs and process them.
3. What parameter should i use to control the number of messages being read in one batch per second.
4. Will same group id create any issue while consuming.
The official document in Spark Streaming already explains on how to consume multiple topics per group id.
https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
Collection<String> topics = Arrays.asList("topicA", "topicB");
JavaInputDStream<ConsumerRecord<String, String>> stream =
KafkaUtils.createDirectStream(
streamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams)
);
One group id and follows same schema for both the topics.
Not sure about this, however from my understanding it would consume all the messages depending on the batch size.
"spark.streaming.backpressure.enabled" set this as true and "spark.streaming.kafka.maxRatePerPartition" set this as a numeric value, based on this spark limits the number of messaged to consume from kafka per batch. Also set the batch duration accordingly. https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/streaming/api/java/JavaStreamingContext.html
This totally depends on your application usage.
1. How the messages are received/consumed from two topics(one common group id, same schema both key/value) in one consumer.
Let say the consumer reads data every second. Producer1 produces 50 messages to Topic1 and Producer 2 produces 1000 messages to Topic2.
Any Kafka consumer can mention a list of topics, so no constraints about this.
So if you have one consumer, it will be responsible for all the partitions of both Topic1 and Topic2.
2. Is it going to read all msgs(1000+50) in one batch and process together in the workflow, OR is it going to read 50 msgs first, process them and then read 1000 msgs and process them.
3. What parameter should I use to control the number of messages being read in one batch per second.
Answer for both 2,3 questions:
It will receive all the messages together (1050) or even more, depending on your configuration.
In order to allow the consumer to receive in batches of 1050 or greater, raise max.poll.records (default 500) to 1050 (or more); other configuration may be a bottleneck, but you should be ok with the rest for the default configurations.
4. Will same group id create any issue while consuming.
The same group-id will affect you if you create more than one consumer, making the consumers to split the partitions they responsible of between topics.
Moreover, if your consumer dies or stops for some reason you have to get it back up with the same group-id, this way the consumer "remembers" the last offset consumed and keeps from the points it stopped.
If you have any more problems regarding to your consumer, I suggest you to read more information in this article, it is chapter 4 from Kafka: The Definitive Guide, explaining deeply about consumers and should answer further questions.
If you want to explore the configuration options, the documentation is always helpful.
We have a front layer which just receives messages and writes to the Kafka topics for back-end processing. We send the messages at a very high rate; per day we process 1 billion messages. We have a thread pool which accepts the messages and writes to the Kafka producer instance. Here I have created only one producer (single instance) which is shared among multiple threads.
Recently, I have been observing that 90% of the threads are in blocked state. I found out that Kafka is sending the data sequentially. There was a synchronized block in the producer.send() method in the Kafka Java driver:
def send(messages: KeyedMessage[K,V]*) {
**lock synchronized {**
if (hasShutdown.get)
throw new ProducerClosedException
recordStats(messages)
sync match {
case true => eventHandler.handle(messages)
case false => asyncSend(messages)
}
}
}
The documentation says that we don't need to create multiple producer instances; one instance can be shared in a multi-threaded environment. But how can we do that? Or should we better create a pool of producer instances?
The reason why it is recommended to share the publisher client across threads is that it leads to better batching, as the messages are batched at partition level. Better batching leads to better compression (if enabled) and also better throughput. You can consider tuning parameters like buffer memory and linger.ms and batch size for optimizing the throughput.
One this is done, then you can consider adding multiple producers.
Also, consider increasing the number of partitions for the topic, if the incoming rate for the topic is quite high.
Is it possible to dynamically update topics list in spark-kafka consumer?
I have a Spark Streaming application which uses spark-kafka consumer.
Say initially I have spark-kakfa consumer listening for topics: ["test"] and after a while my topics list got updated to ["test","testNew"]. now is there a way to update spark-kafka consumer topics list and ask spark-kafka consumer to consume data for updated list of topics without stopping sparkStreaming application or sparkStreaming context
Is it possible to dynamically update topics list in spark-kafka consumer
No. Both the receiver and receiverless approaches are fixed once you initialize the kafka stream using KafkaUtils. There is no way for you to pass new topics as you go as the DAG is fixed.
If you want to read dynamically, perhaps consider a batch k
job which is scheduled iteratively and can read the topics dynamically and creating an RDD out of that.
An additional solution would be to use a technology that gives you kore flexibility over the consumption, such as Akka Streams.
As Yuval said, it isn't possible but there might be a work around if you know what the structure/format of data you are dealing with from Kafka.
For example,
If your streaming application is listening to topics ["test","testNew"]
Downl the line you want to add a new topic named [test4], as a work around, you can simply add a unique key to the that is contained in it and pass it to the existing topics.
Design your streaming application in such a way to recognize/filter the data based on the key you added to that test2 data
You can use Thread based approach
1. define the Cache using any data structure which contains list of topics
2. way to add element in this cache
3. You have to class A and B where B has all the spark related logic
4 Class A is long running job and from A you are calling B , whenever there is new topic you just spawning new thread with B
I'd suggest trying ConsumerStrategies.SubscribePattern from the latest Spark-Kafka integration (0.10) API version.
That would look like:
KafkaUtils.createDirectStream(
mySparkStreamingContext,
PreferConsistent,
SubscribePattern("test.*".r.pattern, myKafkaParamsMap))
I'm just started with Apache Kafka and really try to figure out, how could I design my system to use it in proper manner.
I'm building system which process data and actually my chunk of data is a task (object), that need to be processed. And object knows how it could be processed, so that's not a problem.
My system is actually a splited into 3 main component: Publisher (code which spown tasks), transport - actually kafka, and set of Consumers - it's actually workers who just pull data from the queue, process it somehow. It's important to note, that Consumer could be a publisher itself, if it's task need 2 step computation (Consumer just create tasks and send it back to transport)
So we could start with idea that I have 3 server: 1 single root publisher (kafka server also running there) and 2 consumers servers which actually handle the tasks. Data workflow is like that: Publisher create task, put it to transposrt, than one of consumers take this task from the queue and handle it. And it will be nice if each consumer will be handle the same ammount of tasks as the others (so workload spread eqauly between consumers).
Which kafka configuration pattern I need to use for that case? Does kafka have some message balancing features or I need to create 2 partitions and each consumer will be only binded to single partitions and could consume data only from this partition?
In kafka number of partitions roughly translates to the parallelism of the system.
General tip is create more partitions per topic (eg. 10) and while creating the consumer specify the number of consumer threads corresponding to the number of partitions.
In the High-level consumer API while creating the consumer you can provide the number of streams(threads) to create per topic. Assume that you create 10 partitions and you run the consumer process from a single machine, you can give topicCount as 10. If you run the consumer process from 2 servers you could specify the topicCount as 5.
Please refer to this link
The createMessageStreams call registers the consumer for the topic, which results in rebalancing the consumer/broker assignment. The API encourages creating many topic streams in a single call in order to minimize this rebalancing.
Also you can dynamically increased the number of partitions using kafka-add-partitions.sh command under kafka/bin. After increasing the partitions you can restart the consumer process with increased topicCount
Also while producing you should use the KeyedMessage class based on some random key within your message object so that the messages are evenly distributed across the different partitions