Kafka Customer Thread,Task,Partition? - multithreading

I have a kafka cluster which has 3 machines. And a topic which has 6 partitions(2 partitions every machine).
When i start a consumer application which has 6 consumer threads and belong to one group. I know one consumer thread will be assigned one partition.
What i want to know is: The task of the consumer thread will be runned on the machine where the partition on? Or will be runned on the machine where the app be srarted?

The model your are talking about sounds like the one we have with Apache Spark where workers for processing data run on worker nodes coordinated by a driver application on a the developer/user machine.
Kafka doesn't work in this way.
Kafka brokers are independent from the Kafka application(s) where consumers run for getting messages from topics/partitions.
Where you start your consumer application(s) that is the machine where the application runs; it doesn't run on the broker nodes. The application with related consumers will connect to the "remote" broker nodes for getting messages.
It's also true that you can just run your Kafka application(s) on a broker node just as another JVM process but it's not the model you describe above (as I said is much more like Apache Spark)

Related

What is the best approach to scale nodejs that consumes kafka messages

I have a node application that consumes Kafka messages, makes some long operations, and finally, sends the result to another microservice through Kafka message.
what is the best approach to improve my application scalability so it can process more Kafka messages per second? and why?
Using the clustering module of node js to create multiple instances that connect to Kafka using the same consumer group. (I think it should enhance the scalability by cores count)
Using a thread pool of the worker threads node js module. and distribute the Kafka consumed messages across the worker threads.
any other suggestions?

Clustered app - only one server at a time reads from kafka, what am I missing?

I have a clustered application built around spring tooling, using kafka as the message layer for the fabric. At a high level, its architecture is a master process that parcels out work to slave processes running on separate hardware/vm's.
Master
|_______________
| | |
slave1 slave2 slave3
What I expect to happen is, if I throw 100 messages at Kafka, each of the slaves (three in this example) will pick up a proportionate number of messages and execute a proportionate amount of the work (about 1/3rd in this example).
What really happens is a slave picks up all of the messages and executes all of the work. It is indeterminate which slave will pick up the messages, but it is guaranteed one a slave starts picking up messages, the others will not until the slave has finished its work.
To me, it looks like the read from Kafka is pulling all of the messages from the queue, rather than one at a time. This leads me to believe I missed a configuration either on Kafka or in the Spring kafka.
I think you miss a conceptual understanding what is Apache Kafka and how it works.
There is no queues, first of all. Messages are settled in the topic. Everybody subscribed can get the same message. However there is a concept of consumer group. So, independently of the number of subscrbibers, only one of them will read a single message if the consumer group is the same.
There is another feature in Kafka called partitions. With that you can distribute your messages into different partitions or they will be assigned automatically: evenly by default. This partitions feature has another angle to use. When we have several subscribers for the same topic in the same consumer group, the partitions are distributed between them. So, you may reconsider your logic in favor of built-in features in Apache Kafka.
There is nothing to do from the Spring Kafka perspective, though. You only need properly configure your topic for reasonable number of partitions and provide the same consumer group for all your "slaves".

How Kafka partitions are shared in Spark streaming with Kafka?

I am wondering how the Kafka partitions are shared among the SimpleConsumer being run from inside the executor processes. I know how the high level Kafka consumers are sharing the parititions across differernt consumers in the consumer group. But how does that happen when Spark is using the Simple consumer ? There will be multiple executors for the streaming jobs across machines.
All Spark executors should also be part of the same consumer group. Spark is using roughly the same Java API for Kafka consumers, it's just the scheduling that's distributing it into multiple machines

Spark Streaming and High Availability

I'm building Apache Spark application that acts on multiple streams.
I did read the Performance Tuning section of the documentation:
http://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning
What I didn't get is:
1) Are the streaming receivers located on multiple worker nodes or is the driver machine?
2) What happens if one of the nodes that receives the data fails (power off/restart)
Are the streaming receivers located on multiple worker nodes or is the
driver machine
Receivers are located on worker nodes, which are responsible for the consumption of the source which holds the data.
What happens if one of the nodes that receives the data fails (power
off/restart)
The receiver is located on the worker node. The worker node get's it's tasks from the driver. This driver can either be located on a dedicated master server if you're running in Client Mode, or it can be on one of the workers if you're running in Cluster Mode. In case a node fails which doesn't run the driver, the driver will re-assign the partitions held on the failed node to a different one, which will then be able to re-read the data from the source, and do the additional processing needed to recover from the failure.
This is why a replayable source such as Kafka or AWS Kinesis is needed.

force the spark streaming's kafka consumer process to different machines

I 'm using streaming integrated with streaming-kafka.
My kafka topic has 80 partitions, while my machines have 40 cores. I found that when the job is running, the kafka consumer processes are only deploy to 2 machines (40*2=80), the bandwidth of the 2 machines will be very very high.
I wonder is there any way to control the kafka consumer's dispatch, in order to balance the bandwidth and memory usage?
You can use this consumer from Spark-Packages.
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer
This consumer has been successfully running in many production deployment and this is most reliable Receiver based low level consumer.
This gives more control on offset commits and Receiver fault tolerant. This also give control of how many receiver you can configure for your topic which will determine the parallelism .
Dibyendu

Resources