force the spark streaming's kafka consumer process to different machines - apache-spark

I 'm using streaming integrated with streaming-kafka.
My kafka topic has 80 partitions, while my machines have 40 cores. I found that when the job is running, the kafka consumer processes are only deploy to 2 machines (40*2=80), the bandwidth of the 2 machines will be very very high.
I wonder is there any way to control the kafka consumer's dispatch, in order to balance the bandwidth and memory usage?

You can use this consumer from Spark-Packages.
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer
This consumer has been successfully running in many production deployment and this is most reliable Receiver based low level consumer.
This gives more control on offset commits and Receiver fault tolerant. This also give control of how many receiver you can configure for your topic which will determine the parallelism .
Dibyendu

Related

Clustered app - only one server at a time reads from kafka, what am I missing?

I have a clustered application built around spring tooling, using kafka as the message layer for the fabric. At a high level, its architecture is a master process that parcels out work to slave processes running on separate hardware/vm's.
Master
|_______________
| | |
slave1 slave2 slave3
What I expect to happen is, if I throw 100 messages at Kafka, each of the slaves (three in this example) will pick up a proportionate number of messages and execute a proportionate amount of the work (about 1/3rd in this example).
What really happens is a slave picks up all of the messages and executes all of the work. It is indeterminate which slave will pick up the messages, but it is guaranteed one a slave starts picking up messages, the others will not until the slave has finished its work.
To me, it looks like the read from Kafka is pulling all of the messages from the queue, rather than one at a time. This leads me to believe I missed a configuration either on Kafka or in the Spring kafka.
I think you miss a conceptual understanding what is Apache Kafka and how it works.
There is no queues, first of all. Messages are settled in the topic. Everybody subscribed can get the same message. However there is a concept of consumer group. So, independently of the number of subscrbibers, only one of them will read a single message if the consumer group is the same.
There is another feature in Kafka called partitions. With that you can distribute your messages into different partitions or they will be assigned automatically: evenly by default. This partitions feature has another angle to use. When we have several subscribers for the same topic in the same consumer group, the partitions are distributed between them. So, you may reconsider your logic in favor of built-in features in Apache Kafka.
There is nothing to do from the Spring Kafka perspective, though. You only need properly configure your topic for reasonable number of partitions and provide the same consumer group for all your "slaves".

Kafka Customer Thread,Task,Partition?

I have a kafka cluster which has 3 machines. And a topic which has 6 partitions(2 partitions every machine).
When i start a consumer application which has 6 consumer threads and belong to one group. I know one consumer thread will be assigned one partition.
What i want to know is: The task of the consumer thread will be runned on the machine where the partition on? Or will be runned on the machine where the app be srarted?
The model your are talking about sounds like the one we have with Apache Spark where workers for processing data run on worker nodes coordinated by a driver application on a the developer/user machine.
Kafka doesn't work in this way.
Kafka brokers are independent from the Kafka application(s) where consumers run for getting messages from topics/partitions.
Where you start your consumer application(s) that is the machine where the application runs; it doesn't run on the broker nodes. The application with related consumers will connect to the "remote" broker nodes for getting messages.
It's also true that you can just run your Kafka application(s) on a broker node just as another JVM process but it's not the model you describe above (as I said is much more like Apache Spark)

Can a spring kafka consumer run on multiple machines for the same groip?

Kafka says that the offset is managed by consumers and there should be as many consumers as many partitions for the same group.
Spring integration says that the number of consumer streams in high level consumer is the number of partitions for the same group.
So, can the spring kafka consumer code run on multiple servers for the same group? If yes, how do the offsets know not to be in conflict between servers?
According to the kafka doc, if group (http://kafka.apache.org/documentation.html#introduction) was implemented, each message is consumed by exactly one consumer in the group. Each consumer can run on one machine. Two consumer can run on the same machine, also. In this case, each consumer can be one process.
One group can contain multiple consumers. Partitions can be distributed among all the consumers in one group by some algorithms. The number of consumers can be larger or less than the number of the partitions.
Offset can be managed by aid of zookeeper. but not all functions have been implemented in some clients until now.
As for your use case, in fact, kafka maybe "at-least-once delivery system". Kafka can be at-most-once delivery by disabling retries on the producer OR committing its offset before processing a batch of messages. It is very difficult to implement "exactly-once delivery system", which requires co-operation. But kafka provides offset. So it may be possible.For more details, please see http://kafka.apache.org/documentation.html#semantics, http://ben.kirw.in/2014/11/28/kafka-patterns/, https://dzone.com/articles/kafka-clients-at-most-once-at-least-once-exactly-o and so on.
Based on my personal experience, I spent lots of time to make sure that my kafka system to be exactly-once delivery system. but when the server is down, some messages can be consumed twice. But my testing was done on standalone kafka server, always kafka cluter is used in production. So, I think it may can be considered as exactly-once system.

Data locality in Spark Streaming

Recently I've been doing performance tests on Spark Streaming. I ran a receiver on one of the 6 slaves and submitted a simple Word Count application to the cluster(actually I know this configuration is not proper in practice,just a simple test).I analyzed the scheduling log and found that nearly 88% of tasks are scheduled to the node where receiver ran on and the locality are always PROCESS_LOCAL and the CPU utilization is very high. Why does not Spark Streaming distribute data across the cluster and make full use of cluster? I've read official guide and it does not explain in detail, especially in Spark Streaming. Will it copy stream data to another node with free CPU and start new task on it when a task is on a node with busy CPU? If so, how can we explain the former case?
When you run the stream receiver just on one of the 6 nodes, all the received data are processed on this node (that is the data locality).
Data are not distributed across other nodes by default. If you need the input stream to be repartitioned (balanced across cluster) before further processing, you can use
inputStream.repartition(<number of partitions>)
This distributes the received batches of data across the specified number of machines in the cluster before further processing.
You can read more about level of parallelism in Spark documentation
https://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning

start multiple processor threads on Spark worker within one core

Our situation is: using Spark streaming with AWS Kinesis.
If specify the Spark master to be in memory as "local[32]", then Spark can consume data from Kinesis fairly quick.
But if we switch to a cluster with 1 master and 3 workers (on 4 separate machines), and set master to be "spark://[IP]:[port]", then the Spark cluster is consuming data at a very slow rate. This cluster has 3 worker machines, and each worker machine with 1 core.
I'm trying to speed up the consuming speed, so I add more executors on each worker machine, but it does not help much since each executor will need 1 core at least (and my worker machine has 1 core only). I also read adding more Kinesis shard number will help scale up, but I just want to maximize my read capacity.
Since the "in memory" mode is possible to consume fast enough, is it possible to also start multiple "Kinesis record processor thread" on each worker machine, shown in the picture below? Or start many threads to consume from Kinesis within 1 core?
Thank you very much.
picture below from https://spark.apache.org/docs/1.2.0/streaming-kinesis-integration.html
It turns out to be related to resources of the cluster.
For AWS Kinesis, one Kinesis stream requires one receiver from Spark cluster, and one receiver will acquire one core from Spark workers.
I increased the core of each worker to be 4 cores, and then executors have extra cores to run jobs.

Resources