How Kafka partitions are shared in Spark streaming with Kafka? - apache-spark

I am wondering how the Kafka partitions are shared among the SimpleConsumer being run from inside the executor processes. I know how the high level Kafka consumers are sharing the parititions across differernt consumers in the consumer group. But how does that happen when Spark is using the Simple consumer ? There will be multiple executors for the streaming jobs across machines.

All Spark executors should also be part of the same consumer group. Spark is using roughly the same Java API for Kafka consumers, it's just the scheduling that's distributing it into multiple machines

Related

How can I start multiple consumers for a Kafka topic with multiple partitions?

I have recently started using spark and have to deal with a case where I need to consume multiple partitions of a Kafka topic in spark. How can I start multiple consumers ? Do I need to have multiple instances of same application running using same group id ? or is there any configuration I can make use of while starting application and it does that job internally?
Passing --num-executors and using more than one core per executor will make more than one consumer thread in Spark
Each consumer thread gets mapped to a single partition.
Make the total threads equal the total partitions to maximize distributed throughput

Spark 2.2 structured streaming reading from Kafka topic only one executor reading from partitions

I have connected a kafka topic with 4 partitions into spark streaming using the new API: structured streaming. Everything goes well except the fact that out of my two worker nodes only one worker node (so one executor with 2 cores) is used. The jobs have 4 tasks but only one worker node is being used.
I tried to add number of executors in spark-submit but that didn't change anything.
Any ideas on how to engage the other nodes?
Thanks,
Mikel

How many consumers are created to read records per direct stream?

I'm using Spark Streaming to read data from Kafka (using the Kafka direct stream API).
How many Kafka consumers are instantiated for a stream? Is the number of Kafka consumers equal to the number of executors? Does each executor instantiate one Kafka consumer (with the same group id)?
With direct approach number of consumers will be exactly the same as the number of Kafka Partitions:
The Spark Streaming integration for Kafka 0.10 is similar in design to the 0.8 Direct Stream approach. It provides simple parallelism, 1:1 correspondence between Kafka partitions and Spark partitions, and access to offsets and metadata
and the separate consumer is initialized for each partition.

Spark Streaming and Kafka: one cluster or several standalone boxes?

I am about taking a decision about using Spark-Streaming Kafka integration.
I have a Kafka topic (I can break it into several topics) queuing several dozens of thousands of messages per minute, my spark streaming application ingest the messages by applying transformations, and then update a UI.
Knowing that all failures are handled and data are replicated in Kafka, what is the best option for implementing the Spark Streaming application in order to achieve the best possible performance and robustness:
One Kafka topic and one Spark cluster.
Several Kafka topics and several stand-alone Spark boxes (one machine with stand alone spark cluster for each topic)
Several Kafka topics and one Spark cluster.
I am tempted to go for the second option, but I couldn't find people talking about such a solution.
An important element to consider in this case is the partitioning of the topic.
The parallelism level of your Kafka-Spark integration will be determined by the number of partitions of the topic. The direct Kafka model simplifies the consumption model by establishing a 1:1 mapping between the number of partitions of the topic and RDD partitions for the corresponding Spark job.
So, the recommended setup would be: one Kafka topic with n partitions (where n is tuned for your usecase) and a Spark cluster with enough resources to process the data from those partitions in parallel.
Option #2 feels like trying to re-implement what Spark gives you out of the box: Spark gives you resilient distributed computing. Option #2 is trying to parallelize the payload over several machines and deal with failure by having independent executors. You get that with a single Spark cluster, with the benefit of improved resource usage and a single deployment.
Option 1 is straight forward, simple and probably more efficient. If your requirements are met, that's the one to go for (And honor the KISS Principle).

Data receiving in Spark Streaming

Recently I've been doing performance tests on Spark Streaming. But some problems puzzled me a lot.
In Spark Streaming, receivers are scheduled to run in executors on worker nodes.
How many receivers are there in a cluster? Can I control the number of receivers?
If not all workers run receiver to receive stream data, the other worker nodes will not receive any data? In that case, how can I guarantee task scheduling based on data locality? Copy data from nodes those run receivers?
There is only one receiver per DStream, but you can create more than one DStream and union them together to act as one. This is why it is suggested to run Spark Streaming against a cluster that is at least N(receivers) + 1 cores. Once the data is past the receiving portion, it is mostly a simple Spark application and follows the same rules of a batch job. (This is why streaming is referred to as micro-batching)

Resources