How does Kafka achieve its parallelism with multiple consumption on the same topic same partition? - multithreading

I have read from multiple sources on stack overflow which indicated using multiple consumer group will enable me to read from the same topic same partition from multiple consumers concurrently.
For example,
Can multiple Kafka consumers read from the same partition of same topic by default?
How Kafka broadcast to many Consumer Groups
Parallel Producing and Consuming in Kafka
So this is a follow up question to my previous question but on a slightly different context. Given the fact that we can only read and write to a partition leader, and Kafka logs are stored on hard disk. Each partition represents a log.
Now if I have 100 consumer groups reading from the same topic and same partition, that is basically reading from the same computer because we are only allowed to read from partition leader and cannot read from partition replicas, then how does Kafka even scale this kind of read load?
How does it achieve parallelism? Is it just spawning many threads and processes on the same machine to handler all the consumption concurrently? How can this approach scale horizontally?
Thank you

If you have 100 consumers all reading from the same partition then the data for that partition will be cached in the Linux OS page cache (memory) and so 99 or perhaps even all 100 of the consumers will be fetching data from RAM instead of from a spinning hard disk. This is a unique feature of Kafka that despite its being written in a JVM language, it is designed to leverage off heap memory for extra performance in the case of parallel consumers of the same data.

Related

Spark and Kafka: how to increase parallelism for producer sending large batch of records improving network usage?

I am diving to understand how can I send(produce) a large batch of records to a Kafka Topic from Spark.
From the docs I can see that there is an attempt to use the same producer across tasks in the same workers. When sending a lot of records at once, the network will be a bottle-neck (as well as memory, since kafka will buffer records to be sent). So I am wondering what is the best configuration to improve network usage:
Fewer workers with more cores (so I suppose, this means more threads)
More workers with fewer cores per worker (so I suppose we will use better network IO, since it will be spread across different machines)
Let's say the options I have for 1 and 2 are as follows (from Databricks):
4 workers with 16 cores per worker = 64 cores
10 workers with 4 cores per worker = 40 cores
To better utilize network IO, which is the best choice?
My thought on this for now, but I am not sure, so I am asking you here:
Although from a CPU point of view (expensive calculations jobs), the 1) would be better (more concurrency, and less shuffle), from a network IO point of view, I would rather use 2) even if I will have fewer cores overall.
Appreciate any input on this.
Thank you all.
The best solution is to have more workers to achieve parallelism (scale horizontally). DataFrame have to be write to Kafka using streaming with Kafka as sink as explained here https://docs.databricks.com/spark/latest/structured-streaming/kafka.html (if you don't want to have persistent stream you can always use option trigger once).
Additionally you can assume that 1 dataframe partition = 1cpu so you can optimize this way additionally (but databricks in streaming usually handle it automatically).
On Kafka side I guess that it could be good to have number of partitions/brokers similar to spark/databricks workers.

Does the number of kafka partitions increase the speed of Spark writing to kafka?

When reading, Spark have a mapping 1:1 to kafka partitions, so, with more partitions we can leverage more parellelism to our job.
But does it apply when Spark is writing in kafka ? Writing the same dataset in one topic with 4 partitions is more fast than writing in a topic with 1 partition ?
Yes.
If your topic has 1 partition means it is in one broker. So, If you increase producer rate for the topic, then that broker becomes busy. But if you have multiple partitions, your Kafka cluster shared those partitions into different brokers and those production rate shared within multiple brokers. So, Writing the same dataset in one topic with 4 partitions is more fast than writing in a topic with 1 partition.
This not only production rate. In Kafka brokers, There is multiple processes like compactions, compressions, segmentations etc... So with number of messages, that work load becomes high. But with multiple partitions in multiple brokers, it will be distributed.
However, you don’t necessarily want to use more partitions than needed because increasing partition count simultaneously increases the number of open server files and leads to increased replication latency.
from kafka documentation
Distribution
The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.
Each partition has one server which acts as the "leader" and zero or more servers which act as "followers". The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.

Can Spark/EMR read data from s3 multi-threaded

Due to some unfortunate sequences of events, we've ended up with a very fragmented dataset stored on s3. The table metadata is stored on Glue, and data is written with "bucketBy", and stored in parquet format. Thus discovery of the files is not an issue, and the number of spark partitions is equal to the number of buckets, which provides a good level of parallelism.
When we load this dataset on Spark/EMR we end up having each spark partition loading around ~8k files from s3.
As we've stored the data in a columnar format; per our use-case where we need a couple of fields, we don't really read all the data but a very small portion of what is stored.
Based on CPU utilization on the worker nodes, I can see that each task (running per partition) is utilizing almost around 20% of their CPUs, which I suspect is due to a single thread per task reading files from s3 sequentially, so lots of IOwait...
Is there a way to encourage spark tasks on EMR to read data from s3 multi-threaded, so that we can read multiple files at the same time from s3 within a task? This way, we can utilize the 80% idle CPU to make things a bit faster?
There are two parts to reading S3 data with Spark dataframes:
Discovery (listing the objects on S3)
Reading the S3 objects, including decompressing, etc.
Discovery typically happens on the driver. Some managed Spark environments have optimizations that use cluster resources for faster discovery. This is not typically a problem unless you get beyond 100K objects. Discovery is slower if you have .option("mergeSchema", true) as each file will have to touched to discover its schema.
Reading S3 files is part of executing an action. The parallelism of reading is min(number of partitions, number of available cores). More partitions + more available cores means faster I/O... in theory. In practice, S3 can be quite slow if you haven't accesses these files regularly for S3 to scale their availability up. Therefore, in practice, additional Spark parallelism has diminishing returns. Watch the total network RW bandwidth per active core and tune your execution for the highest value.
You can discover the number of partitions with df.rdd.partitions.length.
There are additional things you can do if the S3 I/O throughput is low:
Make sure the data on S3 is dispersed when it comes to its prefix (see https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html).
Open an AWS support request and ask the prefixes with your data to be scaled up.
Experiment with different node types. We have found storage-optimized nodes to have better effective I/O.
Hope this helps.

Increase number of partitions in Dstream to be greater then Kafka partitions in Direct approach

Their are 32 Kafka partitions and 32 consumers as per Direct approach.
But the data processing for 32 consumers is slow then Kafka rate(1.5x), which creates a backlog of data in Kafka.
I Want to increase the number of partitions for Dstream received by each consumer.
I will like solution to be something around to increase partitions on consumers rather then increasing partitions in Kafka.
In the direct stream approach, at max you can have #consumers = #partitions. Kafka does not allow more than one consumer per partition per group.id. BTW you are asking more partition per consumer? it will not help since your consumers are already running at full capacity and still are insufficient.
Few technical changes you can try to reduce the data backlog on kafka:
Increase number of partitions - although you do not want to do this, still this is the easiest approach. Sometimes platform just needs more hardware.
Optimize processing at consumer side - check possibility of record de-duplication before processing, reduce disk I/O, loop unrolling techniques etc to reduce time taken by consumers.
(higher difficulty) Controlled data distribution - Often it is found that some partitions are able to process better than others. It may be worth looking if this is the case in your platform. Kafka's data distribution policy has some preferences (as well as message-key) which often cause uneven load inside cluster: https://www.cloudera.com/documentation/kafka/latest/topics/kafka_performance.html
Assuming you have enough hardware resources allocated to consumer, you can check below parameter
spark.streaming.kafka.maxRatePerPartition
You can set number of records you consume from single kafka partition per second.

Kafka partitioning for Spark Streaming

I am using kafka with Spark Streaming (2.2.0). The load on the system is dynamic and I am trying to understand how to handle auto scaling.
There are two aspects of auto scaling:
Auto scale the computing infra
Auto scale the application components to take advantage of the auto scaled infra
Infra auto scaling: There can be various well defined trigger points for scaling infra. One of the possible ones in my case shall be the latency or delay in processing messages arriving at kafka. So, I can monitor the kafka cluster and if the messages processing is delayed by more than a certain factor then I know that more computing power needs to be thrown in.
Application auto scaling: In the above scenario, lets say that I add one more nore to the Spark cluster once I feel that messages by being held up in Kafka for long. A new worker starts and registers with the master and thus Spark cluster has more horse power available. There are two ways of making use of this addtional horse power. One strategy could be to repartition the kafka topic by adding more partitions. Once I do that Spark cluster will pull more messages in parallel during next batch and thus the processing speed may go up. The other strategy could be not to repartition kafka topic but add more cores to existing executors so that the message processing time goes down and thus more messages may be processed from an individual partition in same time.
I am not sure if the above strategy is correct or there are other ways of handling such scenarios?
add more cores to existing executors so that the message processing time goes down and thus more messages may be processed from an individual partition in same time.
Spark doesn't work like that. Each partition is normally processed by a single thread. Adding more cores might give you a performance boost only if there some tasks queued waiting for executors.
Might, because CPU is not the only resource that matters. Adding more cores won't help if bottleneck is for example network.
One strategy could be to repartition the kafka topic by adding more partitions. Once I do that Spark cluster will pull more messages in parallel during next batch and thus the processing speed may go up.
This will help if Spark cluster has enough resources to process additional partitions. Otherwise they will just for their share of resources.
Also, adding partitions alone might not be a solution, if you don't scale Kafka cluster at the same time.
Finally your comment:
Now in the code I could be doing a repartition of this RDD to speed up processing.
Unless processing is heavy, repartitioning will cost more than just processing the data.
So what is the answer?
Scaling only one component can achieve constant throughput if resources are unbalanced.
If resources are balanced, you might have to scale all interacting components.
Before you do, make sure that you correctly identified bottleneck.
Even if you scale up your infrastructure, the number of parallel consumers are the order of the number of partitions in your topic. So the right way is to increase the number of partitions as and when required. If you feel the need for scaling up your infra, you can do so as well.

Resources