I have some use cases that I would like to be more clarified, about Kafka topic partitioning -> spark streaming resource utilization.
I use spark standalone mode, so only settings I have are "total number of executors" and "executor memory". As far as I know and according to documentation, way to introduce parallelism into Spark streaming is using partitioned Kafka topic -> RDD will have same number of partitions as kafka, when I use spark-kafka direct stream integration.
So if I have 1 partition in the topic, and 1 executor core, that core will sequentially read from Kafka.
What happens if I have:
2 partitions in the topic and only 1 executor core? Will that core read first from one partition and then from the second one, so there will be no benefit in partitioning the topic?
2 partitions in the topic and 2 cores? Will then 1 executor core read from 1 partition, and second core from the second partition?
1 kafka partition and 2 executor cores?
Thank you.
The basic rule is that you can scale up to the number of Kafka partitions. If you set spark.executor.cores greater than the number of partitions, some of the threads will be idle. If it's less than the number of partitions, Spark will have threads read from one partition then the other. So:
2 partitions, 1 executor: reads from one partition then then other. (I am not sure how Spark decides how much to read from each before switching)
2p, 2c: parallel execution
1p, 2c: one thread is idle
For case #1, note that having more partitions than executors is OK since it allows you to scale out later without having to re-partition. The trick is to make sure that your partitions are evenly divisible by the number of executors. Spark has to process all the partitions before passing data onto the next step in the pipeline. So, if you have 'remainder' partitions, this can slow down processing. For example, 5 partitions and 4 threads => processing takes the time of 2 partitions - 4 at once then one thread running the 5th partition by itself.
Also note that you may also see better processing throughput if you keep the number of partitions / RDDs the same throughout the pipeline by explicitly setting the number of data partitions in functions like reduceByKey().
Related
Lets say there are 3 executors (ex1, ex2, ex3). In one executor(ex1) lets say receiver is running, now what happens when the data comes in that source.
Lets say a data is arrived in a kafka topic say "topic1", now the receiver running in ex1 will consume that data arrived in the topic right? Now where is that data stored?
Is it stored in that executor ex1 itself?
What if that data is too huge? Does it breaks it down and distributes it over to other executors?
Lets say a capacity of 10gb each executor(ex1, ex2, ex3). And a data arrived say 15gb (hypothetical assumption) now what happens to ex1. Will it fail or it will be handled? If handles how will it handle? Does it distributes over the cluster. If it distributes over the cluster, how will the foreachRDD fits into picture if in a batch only one rdd is formed. As if it distributes by breaking up, now it is more then one rdd in cluster right for that particular batch?
How many receivers run in spark job? Does it depends on the number of input sources? If spark is reading from 4 different kafka topics, does it mean that 4 different receivers will run separately in different executors? What if there are only 2 executors and 4 kafka topics/sources?In that case will 4 different receivers run in these two executors evenly? What if the sources are of odd number? IF two executors and 3 kafka sources, then is it that in one of the executors there will be two? What if one of the executors dies? How will it get recovered?
Is it stored in that executor ex1 itself?
Yes all the data fetched by the Spark Driver is pushed to executor for further processing.
What if that data is too huge? Does it breaks it down and
distributes it over to other executors?
The data is Hash partitioned once read by Spark receiver and then distributed among the executor fairly. If still you a data skew try adding a Custom partitioner and repartition the data.
How many receivers run in spark job? Does it depends on the number of input sources? If spark is reading from 4 different kafka topics, does it mean that 4 different receivers will run separately in different executors? What if there are only 2 executors and 4 kafka topics/sources?In that case will 4 different receivers run in these two executors evenly? What if the sources are of odd number? IF two executors and 3 kafka sources, then is it that in one of the executors there will be two? What if one of the executors dies? How will it get recovered?
There only one single receiver (i.e. one or multiple topics) which does the Kafka offset management. It hands over range of offsets per topic to Spark executor and Spark executor reads the data directly from Kafka. If anyone of the executor dies all the stages of it are re-executed from the last successful saved stage.
The Spark load distribution is not based size of data but on the count of events.
Guidelines say that if there are N partitions for a topic then Spark should have 2N executors to achieve optimum CPU resource utilization.
You should find more details on below link,
https://blog.cloudera.com/reading-data-securely-from-apache-kafka-to-apache-spark/
I am interested to know the following nitty gritties of spark parallelism and Partitioning
How many partitions can a executor hold in spark?
How are the partitions distributed (mechanism) among the executors?
How to set the size of the partition. Would like to know the relevant the config parameter.
Does executor store all the partitions in memory? If not when spilled to disk does it spill entire partition to disk or a part of partition to disk?
5 When there are 2 cores per executor but there are 5 partition in that executor then
Not quite the right way to look at it. An Executor holds nothing, it just does work.
A Partition is processed by a Core that has been assigned to an Executor. An Executor typically has 1 core but can have more than 1 such Core.
An App has Actions that translate to 1 or more Jobs.
A Job has Stages (based on Shuffle Boundaries).
Stages have Tasks, the number of these depends on number of Partitions.
Parallel processing of the Partitions depends on number of Cores allocated to Executors.
Spark is scalable in terms of Cores, Memory and Disk. The latter two in relation to your questions means that if the Partitions cannot all fit into Memory on the Worker for your Job, then that Partition or more will spill in its entirety to Disk.
I have to write spark streaming(createDirectStream API) code. I will be receiving around 90K messages per second so though of using 100 partitions for kafka topic to improve the performance.
Could you please let me know how many executors should I use? Can I use 50 executors and 2 cores per executor?
Also, consider if the batch interval is 10seconds and number of partitions of kafka topic is 100, will I receive 100 RDDs i.e. 1 RDD from each kafka partition? Will there be only 1 RDD from each partition for the 10second batch interval.
Thanks
There is no good answer, really, and it depends on how much executor memory + cores you have in your cluster.
The hard-limit is that you cannot have more total executor processes than kafka partitions, and you don't want to saturate your network or other IO.
Therefore, first find if you are capping the network and/or memory/disks with one executor, then run two, and see if throughput doubles & network rates cut in half on the one machine. Then scale out the cores and instances, as needed.
Dropbox recently wrote a blog on their performance testing
Regarding RDDs, assuming you have a 1:1 mapping for executor instances to partition, then each executor would see 10sec worth of data per interval for only one partition, and each executor having its own RDD to process, so thus 100 total RDDs proessed per batch. IMO, the "amount of RDDs" isn't too important because you always get 1 RDD per interval.
Can I say?
The number of the Spark tasks equal to the number of the Spark partitions?
The executor runs once (batch inside of executor) is equal to one task?
Every task produce only a partition?
(duplicate of 1.)
The degree of parallelism, or the number of tasks that can be ran concurrently, is set by:
The number of Executor Instances (configuration)
The Number of Cores per Executor (configuration)
The Number of Partitions being used (coded)
Actual parallelism is the smaller of
executors * cores - which gives the amount of slots available to run tasks
partitions - each partition will translate to a task whenever a slot opens up.
Tasks that run on the same executor will share the same JVM. This is used by the Broadcast feature as you only need one copy of the Broadcast data per Executor for all tasks to be able to access it through shared memory.
You can have multiple executors running, on the same machine, or on different machines. Executors are the true means of scalability.
Note that each Task takes up one Thread ¹, and is assumed to be assigned to one core ².
So -
Is the number of the Spark tasks equal to the number of the Spark partitions?
No (see previous).
The executor runs once (batch inside of executor) is equal to one task?
An Executor is started as an environment for the tasks to run. Multiple tasks will run concurrently within that Executor (multiple threads).
Every task produce only a partition?
For a task, it is one Partition in, one partition out. However, a repartitioning or shuffle/sort can happen in between tasks.
The number of the Spark tasks equal to the number of the Spark partitions?
Same as (1.)
(¹) Assumption is that within your tasks, you are not multithreading yourself (never do that, otherwise core estimate will be off).
(²) Note that due to hyper-threading, you might have more than one virtual core per physical core, and thus you can have several threads per core. You might even be able to handle multiple threads (2 to 3) on a single core without hyper-threading.
Partitions are a feature of RDD and are only available at design time (before an action is called).
Tasks are part of TaskSet per Stage per ActiveJob in a Spark application.
Is the number of the Spark tasks equal to the number of the Spark partitions?
Yes.
The executor runs once (batch inside of executor) is equal to one task?
That recursively uses "executor" and does not make much sense to me.
Every task produce only a partition?
Almost.
Every task produce an output of executing the code (it was created for) for the data in a partition.
The number of the Spark tasks equal to the number of the Spark partitions?
Almost.
The number of the Spark tasks in a single stage equals to the number of RDD partitions.
1.The number of the Spark tasks equal to the number of the Spark partitions?
Yes.
Spark breaks up the data into chunks called partitions. Is a collection of rows that sit on one physical machine in the cluster. Default partition size is 128MB. Allow every executor perform work in parallel. One partition will have a parallelism of only one, even if you have many executors.
With many partitions and only one executor will give you a parallelism of only one. You need to balance the number of executors and partitions to have the desired parallelism. This means that each partition will be processed by only one executor (1 executor for 1 partition for 1 task at a time).
A good rule is that the number of partitions should be larger than the number of executors on your cluster
See also: Chambers, Bill; Zaharia, Matei. Spark: The Definitive Guide: Big Data Processing Made Simple (p. 27). O'Reilly Media. Edição do Kindle.
2.The executor runs once (batch inside of executor) is equal to one task?
Cores are slot for tasks, and each executor can process more than one partition at a time if it has more than one core.
3.Every task produce only a partition?
It depend on the transformation.
Spark has Wide transformations and Narrow Transformation.
Wide Transformation: Will have input partitions contributing to many output partitions (shuffles -> Aggregation, sort, joins). Often referred to as a shuffle whereby Spark exchange partitions across the cluster. When we perform a shuffe, Spark write the results do disk
Narrow Transformation: Which input partition will contribute to only one output partition.
See also: Chambers, Bill; Zaharia, Matei. Spark: The Definitive Guide: Big Data Processing Made Simple. O'Reilly Media. Edição do Kindle.
Note: Read file is a narrow transformation because it does not require shuffle, but when you read one file that is splittable like parquet this file will be split into many partitions
I know that I can repartition an RDD to increase its partitions and use coalesce to decrease its partitions. I have two questions regarding this that I cannot completely understand after reading different resources.
Spark will use a sensible default (1 partition per block which is 64MB in first versions and now 128MB) when generating an RDD. But I also read that it is recommended to use 2 or 3 times the number of cores running the jobs. So here comes the question:
How many partitions should I use for a given file? For example, suppose I have a 10GB .parquet file, 3 executors with 2 cores and 3gb memory each.
Should I repartition? How many partitions should I use? What is the better way to make that choice?
Are all data types (ie .txt, .parquet, etc..) repartitioned by default if no partitioning is provided?
Spark can run a single concurrent task for every partition of an RDD, up to the total number of cores in the cluster.
For example :
val rdd= sc.textFile ("file.txt", 5)
The above line of code will create an RDD named textFile with 5 partitions.
Suppose that you have a cluster with 4 cores and assume that each partition needs to process for 5 minutes. In case of the above RDD with 5 partitions, 4 partition processes will run in parallel as there are 4 cores and the 5th partition process will process after 5 minutes when one of the 4 cores, is free.
The entire processing will be completed in 10 minutes and during the 5th partition process, the resources (remaining 3 cores) will remain idle.
The best way to decide on the number of partitions in a RDD is to make the number of partitions equal to the number of cores in the cluster so that all the
partitions will process in parallel and the resources will be utilized in an optimal way.
Question : Are all data types (ie .txt, .parquet, etc..) repartitioned
by default if no partitioning is provided?
There will be default no of partitions for every rdd.
to check you can use rdd.partitions.length right after rdd created.
to use existing cluster resources in optimal way and to speed up, we have to consider re-partitioning to ensure that all cores are utilized and all partitions have enough number of records which are uniformly distributed.
For better understanding, also have a look at https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html
Note : There is no fixed formula for this. general convention most of them follow is
(numOf executors * no of cores) * replicationfactor (which may be 2 or 3 times more )