How many executors should I use for spark streaming - apache-spark

I have to write spark streaming(createDirectStream API) code. I will be receiving around 90K messages per second so though of using 100 partitions for kafka topic to improve the performance.
Could you please let me know how many executors should I use? Can I use 50 executors and 2 cores per executor?
Also, consider if the batch interval is 10seconds and number of partitions of kafka topic is 100, will I receive 100 RDDs i.e. 1 RDD from each kafka partition? Will there be only 1 RDD from each partition for the 10second batch interval.
Thanks

There is no good answer, really, and it depends on how much executor memory + cores you have in your cluster.
The hard-limit is that you cannot have more total executor processes than kafka partitions, and you don't want to saturate your network or other IO.
Therefore, first find if you are capping the network and/or memory/disks with one executor, then run two, and see if throughput doubles & network rates cut in half on the one machine. Then scale out the cores and instances, as needed.
Dropbox recently wrote a blog on their performance testing
Regarding RDDs, assuming you have a 1:1 mapping for executor instances to partition, then each executor would see 10sec worth of data per interval for only one partition, and each executor having its own RDD to process, so thus 100 total RDDs proessed per batch. IMO, the "amount of RDDs" isn't too important because you always get 1 RDD per interval.

Related

Spark Job Internals

I tried looking through the various posts but did not get an answer. Lets say my spark job has 1000 input partitions but I only have 8 executor cores. The job has 2 stages. Can someone help me understand exactly how spark processes this. If you can help answer the below questions, I'd really appreciate it
As there are only 8 executor cores, will spark process the Stage 1 of my job 8 partitions at a time?
If the above is true, after the first set of 8 partitions are processed where is this data stored when spark is running the second set of 8 partitions?
If I dont have any wide transformations, will this cause a spill to disk?
For a spark job, what is the optimal file size. I mean spark better with processing 1 MB files and 1000 spark partitions or say a 10MB file with 100 spark partitions?
Sorry, if these questions are vague. This is not a real use case but as I am learning about spark I am trying to understand the internal details of how the different partitions get processed.
Thank You!
Spark will run all jobs for the first stage before starting the second. This does not mean that it will start 8 partitions, wait for them all to complete, and then start another 8 partitions. Instead, this means that each time an executor finishes a partition, it will start another partition from the first stage until all partions from the first stage is started, then spark will wait until all stages in the first stage are complete before starting the second stage.
The data is stored in memory, or if not enough memory is available, spilled to disk on the executor memory. Whether a spill happens will depend on exactly how much memory is available, and how much intermediate data results.
The optimal file size is varies, and is best measured, but some key factors to consider:
The total number of files limits total parallelism, so should be greater than the number of cores.
The amount of memory used processing a partition should be less than the amount available to the executor. (~4GB for AWS glue)
There is overhead per file read, so you don't want too many small files.
I would be inclined towards 10MB files or larger if you only have 8 cores.

When should I repartition an RDD?

I know that I can repartition an RDD to increase its partitions and use coalesce to decrease its partitions. I have two questions regarding this that I cannot completely understand after reading different resources.
Spark will use a sensible default (1 partition per block which is 64MB in first versions and now 128MB) when generating an RDD. But I also read that it is recommended to use 2 or 3 times the number of cores running the jobs. So here comes the question:
How many partitions should I use for a given file? For example, suppose I have a 10GB .parquet file, 3 executors with 2 cores and 3gb memory each.
Should I repartition? How many partitions should I use? What is the better way to make that choice?
Are all data types (ie .txt, .parquet, etc..) repartitioned by default if no partitioning is provided?
Spark can run a single concurrent task for every partition of an RDD, up to the total number of cores in the cluster.
For example :
val rdd= sc.textFile ("file.txt", 5)
The above line of code will create an RDD named textFile with 5 partitions.
Suppose that you have a cluster with 4 cores and assume that each partition needs to process for 5 minutes. In case of the above RDD with 5 partitions, 4 partition processes will run in parallel as there are 4 cores and the 5th partition process will process after 5 minutes when one of the 4 cores, is free.
The entire processing will be completed in 10 minutes and during the 5th partition process, the resources (remaining 3 cores) will remain idle.
The best way to decide on the number of partitions in a RDD is to make the number of partitions equal to the number of cores in the cluster so that all the
partitions will process in parallel and the resources will be utilized in an optimal way.
Question : Are all data types (ie .txt, .parquet, etc..) repartitioned
by default if no partitioning is provided?
There will be default no of partitions for every rdd.
to check you can use rdd.partitions.length right after rdd created.
to use existing cluster resources in optimal way and to speed up, we have to consider re-partitioning to ensure that all cores are utilized and all partitions have enough number of records which are uniformly distributed.
For better understanding, also have a look at https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html
Note : There is no fixed formula for this. general convention most of them follow is
(numOf executors * no of cores) * replicationfactor (which may be 2 or 3 times more )

Low Spark Streaming CPU utilization

In my Spark Streaming job, CPU is under utilized (only 5 -10 %).
It is fetching data from Kafka and sending to DynomoDB or thridparty endpoint.
Is there any recommendation for job that will better utilize the cpu resources, assuming that endpoint is not bottleneck.
The level of parallelism of Kafka depends on the number of partitions of the topic.
If the number of partitions in a topic is small, you will not be able to efficiently parallelize in a spark streaming cluster.
First, increase the number of partitions of the topic.
If you can not increase the partition of Kafka topic, increase the number of partitions by repartitioning after DStream.foreachRdd.
This will distribute the data across all the nodes and be more efficient.

Spark performance tuning - number of executors vs number for cores

I have two questions around performance tuning in Spark:
I understand one of the key things for controlling parallelism in the spark job is the number of partitions that exist in the RDD that is being processed, and then controlling the executors and cores processing these partitions. Can I assume this to be true:
# of executors * # of executor cores shoud be <= # of partitions. i.e to say one partition is always processed in one core of one executor. There is no point having more executors*cores than the number of partitions
I understand that having a high number of cores per executor can have a -ve impact on things like HDFS writes, but here's my second question, purely from a data processing point of view what is the difference between the two? For e.g. if I have 10 node cluster what would be the difference between these two jobs (assuming there's ample memory per node to process everything):
5 executors * 2 executor cores
2 executors * 5 executor cores
Assuming there's infinite memory and CPU, from a performance point of view should we expect the above two to perform the same?
Most of the time using larger executors (more memory, more cores) are better. One: larger executor with large memory can easily support broadcast joins and do away with shuffle. Second: since tasks are not created equal, statistically larger executors have better chance of surviving OOM issues.
The only problem with large executors is GC pauses. G1GC helps.
In my experience, if I had a cluster with 10 nodes, I would go for 20 spark executors. The details of the job matter a lot, so some testing will help determine the optional configuration.

Kafka topic partitions to Spark streaming

I have some use cases that I would like to be more clarified, about Kafka topic partitioning -> spark streaming resource utilization.
I use spark standalone mode, so only settings I have are "total number of executors" and "executor memory". As far as I know and according to documentation, way to introduce parallelism into Spark streaming is using partitioned Kafka topic -> RDD will have same number of partitions as kafka, when I use spark-kafka direct stream integration.
So if I have 1 partition in the topic, and 1 executor core, that core will sequentially read from Kafka.
What happens if I have:
2 partitions in the topic and only 1 executor core? Will that core read first from one partition and then from the second one, so there will be no benefit in partitioning the topic?
2 partitions in the topic and 2 cores? Will then 1 executor core read from 1 partition, and second core from the second partition?
1 kafka partition and 2 executor cores?
Thank you.
The basic rule is that you can scale up to the number of Kafka partitions. If you set spark.executor.cores greater than the number of partitions, some of the threads will be idle. If it's less than the number of partitions, Spark will have threads read from one partition then the other. So:
2 partitions, 1 executor: reads from one partition then then other. (I am not sure how Spark decides how much to read from each before switching)
2p, 2c: parallel execution
1p, 2c: one thread is idle
For case #1, note that having more partitions than executors is OK since it allows you to scale out later without having to re-partition. The trick is to make sure that your partitions are evenly divisible by the number of executors. Spark has to process all the partitions before passing data onto the next step in the pipeline. So, if you have 'remainder' partitions, this can slow down processing. For example, 5 partitions and 4 threads => processing takes the time of 2 partitions - 4 at once then one thread running the 5th partition by itself.
Also note that you may also see better processing throughput if you keep the number of partitions / RDDs the same throughout the pipeline by explicitly setting the number of data partitions in functions like reduceByKey().

Resources