Spark: understanding partitioning - cores

Spark: understanding partitioning - cores - multithreading

I'd like to understand partitioning in Spark.
I am running spark in local mode on windows 10.
My laptop has 2 physical cores and 4 logical cores.
1/ Terminology : to me, a core in spark = a thread. So a core in Spark is different than a physical core, right? A Spark core is associated to a task, right?
If so, since you need a thread for a partition, if my sparksql dataframe has 4 partitions, it needs 4 threads right?
2/ If I have 4 logical cores, does it mean that I can only run 4 concurrent threads at the same time on my laptop? So 4 in Spark?
3/ Setting the number of partitions : how to choose the number of partitions of my dataframe, so that further transformations and actions run as fast as possible?
-Should it have 4 partitions since my laptop has 4 logical cores?
-Is the number of partitions related to physical cores or logical cores?
-In spark documentations, it's written that you need 2-3 tasks per CPU. Since I have two physical coresn should the nb of partitions be equal to 4or6?
(I know that number of partitions will not have much effect on local mode, but this is just to understand)

Theres no such thing as a "spark core". If you are referring to options like --executor-cores then yes, that refers to how many tasks each executor will run concurrently.
You can set the number of concurrent tasks to whatever you want, but more than the number of logical cores you have probably won't give and advantage.
Number of partitions to use is situational. Without knowing the data or the transformations you are doing it's hard to give a number. Typical advice is to use just below a multiple of your total cores., for example, if you have 16 cores, maybe 47, 79, 127 and similar numbers just under a multiple of 16 are good to use. The reason for this is you want to make sure all cores are working (as little time as possible do you have resources idle, waiting for others to finish). but you leave a little extra to allow for speculative execution (spark may decide to run the same task twice if it is running slowly to see if it will go faster on a second try).
Picking the number is a bit of trial and error though, Take advantage of the spark job server to monitor how your tasks are running. Having few tasks with many of records each means you should probably increase the number of partitions, on the other hand, many partitions with only a few records each is also bad and you should try to reduce the partitioning in these cases.

Related

decide no of partition in spark (running on YARN) based on executer ,cores and memory

How to decide no of partition in spark (running on YARN) based on executer, cores and memory.
As i am new to spark so doesn't have much hands on real scenario
I know many things to consider to decide the partition but still any production general scenario explanation in detail will be very helpful.
Thanks in advance

One important parameter for parallel collections is the number of
partitions to cut the dataset into. Spark will run one task for each
partition of the cluster. Typically you want 2-4 partitions for each
CPU in your cluster
the number of parition is recommended to be 2/4 * the number of cores.
so if you have 7 executor with 5 core , you can repartition between 7*5*2 = 70 and 7*5*4 = 140 partition
https://spark.apache.org/docs/latest/rdd-programming-guide.html

IMO with spark 3.0 and AWS EMR 2.4.x with adaptive query execution you're often better off letting spark handle it. If you do want to hand tune it the answer can often times be complicated. One good option is to have 2 or 4 times the number of cpus available. While this is useful for most datasizes it becomes problematic with very large and very small datasets. In those cases it's useful to aim for ~128MB per partition.

How the Number of partitions and Number of concurrent tasks in spark calculated

I have a cluster with 4 nodes (each with 16 cores) using Spark 1.0.1.
I have an RDD which I've repartitioned so it has 200 partitions (hoping to increase the parallelism).
When I do a transformation (such as filter) on this RDD, I can't seem to get more than 64 tasks (my total number of cores across the 4 nodes) going at one point in time. By tasks, I mean the number of tasks that appear under the Application Spark UI. I tried explicitly setting the spark.default.parallelism to 128 (hoping I would get 128 tasks concurrently running) and verified this in the Application UI for the running application but this had no effect. Perhaps, this is ignored for a 'filter' and the default is the total number of cores available.
I'm fairly new with Spark so maybe I'm just missing or misunderstanding something fundamental. Any help would be appreciated.

This is correct behavior. Each "core" can execute exactly one task at a time, with each task corresponding to a partition. If your cluster only has 64 cores, you can only run at most 64 tasks at once.
You could run multiple workers per node to get more executors. That would give you more cores in the cluster. But however many cores you have, each core will run only one task at a time.
you can see the more details on the following thread
How does Spark paralellize slices to tasks/executors/workers?

How is task distributed in spark

I am trying to understand that when a job is submitted from the spark-submit and I have spark deployed system with 4 nodes how is the work distributed in spark. If there is large data set to operate on, I wanted to understand exactly in how many stages are the task divided and how many executors run for the job. Wanted to understand how is this decided for every stage.

It's hard to answer this question exactly, because there are many uncertainties.
Number of stages depends only on described workflow, which includes different kind of maps, reduces, joins, etc. If you understand it, you basically can read that right from the code. But most importantly that helps you to write more performant algorithms, because it's generally known the one have to avoid shuffles. For example, when you do a join, it requires shuffle - it's a boundary stage. This is pretty simple to see, you have to print rdd.toDebugString() and then look at indentation (look here), because indentation is a shuffle.
But with number of executors that's completely different story, because it depends on number of partitions. It's like for 2 partitions it requires only 2 executors, but for 40 ones - all 4, since you have only 4. But additionally number of partitions might depend on few properties you can provide at the spark-submit:
spark.default.parallelism parameter or
data source you use (f.e. for HDFS and Cassandra it is different)
It'd be a good to keep all of the cores in cluster busy, but no more (meaning single process only just one partition), because processing of each partition takes a bit of overhead. On the other hand if your data is skewed, then some cores would require more time to process bigger partitions, than others - in this case it helps to split data to more partitions so that all cores are busy roughly same amount of time. This helps with balancing cluster and throughput at the same time.

Partitioning the RDD for Spark Jobs

When I submit job spark in yarn cluster I see spark-UI I get 4 stages of jobs but, memory used is very low in all nodes and it says 0 out of 4 gb used. I guess that might be because I left it in default partition.
Files size ranges are betweenr 1 mb to 100 mb in s3. There are around 2700 files with size of 26 GB. And exactly same 2700 jobs were running in stage 2.
Is it worth to repartition something around 640 partitons, would it improve the performace? or
It doesn't matter if partition is granular than actually required? or
My submit parameters needs to be addressed?
Cluster details,
Cluster with 10 nodes
Overall memory 500 GB
Overall vCores 64
--excutor-memory 16 g
--num-executors 16
--executor-cores 1
Actually it runs on 17 cores out of 64. I dont want to increase the number of cores since others might use the cluster.

You partition, and repartition for following reasons:
To make sure we have enough work to distribute to the distinct cores in our cluster (nodes * cores_per_node). Obviously we need to tune the number of executors, cores per executor, and memory per executor to make that happen as intended.
To make sure we evenly distribute work: the smaller the partitions, the lesser the chance than one core might have much more work to do than all other cores. Skewed distribution can have a huge effect on total lapse time if the partitions are too big.
To keep partitions in managable sizes. Not to big, and not to small so we dont overtax GC. Also bigger partitions might have issues when we have non-linear O.
To small partitions will create too much process overhead.
As you might have noticed, there will be a goldilocks zone. Testing will help you determine ideal partition size.
Note that it is ok to have much more partitions than we have cores. Queuing partitions to be assigned a task is something that I design for.
Also make sure you configure your spark job properly otherwise:
Make sure you do not have too many executors. One or Very Few executors per node is more than enough. Fewer executors will have less overhead, as they work in shared memory space, and individual tasks are handled by threads instead of processes. There is a huge amount of overhead to starting up a process, but Threads are pretty lightweight.
Tasks need to talk to each other. If they are in the same executor, they can do that in-memory. If they are in different executors (processes), then that happens over a socket (overhead). If that is over multiple nodes, that happens over a traditional network connection (more overhead).
Assign enough memory to your executors. When using Yarn as the scheduler, it will fit the executors by default by their memory, not by the CPU you declare to use.
I do not know what your situation is (you made the node names invisible), but if you only have a single node with 15 cores, then 16 executors do not make sense. Instead, set it up with One executor, and 16 cores per executor.

Settle the right number of partition on RDD

I read some comments which says than a good number of partition for a RDD is 2-3 time the number of core. I have 8 nodes each with two 12-cores processor, so i have 192 cores, i setup the partition beetween 384-576 but it doesn't seems works efficiently, i tried 8 partition, same result. Maybe i have to setup other parameters in order to my job works better on the cluster rather than on my machine. I add that the file i analyse make 150k lines.
val data = sc.textFile("/img.csv",384)

The primary effect would be by specifying too few partitions or far too many partitions.
Too few partitions You will not utilize all of the cores available in the cluster.
Too many partitions There will be excessive overhead in managing many small tasks.
Between the two the first one is far more impactful on performance. Scheduling too many smalls tasks is a relatively small impact at this point for partition counts below 1000. If you have on the order of tens of thousands of partitions then spark gets very slow.
Now, considering your case, you are getting the same results from 8 and 384-576 partitions. Generally the thumb rule says,
NoOfPartitions = (NumberOfWorkerNodes*NoOfCoresPerWorkerNode)-1
It says that, as we know, the task is processed by CPU cores. So we should set that many number of partitions which is the total number of cores in the cluster to process-1(for Application Master of driver). That means the each core will process each partition at a time.
That means with 191 partitions can improve the performance. Otherwise impact of setting less and more partitions scenario is explained in beginnning.
Hope this will help!!!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string