Apache Spark: Understanding terminology of Driver and Executor Configuration - apache-spark

I have been seeing the following terms in every distributed computing open source projects more often particularly in Apache spark and hoping to get explanation with a simple example.
spark.driver.cores - Number of cores to use for the driver process, only in cluster mode.
spark.driver.memory - Amount of memory to use for the driver process
spark.executor.cores - The number of cores to use on each executor
spark.executor.memory - Amount of memory to use per executor process
spark.task.cpus - Number of cores to allocate for each task.
For example, if there are three computers C1, C2 and C3 with intel i5 processor(CPU) which has 4 cores(shorturl.at/apsSW) and assume 16 GB RAM and 1 TB secondary storage(mounted/in-built).
So, where does above mentioned terms would fit if I try to process 1 GB of CSV file using those three computers in spark cluster environment with YARN?
If we take C1 as master computer/server/node (Uff.. too many terms) and C2, C3 as slaves/workers/executers
On highlevel, I am thinking that (with an assumption that the file is not in C1, C2 or C3 but on some cloud storage) -
When we submit spark program(let's say the program about reading and displaying first row on drivers console) on driver(C1), it tells its executors(C2 and C3) to get the data from that CSV file through internet/intranet in partitions(if partitions are defined in program).
But,
don't know how driver cores and memory & executor cores and memory impact this entire process.
Oh.. coming to the term process, it is always been linked to cores. What exactly a process represent, does it represent process behind spark program?
And there comes, tasks.. are these tasks represent processes spawned from spark program?
And in cloud, there is a term vCPUs to confuse more, does vCPU corresponds to a single core?

So, where does above mentioned terms would fit if I try to process 1
GB of CSV file using those three computers in spark cluster
environment with YARN?
By default you would have a driver randomly assigned to one of those three nodes.(with 1 gig of memory, 1 core) and 2(if I remember correctly) executors assigned (with 1 gig of memory each 1 core each)
Of course all of this is configurable but it's best to stick to defaults until you run out of memory.
don't know how driver cores and memory & executor cores and memory
impact this entire process.
Good, leave it alone until you run out of memory/need more performance. Don't pre-tune your config. Wait until there is a performance issue then solve that. Way to often people assign 8 gigs 8 cores to a process and use 50k and 1 core. There are settings to tune performance later when it's an issue. Know that there there, and when you run into an issue then start to tweak them.
h.. coming to the term process, it is always been linked to cores.
What exactly a process represent, does it represent process behind
spark program?
I don't really get the ask. Computation take resources. You need space or cores to process information.
And there comes, tasks.. are these tasks represent processes spawned
from spark program?
A task is a unit of work defined by the program you write. Look into the spark UI and it will give you some intuition as to how things work.
And in cloud, there is a term vCPUs to confuse more, does vCPU
corresponds to a single core?
It allows you to define the number of cores on a machine. You can underprescibe or over prescribe them to a node, depends on your needs. For now imagine them as your way to define what the node has availble, and that spark will interpret them as cores.

Related

Spark: How to tune memory / cores given my cluster?

There are several threads with significant votes that I am having difficulty interpreting, perhaps due to jargon of 2016 being different of that today? (or I am just not getting it, too)
Apache Spark: The number of cores vs. the number of executors
How to tune spark executor number, cores and executor memory?
Azure/Databricks offers some best practices on cluster sizing: https://learn.microsoft.com/en-us/azure/databricks/clusters/cluster-config-best-practices
So for my workload, lets say I am interested in (using Databricks current jargon):
1 Driver: Comprised of 64gb of memory and 8 cores
1 Worker: Comprised of 256gb of memory and 64 cores
Drawing on the above Microsoft link, fewer workers should in turn lead to less shuffle; among the most costly Spark operations.
So, I have 1 driver and 1 worker. How, then, do I translate these terms into what is discussed here on SO in terms of "nodes" and "executors".
Ultimately, I would like to set my Spark config "correctly" such that cores and memory are, as optimized as possible.

Can Spark executor be enabled for multithreading more than CPU cores?

I understand if executor-cores is set to more than 1, then the executor will run in parallel. However, from my experience, the number of parallel processes in the executor is always equal to the number of CPUs in the executor.
For example, suppose I have a machine with 48 cores and set executor-cores to 4, and then there will be 12 executors.
What we need is to run 8 threads or more for each executor (so 2 or more threads per CPU). The reason is that the task is quite light weight and CPU usage is quite low around 10%, so we want to boost CPU usage through multiple threads per CPU.
So asking if we could possibly achieve this in the Spark configuration. Thanks a lot!
Spark executors are processing tasks, which are derived from the execution plan/code and partitions of the dataframe. Each core on an executor is always processing only one task, so each executor only get the number of tasks at most the amount of cores. Having more tasks in one executor as you are asking for is not possible.
You should look for code changes, minimize amount of shuffles (no inner joins; use windows instead) and check out for skew in your data leading to non-uniformly distributed partition sizing (dataframe partitions, not storage partitions).
WARNING:
If you are however alone on your cluster and you do not want to change your code, you can change the YARN settings for the server and represent it with more than 48 cores, even though there are just 48. This can lead to severe instability of the system, since executors are now sharing CPUs. (And your OS also needs CPU power.)
This answer is meant as a complement to #Telijas' answer, because in general I agree with it. It's just to give that tiny bit of extra information.
There are some configuration parameters in which you can set the number of thread for certain parts of Spark. There is, for example, a section in the Spark docs that discusses some of them (for all of this I'm looking at the latest Spark version at the time of writing this post: version 3.3.1):
Depending on jobs and cluster configurations, we can set number of threads in several places in Spark to utilize available resources efficiently to get better performance. Prior to Spark 3.0, these thread configurations apply to all roles of Spark, such as driver, executor, worker and master. From Spark 3.0, we can configure threads in finer granularity starting from driver and executor. Take RPC module as example in below table. For other modules, like shuffle, just replace “rpc” with “shuffle” in the property names except spark.{driver|executor}.rpc.netty.dispatcher.numThreads, which is only for RPC module.
Property Name
Default
Meaning
Since Version
spark.{driver
executor}.rpc.io.serverThreads
Fall back on spark.rpc.io.serverThreads
Number of threads used in the server thread pool
spark.{driver
executor}.rpc.io.clientThreads
Fall back on spark.rpc.io.clientThreads
Number of threads used in the client thread pool
spark.{driver
executor}.rpc.netty.dispatcher.numThreads
Fall back on spark.rpc.netty.dispatcher.numThreads
Number of threads used in RPC message dispatcher thread pool
Then here follows a (non-exhaustive in no particular order, just been looking through the source code) list of some other number-of-thread-related configuration parameters:
spark.sql.streaming.fileSource.cleaner.numThreads
spark.storage.decommission.shuffleBlocks.maxThreads
spark.shuffle.mapOutput.dispatcher.numThreads
spark.shuffle.push.numPushThreads
spark.shuffle.push.merge.finalizeThreads
spark.rpc.connect.threads
spark.rpc.io.threads
spark.rpc.netty.dispatcher.numThreads (will be overridden by the driver/executor-specific ones from the table above)
spark.resultGetter.threads
spark.files.io.threads
I didn't add the meaning of these parameters to this answer because that's a different question and quite "Googleable". This is just meant as an extra bit of info.

Spark: understanding partitioning - cores

I'd like to understand partitioning in Spark.
I am running spark in local mode on windows 10.
My laptop has 2 physical cores and 4 logical cores.
1/ Terminology : to me, a core in spark = a thread. So a core in Spark is different than a physical core, right? A Spark core is associated to a task, right?
If so, since you need a thread for a partition, if my sparksql dataframe has 4 partitions, it needs 4 threads right?
2/ If I have 4 logical cores, does it mean that I can only run 4 concurrent threads at the same time on my laptop? So 4 in Spark?
3/ Setting the number of partitions : how to choose the number of partitions of my dataframe, so that further transformations and actions run as fast as possible?
-Should it have 4 partitions since my laptop has 4 logical cores?
-Is the number of partitions related to physical cores or logical cores?
-In spark documentations, it's written that you need 2-3 tasks per CPU. Since I have two physical coresn should the nb of partitions be equal to 4or6?
(I know that number of partitions will not have much effect on local mode, but this is just to understand)
Theres no such thing as a "spark core". If you are referring to options like --executor-cores then yes, that refers to how many tasks each executor will run concurrently.
You can set the number of concurrent tasks to whatever you want, but more than the number of logical cores you have probably won't give and advantage.
Number of partitions to use is situational. Without knowing the data or the transformations you are doing it's hard to give a number. Typical advice is to use just below a multiple of your total cores., for example, if you have 16 cores, maybe 47, 79, 127 and similar numbers just under a multiple of 16 are good to use. The reason for this is you want to make sure all cores are working (as little time as possible do you have resources idle, waiting for others to finish). but you leave a little extra to allow for speculative execution (spark may decide to run the same task twice if it is running slowly to see if it will go faster on a second try).
Picking the number is a bit of trial and error though, Take advantage of the spark job server to monitor how your tasks are running. Having few tasks with many of records each means you should probably increase the number of partitions, on the other hand, many partitions with only a few records each is also bad and you should try to reduce the partitioning in these cases.

Apache spark:Can you start a worker in standalone mode with more cores or memory than physically available?

I am talking about the standalone mode of spark.
Lets say the value of SPARK_WORKER_CORES=200 and there are only 4 cores available on the node where I am trying to start the worker. Will the worker get 4 cores and continue or will the worker not start at all ?
A similar case, If I set SPARK_WORKER_MEMORY=32g and there is only 2g of memory actually available on that node ?
"Cores" in Spark is sort of a misnomer. "Cores" actually corresponds to the number of threads created to process data. So, you could have an arbitrarily large number of cores without an explicit failure. That being said, overcommitting by 50x will likely lead to incredibly poor performance due to context switching and overhead costs. This means that for both workers and executors you can arbitrarily increase this number. In practice in Spark Standalone, I've generally seen this overcommitted no more than 2-3x the number of logical cores.
When it comes to specifying worker memory, once again, you can in theory increase it to an arbitrarily large number. This is because, for a worker, the memory amount specifies how much it is allowed to allocate for executors, but it doesn't explicitly allocate that amount when you start the worker. Therefore you can make this value much larger than physical memory.
The caveat here is that when you start up an executor, if you set the executor memory to be greater than the amount of physical memory, your executors will fail to start. This is because executor memory directly corresponds to the -Xmx setting of the java process for that executor.

Apache Spark standalone mode: number of cores

I'm trying to understand the basics of Spark internals and Spark documentation for submitting applications in local mode says for spark-submit --master setting:
local[K] Run Spark locally with K worker threads (ideally, set this to
the number of cores on your machine).
local[*] Run Spark locally with
as many worker threads as logical cores on your machine.
Since all the data is stored on a single local machine, it does not benefit from distributed operations on RDDs.
How does it benefit and what internally is going on when Spark utilizes several logical cores?
The system will allocate additional threads for processing data. Despite being limited to a single machine, it can still take advantage of the high degree of parallelism available in modern servers.
If you have a reasonable sized data set, say something with a dozen partitions, you can measure the time it takes to use local[1] vs local[n] (where n is the number of cores in your machine). You can also see the difference in utilization of your machine. If you only have one core designated for use, it will only use 100% of one core (plus some extra for garbage collection). If you have 4 cores, and specify local[4], it will use 400% of a core (4 cores). And execution time can be significantly shortened (although not typically by 4x).

Resources