I'm trying to understand the basics of Spark internals and Spark documentation for submitting applications in local mode says for spark-submit --master setting:
local[K] Run Spark locally with K worker threads (ideally, set this to
the number of cores on your machine).
local[*] Run Spark locally with
as many worker threads as logical cores on your machine.
Since all the data is stored on a single local machine, it does not benefit from distributed operations on RDDs.
How does it benefit and what internally is going on when Spark utilizes several logical cores?
The system will allocate additional threads for processing data. Despite being limited to a single machine, it can still take advantage of the high degree of parallelism available in modern servers.
If you have a reasonable sized data set, say something with a dozen partitions, you can measure the time it takes to use local[1] vs local[n] (where n is the number of cores in your machine). You can also see the difference in utilization of your machine. If you only have one core designated for use, it will only use 100% of one core (plus some extra for garbage collection). If you have 4 cores, and specify local[4], it will use 400% of a core (4 cores). And execution time can be significantly shortened (although not typically by 4x).
Related
I have been seeing the following terms in every distributed computing open source projects more often particularly in Apache spark and hoping to get explanation with a simple example.
spark.driver.cores - Number of cores to use for the driver process, only in cluster mode.
spark.driver.memory - Amount of memory to use for the driver process
spark.executor.cores - The number of cores to use on each executor
spark.executor.memory - Amount of memory to use per executor process
spark.task.cpus - Number of cores to allocate for each task.
For example, if there are three computers C1, C2 and C3 with intel i5 processor(CPU) which has 4 cores(shorturl.at/apsSW) and assume 16 GB RAM and 1 TB secondary storage(mounted/in-built).
So, where does above mentioned terms would fit if I try to process 1 GB of CSV file using those three computers in spark cluster environment with YARN?
If we take C1 as master computer/server/node (Uff.. too many terms) and C2, C3 as slaves/workers/executers
On highlevel, I am thinking that (with an assumption that the file is not in C1, C2 or C3 but on some cloud storage) -
When we submit spark program(let's say the program about reading and displaying first row on drivers console) on driver(C1), it tells its executors(C2 and C3) to get the data from that CSV file through internet/intranet in partitions(if partitions are defined in program).
But,
don't know how driver cores and memory & executor cores and memory impact this entire process.
Oh.. coming to the term process, it is always been linked to cores. What exactly a process represent, does it represent process behind spark program?
And there comes, tasks.. are these tasks represent processes spawned from spark program?
And in cloud, there is a term vCPUs to confuse more, does vCPU corresponds to a single core?
So, where does above mentioned terms would fit if I try to process 1
GB of CSV file using those three computers in spark cluster
environment with YARN?
By default you would have a driver randomly assigned to one of those three nodes.(with 1 gig of memory, 1 core) and 2(if I remember correctly) executors assigned (with 1 gig of memory each 1 core each)
Of course all of this is configurable but it's best to stick to defaults until you run out of memory.
don't know how driver cores and memory & executor cores and memory
impact this entire process.
Good, leave it alone until you run out of memory/need more performance. Don't pre-tune your config. Wait until there is a performance issue then solve that. Way to often people assign 8 gigs 8 cores to a process and use 50k and 1 core. There are settings to tune performance later when it's an issue. Know that there there, and when you run into an issue then start to tweak them.
h.. coming to the term process, it is always been linked to cores.
What exactly a process represent, does it represent process behind
spark program?
I don't really get the ask. Computation take resources. You need space or cores to process information.
And there comes, tasks.. are these tasks represent processes spawned
from spark program?
A task is a unit of work defined by the program you write. Look into the spark UI and it will give you some intuition as to how things work.
And in cloud, there is a term vCPUs to confuse more, does vCPU
corresponds to a single core?
It allows you to define the number of cores on a machine. You can underprescibe or over prescribe them to a node, depends on your needs. For now imagine them as your way to define what the node has availble, and that spark will interpret them as cores.
As is well known, It is possible to increase the number of cores when submitting our application. Actually, I'm trying to allocate all available cores on the server for the Spark application. I'm wondering what will happen to the performance? will it reduce or be better than usual?
The first thing about allocating cores (--executor-cores) might come in mind that more cores in an executor means more parallelism, more tasks will be executed concurrently, better performance. But it's not true for spark ecosystem. After leaving 1 core for os and other application running in the worker, Study has shown that it's optimal to allocate 5 cores for each executor.
For example, if you have a worker node with 16 cores, the optimal total executors and cores per executor will be --num-executors 3 and --executor-cores 5 (as 5*3=15) respectively.
Not only optimal resource allocation brings better performance, it also depends on how the transformations and actions are done on dataframes. More shuffling of data between different executors hampers in performance.
your operating system always need resource for its bare need.
It good to keep 1 core and 1 GB memory for operating system and for other applications.
If you allocate all resource to spark then it will not going to improve your performance, your other applications starve for resources.
I think its not better idea to allocate all resources to spark only.
Follow below post if you want to tune your spark cluster
How to tune spark executor number, cores and executor memory?
I'd like to understand partitioning in Spark.
I am running spark in local mode on windows 10.
My laptop has 2 physical cores and 4 logical cores.
1/ Terminology : to me, a core in spark = a thread. So a core in Spark is different than a physical core, right? A Spark core is associated to a task, right?
If so, since you need a thread for a partition, if my sparksql dataframe has 4 partitions, it needs 4 threads right?
2/ If I have 4 logical cores, does it mean that I can only run 4 concurrent threads at the same time on my laptop? So 4 in Spark?
3/ Setting the number of partitions : how to choose the number of partitions of my dataframe, so that further transformations and actions run as fast as possible?
-Should it have 4 partitions since my laptop has 4 logical cores?
-Is the number of partitions related to physical cores or logical cores?
-In spark documentations, it's written that you need 2-3 tasks per CPU. Since I have two physical coresn should the nb of partitions be equal to 4or6?
(I know that number of partitions will not have much effect on local mode, but this is just to understand)
Theres no such thing as a "spark core". If you are referring to options like --executor-cores then yes, that refers to how many tasks each executor will run concurrently.
You can set the number of concurrent tasks to whatever you want, but more than the number of logical cores you have probably won't give and advantage.
Number of partitions to use is situational. Without knowing the data or the transformations you are doing it's hard to give a number. Typical advice is to use just below a multiple of your total cores., for example, if you have 16 cores, maybe 47, 79, 127 and similar numbers just under a multiple of 16 are good to use. The reason for this is you want to make sure all cores are working (as little time as possible do you have resources idle, waiting for others to finish). but you leave a little extra to allow for speculative execution (spark may decide to run the same task twice if it is running slowly to see if it will go faster on a second try).
Picking the number is a bit of trial and error though, Take advantage of the spark job server to monitor how your tasks are running. Having few tasks with many of records each means you should probably increase the number of partitions, on the other hand, many partitions with only a few records each is also bad and you should try to reduce the partitioning in these cases.
I want to run Spark with only 1 thread. But whatever option I tried, Spark always used all 8 cores in my CPU.
I tried various mixtures of --master local, --master local[1], --executor-cores 1, --total-executor-cores 1, --conf spark.max.cores=1 options but nothing worked. When I see the top result on my Ubuntu 14.04, CPU usage is always about 600% (approximately 75% * 8 cores).
My goal is to compare running time of Spark tasks by varying number of cores used. Please help!
** Added
I'm working on the code from https://github.com/amplab/SparkNet/blob/master/src/main/scala/apps/CifarApp.scala . Sincerely thank you for everybody's help.
First of all you're mixing options which belong to different deployment modes. Parameters like spark.cores.max (not spark.max.cores) or spark.executor.cores are meaningful only in a Standalone Mode (not the same as local) and on Yarn.
In case of a local mode the only thing that really matters is the parameter n passed with master definition (local[n]). It doesn't mean that local[1] will run only using one thread. Spark alone is using a number of different threads (20 or so if I remember correctly) for bookkeeping, supervising, shuffles, UI and other stuff.
What is limited is the number of the executor threads. It still doesn't mean that a single executor cannot start more than one thread which is most likely the case here. You're using libraries which are designed for parallel execution. If you don't use GPU then computations are most likely executed in parallel on the CPUs. All of that is independent and not controlled by Spark itself. If you want a full control you should execute your application in a restricted environment like VM or container.
The code you referred to uses SparkContext.parallelize(...) without setting the numPartitions argument. This means that the value of spark.default.parallelism (see docs) is used to decide the number of parts == cores to use.
From the docs, this parameter defaults to:
For operations like parallelize with no parent RDDs, it depends on the cluster manager:
Local mode: number of cores on the local machine
So, adding --conf spark.default.parallelism=1 to your command should make these RDDs use a single partition, thus use a single core.
How to decide the number of workers on spark standalone cluster mode?
The duration will decreased when I added workers in standalone cluster mode.
For example, for my input data 3.5 G, it would take 3.8 min for WordCount. However, it would take 2.6 min after I added one worker with memory 4 G.
Is it fine to add workers for tuning spark? I am thinking about the risk on that.
My environment setting were as below,
Memory 128 G, 16 CPU for 9 VM
Centos
Hadoop 2.5.0-cdh5.2.0
Spark 1.1.0
Input data information
3.5 G data file from HDFS
You can tune both the executors (number of JVMs and their memory) as well as the number of tasks. if what you're doing can benefit from parallelism you can spin more executors by configuration and increase the number of tasks (by calling partitions/coalesce etc in your code).
When you set the parallelism take into account if you're doing mostly IO or computations etc. generally speaking Spark recommendation is for 2-3 tasks per CPU core