How to decide the number of workers on spark standalone cluster mode?
The duration will decreased when I added workers in standalone cluster mode.
For example, for my input data 3.5 G, it would take 3.8 min for WordCount. However, it would take 2.6 min after I added one worker with memory 4 G.
Is it fine to add workers for tuning spark? I am thinking about the risk on that.
My environment setting were as below,
Memory 128 G, 16 CPU for 9 VM
Centos
Hadoop 2.5.0-cdh5.2.0
Spark 1.1.0
Input data information
3.5 G data file from HDFS
You can tune both the executors (number of JVMs and their memory) as well as the number of tasks. if what you're doing can benefit from parallelism you can spin more executors by configuration and increase the number of tasks (by calling partitions/coalesce etc in your code).
When you set the parallelism take into account if you're doing mostly IO or computations etc. generally speaking Spark recommendation is for 2-3 tasks per CPU core
Related
I have been seeing the following terms in every distributed computing open source projects more often particularly in Apache spark and hoping to get explanation with a simple example.
spark.driver.cores - Number of cores to use for the driver process, only in cluster mode.
spark.driver.memory - Amount of memory to use for the driver process
spark.executor.cores - The number of cores to use on each executor
spark.executor.memory - Amount of memory to use per executor process
spark.task.cpus - Number of cores to allocate for each task.
For example, if there are three computers C1, C2 and C3 with intel i5 processor(CPU) which has 4 cores(shorturl.at/apsSW) and assume 16 GB RAM and 1 TB secondary storage(mounted/in-built).
So, where does above mentioned terms would fit if I try to process 1 GB of CSV file using those three computers in spark cluster environment with YARN?
If we take C1 as master computer/server/node (Uff.. too many terms) and C2, C3 as slaves/workers/executers
On highlevel, I am thinking that (with an assumption that the file is not in C1, C2 or C3 but on some cloud storage) -
When we submit spark program(let's say the program about reading and displaying first row on drivers console) on driver(C1), it tells its executors(C2 and C3) to get the data from that CSV file through internet/intranet in partitions(if partitions are defined in program).
But,
don't know how driver cores and memory & executor cores and memory impact this entire process.
Oh.. coming to the term process, it is always been linked to cores. What exactly a process represent, does it represent process behind spark program?
And there comes, tasks.. are these tasks represent processes spawned from spark program?
And in cloud, there is a term vCPUs to confuse more, does vCPU corresponds to a single core?
So, where does above mentioned terms would fit if I try to process 1
GB of CSV file using those three computers in spark cluster
environment with YARN?
By default you would have a driver randomly assigned to one of those three nodes.(with 1 gig of memory, 1 core) and 2(if I remember correctly) executors assigned (with 1 gig of memory each 1 core each)
Of course all of this is configurable but it's best to stick to defaults until you run out of memory.
don't know how driver cores and memory & executor cores and memory
impact this entire process.
Good, leave it alone until you run out of memory/need more performance. Don't pre-tune your config. Wait until there is a performance issue then solve that. Way to often people assign 8 gigs 8 cores to a process and use 50k and 1 core. There are settings to tune performance later when it's an issue. Know that there there, and when you run into an issue then start to tweak them.
h.. coming to the term process, it is always been linked to cores.
What exactly a process represent, does it represent process behind
spark program?
I don't really get the ask. Computation take resources. You need space or cores to process information.
And there comes, tasks.. are these tasks represent processes spawned
from spark program?
A task is a unit of work defined by the program you write. Look into the spark UI and it will give you some intuition as to how things work.
And in cloud, there is a term vCPUs to confuse more, does vCPU
corresponds to a single core?
It allows you to define the number of cores on a machine. You can underprescibe or over prescribe them to a node, depends on your needs. For now imagine them as your way to define what the node has availble, and that spark will interpret them as cores.
I'm running a spark batch job on aws fargate in standalone mode. On the compute environment, I have 8 vcpu and job definition has 1 vcpu and 2048 mb memory. In the spark application I can specify how many core I want to use and doing this using below code
sparkSess = SparkSession.builder.master("local[8]")\
.appName("test app")\
.config("spark.debug.maxToStringFields", "1000")\
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")\
.getOrCreate()
local[8] is specifying 8 cores/threads (that’s what I'm assuming).
Initially I was running the spark app without specifying cores and I think job was running in single thread and was taking around 10 min to complete but with this number it is reducing the time to process. I started with 2 it almost reduced to 5 minutes and then I have changed to 4, 8 and now it is taking almost 4 minutes. But I don't understand the relation between vcpu and spark threads. Whatever the number I specify for cores, sparkContext.defaultParallelism shows me that value.
Is this the correct way? Is there any relation between this number and the vcpu that I specify on job definition or compute environment.
You are running in Spark Local Mode. Learning Spark has this to say about Local mode:
Spark driver runs on a single JVM, like a laptop or single node
Spark executor runs on the same JVM as the driver
Cluster manager Runs on the same host
Damji, Jules S.,Wenig, Brooke,Das, Tathagata,Lee, Denny. Learning Spark (p. 30). O'Reilly Media. Kindle Edition.
local[N] launches with N threads. Given the above definition of Local Mode, those N threads must be shared by the Local Mode Driver, Executor and Cluster Manager.
As such, from the available vCPUs, allotting one vCPU for the Driver thread, one for the Cluster Manager, one for the OS and the remaining for Executor seems reasonable.
The optimal number of threads/vCPUs for the Executor will depend on the number of partitions your data has.
Please bear with me because I am still quite new to Spark.
I have a GCP DataProc cluster which I am using to run a large number of Spark jobs, 5 at a time.
Cluster is 1 + 16, 8 cores / 40gb mem / 1TB storage per node.
Now I might be misunderstanding something or not doing something correctly, but I currently have 5 jobs running at once, and the Spark UI is showing that only 34/128 vcores are in use, and they do not appear to be evenly distributed (The jobs were executed simultaneously, but the distribution is 2/7/7/11/7. There is only one core allocated per running container.
I have used the flags --executor-cores 4 and --num-executors 6 which doesn't seem to have made any difference.
Can anyone offer some insight/resources as to how I can fine tune these jobs to use all available resources?
I have managed to solve the issue - I had no cap on the memory usage so it looked as though all memory was allocated to just 2 cores per node.
I added the property spark.executor.memory=4G and re-ran the job, it instantly allocated 92 cores.
Hope this helps someone else!
The Dataproc default configurations should take care of the number of executors. Dataproc also enables dynamic allocation, so executors will only be allocated if needed (according to Spark).
Spark cannot parallelize beyond the number of partitions in a Dataset/RDD. You may need to set the following properties to get good cluster utilization:
spark.default.parallelism: the default number of output partitions from transformations on RDDs (when not explicitly set)
spark.sql.shuffle.partitions: the number of output partitions from aggregations using the SQL API
Depending on your use case, it may make sense to explicitly set partition counts for each operation.
I have a cluster with 4 nodes (each with 16 cores) using Spark 1.0.1.
I have an RDD which I've repartitioned so it has 200 partitions (hoping to increase the parallelism).
When I do a transformation (such as filter) on this RDD, I can't seem to get more than 64 tasks (my total number of cores across the 4 nodes) going at one point in time. By tasks, I mean the number of tasks that appear under the Application Spark UI. I tried explicitly setting the spark.default.parallelism to 128 (hoping I would get 128 tasks concurrently running) and verified this in the Application UI for the running application but this had no effect. Perhaps, this is ignored for a 'filter' and the default is the total number of cores available.
I'm fairly new with Spark so maybe I'm just missing or misunderstanding something fundamental. Any help would be appreciated.
This is correct behavior. Each "core" can execute exactly one task at a time, with each task corresponding to a partition. If your cluster only has 64 cores, you can only run at most 64 tasks at once.
You could run multiple workers per node to get more executors. That would give you more cores in the cluster. But however many cores you have, each core will run only one task at a time.
you can see the more details on the following thread
How does Spark paralellize slices to tasks/executors/workers?
I'm trying to understand the basics of Spark internals and Spark documentation for submitting applications in local mode says for spark-submit --master setting:
local[K] Run Spark locally with K worker threads (ideally, set this to
the number of cores on your machine).
local[*] Run Spark locally with
as many worker threads as logical cores on your machine.
Since all the data is stored on a single local machine, it does not benefit from distributed operations on RDDs.
How does it benefit and what internally is going on when Spark utilizes several logical cores?
The system will allocate additional threads for processing data. Despite being limited to a single machine, it can still take advantage of the high degree of parallelism available in modern servers.
If you have a reasonable sized data set, say something with a dozen partitions, you can measure the time it takes to use local[1] vs local[n] (where n is the number of cores in your machine). You can also see the difference in utilization of your machine. If you only have one core designated for use, it will only use 100% of one core (plus some extra for garbage collection). If you have 4 cores, and specify local[4], it will use 400% of a core (4 cores). And execution time can be significantly shortened (although not typically by 4x).