I am running a TPC-DS benchmark for Spark 3.0.1 in local mode and using sparkMeasure to get workload statistics. I have 16 total cores and SparkContext is available as
Spark context available as 'sc' (master = local[*], app id = local-1623251009819)
Q1. For local[*], driver and executors are created in a single JVM with 16 threads. Considering Spark's configuration which of the following will be true?
1 worker instance, 1 executor having 16 cores/threads
1 worker instance, 16 executors each having 1 core
For a particular query, sparkMeasure reports shuffle data as follows
shuffleRecordsRead => 183364403
shuffleTotalBlocksFetched => 52582
shuffleTotalBlocksFetched => 52582
shuffleLocalBlocksFetched => 52582
shuffleRemoteBlocksFetched => 0
shuffleTotalBytesRead => 1570948723 (1498.0 MB)
shuffleLocalBytesRead => 1570948723 (1498.0 MB)
shuffleRemoteBytesRead => 0 (0 Bytes)
shuffleRemoteBytesReadToDisk => 0 (0 Bytes)
shuffleBytesWritten => 1570948723 (1498.0 MB)
shuffleRecordsWritten => 183364480
Q2. Regardless of the query specifics, why is there data shuffling when everything is inside a single JVM?
executor is a jvm process when you use local[*] you run Spark
locally with as many worker threads as logical cores on your machine so : 1 executor and as many worker threads as logical
cores. when you configure SPARK_WORKER_INSTANCES=5 in spark-env.sh and execute these commands start-master.sh and start-slave.sh spark://local:7077 to bring up a standalone spark cluster in your
local machine you have one master and 5 workers, if you want to send
your application to this cluster you must configure application like
SparkSession.builder().appName("app").master("spark://localhost:7077")
in this case you can't specify [*] or [2] for example. but when
you specify master to be local[*] a jvm process is created and
master and all workers will be in that jvm process and after your
application finished that jvm instance will be destroyed. local[*]
and spark://localhost:7077 are two separate things.
workers do their job using tasks and each task actually is a thread
i.e. task = thread. workers have memory and they assign a memory
partition to each task in order to they do their job such as reading
a part of a dataset into its own memory partition or do a
transformation on read data. when a task such as join needs other
partitions, shuffle occurs regardless weather the job is ran in
cluster or local. if you were in cluster there is a possibility that
two tasks were in different machines so Network transmission will be
added to other stuffs such as writing the result and then reading by
another task. in local if task B needs the data in the partition of
the task A, task A should write it down and then task B will read it
to do its job
Local mode is the same as non-distributed single-JVM deployment mode.
Q1: It is neither. In this mode Spark spawns all execution components
namely Driver, n threads for data processing and Master in a single JVM.
If I had to abstract it to one of your 2 options I would say, 1 worker
instance, 16 executors each having 1 core, but as said this is not the
right way to look at it. The other option could be N Workers with M Executors with 1 Core each where N x M = 16.
The default parallelism is the number of threads as specified in the
master URL = local[*].
Q2: The threads will service partitions, concurrently, one at a time,
as many as needed, sequentially within the current Stage, being
assigned by the Driver when free. A stage is a boundary that causes
shuffling, regardless of how you run, in YARN Cluster or local.
Shuffling - what is that then? Shuffle occurs when data is required to
be re-arranged over existing partitions. E.g. a groupBy or orderBy? We
may have M partitions and after the groupBy N partitions. This is a
wide-transformation concept at the core of Spark for parallel
processing, so (even) with local[*] this will apply.
Related
I've noticed that spark can sometimes schedule all tasks of a job onto the same executor when other executors are busy processing other job tasks
Here was a quick example on the spark shell, I created 3 executor with 1 core each
#1 sc.parallelize(List.range(0,1000), 2).mapPartitions(x => { Thread.sleep(100000); x }).count()
#2 sc.parallelize(List.range(0,1000), 10).mapPartitions(x => { Thread.sleep(1000); x }).count()
Job #1 hogs up the 2 executors by sleeping, and job #2 ends up running all of its 10 tasks sequentially on the 3rd executor. While it is understandable from sparks perspective that there weren't enough resources, but it should atleast try to distribute tasks even if it means they get delayed? What happens now is that the 1 executor becomes a hotspot for the next stage in that job because all the shuffle files have been persisted on that executor
While this is a manually generated example, we have noticed in our production setup where jobs with say 10k tasks are only being executed on a handful of executors
Just to debrief you a bit,
We create a single application and run several jobs ( 16 jobs in parallel sometimes even 48 ) within it. Should we have 1 application per job instead? What's the rationale behind creating 1 application vs multiple? All jobs are usually independent of each other
I am trying to understand the relationship between different components and elements in the Spark architecture but am unable to get a grip on it. Can someone please validate my assumptions and correct me where I am wrong.
My understanding - A node is the actual physical machine. One node can contain the main driver while others will contain the workers.
Q - Can a node have multiple drivers (if I have multiple applications)?
My understanding - A worker is a process within a node. There can be multiple workers within each node though not recommended.
My understanding - An executor is a sub-process(?) within a worker process. Each worker can have multiple executors.
Q. What metric determines the number of executors per worker?
Q. Is the idea of a JVM associated with an executor process or at a higher "worker" level?
Q. What is the relationship between a core and an executor?
Q - Can RAM and HD be allocated at an executor level?
For e.g., if I have a worker node with 100GB of RAM and 5 TB HD, can I allocate 20 GB RAM and 1 TB HDD per executor for that worker?
My understanding - A partition is a portion of the actual data. This split could happen using hashing, round robin or range.
Q - What determines the location of these data partitions?
For e.g., if I have a cluster with 2 nodes, 10 executors (5 executors in each node) and a dataframe with 20 partitions, I'm assuming I would have 2 partitions in each executor or is there a chance that partition distribution could be skewed? What would I need to do to ensure that all my partitions that have a certain partitioning key get co-located within the same worker so there is minimum network transfer when these partitions have to work together to, say, perform an aggregation or a join?
Q - What happens when a repartition() is performed. For e.g., if I have 20 partitions across 10 executors (say, 2 partitions in each) and I repartition(2). I will now have only 2 portions of data which I assume would be resting in a couple of executors. What happens to the remaining executors?
Assumption - A task is the lowest unit of work that performs the actual ask. The number of tasks dependent on the number of partitions. So, if there are 20 partitions, I would have 20 tasks in each stage.
Q - Are these tasks performed by individual executors?
Q - If I have less executors (say, 10) than the partitions (say, 20), does it mean that only 10 tasks will be executed in parallel at any point? is the degree of parallelism constrained by the number of executors?
Thanks in advance!
My understanding - A node is the actual physical machine. One node can contain the main driver while others will contain the workers. (This is Correct as a starting Point)
Q - Can a node have multiple drivers (if I have multiple applications)?
Yes Because driver is just a process that gets created based on the program that you might have written. And you can have multiple process running on the same node.
My understanding - A worker is a process within a node. There can be multiple workers within each node though not recommended.
your understanding here seems wrong because worker is actually a node or machine. Either you say it worker or worker node both are same
My understanding - An executor is a sub-process(?) within a worker process. Each worker can have multiple executors.
An executor is a process inside the worker node and a single worker node can have multiple executors
Q. What metric determines the number of executors per worker?
configuration(Number of cores and memory) of your worker node decides what is the max executors it can run on any specific worker node.
Q. Is the idea of a JVM associated with an executor process or at a higher "worker" level?
It is associated with the executor process. Spark executor is a single JVM instance on a node that serves a single spark application
Q. What is the relationship between a core and an executor?
Core property controls the number of concurrent tasks an executor can run. For example if you request 2 executor each with 2 cores then you can run 4 concurrent tasks at the same time during your job execution.
Q - Can RAM and HD be allocated at an executor level?
For e.g., if I have a worker node with 100GB of RAM and 5 TB HD, can I allocate 20 GB RAM and 1 TB HDD per executor for that worker?
Generally spark perform all its computation in memory. RAM is allocated at the executor level and HD would be allocated at the Worker node level only. Spark would just spill the data to the disk only when it does not fit in memory
My understanding - A partition is a portion of the actual data. This split could happen using hashing, round robin or range.
Q - What determines the location of these data partitions?
These partitions could be anywhere and might not be equally distributed in most of the cases.It could happen some of the executors does not have a single partition and other executors have more than 2 partitions.
In order to have colocated partitions or partitions that have same keys you would have to repartition data based on the specific column in your dataframe and then it would partition your data based on the values of that column and make sure that same column values are there in the same partition
When you repartition the data to 2 partitions then it would shuffle the data between all the executors and then break the dat into 2 partitions and then that data could be on any of the executors and other executors would be empty or idle in that case.
Assumption - A task is the lowest unit of work that performs the actual ask. The number of tasks dependent on the number of partitions. So, if there are 20 partitions, I would have 20 tasks in each stage.
you would have 20 tasks for that specific stage and it wont remain same for all the stages as stage gets created when there is data shuffle that needs to happen. If there is no shuffle happening based on the code that you might have written it would just create a single stage with 20 tasks for sure.
Q - Are these tasks performed by individual executors? Yes
Q - If I have less executors (say, 10) than the partitions (say, 20), does it mean that only 10 tasks will be executed in parallel at any point? is the degree of parallelism constrained by the number of executors? Yes
I'm running a spark batch job on aws fargate in standalone mode. On the compute environment, I have 8 vcpu and job definition has 1 vcpu and 2048 mb memory. In the spark application I can specify how many core I want to use and doing this using below code
sparkSess = SparkSession.builder.master("local[8]")\
.appName("test app")\
.config("spark.debug.maxToStringFields", "1000")\
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")\
.getOrCreate()
local[8] is specifying 8 cores/threads (that’s what I'm assuming).
Initially I was running the spark app without specifying cores and I think job was running in single thread and was taking around 10 min to complete but with this number it is reducing the time to process. I started with 2 it almost reduced to 5 minutes and then I have changed to 4, 8 and now it is taking almost 4 minutes. But I don't understand the relation between vcpu and spark threads. Whatever the number I specify for cores, sparkContext.defaultParallelism shows me that value.
Is this the correct way? Is there any relation between this number and the vcpu that I specify on job definition or compute environment.
You are running in Spark Local Mode. Learning Spark has this to say about Local mode:
Spark driver runs on a single JVM, like a laptop or single node
Spark executor runs on the same JVM as the driver
Cluster manager Runs on the same host
Damji, Jules S.,Wenig, Brooke,Das, Tathagata,Lee, Denny. Learning Spark (p. 30). O'Reilly Media. Kindle Edition.
local[N] launches with N threads. Given the above definition of Local Mode, those N threads must be shared by the Local Mode Driver, Executor and Cluster Manager.
As such, from the available vCPUs, allotting one vCPU for the Driver thread, one for the Cluster Manager, one for the OS and the remaining for Executor seems reasonable.
The optimal number of threads/vCPUs for the Executor will depend on the number of partitions your data has.
I have installed Spark on master and 2 workers. The original core number per worker is 8. When I start the master, the workers are work properly without any problem, but the problem is in Spark GUI each worker has only 2 cores assigned.
Kindly, how can I increase the number of the cores in which each worker works with 8 cores?
The setting which controls cores per executor is spark.executor.cores. See doc. It can be set either via spark-submit cmd argument or in spark-defaults.conf. The file is usually located in /etc/spark/conf (ymmv). YOu can search for the conf file with find / -type f -name spark-defaults.conf
spark.executor.cores 8
However the setting does not guarantee that each executor will always get all the available cores. This depends on your workload.
If you schedule tasks on a dataframe or rdd, spark will run a parallel task for each partition of the dataframe. A task will be scheduled to an executor (separate jvm) and the executor can run multiple tasks in parallel in jvm threads on each core.
Also an exeucutor will not necessarily run on a separate worker. If there is enough memory, 2 executors can share a worker node.
In order to use all the cores the setup in your case could look as follows:
given you have 10 gig of memory on each node
spark.default.parallelism 14
spark.executor.instances 2
spark.executor.cores 7
spark.executor.memory 9g
Setting memory to 9g will make sure, each executor is assigned to a separate node. Each executor will have 7 cores available. And each dataframe operation will be scheduled to 14 concurrent tasks, which will be distributed x 7 to each executor. You can also repartition a dataframe, instead of setting default.parallelism. One core and 1gig of memory is left for the operating system.
I am running spark v 1.6.1 on a single machine in standalone mode, having 64GB RAM and 16cores.
I have created five worker instances to create five executor as in standalone mode, there cannot be more than one executor in one worker node.
Configuration:
SPARK_WORKER_INSTANCES 5
SPARK_WORKER_CORE 1
SPARK_MASTER_OPTS "-Dspark.deploy.default.Cores=5"
all other configurations are default in spark_env.sh
I am running a spark streaming direct kafka job at an interval of 1 min, which takes data from kafka and after some aggregation write the data to mongo.
Problems:
when I start master and slave, it starts one master process and five worker processes. each only consume about 212 MB of ram.when i submit the job , it again creates 5 executor processes and 1 job process and also the memory uses grows to 8GB in total and keeps growing over time (slowly) also when there is no data to process.
we are unpersisting cached rdd at the end also set spark.cleaner.ttl to 600. but still memory is growing.
one more thing, I have seen the merged SPARK-1706, then also why i am unable to create multiple executor within a worker.and also in spark_env.sh file , setting any configuration related to executor comes under YARN only mode.
Any help would be greatly appreciated,
Thanks