Spark Local vs Cluster - multithreading

Spark Local vs Cluster - multithreading

What is the Spark cluster equivalent of standalone's local[N]. I mean, the value we set as a parameter of local as N, which parameter takes it in the cluster mode?

In local[N] - N is the maximum number of cores can be used in a node at any point of time.
In cluster mode you can set --executor-cores N.
It means that each executor can run a maximum of N tasks at the same time in an executor.

In cluster mode, one executor will run on one worker node, which means that one executor will takes all the cores on the worker node. It could result in under-utilization of the resources. Keep in mind, driver will also takes one worker node.

Related

Spark - only one partition is processed at each node

I see in my Spark job that usually (but not always) only one partition is being processed on each node. What could be the possible reasons? How can I debug it?

You should check the executor's resources configuration:
spark.executor.memory
spark.executor.cores
These configs control how many executors can run concurrently on each node and therefore -- how many partitions are processed concurrently (by default, every executor processes a single partition).
For example, if your nodes have 8 cores and 32gb memory each and your spark application is defined with:
spark.executor.memory=25g
spark.executor.cores=3
only one executor will be able to run concurrently on each node and in order to run 2 executors concurrntly the node should have at least 50gb memory.

Spark executors and shuffle in local mode

I am running a TPC-DS benchmark for Spark 3.0.1 in local mode and using sparkMeasure to get workload statistics. I have 16 total cores and SparkContext is available as
Spark context available as 'sc' (master = local[*], app id = local-1623251009819)
Q1. For local[*], driver and executors are created in a single JVM with 16 threads. Considering Spark's configuration which of the following will be true?
1 worker instance, 1 executor having 16 cores/threads
1 worker instance, 16 executors each having 1 core
For a particular query, sparkMeasure reports shuffle data as follows
shuffleRecordsRead => 183364403
shuffleTotalBlocksFetched => 52582
shuffleTotalBlocksFetched => 52582
shuffleLocalBlocksFetched => 52582
shuffleRemoteBlocksFetched => 0
shuffleTotalBytesRead => 1570948723 (1498.0 MB)
shuffleLocalBytesRead => 1570948723 (1498.0 MB)
shuffleRemoteBytesRead => 0 (0 Bytes)
shuffleRemoteBytesReadToDisk => 0 (0 Bytes)
shuffleBytesWritten => 1570948723 (1498.0 MB)
shuffleRecordsWritten => 183364480
Q2. Regardless of the query specifics, why is there data shuffling when everything is inside a single JVM?

executor is a jvm process when you use local[*] you run Spark
locally with as many worker threads as logical cores on your machine so : 1 executor and as many worker threads as logical
cores. when you configure SPARK_WORKER_INSTANCES=5 in spark-env.sh and execute these commands start-master.sh and start-slave.sh spark://local:7077 to bring up a standalone spark cluster in your
local machine you have one master and 5 workers, if you want to send
your application to this cluster you must configure application like
SparkSession.builder().appName("app").master("spark://localhost:7077")
in this case you can't specify [*] or [2] for example. but when
you specify master to be local[*] a jvm process is created and
master and all workers will be in that jvm process and after your
application finished that jvm instance will be destroyed. local[*]
and spark://localhost:7077 are two separate things.
workers do their job using tasks and each task actually is a thread
i.e. task = thread. workers have memory and they assign a memory
partition to each task in order to they do their job such as reading
a part of a dataset into its own memory partition or do a
transformation on read data. when a task such as join needs other
partitions, shuffle occurs regardless weather the job is ran in
cluster or local. if you were in cluster there is a possibility that
two tasks were in different machines so Network transmission will be
added to other stuffs such as writing the result and then reading by
another task. in local if task B needs the data in the partition of
the task A, task A should write it down and then task B will read it
to do its job

Local mode is the same as non-distributed single-JVM deployment mode.
Q1: It is neither. In this mode Spark spawns all execution components
namely Driver, n threads for data processing and Master in a single JVM.
If I had to abstract it to one of your 2 options I would say, 1 worker
instance, 16 executors each having 1 core, but as said this is not the
right way to look at it. The other option could be N Workers with M Executors with 1 Core each where N x M = 16.
The default parallelism is the number of threads as specified in the
master URL = local[*].
Q2: The threads will service partitions, concurrently, one at a time,
as many as needed, sequentially within the current Stage, being
assigned by the Driver when free. A stage is a boundary that causes
shuffling, regardless of how you run, in YARN Cluster or local.
Shuffling - what is that then? Shuffle occurs when data is required to
be re-arranged over existing partitions. E.g. a groupBy or orderBy? We
may have M partitions and after the groupBy N partitions. This is a
wide-transformation concept at the core of Spark for parallel
processing, so (even) with local[*] this will apply.

Is there any relation between vCpu on Fargate resource and spark core/threads?

I'm running a spark batch job on aws fargate in standalone mode. On the compute environment, I have 8 vcpu and job definition has 1 vcpu and 2048 mb memory. In the spark application I can specify how many core I want to use and doing this using below code
sparkSess = SparkSession.builder.master("local[8]")\
.appName("test app")\
.config("spark.debug.maxToStringFields", "1000")\
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")\
.getOrCreate()
local[8] is specifying 8 cores/threads (that’s what I'm assuming).
Initially I was running the spark app without specifying cores and I think job was running in single thread and was taking around 10 min to complete but with this number it is reducing the time to process. I started with 2 it almost reduced to 5 minutes and then I have changed to 4, 8 and now it is taking almost 4 minutes. But I don't understand the relation between vcpu and spark threads. Whatever the number I specify for cores, sparkContext.defaultParallelism shows me that value.
Is this the correct way? Is there any relation between this number and the vcpu that I specify on job definition or compute environment.

You are running in Spark Local Mode. Learning Spark has this to say about Local mode:
Spark driver runs on a single JVM, like a laptop or single node
Spark executor runs on the same JVM as the driver
Cluster manager Runs on the same host
Damji, Jules S.,Wenig, Brooke,Das, Tathagata,Lee, Denny. Learning Spark (p. 30). O'Reilly Media. Kindle Edition.
local[N] launches with N threads. Given the above definition of Local Mode, those N threads must be shared by the Local Mode Driver, Executor and Cluster Manager.
As such, from the available vCPUs, allotting one vCPU for the Driver thread, one for the Cluster Manager, one for the OS and the remaining for Executor seems reasonable.
The optimal number of threads/vCPUs for the Executor will depend on the number of partitions your data has.

Increase the Spark workers cores

I have installed Spark on master and 2 workers. The original core number per worker is 8. When I start the master, the workers are work properly without any problem, but the problem is in Spark GUI each worker has only 2 cores assigned.
Kindly, how can I increase the number of the cores in which each worker works with 8 cores?

The setting which controls cores per executor is spark.executor.cores. See doc. It can be set either via spark-submit cmd argument or in spark-defaults.conf. The file is usually located in /etc/spark/conf (ymmv). YOu can search for the conf file with find / -type f -name spark-defaults.conf
spark.executor.cores 8
However the setting does not guarantee that each executor will always get all the available cores. This depends on your workload.
If you schedule tasks on a dataframe or rdd, spark will run a parallel task for each partition of the dataframe. A task will be scheduled to an executor (separate jvm) and the executor can run multiple tasks in parallel in jvm threads on each core.
Also an exeucutor will not necessarily run on a separate worker. If there is enough memory, 2 executors can share a worker node.
In order to use all the cores the setup in your case could look as follows:
given you have 10 gig of memory on each node
spark.default.parallelism 14
spark.executor.instances 2
spark.executor.cores 7
spark.executor.memory 9g
Setting memory to 9g will make sure, each executor is assigned to a separate node. Each executor will have 7 cores available. And each dataframe operation will be scheduled to 14 concurrent tasks, which will be distributed x 7 to each executor. You can also repartition a dataframe, instead of setting default.parallelism. One core and 1gig of memory is left for the operating system.

How to release seemingly inactive executors from a long-running PySpark framework?

Here's my problem. Let's say I have a long-running PySpark framework. It has thousands of tasks that can all be executed in parallel. I get allocated 1,000 cores at the beginning on many different hosts. Each task needs one core. Then, when those finish, the host holds onto one core and has no active tasks. Since there are a large number of hosts, what can happen is that a larger and larger percentage of my cores are allocated to executors that don't have any active tasks. So I can have 1000 cores allocated, but only 100 active tasks. The other 900 cores are allocated to executors that have no active tasks. How can I improve this? Is there a way to shut down executors that aren't doing anything? I am currently using PySpark 1.2, so it'd be great for the functionality to be in that version, but would be happy to hear about solutions (or better solutions) in newer versions. Thanks!

If you do not specify the number of executors that Spark should use, Spark allocates executors as long as Spark has at least 1 task pending in its queue. You can set an upper limit to the number of executors that Spark can dynamically allocate by using this parameter: spark.dynamicAllocation.maxExecutors.
In other word, when launching spark, use:
pyspark --master yarn-client --conf spark.dynamicAllocation.maxExecutors=1000
instead of
pyspark --master yarn-client --num-executors=1000
By default, Spark will release executors after 60s of non-activity.
Note, if you .persist() your Spark.DataFrame, make sure to .unpersist() them otherwise Spark will not release the executors.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string