what factors affect how many spark job concurrently - apache-spark

We recently have set up the Spark Job Server to which the spark jobs are submitted.But we found out that our 20 nodes(8 cores/128G Memory per node) spark cluster can only afford 10 spark jobs running concurrently.
Can someone share some detailed info about what factors would actually affect how many spark jobs can be run concurrently? How can we tune the conf so that we can take full advantage of the cluster?

Question is missing some context, but first - it seems like Spark Job Server limits the number of concurrent jobs (unlike Spark itself, which puts a limit on number of tasks, not jobs):
From application.conf
# Number of jobs that can be run simultaneously per context
# If not set, defaults to number of cores on machine where jobserver is running
max-jobs-per-context = 8
If that's not the issue (you set the limit higher, or are using more than one context), then the total number of cores in the cluster (8*20 = 160) is the maximum number of concurrent tasks. If each of your jobs creates 16 tasks, Spark would queue the next incoming job waiting for CPUs to be available.
Spark creates a task per partition of the input data, and the number of partitions is decided according to the partitioning of the input on disk, or by calling repartition or coalesce on the RDD/DataFrame to manually change the partitioning. Some other actions that operate on more than one RDD (e.g. union) may also change the number of partitions.

Some things that could limit the parallelism that you're seeing:
If your job consists of only map operations (or other shuffle-less operations), it will be limited to the number of partitions of data you have. So even if you have 20 executors, if you have 10 partitions of data, it will only spawn 10 task (unless the data is splittable, in something like parquet, LZO indexed text, etc).
If you're performing a take() operation (without a shuffle), it performs an exponential take, using only one task and then growing until it collects enough data to satisfy the take operation. (Another question similar to this)
Can you share more about your workflow? That would help us diagnose it.

Related

Can the number of executor cores be greater than the total number of spark tasks? [duplicate]

What happens when number of spark tasks be greater than the executor core? How is this scenario handled by Spark
Is this related to this question?
Anyway, you can check this Cloudera How-to. In "Tuning Resource Allocation" section, It's explained that a spark application can request executors by turning on the dynamic allocation property. It's also important to set cluster properties such as num-executors, executor-cores, executor-memory... so that spark requests fit into what your resource manager has available.
yes, this scenario can happen. In this case some of the cores will be idle. Scenarios where this can happen:
You call coalesce or repartition with a number of partitions < number of cores
you use the default number of spark.sql.shuffle.partitions (=200)
and you have more than 200 cores available. This will be an issue for
joins, sorting and aggregation. In this case you may want to increase spark.sql.shuffle.partitions
Note that even if you have enough tasks, some (or most) of them could be empty. This can happen if you have a large data skew or you do something like groupBy() or Window without a partitionBy. In this case empty partitions will be finished immediately, turning most of your cores idle
I think the question is a little off beam. It is unlikely what you ask. Why?
With a lot of data you will have many partitions and you may repartition.
Say you have 10,000 partitions which equates to 10,000 tasks.
An executor (core) will serve a partition effectively a task (1:1 mapping) and when finished move on to the next task, until all tasks finished in the stage and then next will start (if it is in plan / DAG).
It's more likely you will not have a cluster of 10,000 executor cores at most places (for your App), but there are sites that have that, that is true.
If you have more cores allocated than needed, then they remain idle and non-usable for others. But with dynamic resource allocation, executors can be relinquished. I have worked with YARN and Spark Standalone, how this is with K8 I am not sure.
Transformations alter what you need in terms of resources. E.g. an order by may result in less partitions and thus may contribute to idleness.

can number of Spark task be greater than the executor core?

What happens when number of spark tasks be greater than the executor core? How is this scenario handled by Spark
Is this related to this question?
Anyway, you can check this Cloudera How-to. In "Tuning Resource Allocation" section, It's explained that a spark application can request executors by turning on the dynamic allocation property. It's also important to set cluster properties such as num-executors, executor-cores, executor-memory... so that spark requests fit into what your resource manager has available.
yes, this scenario can happen. In this case some of the cores will be idle. Scenarios where this can happen:
You call coalesce or repartition with a number of partitions < number of cores
you use the default number of spark.sql.shuffle.partitions (=200)
and you have more than 200 cores available. This will be an issue for
joins, sorting and aggregation. In this case you may want to increase spark.sql.shuffle.partitions
Note that even if you have enough tasks, some (or most) of them could be empty. This can happen if you have a large data skew or you do something like groupBy() or Window without a partitionBy. In this case empty partitions will be finished immediately, turning most of your cores idle
I think the question is a little off beam. It is unlikely what you ask. Why?
With a lot of data you will have many partitions and you may repartition.
Say you have 10,000 partitions which equates to 10,000 tasks.
An executor (core) will serve a partition effectively a task (1:1 mapping) and when finished move on to the next task, until all tasks finished in the stage and then next will start (if it is in plan / DAG).
It's more likely you will not have a cluster of 10,000 executor cores at most places (for your App), but there are sites that have that, that is true.
If you have more cores allocated than needed, then they remain idle and non-usable for others. But with dynamic resource allocation, executors can be relinquished. I have worked with YARN and Spark Standalone, how this is with K8 I am not sure.
Transformations alter what you need in terms of resources. E.g. an order by may result in less partitions and thus may contribute to idleness.

What is the relationship between tasks and partitions?

Can I say?
The number of the Spark tasks equal to the number of the Spark partitions?
The executor runs once (batch inside of executor) is equal to one task?
Every task produce only a partition?
(duplicate of 1.)
The degree of parallelism, or the number of tasks that can be ran concurrently, is set by:
The number of Executor Instances (configuration)
The Number of Cores per Executor (configuration)
The Number of Partitions being used (coded)
Actual parallelism is the smaller of
executors * cores - which gives the amount of slots available to run tasks
partitions - each partition will translate to a task whenever a slot opens up.
Tasks that run on the same executor will share the same JVM. This is used by the Broadcast feature as you only need one copy of the Broadcast data per Executor for all tasks to be able to access it through shared memory.
You can have multiple executors running, on the same machine, or on different machines. Executors are the true means of scalability.
Note that each Task takes up one Thread ¹, and is assumed to be assigned to one core ².
So -
Is the number of the Spark tasks equal to the number of the Spark partitions?
No (see previous).
The executor runs once (batch inside of executor) is equal to one task?
An Executor is started as an environment for the tasks to run. Multiple tasks will run concurrently within that Executor (multiple threads).
Every task produce only a partition?
For a task, it is one Partition in, one partition out. However, a repartitioning or shuffle/sort can happen in between tasks.
The number of the Spark tasks equal to the number of the Spark partitions?
Same as (1.)
(¹) Assumption is that within your tasks, you are not multithreading yourself (never do that, otherwise core estimate will be off).
(²) Note that due to hyper-threading, you might have more than one virtual core per physical core, and thus you can have several threads per core. You might even be able to handle multiple threads (2 to 3) on a single core without hyper-threading.
Partitions are a feature of RDD and are only available at design time (before an action is called).
Tasks are part of TaskSet per Stage per ActiveJob in a Spark application.
Is the number of the Spark tasks equal to the number of the Spark partitions?
Yes.
The executor runs once (batch inside of executor) is equal to one task?
That recursively uses "executor" and does not make much sense to me.
Every task produce only a partition?
Almost.
Every task produce an output of executing the code (it was created for) for the data in a partition.
The number of the Spark tasks equal to the number of the Spark partitions?
Almost.
The number of the Spark tasks in a single stage equals to the number of RDD partitions.
1.The number of the Spark tasks equal to the number of the Spark partitions?
Yes.
Spark breaks up the data into chunks called partitions. Is a collection of rows that sit on one physical machine in the cluster. Default partition size is 128MB. Allow every executor perform work in parallel. One partition will have a parallelism of only one, even if you have many executors.
With many partitions and only one executor will give you a parallelism of only one. You need to balance the number of executors and partitions to have the desired parallelism. This means that each partition will be processed by only one executor (1 executor for 1 partition for 1 task at a time).
A good rule is that the number of partitions should be larger than the number of executors on your cluster
See also: Chambers, Bill; Zaharia, Matei. Spark: The Definitive Guide: Big Data Processing Made Simple (p. 27). O'Reilly Media. Edição do Kindle.
2.The executor runs once (batch inside of executor) is equal to one task?
Cores are slot for tasks, and each executor can process more than one partition at a time if it has more than one core.
3.Every task produce only a partition?
It depend on the transformation.
Spark has Wide transformations and Narrow Transformation.
Wide Transformation: Will have input partitions contributing to many output partitions (shuffles -> Aggregation, sort, joins). Often referred to as a shuffle whereby Spark exchange partitions across the cluster. When we perform a shuffe, Spark write the results do disk
Narrow Transformation: Which input partition will contribute to only one output partition.
See also: Chambers, Bill; Zaharia, Matei. Spark: The Definitive Guide: Big Data Processing Made Simple. O'Reilly Media. Edição do Kindle.
Note: Read file is a narrow transformation because it does not require shuffle, but when you read one file that is splittable like parquet this file will be split into many partitions

Apache Spark - restricting per stage concurrent thread count

I want to control the number of concurrent partitions being processed at every stage in my Spark RDD. .repartition(...) is not the solution, as that just modifies the overall number of partitions during a stage, not how many are being processed.
Normally you restrict the number of concurrent partitions at the start by using the --executor-cores and --num-executors parameters. And that is not exact, as processing stages can be staggered etc.
The main thing I want to accomplish is a dataload process from a database with certain resource restrictions (concurrent connections) - but I do not want those database resource restrictions to dictate the concurrency of the rest of my spark process or RDD. I also do not want to force very huge partitions at the beginning of the process that will have to be further split and redistributed.
It seems like a reasonable thing to expect, but at first glance not something that can be accomplished within the Spark API.
Example (some pseudo code)
rdd = pseudoReadFromJDBC(partitions = 500,parallelism=10)
.repartition(100)
.parallelism(50)
.operatorOnRDD();
So in this case, In the first stage, I would split the data retrieved from the JDBC Query in 500 smaller datasets. However, I would restrict Spark from only allowing to run 10 threads of it simultaneously, so I have maximum only 10 JDBC Connections simultaneously opened. Other partitions would just queue up.
Then in the second stage, I might repartition, but more importantly, I want to choose a higher degree of actual parallelism because I am not restricted anymore by the database allowing a limited amount of simultaneous connections.
That is what I mean with changing it on a per stage basis.
There is one parameter spark.default.parallelism. You can try changing value of this.

How the Number of partitions and Number of concurrent tasks in spark calculated

I have a cluster with 4 nodes (each with 16 cores) using Spark 1.0.1.
I have an RDD which I've repartitioned so it has 200 partitions (hoping to increase the parallelism).
When I do a transformation (such as filter) on this RDD, I can't seem to get more than 64 tasks (my total number of cores across the 4 nodes) going at one point in time. By tasks, I mean the number of tasks that appear under the Application Spark UI. I tried explicitly setting the spark.default.parallelism to 128 (hoping I would get 128 tasks concurrently running) and verified this in the Application UI for the running application but this had no effect. Perhaps, this is ignored for a 'filter' and the default is the total number of cores available.
I'm fairly new with Spark so maybe I'm just missing or misunderstanding something fundamental. Any help would be appreciated.
This is correct behavior. Each "core" can execute exactly one task at a time, with each task corresponding to a partition. If your cluster only has 64 cores, you can only run at most 64 tasks at once.
You could run multiple workers per node to get more executors. That would give you more cores in the cluster. But however many cores you have, each core will run only one task at a time.
you can see the more details on the following thread
How does Spark paralellize slices to tasks/executors/workers?

Resources