Apache Spark - restricting per stage concurrent thread count - apache-spark

I want to control the number of concurrent partitions being processed at every stage in my Spark RDD. .repartition(...) is not the solution, as that just modifies the overall number of partitions during a stage, not how many are being processed.
Normally you restrict the number of concurrent partitions at the start by using the --executor-cores and --num-executors parameters. And that is not exact, as processing stages can be staggered etc.
The main thing I want to accomplish is a dataload process from a database with certain resource restrictions (concurrent connections) - but I do not want those database resource restrictions to dictate the concurrency of the rest of my spark process or RDD. I also do not want to force very huge partitions at the beginning of the process that will have to be further split and redistributed.
It seems like a reasonable thing to expect, but at first glance not something that can be accomplished within the Spark API.
Example (some pseudo code)
rdd = pseudoReadFromJDBC(partitions = 500,parallelism=10)
.repartition(100)
.parallelism(50)
.operatorOnRDD();
So in this case, In the first stage, I would split the data retrieved from the JDBC Query in 500 smaller datasets. However, I would restrict Spark from only allowing to run 10 threads of it simultaneously, so I have maximum only 10 JDBC Connections simultaneously opened. Other partitions would just queue up.
Then in the second stage, I might repartition, but more importantly, I want to choose a higher degree of actual parallelism because I am not restricted anymore by the database allowing a limited amount of simultaneous connections.
That is what I mean with changing it on a per stage basis.

There is one parameter spark.default.parallelism. You can try changing value of this.

Related

Spark map-side aggregation: Per partition only?

I have been reading on map-side reduce/aggregation and there is one thing I can't seem to understand clearly. Does it happen per partition only or is it broader in scope? I mean does it also reduce across partitions if the same key appears in multiple partitions processed by the same Executor?
Now I have a few more questions depending on whether the answer is "per partition only" or not.
Assuming it's per partition:
What are good ways to deal with a situation where I know my dataset lends itself well to reducing further across local partitions before a shuffle. E.g. I process 10 partitions per Executor and I know they all include many overlapping keys, so it could potentially be reduced to just 1/10th. Basically I'm looking for a local reduce() (like so many). Coalesce()ing them comes to mind, any common methods to deal with this?
Assuming it reduces across partitions:
Does it happen per Executor? How about Executors assigned to the same Worker node, do they have the ability to reduce across each others partitions recognizing that they are co-located?
Does it happen per core (Thread) within the Executor? The reason I'm asking this is because some of the diagrams I looked at seem to show a Mapper per core/Thread of the executor, it looks like results of all tasks coming out of that core goes to a single Mapper instance. (which does the shuffle writes if I am not mistaken)
Is it deterministic? E.g. if I have a record, let's say A=1 in 10 partitions processed by the same Executor, can I expect to see A=10 for the task reading the shuffle output? Or is it best-effort, e.g. it still reduces but there are some constraints (buffer size etc.) so the shuffle read may encounter A=4 and A=6.
Map side aggregation is similar to Hadoop combiner approach. Reduce locally makes sense to Spark as well and means less shuffling. So it works per partition - as you state.
When applying reducing functionality, e.g. a groupBy & sum, then shuffling occurs initially so that keys are in same partition, so that the above can occur (with dataframes automatically). But a simple count, say, will also reduce locally and then the overall count will be computed by taking the intermediate results back to the driver.
So, results are combined on the Driver from Executors - depending on what is actually requested, e.g. collect, print of a count. But if writing out after aggregation of some nature, then the reducing is limited to the Executor on a Worker.

What are Shuffled Partitions?

What is spark.sql.shuffle.partitions in a more technical sense? I have seen answers like here which says: "configures the number of partitions that are used when shuffling data for joins or aggregations."
What does that actually mean? How does shuffling work from node to node differently when this number is higher or lower?
Thanks!
Partitions define where data resides in your cluster. A single partition can contain many rows, but all of them will be processed together in a single task on one node.
Looking at edge cases, if we re-partition our data into a single partition, even if you have 100 executors, it will be only processed by one.
On the other hand, if you have a single executor, but multiple partitions, they will be all (obviously) processed on the same machine.
Shuffles happen, when one executor needs data from another - basic example is groupBy aggregation operation, as we need all related rows to calculate result. Irrespective of how many partitions we had before groupBy, after it spark will split results into spark.sql.shuffle.partitions
Quoting after "Spark - the definitive guide" by Bill Chambers and Matei Zaharia:
A good rule of thumb is that the number of partitions should be larger than the number of executors on your cluster, potentially by multiple factors depending on the workload. If you are running code on your local machine, it would behoove you to set this value lower because your local machine is unlikely to be able to execute that number of tasks in parallel.
So, to sum up, if you set this number lower than your cluster's capacity to run tasks, you won't be able to use all of its resources. On the other hand, since tasks are run on a single partitions, having thousands of small partitions would (I expect) have some overhead.
spark.sql.shuffle.partitions is the parameter which determines how many blocks your shuffle will be performed in.
Say you had 40Gb of data and had spark.sql.shuffle.partitions set to 400 then your data will be shuffled in 40gb / 400 sized blocks (assuming your data is evenly distributed).
By changing the spark.sql.shuffle.partitions you change the size of blocks being shuffled and the number of blocks for each shuffle stage.
As Daniel says a rule of thumb is to never have spark.sql.shuffle.partitions set lower than the number of cores for a job.

Number of Executor Cores and benefits or otherwise - Spark

Some run-time clarifications are requested.
In a thread elsewhere I read, it was stated that a Spark Executor should only have a single Core allocated. However, I wonder if this is really always true. Reading the various SO-questions and the likes of, as well as Karau, Wendell et al, it is clear that there are equal and opposite experts who state one should in some cases specify more Cores per Executor, but the discussion tends to be more technical than functional. That is to say, functional examples are lacking.
My understanding is that a Partition of an RDD or DF, DS, is serviced by a single Executor. Fine, no issue, makes perfect sense. So, how can the Partition benefit from multiple Cores?
If I have a map followed by, say a, filter, these are not two Tasks that can be interleaved - as in what Informatica does, as my understanding is they are fused together. This being so, then what is an example of benefit from an assigned Executor running more Cores?
From JL: In other (more technical) words, a Task is a computation on the records in a RDD partition in a Stage of a RDD in a Spark Job. What does it mean functionally speaking, in practice?
Moreover, can Executor be allocated if not all Cores can be acquired? I presume there is a wait period and that after a while it may be allocated in a more limited capacity. True?
From a highly rated answer on SO, What is a task in Spark? How does the Spark worker execute the jar file?, the following is stated: When you create the SparkContext, each worker starts an executor. From another SO question: When a SparkContext is created, each worker node starts an executor.
Not sure I follow these assertions. If Spark does not know the number of partitions etc. in advance, why allocate Executors so early?
I ask this, as even this excellent post How are stages split into tasks in Spark? does not give a practical example of multiple Cores per Executor. I can follow the post clearly and it fits in with my understanding of 1 Core per Executor.
My understanding is that a Partition (...) serviced by a single Executor.
That's correct, however the opposite is not true - a single executor can handle multiple partitions / tasks across multiple stages or even multiple RDDs).
then what is an example of benefit from an assigned Executor running more Cores?
First and foremost processing multiple tasks at the same time. Since each executor is a separate JVM, which is a relatively heavy process, it might preferable to keep only instance for a number of threads. Additionally it can provide further advantages, like exposing shared memory that can be used across multiple tasks (for example to store broadcast variables).
Secondary application is applying multiple threads to a single partition when user invokes multi-threaded code. That's however not something that is done by default (Number of CPUs per Task in Spark)
See also What are the benefits of running multiple Spark tasks in the same JVM?
If Spark does not know the number of partitions etc. in advance, why allocate Executors so early?
Pretty much by extension of the points made above - executors are not created to handle specific task / partition. There are long running processes, and as long as dynamic allocation is not enabled, there are intended to last for the full lifetime of the corresponding application / driver (preemption or failures, as well as already mentioned dynamic allocation, can affect that, but that's the basic model).

How is task distributed in spark

I am trying to understand that when a job is submitted from the spark-submit and I have spark deployed system with 4 nodes how is the work distributed in spark. If there is large data set to operate on, I wanted to understand exactly in how many stages are the task divided and how many executors run for the job. Wanted to understand how is this decided for every stage.
It's hard to answer this question exactly, because there are many uncertainties.
Number of stages depends only on described workflow, which includes different kind of maps, reduces, joins, etc. If you understand it, you basically can read that right from the code. But most importantly that helps you to write more performant algorithms, because it's generally known the one have to avoid shuffles. For example, when you do a join, it requires shuffle - it's a boundary stage. This is pretty simple to see, you have to print rdd.toDebugString() and then look at indentation (look here), because indentation is a shuffle.
But with number of executors that's completely different story, because it depends on number of partitions. It's like for 2 partitions it requires only 2 executors, but for 40 ones - all 4, since you have only 4. But additionally number of partitions might depend on few properties you can provide at the spark-submit:
spark.default.parallelism parameter or
data source you use (f.e. for HDFS and Cassandra it is different)
It'd be a good to keep all of the cores in cluster busy, but no more (meaning single process only just one partition), because processing of each partition takes a bit of overhead. On the other hand if your data is skewed, then some cores would require more time to process bigger partitions, than others - in this case it helps to split data to more partitions so that all cores are busy roughly same amount of time. This helps with balancing cluster and throughput at the same time.

what factors affect how many spark job concurrently

We recently have set up the Spark Job Server to which the spark jobs are submitted.But we found out that our 20 nodes(8 cores/128G Memory per node) spark cluster can only afford 10 spark jobs running concurrently.
Can someone share some detailed info about what factors would actually affect how many spark jobs can be run concurrently? How can we tune the conf so that we can take full advantage of the cluster?
Question is missing some context, but first - it seems like Spark Job Server limits the number of concurrent jobs (unlike Spark itself, which puts a limit on number of tasks, not jobs):
From application.conf
# Number of jobs that can be run simultaneously per context
# If not set, defaults to number of cores on machine where jobserver is running
max-jobs-per-context = 8
If that's not the issue (you set the limit higher, or are using more than one context), then the total number of cores in the cluster (8*20 = 160) is the maximum number of concurrent tasks. If each of your jobs creates 16 tasks, Spark would queue the next incoming job waiting for CPUs to be available.
Spark creates a task per partition of the input data, and the number of partitions is decided according to the partitioning of the input on disk, or by calling repartition or coalesce on the RDD/DataFrame to manually change the partitioning. Some other actions that operate on more than one RDD (e.g. union) may also change the number of partitions.
Some things that could limit the parallelism that you're seeing:
If your job consists of only map operations (or other shuffle-less operations), it will be limited to the number of partitions of data you have. So even if you have 20 executors, if you have 10 partitions of data, it will only spawn 10 task (unless the data is splittable, in something like parquet, LZO indexed text, etc).
If you're performing a take() operation (without a shuffle), it performs an exponential take, using only one task and then growing until it collects enough data to satisfy the take operation. (Another question similar to this)
Can you share more about your workflow? That would help us diagnose it.

Resources