I am running a query using Hive on Spark which is exhibiting some strange behavior. I've run it multiple times and observed the same behavior. The query:
reads from a large Hive external table
Spark creates about ~990,000 tasks
runs in a YARN queue with > 2900 CPUs available
uses 700 executors with 4 CPUs per executor
All is well at the start of the job. After ~1.5 hours of 2800 CPUs cranking, the job is ~80% complete (800k/990k tasks). From there, things start to nosedive: Spark stops using all of the CPUs available to it to work on tasks. With ~190k tasks to go, Spark will gradually drop from using 2800 CPUs to double digits (usually bottoming out around 20 total CPUs). This makes the last 190k tasks take significantly longer to finish than the previous 800k.
I could see as the job got very close to completing that Spark would be unable to parallelize a small amount of remaining tasks across a large number of CPUs. But with 190k tasks left to be started, it seems way too early for that.
Things I've checked:
No other job is pre-empting its resources in YARN. (In addition, if this were the case, I would expect the job to randomly lose/regain resources, instead of predictably losing steam at the 80% mark).
This occurs whether dynamic allocation is enabled or disabled. If disabled, Spark has all 2800 CPUs available for the entire run time of the job - it just doesn't use them. If enabled, Spark does spin down executors as it decides it no longer needs them.
If data skew were the issue, I could see some tasks taking longer than others to finish. But it doesn't explain why Spark wouldn't be using idle CPUs to start on the backlog of tasks still to go.
Does anyone have any advice?
For posterity, this answer from Travis Hegner contained the answer.
Setting spark.locality.wait=0s fixes this issue. I'm also not sure why a 3 second wait causes such a pile up in Spark's ability to schedule tasks, but setting to 0 makes the job run extremely well.
Related
I have a dataset for which I'd like to run multiple jobs for in parallel.
I do this by launching each action in its own thread to get multiple Spark jobs per Spark application like the docs say.
Now the task I'm running doesn't benefit endlessly from throwing more cores at it - at like 50 cores or so the gain of adding more resources is quite minimal.
So for example if I have 2 jobs and 100 cores I'd like to run both jobs in parallel each of them only occupying 50 cores at max to get faster results.
One thing I could probably do is to set the amount of partitions to 50 so the jobs could only spawn 50 tasks(?). But apparently there are some performance benefits of having more partitions than available cores to get a better overall utilization.
But other than that I didn't spot anything useful in the docs to limit the resources per Apache Spark job inside one application. (I'd like to avoid spawning multiple applications to split up the executors).
Is there any good way to do this?
Perhaps asking Spark driver to use fair scheduling is the most appropriate solution in your case.
Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
There is also a concept of pools, but I've not used them, perhaps that gives you some more flexibility on top of fair scheduling.
Seems like conflicting requirements with no silver bullet.
parallelize as much as possible.
limit any one job from hogging resources IF (and only if) another job is running as well.
So:
if you increase number of partitions then you'll address #1 but not #2.
if you specify spark.cores.max then you'll address #2 but not #1.
if you do both (more partitions and limit spark.cores.max) then you'll address #2 but not #1.
If you only increase number of partitions then only thing you're risking is that a long running big job will delay the completion/execution of some smaller jobs, though overall it'll take the same amount of time to run two jobs on given hardware in any order as long as you're not restricting concurrency (spark.cores.max).
In general I would stay away from restricting concurrency (spark.cores.max).
Bottom line, IMO
don't touch spark.cores.max.
increase partitions if you're not using all your cores.
use fair scheduling
if you have strict latency/response-time requirements then use separate auto-scaling clusters for long running and short running jobs
I understand if executor-cores is set to more than 1, then the executor will run in parallel. However, from my experience, the number of parallel processes in the executor is always equal to the number of CPUs in the executor.
For example, suppose I have a machine with 48 cores and set executor-cores to 4, and then there will be 12 executors.
What we need is to run 8 threads or more for each executor (so 2 or more threads per CPU). The reason is that the task is quite light weight and CPU usage is quite low around 10%, so we want to boost CPU usage through multiple threads per CPU.
So asking if we could possibly achieve this in the Spark configuration. Thanks a lot!
Spark executors are processing tasks, which are derived from the execution plan/code and partitions of the dataframe. Each core on an executor is always processing only one task, so each executor only get the number of tasks at most the amount of cores. Having more tasks in one executor as you are asking for is not possible.
You should look for code changes, minimize amount of shuffles (no inner joins; use windows instead) and check out for skew in your data leading to non-uniformly distributed partition sizing (dataframe partitions, not storage partitions).
WARNING:
If you are however alone on your cluster and you do not want to change your code, you can change the YARN settings for the server and represent it with more than 48 cores, even though there are just 48. This can lead to severe instability of the system, since executors are now sharing CPUs. (And your OS also needs CPU power.)
This answer is meant as a complement to #Telijas' answer, because in general I agree with it. It's just to give that tiny bit of extra information.
There are some configuration parameters in which you can set the number of thread for certain parts of Spark. There is, for example, a section in the Spark docs that discusses some of them (for all of this I'm looking at the latest Spark version at the time of writing this post: version 3.3.1):
Depending on jobs and cluster configurations, we can set number of threads in several places in Spark to utilize available resources efficiently to get better performance. Prior to Spark 3.0, these thread configurations apply to all roles of Spark, such as driver, executor, worker and master. From Spark 3.0, we can configure threads in finer granularity starting from driver and executor. Take RPC module as example in below table. For other modules, like shuffle, just replace “rpc” with “shuffle” in the property names except spark.{driver|executor}.rpc.netty.dispatcher.numThreads, which is only for RPC module.
Property Name
Default
Meaning
Since Version
spark.{driver
executor}.rpc.io.serverThreads
Fall back on spark.rpc.io.serverThreads
Number of threads used in the server thread pool
spark.{driver
executor}.rpc.io.clientThreads
Fall back on spark.rpc.io.clientThreads
Number of threads used in the client thread pool
spark.{driver
executor}.rpc.netty.dispatcher.numThreads
Fall back on spark.rpc.netty.dispatcher.numThreads
Number of threads used in RPC message dispatcher thread pool
Then here follows a (non-exhaustive in no particular order, just been looking through the source code) list of some other number-of-thread-related configuration parameters:
spark.sql.streaming.fileSource.cleaner.numThreads
spark.storage.decommission.shuffleBlocks.maxThreads
spark.shuffle.mapOutput.dispatcher.numThreads
spark.shuffle.push.numPushThreads
spark.shuffle.push.merge.finalizeThreads
spark.rpc.connect.threads
spark.rpc.io.threads
spark.rpc.netty.dispatcher.numThreads (will be overridden by the driver/executor-specific ones from the table above)
spark.resultGetter.threads
spark.files.io.threads
I didn't add the meaning of these parameters to this answer because that's a different question and quite "Googleable". This is just meant as an extra bit of info.
This might be a very generic question but hope someone can point some hint. But I found that sometimes, my job spark seems to hit a "pause" many times:
The natural of the job is: read orc files (from a hive table), filter by certain columns, no join, then write out to another hive table.
There were total 64K tasks for my job / stage (FileScan orc, followed by Filter, Project).
The application has 500 executors, each has 4 cores. Initially, about 2000 tasks were running concurrently, things look good.
After a while, I noticed the number running tasks dropped all the way near 100. Many cores/executors were just waiting with nothing to do. (I checked the log from these waiting executors, there was no error. All assigned tasks were done on them, they were just waiting)
After about 3-5 minutes, then these waiting executors suddenly got tasks assigned and now were working happily.
Any particular reasons this can be? The application is running from spark-shell (--master yarn --deploy-mode client, with number of executors/sizes etc. specified)
Thanks!
I am being new on Spark. I am facing performance issue when the number of worker nodes are increased. So to investigate that, I have tried some sample code on spark-shell.
I have created a Amazon AWS EMR with 2 worker nodes (m3.xlarge). I have used the following code on spark-shell on the master node.
var df = sqlContext.range(0,6000000000L).withColumn("col1",rand(10)).withColumn("col2",rand(20))
df.selectExpr("id","col1","col2","if(id%2=0,1,0) as key").groupBy("key").agg(avg("col1"),avg("col2")).show()
This code executed without any issues and took around 8 mins. But when I have added 2 more worker nodes (m3.xlarge) and executed the same code using spark-shell on master node, the time increased to 10 mins.
Here is the issue, I think the time should be decreased, not by half, but I should decrease. I have no idea why on increasing worker node same spark job is taking more time. Any idea why this is happening? Am I missing any thing?
This should not happen, but it is possible for an algorithm to run slower when distributed.
Basically, if the synchronization part is a heavy one, doing that with 2 nodes will take more time then with one.
I would start by comparing some simpler transformations, running a more asynchronous code, as without any sync points (such as group by key), and see if you get the same issue.
#z-star, yes an algorithm might b slow when distributed. I found the solution by using Spark Dynamic Allocation. This enable spark to use only required executors. While the static allocation runs a job on all executors, which was increasing the execution time with more nodes.
I am using YARN environment to run spark programs,
with option --master yarn-cluster.
When I open a spark application's application master, I saw a lot of Scheduler Delay in a stage. Some of them are even more than 10 minutes. I wonder what are they and why it takes so long?
Update:
Usually operations like aggregateByKey take much more time (i.e. scheduler delay) before executors really start doing tasks. Why is it?
Open the "Show Additional Metrics" (click the right-pointing triangle so it points down) and mouse over the check box for "Scheduler Delay". It shows this tooltip:
Scheduler delay includes time to ship the task from the scheduler to the executor, and time to send the task result from the executor to
the scheduler. If scheduler delay is large, consider decreasing the
size of tasks or decreasing the size of task results.
The scheduler is part of the master that divides the job into stages of tasks and works with the underlying cluster infrastructure to distribute them around the cluster.
Have a look at TaskSetManager's class comment:
..Schedules the tasks within a single TaskSet in the TaskSchedulerImpl. This class keeps track of each task, retries tasks if they fail (up to a limited number of times), and handles locality-aware scheduling for this TaskSet via delay scheduling...
I assume it is the result of the following paper, on which Matei Zaharia was working (co-founder and Chief Technologist of Databricks which develop Spark) ,too: https://cs.stanford.edu/~matei/
Thus, Spark is checking the partition's locality of a pending task. If the locality-level is low (e.g. not on local jvm) the task gets not directly killed or ignored, Instead it gets a launch delay, which is fair.