How does static resource calculation work, if we have multiple spark jobs that share the same cluster?.
As per the calculation, which is explained in many Spark videos and many other websites.
link :https://medium.com/analytics-vidhya/understanding-resource-allocation-configurations-for-a-spark-application-9c1307e6b5e3
link : https://www.projectpro.io/recipes/explain-resource-allocation-configurations-for-spark-application
If we have:
10 node cluster---->16 cores per node------->64 GB per node.
Then the final calculation will look like this:
num-executor : 29, executor-core: 5, executor-memory: 19 GB.
But if we give these resources to a single job, then basically an entire cluster's resources will be consumed by a single job. Then how will other jobs work?
Where does this calculation actually help us in real-time projects?
.
Related
I have a dataset for which I'd like to run multiple jobs for in parallel.
I do this by launching each action in its own thread to get multiple Spark jobs per Spark application like the docs say.
Now the task I'm running doesn't benefit endlessly from throwing more cores at it - at like 50 cores or so the gain of adding more resources is quite minimal.
So for example if I have 2 jobs and 100 cores I'd like to run both jobs in parallel each of them only occupying 50 cores at max to get faster results.
One thing I could probably do is to set the amount of partitions to 50 so the jobs could only spawn 50 tasks(?). But apparently there are some performance benefits of having more partitions than available cores to get a better overall utilization.
But other than that I didn't spot anything useful in the docs to limit the resources per Apache Spark job inside one application. (I'd like to avoid spawning multiple applications to split up the executors).
Is there any good way to do this?
Perhaps asking Spark driver to use fair scheduling is the most appropriate solution in your case.
Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
There is also a concept of pools, but I've not used them, perhaps that gives you some more flexibility on top of fair scheduling.
Seems like conflicting requirements with no silver bullet.
parallelize as much as possible.
limit any one job from hogging resources IF (and only if) another job is running as well.
So:
if you increase number of partitions then you'll address #1 but not #2.
if you specify spark.cores.max then you'll address #2 but not #1.
if you do both (more partitions and limit spark.cores.max) then you'll address #2 but not #1.
If you only increase number of partitions then only thing you're risking is that a long running big job will delay the completion/execution of some smaller jobs, though overall it'll take the same amount of time to run two jobs on given hardware in any order as long as you're not restricting concurrency (spark.cores.max).
In general I would stay away from restricting concurrency (spark.cores.max).
Bottom line, IMO
don't touch spark.cores.max.
increase partitions if you're not using all your cores.
use fair scheduling
if you have strict latency/response-time requirements then use separate auto-scaling clusters for long running and short running jobs
I have a question about how I can work on large files using Spark. Let's say I have a really large file (1 TB) while I only have access to 500GB RAM in my cluster. A simple wordcount application would look like the follows:
sc.textfile(path_to_file).flatmap(split_line_to_words).map(lambda x: (x,1)).reduceByKey()
When I do not have access to enough memory, will the above application fail due to OOM? if so, what are some ways I can fix this?
Well, this is not an issue.
N Partitions equal to block size of HDFS (like) file system will be created on Worker Nodes at some stage physically- resulting in many N small tasks to execute, easily fitting inside the 500GB, over the life of the Spark App.
Partitions and its task equivalent will run concurrently, based on how many executors you have allocated. If you have, say, M executors with 1 core, then max M tasks run concurrently. Depends also on scheduling and resource allocation mode.
Spark handles like any OS as it were, situations of size and resources and depending on resources available, more or less can be done. The DAG Scheduler plays a role in all this. But keeping it simple here.
Please bear with me because I am still quite new to Spark.
I have a GCP DataProc cluster which I am using to run a large number of Spark jobs, 5 at a time.
Cluster is 1 + 16, 8 cores / 40gb mem / 1TB storage per node.
Now I might be misunderstanding something or not doing something correctly, but I currently have 5 jobs running at once, and the Spark UI is showing that only 34/128 vcores are in use, and they do not appear to be evenly distributed (The jobs were executed simultaneously, but the distribution is 2/7/7/11/7. There is only one core allocated per running container.
I have used the flags --executor-cores 4 and --num-executors 6 which doesn't seem to have made any difference.
Can anyone offer some insight/resources as to how I can fine tune these jobs to use all available resources?
I have managed to solve the issue - I had no cap on the memory usage so it looked as though all memory was allocated to just 2 cores per node.
I added the property spark.executor.memory=4G and re-ran the job, it instantly allocated 92 cores.
Hope this helps someone else!
The Dataproc default configurations should take care of the number of executors. Dataproc also enables dynamic allocation, so executors will only be allocated if needed (according to Spark).
Spark cannot parallelize beyond the number of partitions in a Dataset/RDD. You may need to set the following properties to get good cluster utilization:
spark.default.parallelism: the default number of output partitions from transformations on RDDs (when not explicitly set)
spark.sql.shuffle.partitions: the number of output partitions from aggregations using the SQL API
Depending on your use case, it may make sense to explicitly set partition counts for each operation.
I have a processing pipeline that is built using Spark SQL. The objective is to read data from Hive in the first step and apply a series of functional operations (using Spark SQL) in order to achieve the functional output. Now, these operations are quite in number (more than 100), which means I am running around 50 to 60 spark sql queries in a single pipeline. While the application completes successfully without any issues, my focus area has shifted to optimizing the overall process. I have been able to speed up the executions using spark.sql.shuffle.partitions, changing the executor memory and reducing the size of the spark.memory.fraction from default 0.6 to 0.2. I got great benefits by doing all these changes and the over all execution time reduced from 20-25 mins to around 10 mins. Data volume is around 100k rows (source side).
The observations that I have from the Cluster are:
-The number of jobs triggered as apart of application id are 235.
-The total number of stages across all the jobs created are around 600.
-8 executors are used in a two node cluster (64 GB RAM in total with 10 cores).
-The resource manager UI of Yarn (for an application id) becomes very slow to retrieve the details of jobs/stages.
In one of the videos of Spark tuning, I heard that we should try to reduce the number of stages to a bare minimum, also DAG size should be smaller. What are the guidelines to do this. How to find the number of shuffles that are happening (my SQLs have many joins and group by clauses).
I would like to have suggestions on the above scenario of what possible things I can do in order to improvise the performance and handle the data skews in the SQL queries that are JOIN/GROUP_BY heavy.
Thanks
I'm running Hadoop task on a cluster of YARN with max of 8 tasks and 16 cores.
When I run the job I see 8 tasks running on a node yet all 16 cores been used.
Is map task is multi threaded ?
Map task use more than 1 core ?
Can I know which cores used each map task ?
Thanks,
Assaf
You can configure the number of cores per map, as well as the maximum number of usable cores - see here.
The question sounds a bit confused, so, some more details which may be relevant:
A task might do more than just run a map, and, if you're running hadoop, you might be using the cores with something else in the system (ie, maybe some other process is using the cores).
A mapping task might use more than one mapper to do its job - that's part of the point of using hadoop and a MR architecture - your work will get auto-magically distributed and split for you.
Also, beware, your number of tasks doesn't directly relate to the number of mappers, cores or other resources in use; if what you're looking to do is limit cpu usage, or in any other way control resource allocation, change the properties of your containers.
For a more detailed discussion of resource allocation (esp. when compared to MR1) see here.