Spark jobs seem to only be using a small amount of resources - apache-spark

Please bear with me because I am still quite new to Spark.
I have a GCP DataProc cluster which I am using to run a large number of Spark jobs, 5 at a time.
Cluster is 1 + 16, 8 cores / 40gb mem / 1TB storage per node.
Now I might be misunderstanding something or not doing something correctly, but I currently have 5 jobs running at once, and the Spark UI is showing that only 34/128 vcores are in use, and they do not appear to be evenly distributed (The jobs were executed simultaneously, but the distribution is 2/7/7/11/7. There is only one core allocated per running container.
I have used the flags --executor-cores 4 and --num-executors 6 which doesn't seem to have made any difference.
Can anyone offer some insight/resources as to how I can fine tune these jobs to use all available resources?

I have managed to solve the issue - I had no cap on the memory usage so it looked as though all memory was allocated to just 2 cores per node.
I added the property spark.executor.memory=4G and re-ran the job, it instantly allocated 92 cores.
Hope this helps someone else!

The Dataproc default configurations should take care of the number of executors. Dataproc also enables dynamic allocation, so executors will only be allocated if needed (according to Spark).
Spark cannot parallelize beyond the number of partitions in a Dataset/RDD. You may need to set the following properties to get good cluster utilization:
spark.default.parallelism: the default number of output partitions from transformations on RDDs (when not explicitly set)
spark.sql.shuffle.partitions: the number of output partitions from aggregations using the SQL API
Depending on your use case, it may make sense to explicitly set partition counts for each operation.

Related

Why caching small Spark RDDs takes big memory allocation in Yarn?

The RDDs that are cached (in total 8) are not big, only around 30G, however, on Hadoop UI, it shows that the Spark application is taking lots of memory (no active jobs are running), i.e. 1.4T, why so much?
Why it shows around 100 executors (here, i.e. vCores) even when there's no active jobs running?
Also, if cached RDDs are stored across 100 executors, are those executors preserved and no more other Spark apps can use them for running tasks any more? To rephrase the question: will preserving a little memory resource (.cache) in executors prevents other Spark app from leveraging the idle computing resource of them?
Is there any potential Spark config / zeppelin config that can cause this phenomenon?
UPDATE 1
After checking the Spark conf (zeppelin), it seems there's the default (configured by administrator by default) setting for spark.executor.memory=10G, which is probably the reason why.
However, here's a new question: Is it possible to keep only the memory needed for the cached RDDs in each executors and release the rest, instead of holding always the initially set memory spark.executor.memory=10G?
Spark configuration
Perhaps you can try to repartition(n) your RDD to a fewer n < 100 partitions before caching. A ~30GB RDD would probably fit into storage memory of ten 10GB executors. A good overview of Spark memory management can be found here. This way, only those executors that hold cached blocks will be "pinned" to your application, while the rest can be reclaimed by YARN via Spark dynamic allocation after spark.dynamicAllocation.executorIdleTimeout (default 60s).
Q: Is it possible to keep only the memory needed for the cached RDDs in each executors and release the rest, instead of holding always the initially set memory spark.executor.memory=10G?
When Spark uses YARN as its execution engine, YARN allocates the containers of a specified (by application) size -- at least spark.executor.memory+spark.executor.memoryOverhead, but may be even bigger in case of pyspark -- for all the executors. How much memory Spark actually uses inside a container becomes irrelevant, since the resources allocated to a container will be considered off-limits to other YARN applications.
Spark assumes that your data is equally distributed on all the executors and tasks. That's the reason why you set memory per task. So to make Spark to consume less memory, your data has to be evenly distributed:
If you are reading from Parquet files or CSVs, make sure that they have similar sizes. Running repartition() causes shuffling, which if the data is so skewed may cause other problems if executors don't have enough resources
Cache won't help to release memory on the executors because it doesn't redistribute the data
Can you please see under "Event Timeline" on the Stages "how big are the green bars?" Normally that's tied to the data distribution, so that's a way to see how much data is loaded (proportionally) on every task and how much they are doing. As all tasks have same memory assigned, you can see graphically if resources are wasted (in case there are mostly tiny bars and few big bars). A sample of wasted resources can be seen on the image below
There are different ways to create evenly distributed files for your process. I mention some possibilities, but for sure there are more:
Using Hive and DISTRIBUTE BY clause: you need to use a field that is equally balanced in order to create as many files (and with proper size) as expected
If the process creating those files is a Spark process reading from a DB, try to create as many connections as files you need and use a proper field to populate Spark partitions. That is achieved, as explained here and here with partitionColumn, lowerBound, upperBound and numPartitions properties
Repartition may work, but see if coalesce also make sense in your process or in the previous one generating the files you are reading from

What performance parameters to set for spark scala code to run on yarn using spark-submit?

My use case is to merge two tables where one table contains 30 million records with 200 cols and another table contains 1 million records with 200 cols.I am using broadcast join for small table.I am loading both the tables as data-frames from hive managed tables on HDFS.
I need the values to set for driver memory and executor memory and other parameters along with it for this use case.
I have this hardware configurations for my yarn cluster :
Spark Version 2.0.0
Hdp version 2.5.3.0-37
1) yarn clients 20
2) Max. virtual cores allocated for a container (yarn.scheduler.maximum.allocation-vcores) 19
3) Max. Memory allocated for a yarn container 216gb
4) Cluster Memory Available 3.1 TB available
Any other info you need I can provide for this cluster.
I have to decrease the time to complete this process.
I have been using some configurations but I think its wrong, it took me 4.5 mins to complete it but I think spark has capability to decrease this time.
There are mainly two things to look at when you want to speed up your spark application.
Caching/persistance:
This is not a direct way to speed up the processing. This will be useful when you have multiple actions(reduce, join etc) and you want to avoid the re-computation of the RDDs in the case of failures and hence decrease the application run duration.
Increasing the parallelism:
This is the actual solution to speed up your Spark application. This can be achieved by increasing the number of partitions. Depending on the use case, you might have to increase the partitions
Whenever you create your dataframes/rdds: This is the better way to increase the partitions as you don't have to trigger a costly shuffle operation to increase the partitions.
By calling repartition: This will trigger a shuffle operation.
Note: Once you increase the number of partitions, then increase the executors(may be very large number of small containers with few vcores and few GBs of memory
Increasing the parallelism inside each executor
By adding more cores to each executor, you can increase the parallelism at the partition level. This will also speed up the processing.
To have a better understanding of configurations please refer this post

How the Number of partitions and Number of concurrent tasks in spark calculated

I have a cluster with 4 nodes (each with 16 cores) using Spark 1.0.1.
I have an RDD which I've repartitioned so it has 200 partitions (hoping to increase the parallelism).
When I do a transformation (such as filter) on this RDD, I can't seem to get more than 64 tasks (my total number of cores across the 4 nodes) going at one point in time. By tasks, I mean the number of tasks that appear under the Application Spark UI. I tried explicitly setting the spark.default.parallelism to 128 (hoping I would get 128 tasks concurrently running) and verified this in the Application UI for the running application but this had no effect. Perhaps, this is ignored for a 'filter' and the default is the total number of cores available.
I'm fairly new with Spark so maybe I'm just missing or misunderstanding something fundamental. Any help would be appreciated.
This is correct behavior. Each "core" can execute exactly one task at a time, with each task corresponding to a partition. If your cluster only has 64 cores, you can only run at most 64 tasks at once.
You could run multiple workers per node to get more executors. That would give you more cores in the cluster. But however many cores you have, each core will run only one task at a time.
you can see the more details on the following thread
How does Spark paralellize slices to tasks/executors/workers?

Spark shows different number of cores than what is passed to it using spark-submit

TL;DR
Spark UI shows different number of cores and memory than what I'm asking it when using spark-submit
more details:
I'm running Spark 1.6 in standalone mode.
When I run spark-submit I pass it 1 executor instance with 1 core for the executor and also 1 core for the driver.
What I would expect to happen is that my application will be ran with 2 cores total.
When I check the environment tab on the UI I see that it received the correct parameters I gave it, however it still seems like its using a different number of cores. You can see it here:
This is my spark-defaults.conf that I'm using:
spark.executor.memory 5g
spark.executor.cores 1
spark.executor.instances 1
spark.driver.cores 1
Checking the environment tab on the Spark UI shows that these are indeed the received parameters but the UI still shows something else
Does anyone have any idea on what might cause Spark to use different number of cores than what I want I pass it? I obviously tried googling it but didn't find anything useful on that topic
Thanks in advance
TL;DR
Use spark.cores.max instead to define the total number of cores available, and thus limit the number of executors.
In standalone mode, a greedy strategy is used and as many executors will be created as there are cores and memory available on your worker.
In your case, you specified 1 core and 5GB of memory per executor.
The following will be calculated by Spark :
As there are 8 cores available, it will try to create 8 executors.
However, as there is only 30GB of memory available, it can only create 6 executors : each executor will have 5GB of memory, which adds up to 30GB.
Therefore, 6 executors will be created, and a total of 6 cores will be used with 30GB of memory.
Spark basically fulfilled what you asked for. In order to achieve what you want, you can make use of the spark.cores.max option documented here and specify the exact number of cores you need.
A few side-notes :
spark.executor.instances is a YARN-only configuration
spark.driver.memory defaults to 1 core already
I am also working on easing the notion of the number of executors in standalone mode, this might get integrated into a next release of Spark and hopefully help figuring out exactly the number of executors you are going to have, without having to calculate it on the go.

Spark + EMR using Amazon's "maximizeResourceAllocation" setting does not use all cores/vcores

I'm running an EMR cluster (version emr-4.2.0) for Spark using the Amazon specific maximizeResourceAllocation flag as documented here. According to those docs, "this option calculates the maximum compute and memory resources available for an executor on a node in the core node group and sets the corresponding spark-defaults settings with this information".
I'm running the cluster using m3.2xlarge instances for the worker nodes. I'm using a single m3.xlarge for the YARN master - the smallest m3 instance I can get it to run on, since it doesn't do much.
The situation is this: When I run a Spark job, the number of requested cores for each executor is 8. (I only got this after configuring "yarn.scheduler.capacity.resource-calculator": "org.apache.hadoop.yarn.util.resource.DominantResourceCalculator" which isn't actually in the documentation, but I digress). This seems to make sense, because according to these docs an m3.2xlarge has 8 "vCPUs". However, on the actual instances themselves, in /etc/hadoop/conf/yarn-site.xml, each node is configured to have yarn.nodemanager.resource.cpu-vcores set to 16. I would (at a guess) think that must be due to hyperthreading or perhaps some other hardware fanciness.
So the problem is this: when I use maximizeResourceAllocation, I get the number of "vCPUs" that the Amazon Instance type has, which seems to be only half of the number of configured "VCores" that YARN has running on the node; as a result, the executor is using only half of the actual compute resources on the instance.
Is this a bug in Amazon EMR? Are other people experiencing the same problem? Is there some other magic undocumented configuration that I am missing?
Okay, after a lot of experimentation, I was able to track down the problem. I'm going to report my findings here to help people avoid frustration in the future.
While there is a discrepancy between the 8 cores asked for and the 16 VCores that YARN knows about, this doesn't seem to make a difference. YARN isn't using cgroups or anything fancy to actually limit how many CPUs the executor can actually use.
"Cores" on the executor is actually a bit of a misnomer. It is actually how many concurrent tasks the executor will willingly run at any one time; essentially boils down to how many threads are doing "work" on each executor.
When maximizeResourceAllocation is set, when you run a Spark program, it sets the property spark.default.parallelism to be the number of instance cores (or "vCPUs") for all the non-master instances that were in the cluster at the time of creation. This is probably too small even in normal cases; I've heard that it is recommended to set this at 4x the number of cores you will have to run your jobs. This will help make sure that there are enough tasks available during any given stage to keep the CPUs busy on all executors.
When you have data that comes from different runs of different spark programs, your data (in RDD or Parquet format or whatever) is quite likely to be saved with varying number of partitions. When running a Spark program, make sure you repartition data either at load time or before a particularly CPU intensive task. Since you have access to the spark.default.parallelism setting at runtime, this can be a convenient number to repartition to.
TL;DR
maximizeResourceAllocation will do almost everything for you correctly except...
You probably want to explicitly set spark.default.parallelism to 4x number of instance cores you want the job to run on on a per "step" (in EMR speak)/"application" (in YARN speak) basis, i.e. set it every time and...
Make sure within your program that your data is appropriately partitioned (i.e. want many partitions) to allow Spark to parallelize it properly
With this setting you should get 1 executor on each instance (except the master), each with 8 cores and about 30GB of RAM.
Is the Spark UI at http://:8088/ not showing that allocation?
I'm not sure that setting is really a lot of value compared to the other one mentioned on that page, "Enabling Dynamic Allocation of Executors". That'll let Spark manage it's own number of instances for a job, and if you launch a task with 2 CPU cores and 3G of RAM per executor you'll get a pretty good ratio of CPU to memory for EMR's instance sizes.
in the EMR version 3.x, this maximizeResourceAllocation was implemented with a reference table: https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/vcorereference.tsv
it used by a shell script: maximize-spark-default-config, in the same repo, you can take a look how they implemented this.
maybe in the new EMR version 4, this reference table was somehow wrong... i believe you can find all this AWS script in your EC2 instance of EMR, should be located in /usr/lib/spark or /opt/aws or something like this.
anyway, at least, you can write your own bootstrap action scripts for this in EMR 4, with a correct reference table, similar to the implementation in EMR 3.x
moreover, since we are going to use STUPS infrastructure, worth take a look the STUPS appliance for Spark: https://github.com/zalando/spark-appliance
you can explicitly specify the number of cores by setting senza parameter DefaultCores when you deploy your spark cluster
some of highlight of this appliance comparing to EMR are:
able to use it with even t2 instance type, auto-scalable based on roles like other STUPS appliance, etc.
and you can easily deploy your cluster in HA mode with zookeeper, so no SPOF on master node, HA mode in EMR is currently still not possible, and i believe EMR is mainly designed for "large clusters temporarily for ad-hoc analysis jobs", not for "dedicated cluster that is permanently on", so HA mode will not be possible in near further with EMR.

Resources