Spark : Understanding Dynamic Allocation - apache-spark

I have launched a spark job with the following configuration :
--master yarn --deploy-mode cluster --conf spark.scheduler.mode=FAIR --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.maxExecutors=19 --conf spark.dynamicAllocation.minExecutors=0
It works well and finished in success, but after checking spark history ui, this is what i saw :
My questions are (Im concerned by understanding more than solutions) :
Why spark request the last executor if it has no task to do ?
How can i optimise cluster resource requested by my job in the dynamic allocation mode ?
Im using Spark 2.3.0 on Yarn.

You need to respect the 2 requierements for using spark dynamic allocation:
spark.dynamicAllocation.enable
spark.shuffle.service.enabled => The purpose of the external shuffle service is to allow executors to be removed without deleting shuffle files.
The resources are adjusted dynamically based on the workload. The app will give resources back if you are no longer using it.
I am not sure that there is an order, it will just request executors in round and exponentially, i.e: an application will add 1 executor in the first round, and then 2, 4 8 and so on...
Configuring external shuffle service

It's difficult to know what Spark did there without knowing the content of the job you submitted. Unfortunately the configuration string you provided does not say much about what Spark will actually perform upon job submission.
You will likely get a better understanding of what happened during a task by looking at the 'SQL' part of the history UI (right side of the top bar) as well as at the stdout logs.
Generally one of the better places to read about how Spark works is the official page: https://spark.apache.org/docs/latest/cluster-overview.html
Happy sparking ;)

Its because of the allocation policy :
Additionally, the number of executors requested in each round
increases exponentially from the previous round.
reference

Related

Spark on EMR-5.32.0 not spawning requested executors

I am running into some problems in (Py)Spark on EMR (release 5.32.0). Approximately a year ago I ran the same program on an EMR cluster (I think the release must have been 5.29.0). Then I was able to configure my PySpark program using spark-submit arguments properly. However, now I am running the same/similar code, but the spark-submit arguments do not seem to have any effect.
My cluster configuration:
master node: 8 vCore, 32 GiB memory, EBS only storage EBS Storage:128 GiB
slave nodes: 10 x 16 vCore, 64 GiB memory, EBS only storage EBS Storage:256 GiB
I run the program with the following spark-submit arguments:
spark-submit --master yarn --conf "spark.executor.cores=3" --conf "spark.executor.instances=40" --conf "spark.executor.memory=8g" --conf "spark.driver.memory=8g" --conf "spark.driver.maxResultSize=8g" --conf "spark.dynamicAllocation.enabled=false" --conf "spark.default.parallelism=480" update_from_text_context.py
I did not change anything in the default configurations on the cluster.
Below a screenshot of the Spark UI, which is indicating only 10 executors, whereas I expect to have 40 executors available...
I tried different spark-submit arguments in order to make sure that the error was unrelated to Apache Spark: setting executor instances does not change the executors. I tried a lot of things, and nothing seems to help.
I am a little lost here, could someone help?
UPDATE:
I ran the same code on EMR release label 5.29.0, and there the conf settings in the spark-submit argument seems to have effect:
Why is this happening?
Sorry for the confusion, but this is intentional. On emr-5.32.0, Spark+YARN will coalesce multiple executor requests that land on the same node into a larger executor container. Note how even though you had fewer executors than you expected, each of them had more memory and cores that you had specified. (There's one asterisk here, though, that I'll explain below.)
This feature is intended to provide better performance by default in most cases. If you would really prefer to keep the previous behavior, you may disable this new feature by setting spark.yarn.heterogeneousExecutors.enabled=false, though we (I am on the EMR team) would like to hear from you about why the previous behavior is preferable.
One thing that doesn't make sense to me, though, is that you should be ending up with the same total number of executor cores that you would have without this feature, but that doesn't seem to have occurred for the example you shared. You asked for 40 executors with 3 cores each but then got 10 executors with 15 cores each, which is a bit more in total. This may have to do with the way that your requested spark.executor.memory of 8g divides into the memory available on your chosen instance type, which I'm guessing is probably m5.4xlarge. One thing that may help you is to remove all of your overrides for spark.executor.memory/cores/instances and just use the defaults. Our hope is that defaults will give the best performance in most cases. If not, like I said above, please let us know so that we can improve further!
Ok, if someone is facing the same problem. As a workaround you can just revert back to a previous version of EMR. In my example I reverted back to EMR release label 5.29.0, which solved all my problems. Suddenly I was able to configure the Spark job again!
Still I am not sure why it doesn't work in EMR release label 5.32.0. So if someone has suggestions, please let me know!

What is the difference between submitting spark job to spark-submit and to hadoop directly?

I have noticed that in my project there are 2 ways of running spark jobs.
First way is submitting a job to spark-submit file
./bin/spark-submit
--class org.apache.spark.examples.SparkPi
--master local[8]
/path/to/examples.jar
100
Second way is to package java file into jar and run it via hadoop, while having Spark code inside MainClassName:
hadoop jar JarFile.jar MainClassName
`
What is the difference between these 2 ways?
Which prerequisites I need to have in order to use either?
As you stated on the second way of running a spark job, packaging a java file with Spark classes and/or syntax is essentially wrapping your Spark job within a Hadoop job. This can have its disadvantages (mainly that your job gets directly dependent on the java and scala version you have on your system/cluster, but also some growing pains about the support between the different frameworks' versions). So in that case, the developer must be careful about the setup that the job will be run on on two different platforms, even if it seems a bit simpler for users of Hadoop which have a better grasp with Java and the Map/Reduce/Driver layout instead of the more already-tweaked nature of Spark and the sort-of-steep-learning-curve convenience of Scala.
The first way of submitting a job is the most "standard" (as far as the majority of usage it can be seen online, so take this with a grain of salt), operating the execution of the job almost entirely within Spark (except if you store the output of your job or take its input from the HDFS, of course). By using this way, you are only somewhat dependent to Spark, keeping the strange ways of Hadoop (aka its YARN resource management) away from your job. And it can be significantly faster in execution time, since it's the most direct approach.

Multiple Spark Applications at same time , same Jarfile... Jobs are in waiting status

Spark/Scala noob here.
I am running spark in a clustered environment. I have two very similar apps (each with unique spark config and context). When I try and kick them both off the first seems to grab all resources and the second will wait to grab resources. I am setting resource on the submit but it doesn't seem to matter. Each node has 24 cores and 45 gb memory available for use. Here are the two commands I am using to submit that I want to run in parallel.
./bin/spark-submit --master spark://MASTER:6066 --class MainAggregator --conf spark.driver.memory=10g --conf spark.executor.memory=10g --executor-cores 3 --num-executors 5 sparkapp_2.11-0.1.jar -new
./bin/spark-submit --master spark://MASTER:6066 --class BackAggregator --conf spark.driver.memory=5g --conf spark.executor.memory=5g --executor-cores 3 --num-executors 5 sparkapp_2.11-0.1.jar 01/22/2020 01/23/2020
Also I should note that the second App does kick off but in the master monitoring webpage I see it as "Waiting" and it will have 0 cores until the first is done. The apps do pull from the same tables, but will have much different data chunks that they are pulling so the RDD/Dataframes are unique if that makes a difference.
What am I missing in order to run these at the same time?
second App does kick off but in the master monitoring webpage I see it
as "Waiting" and it will have 0 cores until the first is done.
I encountered the same thing some time back. Here there are 2 things..
May be these are the causes.
1) You dont have proper infrastructure.
2) You might have used capacity scheduler which doesnt have preemptive machanism to accommodate new jobs until it.
If it is #1 then you have to increase more nodes allocate more resouces using your spark-submit.
If it is #2 Then you can adopt hadoop fair schedular where you can maintain 2 pools see spark documentation on this advantage would be you can run parllel jobs Fair will take care by pre-empting some resouces and allocate to another job which is running parllely.
mainpool for the first spark job..
backlogpool to run your second spark job.
To achive this you need to have an xml like this with pool configuration
sample pool configuration :
<pool name="default">
<schedulingMode>FAIR</schedulingMode>
<weight>3</weight>
<minShare>3</minShare>
</pool>
<pool name="mainpool">
<schedulingMode>FAIR</schedulingMode>
<weight>3</weight>
<minShare>3</minShare>
</pool>
<pool name="backlogpool">
<schedulingMode>FAIR</schedulingMode>
<weight>3</weight>
<minShare>3</minShare>
</pool>
Along with that you need to do some more minor changes... in the driver code like which pool first job should go and which pool second job should go.
How it works :
For more details see the articles by me..
hadoop-yarn-fair-schedular-advantages-explained-part1
hadoop-yarn-fair-schedular-advantages-explained-part2
Try these ideas to overcome the waiting. Hope this helps..

Getting BusyPoolException com.datastax.spark.connector.writer.QueryExecutor , what wrong me doing?

I am using spark-sql-2.4.1 ,spark-cassandra-connector_2.11-2.4.1 with java8 and apache cassandra 3.0 version.
I have my spark-submit or spark cluster environment as below to load 2 billion records.
--executor-cores 3
--executor-memory 9g
--num-executors 5
--driver-cores 2
--driver-memory 4g
Using following configurration
cassandra.concurrent.writes=1500
cassandra.output.batch.size.rows=10
cassandra.output.batch.size.bytes=2048
cassandra.output.batch.grouping.key=partition
cassandra.output.consistency.level=LOCAL_QUORUM
cassandra.output.batch.grouping.buffer.size=3000
cassandra.output.throughput_mb_per_sec=128
Job is taking around 2 hrs , it really huge time
When I check logs I see
WARN com.datastax.spark.connector.writer.QueryExecutor - BusyPoolException
how to fix this ?
You have incorrect value for cassandra.concurrent.writes - this means that you're sending 1500 concurrent batches at the same time. But by default, Java driver allows 1024 simultaneous requests. And usually, if you have too high number for this parameter, could lead to overload of the nodes, and as result - retries for tasks.
Also, other settings are incorrect - if you sepcify cassandra.output.batch.size.rows, then its value overrides the value of cassandra.output.batch.size.bytes. See corresponding section of the Spark Cassandra Connector reference for more details.
One of the aspects of performance tuning is to have correct number of Spark partitions, so you reach good parallelism - but this really depends on your code, how many nodes in Cassandra cluster, etc.
P.S. Also, please note that configuration parameters should be started with spark.cassandra., not with simple cassandra. - if you specified them in this form, then these parameters are ignored and defaults are used.

A SPARK CLUSTER ISSUE

I know that when the spark cluster in the production environment is running a job, it is in the stand-alone mode.
While I was running a job, a few points of worker's memory overflow caused the worker node process to die.
I would like to ask how to analyze the error shown in the image below:
Spark Worker Fatal Error
EDIT: This is a relatively common problem please also view this if the below doesn't help you Spark java.lang.OutOfMemoryError: Java heap space.
Without seeing your code here is the process you should follow:
(1) If the issue is caused primarily from the Java allocation running out of space within the container allocation I would advise messing with your memory overhead settings (below). The current value are a little high and will cause the excess spin-up of vcores. Add the two below settings to your spark-submit and re-run.
--conf "spark.yarn.executor.memoryOverhead=4000m"
--conf "spark.yarn.driver.memoryOverhead=2000m"
(2) Adjust Executor and Driver Memory Levels. Start low and climb. Add these values to the spark-submit statement.
--driver-memory 10g
--executor-memory 5g
(3) Adjust Number of Executor Values in the spark submit.
--num-executors ##
(4) Look at the Yarn stages of the job and figure where inefficiencies in the code is present and where persistence's can be added and replaced. I would advise to heavily look into spark-tuning.

Resources