Why spark application are not running on all nodes - apache-spark

I installed the following spark benchmark:
https://github.com/BBVA/spark-benchmarks
I run Spark on top of YARN on 8 workers but I only get 2 running executors during the job (TestDFSIO).
I also set executor-cores to be 9 but only 2 are running.
Why would that happen?
I think the problem is coming from YARN because I get a similar (almost) issue with TestDFSIO on Hadoop. In fact, at the beginning of the job, only two nodes run, but then all the nodes execute the application in parallel!
Note that I am not using HDFS for storage!

I solved this issue. What I've done is that I set the number of cores per executor to 5 (--executor-cores) and the total number of executors to 23 (--num-executors) which was at first 2 by default.

Related

Dataproc pyspark on yarn cluster not starting with specified number of cores per executor. Or, maybe it is?

This has been talked about here, here, here and here
The problem is the same, however, I suppose that maybe the yarn-reported numbers are buggy. Looking for someone expert to shed some light. What I have noticed:
I start the spark session with
spark = SparkSession.builder.\
config("spark.dynamicAllocation.enabled",False).\
config("spark.executor.cores","3").\
config("spark.executor.instances","4").\
config("spark.executor.memory","6g").\
getOrCreate()
And the yarn only 1 core per executor:
However, I can still see 12 tasks running in parallel from the spark UI
As suggested in the posts linked, I set yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator in the capacity-scheduler.xml
Situation remains exactly the same:
yarn shows 1 core per executor allocated
spark UI shows 12 tasks running in parallel
Question:
Is the number being shown in the yarn UI a bug? How to reconcile the two incompatible numbers?

Spark with Hadoop Yarn : Use the entire cluster nodes

I'm using Spark with HDFS Hadoop Storage and Yarn. My cluster contains 5 nodes (1 master and 4 slaves).
Master node : 48Gb RAM - 16 CPU Cores
Slave nodes : 12 Gb RAM - 16 CPU Cores
I'm executing two different processes : WordCount method and SparkSQL with two different files. All is working but I'm asking some question, maybe I don't understand very well Hadoop-Spark.
First example : WordCount
I executed WordCount function and I get result in two files (part-00000 and part-00001). The availability is slave4 and slave1 for part-00000 and slave3 and slave4 for part-00001.
Why not a part on slave2 ? Is it normal ?
When I look application_ID, I see the only 1 slave made the job :
Why my task is not well-distributed over my cluster ?
Second example : SparkSQL
In this case, I don't have saved file because I just want to return an SQL result, but only 1 slave node works too.
So why I have only 1 slave node which makes the task while I have a cluster which seems to work fine ?
The command line to execute this is :
time ./spark/bin/spark-submit --master yarn --deploy-mode cluster /home/valentin/SparkCount.py
Thank you !
spark.executor.instances defaults to 2
You need to increase this value to have more executors running at once
You can also tweak the cores and memory allocated to each executor. As far as I know, there is no magic formula.
If you want to not specify these values by hand. I might suggest reading the section on Speculative Execution in Spark documentation

Why does Spark UI show only 6 cores available per worker while I have more?

Why does Spark UI show only 6 cores available per worker (not the number of cores used) while I have 16 on each of my 3 machines (8 sockets * 2 cores/socket) or even 32 if take in account the number of threads per core (2). I tried to set SPARK_WORKER_CORES in spark-env.sh file but it changes nothing (I made the changes on all 3 workers). I also comment the line to see if it changes something: number of cores available is always stuck at 6.
I'm using Spark 2.2.0 in standalone cluster:
pyspark --master spark://myurl:7077
result of lscpu command:
I've found that I simply had to stop the master and slaves and restart them so the parameter SPARK_WORKER_CORES is refreshed.

Spark looses all executors one minute after starting

I run pyspark on 8 node Google dataproc cluster with default settings.
Few seconds after starting I see 30 executor cores running (as expected):
>>> sc.defaultParallelism
30
One minute later:
>>> sc.defaultParallelism
2
From that point all actions run on only 2 cores:
>>> rng = sc.parallelize(range(1,1000000))
>>> rng.cache()
>>> rng.count()
>>> rng.getNumPartitions()
2
If I run rng.cache() while cores are still connected they stay connected and jobs get distributed.
Checking on monitoring app (port 4040 on master node) shows executors are removed:
Executor 1
Removed at 2016/02/25 16:20:14
Reason: Container container_1456414665542_0006_01_000002 exited from explicit termination request."
Is there some setting that could keep cores connected without workarounds?
For the most part, what you are seeing is actually just the differences in how Spark on YARN can be configured vs spark standalone. At the moment, YARN's reporting of "VCores Used" doesn't actually correctly correspond to a real container reservation of cores, and containers are actually just based on the memory reservation.
Overall there are a few things at play here:
Dynamic allocation causes Spark to relinquish idle executors back to YARN, and unfortunately at the moment spark prints that spammy but harmless "lost executor" message. This was the classical problem of spark on YARN where spark originally paralyzed clusters it ran on because it would grab the maximum number of containers it thought it needed and then never give them up.
With dynamic allocation, when you start a long job, spark quickly allocates new containers (with something like exponential ramp-up to quickly be able to fill a full YARN cluster within a couple minutes), and when idle, relinquishes executors with the same ramp-down at an interval of about 60 seconds (if idle for 60 seconds, relinquish some executors).
If you want to disable dynamic allocation you can run:
spark-shell --conf spark.dynamicAllocation.enabled=false
gcloud dataproc jobs submit spark --properties spark.dynamicAllocation.enabled=false --cluster <your-cluster> foo.jar
Alternatively, if you specify a fixed number of executors, it should also automatically disable dynamic allocation:
spark-shell --conf spark.executor.instances=123
gcloud dataproc jobs submit spark --properties spark.executor.instances=123 --cluster <your-cluster> foo.jar

Spark cluster only using master

I have a Spark 1.1.0 cluster with three machines of differing power. When I run the start-all.sh script and check the UI I have all slaves and the master listed. Each worker is listed (they have differing number of cores) with the number of cores listed correctly but the notice that zero are used.
cores
4 (0 Used)
2 (0 Used)
8 (8 Used)
Ssh is set up and working, hadoop seems fine too. The 8 core machine is the master so any submitted job runs only there. I see it being executed in the web UI but the other workers are never given work.
What might be happening here is that the Total_Input_File_Size could be less than MAX_SPLIT_SIZE. So, there will be only one mapper running which will be executing only on the master.
The number of mappers generated are Total_Input_File_Size/MAX_SPLIT_SIZE. So, if you have given a small file, try to give a large input file or lower the value of max_split_size.
Let me know if the problem is anything else.
Have you set --deploy-mode cluster in your spark-submit command?
if you empty this option, the application will not travel to other workers.

Resources