spark standalone cluster, job running on one executor - apache-spark

I have a small cluster of 3 nodes with 12 total cores and 44 GB of memory. I am reading a small text file from hdfs (5 mb ) and running kmeans algorithm on it. I set the number of executors to 3 and partitioned my text file into three partitions. The application UI shows that only one of the executors is running all the tasks.
Here is the screenshot of the application GUIenter image description here
And here is the Jobs UI:
enter image description here
Can somebody help me figure out why my tasks are all running in one executor while others are idle? Thanks.

try to re-partition your file into 12 partitions. If you have 3 partitions and each node has 4 cores it's enough no run all tasks on 1 node. Spark roughly splits the work as 1 partition on 1 core.

Related

Spark UI on Google Dataproc: numbers interpretation

I'm running a spark job on a Google Dataproc cluster (3 nodes n1-highmem-4 so 4 cores and 26GB each, same type for the master).
I have a few questions about informations displayed on the Hadoop and the spark UI:
When I check the Hadoop UI I get this:
My question here is : my total RAM is supposed to be 84 (3x26) so why only 60Gb displayed here ? Is 24GB used for something else ?
2)
This is the screen showing currently launched executors.
My questions are:
Why only 10 cores are used ? Shouldn't we be able to launch a 6th executor using the 2 remaining cores since we have 12, and 2 seem be used per executor ?
Why 2 cores per executor ? Does it change anything if we run 12 executor with 1 core each instead ?
What is the "Input" column ? The total volume each executor received to analyze ?
3)
This is a screenshot of the "Storage" panel. I see the dataframe I'm working on.
I don't understand the "size in memory" column. Is it the total RAM used to cache the dataframe ? It seems very low compared to the size of row files I load into the Dataframe ( 500GB+ ). Is it a wrong interpretation ?
Thanks to anyone who will read this !
If you can take a look at this answer, it mostly answers your question 1 and 2.
To sum up, the total memory is less because some memory are reserved to run OS and system daemons or Hadoop daemons itself, e.g.Namenode, NodeManager.
Similar to cores, in your case it would be 3 nodes and each node runs 2 executors and each executor uses up 2 cores, except for the application master. For the node that application master lives in, there will be only one executor and the cores left are given to master. That's why you see only 5 executor and 10 cores.
For your 3rd question, that number should be the memory used up by the partitions in that RDD, which is approximately equal to memory allocated to each executor in your case it's ~13G.
Note that Spark doesn't load your 500G data at once instead it loads in data in partitions, the number of concurrently loaded partitions depend on the number of cores you have available.

Increase the Spark workers cores

I have installed Spark on master and 2 workers. The original core number per worker is 8. When I start the master, the workers are work properly without any problem, but the problem is in Spark GUI each worker has only 2 cores assigned.
Kindly, how can I increase the number of the cores in which each worker works with 8 cores?
The setting which controls cores per executor is spark.executor.cores. See doc. It can be set either via spark-submit cmd argument or in spark-defaults.conf. The file is usually located in /etc/spark/conf (ymmv). YOu can search for the conf file with find / -type f -name spark-defaults.conf
spark.executor.cores 8
However the setting does not guarantee that each executor will always get all the available cores. This depends on your workload.
If you schedule tasks on a dataframe or rdd, spark will run a parallel task for each partition of the dataframe. A task will be scheduled to an executor (separate jvm) and the executor can run multiple tasks in parallel in jvm threads on each core.
Also an exeucutor will not necessarily run on a separate worker. If there is enough memory, 2 executors can share a worker node.
In order to use all the cores the setup in your case could look as follows:
given you have 10 gig of memory on each node
spark.default.parallelism 14
spark.executor.instances 2
spark.executor.cores 7
spark.executor.memory 9g
Setting memory to 9g will make sure, each executor is assigned to a separate node. Each executor will have 7 cores available. And each dataframe operation will be scheduled to 14 concurrent tasks, which will be distributed x 7 to each executor. You can also repartition a dataframe, instead of setting default.parallelism. One core and 1gig of memory is left for the operating system.

Where does spark job run in a cluster of 2 nodes, but the spark submit configurations can easily accommodate in a single node? (cluster mode)

spark cluster has 2 worker nodes.
Node 1: 64 GB, 8 cores.
Node 2: 64 GB, 8 cores.
Now if i submit a spark job using spark-submit in cluster mode with
2 executors and each executor memory as 32 GB, 4 cores/executor.
Now my question is, as the above configuration can be accommodated in a single node itself, will spark run it using 2 worker nodes or just in one node?
Also, if a configuration doesn't have a multiple of cores as the executors then how many cores allocated for each executor?
Example: if num of cores in a node available after excluding one core for yarn deamon are 7. since 2 nodes, 2*7=14 (total cores available)and as HDFS give good throughput if num of cores per executor were 5..
Now 14/5 to find the num of executors. should i consider 14/5 as 2 or 3 exeutors? then how these cores are equally distributed?
It is more of a resource manager question then a Spark question, but in your case the 2 executors cant run in a single machine cause the OS has an overhead that uses at least 1 core and 1GB RAM , even if you will set the ram to 30 GB and 3 cores/executor. they will run on different nodes because Spark tries to get the best data locality it can so obviously it wont use the same node for 2 executors.

How to handle Spark Executors when number of partitions do not match no of Executors?

Let's say I have 3 executors and 4 partitions, and we assume theses number cannot be changed.
This is not an efficient setup, because we have to read 2 passes: in 1 pass, we read 3 partitions; and in the second partition, we read 1 partition.
Is there a way in Spark that we can improve the efficiency without changing the number of executors and partitions?
In your scenario you need to update the number of cores.
In spark each partition is taken up for execution by one task of spark. As you have 3 executors and 4 partitions and if you assume you have total 3 cores I.e one core per executor then 3 partition of data will be run in parallel and one partition will be taken once one core for the executor will be free. To handle this latency we need to increase spark.executor.cores=2. I.e each executor can run 2 threads at a time I.e 2 tasks at a time.
So all your partitions will be executed in parallel but it does not guarantee whether 1 executor will run 2 tasks and 2 executors will run one task each or 2 executors will run 2 tasks on 2 individual partitions with one executor will be idle.

Spark not using all available cores on a Node in Standalone Cluster

I'm running a small cluster with a separate 1 Master and 1 Slave node (with 8 VCores). I launch the cluster via /sbin/start-all.sh and then add pyspark to it with /bin/pyspark --master spark://<master ip>:7077
now in the webui everything seems OK I got my worker registered with the master and I have 8 Cores available. Also the pyspark shell also got all 8 cores.
I have a small RDD consisting of 14 rows each row containing a string pointing to a compressed text file.
def open_gzip(filepath):
with gzip.open(filepath, 'rb') as f:
file_content = f.read()
return file_content.split(b'\r\n')
wat_paths_rdd = sc.textFile('./file.paths')
wat_rdd = wat_paths_rdd.flatMap(open_gzip)
now when I try to run this code, I can see in htop that on my worker node only 2 cores are utilized when flatMap is invoked.
The following parameters I have tried to set on both slave and master with no avail:
in /conf/spark-defaults.conf
spark.cores.max 8
spark.executor.cores 8
even though I can set
spark.executor.memory 14500m
in /conf/spark-env.sh
export SPARK_WORKER_CORES=8
I'm a bit at a lose here in my previous config, where I ran everything off one machine and spark.cores.max 8 was enough.
Number of cores are utilised based on number of tasks which are dependent on number of partitions of your rdd. Please check
rdd.getNumPartitions
If they are 2, then you need to increase number of partitions 2-3 times the number of cores using
rdd.repartition
or in the start when you parallelize your file.

Resources