Spark with Hadoop Yarn : Use the entire cluster nodes - apache-spark

I'm using Spark with HDFS Hadoop Storage and Yarn. My cluster contains 5 nodes (1 master and 4 slaves).
Master node : 48Gb RAM - 16 CPU Cores
Slave nodes : 12 Gb RAM - 16 CPU Cores
I'm executing two different processes : WordCount method and SparkSQL with two different files. All is working but I'm asking some question, maybe I don't understand very well Hadoop-Spark.
First example : WordCount
I executed WordCount function and I get result in two files (part-00000 and part-00001). The availability is slave4 and slave1 for part-00000 and slave3 and slave4 for part-00001.
Why not a part on slave2 ? Is it normal ?
When I look application_ID, I see the only 1 slave made the job :
Why my task is not well-distributed over my cluster ?
Second example : SparkSQL
In this case, I don't have saved file because I just want to return an SQL result, but only 1 slave node works too.
So why I have only 1 slave node which makes the task while I have a cluster which seems to work fine ?
The command line to execute this is :
time ./spark/bin/spark-submit --master yarn --deploy-mode cluster /home/valentin/SparkCount.py
Thank you !

spark.executor.instances defaults to 2
You need to increase this value to have more executors running at once
You can also tweak the cores and memory allocated to each executor. As far as I know, there is no magic formula.
If you want to not specify these values by hand. I might suggest reading the section on Speculative Execution in Spark documentation

Related

Spark UI on Google Dataproc: numbers interpretation

I'm running a spark job on a Google Dataproc cluster (3 nodes n1-highmem-4 so 4 cores and 26GB each, same type for the master).
I have a few questions about informations displayed on the Hadoop and the spark UI:
When I check the Hadoop UI I get this:
My question here is : my total RAM is supposed to be 84 (3x26) so why only 60Gb displayed here ? Is 24GB used for something else ?
2)
This is the screen showing currently launched executors.
My questions are:
Why only 10 cores are used ? Shouldn't we be able to launch a 6th executor using the 2 remaining cores since we have 12, and 2 seem be used per executor ?
Why 2 cores per executor ? Does it change anything if we run 12 executor with 1 core each instead ?
What is the "Input" column ? The total volume each executor received to analyze ?
3)
This is a screenshot of the "Storage" panel. I see the dataframe I'm working on.
I don't understand the "size in memory" column. Is it the total RAM used to cache the dataframe ? It seems very low compared to the size of row files I load into the Dataframe ( 500GB+ ). Is it a wrong interpretation ?
Thanks to anyone who will read this !
If you can take a look at this answer, it mostly answers your question 1 and 2.
To sum up, the total memory is less because some memory are reserved to run OS and system daemons or Hadoop daemons itself, e.g.Namenode, NodeManager.
Similar to cores, in your case it would be 3 nodes and each node runs 2 executors and each executor uses up 2 cores, except for the application master. For the node that application master lives in, there will be only one executor and the cores left are given to master. That's why you see only 5 executor and 10 cores.
For your 3rd question, that number should be the memory used up by the partitions in that RDD, which is approximately equal to memory allocated to each executor in your case it's ~13G.
Note that Spark doesn't load your 500G data at once instead it loads in data in partitions, the number of concurrently loaded partitions depend on the number of cores you have available.

Why spark application are not running on all nodes

I installed the following spark benchmark:
https://github.com/BBVA/spark-benchmarks
I run Spark on top of YARN on 8 workers but I only get 2 running executors during the job (TestDFSIO).
I also set executor-cores to be 9 but only 2 are running.
Why would that happen?
I think the problem is coming from YARN because I get a similar (almost) issue with TestDFSIO on Hadoop. In fact, at the beginning of the job, only two nodes run, but then all the nodes execute the application in parallel!
Note that I am not using HDFS for storage!
I solved this issue. What I've done is that I set the number of cores per executor to 5 (--executor-cores) and the total number of executors to 23 (--num-executors) which was at first 2 by default.

Need help in understanding pyspark execution on yarn as Master

I have already some picture of yarn architecture as well as spark architecture.But when I try to understand them together(thats what happens
when apark job runs on YARN as master) on a Hadoop cluster, I am getting in to some confusions.So first I will say my understanding with below example and then I will
come to my confusions
Say I have a file "orderitems" stored on HDFS with some replication factor.
Now I am processing the data by reading this file in to a spark RDD (say , for calculating order revenue).
I have written the code and configured the spark submit as given below
spark-submit \
--master yarn \
--conf spark.ui.port=21888 \
--num-executors 2 \
--executor-memory 512M \
src/main/python/order_revenue.py
Lets assume that I have created the RDD with a partition of 5 and I have executed in yarn-client mode.
Now As per my understanding , once I submit the spark job on YARN,
Request goes to Application manager which is a component of resource
manager.
Application Manager will find one node manager and ask it to launch a
container.
This is the first container of an application and we will call it an
Application Master.
Application master takes over the responsibility of executing and monitoring
the job.
Since I have submitted on client mode,driver program will run on my edge Node/Gateway Node.
I have provided num-executors as 2 and executor memory as 512 mb
Also I have provided no.of partitions for RDD as 5 which means , it will create 5 partitions of data read
and distribute over 5 nodes.
Now here my few confusions over this
I have read in user guide that, partitions of rdd will be distributed to different nodes. Does these nodes are same as the
'Data Nodes' of HDFS cluster? I mean here its 5 partitions, does
this mean its in 5 data nodes?
I have mentioned num-executors as 2.So this 5 partitions of data will utilizes 2 executors(CPU).So my nextquestion is , from where
this 2 executors (CPU) will be picked? I mean 5 partitions are in 5 nodes
right , so does these 2 executors are also in any of these nodes?
The scheduler is responsible for allocating resources to the various running applications subject to constraints of capacities,
queues etc. And also a Container is a Linux Control Group which is
a linux kernel feature that allows users to allocate
CPU,memory,Disk I/O and Bandwidth to a user process. So my final
question is Containers are actually provided by "scheduler"?
I am confused here. I have referred architecture, release document and some videos and got messed up.
Expecting helping hands here.
To answer your questions first:
1) Very simply, Executor is spark's worker node and driver is manager node and have nothing to do with hadoop nodes. Assume executors to be processing units (say 2 here) and repartition(5) divides data in 5 chunks to be by these 2 executors and on some basis these data chunks will be divided amongst 2 executors. Repartition data does not create nodes
Spark cluster architecture:
Spark on yarn client mode:
Spark on yarn cluster mode:
For other details you can read the blog post https://sujithjay.com/2018/07/24/Understanding-Apache-Spark-on-YARN/
and https://0x0fff.com/spark-architecture/

Spark not using all available cores on a Node in Standalone Cluster

I'm running a small cluster with a separate 1 Master and 1 Slave node (with 8 VCores). I launch the cluster via /sbin/start-all.sh and then add pyspark to it with /bin/pyspark --master spark://<master ip>:7077
now in the webui everything seems OK I got my worker registered with the master and I have 8 Cores available. Also the pyspark shell also got all 8 cores.
I have a small RDD consisting of 14 rows each row containing a string pointing to a compressed text file.
def open_gzip(filepath):
with gzip.open(filepath, 'rb') as f:
file_content = f.read()
return file_content.split(b'\r\n')
wat_paths_rdd = sc.textFile('./file.paths')
wat_rdd = wat_paths_rdd.flatMap(open_gzip)
now when I try to run this code, I can see in htop that on my worker node only 2 cores are utilized when flatMap is invoked.
The following parameters I have tried to set on both slave and master with no avail:
in /conf/spark-defaults.conf
spark.cores.max 8
spark.executor.cores 8
even though I can set
spark.executor.memory 14500m
in /conf/spark-env.sh
export SPARK_WORKER_CORES=8
I'm a bit at a lose here in my previous config, where I ran everything off one machine and spark.cores.max 8 was enough.
Number of cores are utilised based on number of tasks which are dependent on number of partitions of your rdd. Please check
rdd.getNumPartitions
If they are 2, then you need to increase number of partitions 2-3 times the number of cores using
rdd.repartition
or in the start when you parallelize your file.

Spark on YARN resource manager: Relation between YARN Containers and Spark Executors

I'm new to Spark on YARN and don't understand the relation between the YARN Containers and the Spark Executors. I tried out the following configuration, based on the results of the yarn-utils.py script, that can be used to find optimal cluster configuration.
The Hadoop cluster (HDP 2.4) I'm working on:
1 Master Node:
CPU: 2 CPUs with 6 cores each = 12 cores
RAM: 64 GB
SSD: 2 x 512 GB
5 Slave Nodes:
CPU: 2 CPUs with 6 cores each = 12 cores
RAM: 64 GB
HDD: 4 x 3 TB = 12 TB
HBase is installed (this is one of the parameters for the script below)
So I ran python yarn-utils.py -c 12 -m 64 -d 4 -k True (c=cores, m=memory, d=hdds, k=hbase-installed) and got the following result:
Using cores=12 memory=64GB disks=4 hbase=True
Profile: cores=12 memory=49152MB reserved=16GB usableMem=48GB disks=4
Num Container=8
Container Ram=6144MB
Used Ram=48GB
Unused Ram=16GB
yarn.scheduler.minimum-allocation-mb=6144
yarn.scheduler.maximum-allocation-mb=49152
yarn.nodemanager.resource.memory-mb=49152
mapreduce.map.memory.mb=6144
mapreduce.map.java.opts=-Xmx4915m
mapreduce.reduce.memory.mb=6144
mapreduce.reduce.java.opts=-Xmx4915m
yarn.app.mapreduce.am.resource.mb=6144
yarn.app.mapreduce.am.command-opts=-Xmx4915m
mapreduce.task.io.sort.mb=2457
These settings I made via the Ambari interface and restarted the cluster. The values also match roughly what I calculated manually before.
I have now problems
to find the optimal settings for my spark-submit script
parameters --num-executors, --executor-cores & --executor-memory.
to get the relation between the YARN container and the Spark executors
to understand the hardware information in my Spark History UI (less memory shown as I set (when calculated to overall memory by multiplying with worker node amount))
to understand the concept of the vcores in YARN, here I couldn't find any useful examples yet
However, I found this post What is a container in YARN? , but this didn't really help as it doesn't describe the relation to the executors.
Can someone help to solve one or more of the questions?
I will report my insights here step by step:
First important thing is this fact (Source: this Cloudera documentation):
When running Spark on YARN, each Spark executor runs as a YARN container. [...]
This means the number of containers will always be the same as the executors created by a Spark application e.g. via --num-executors parameter in spark-submit.
Set by the yarn.scheduler.minimum-allocation-mb every container always allocates at least this amount of memory. This means if parameter --executor-memory is set to e.g. only 1g but yarn.scheduler.minimum-allocation-mb is e.g. 6g, the container is much bigger than needed by the Spark application.
The other way round, if the parameter --executor-memory is set to somthing higher than the yarn.scheduler.minimum-allocation-mb value, e.g. 12g, the Container will allocate more memory dynamically, but only if the requested amount of memory is smaller or equal to yarn.scheduler.maximum-allocation-mb value.
The value of yarn.nodemanager.resource.memory-mb determines, how much memory can be allocated in sum by all containers of one host!
=> So setting yarn.scheduler.minimum-allocation-mb allows you to run smaller containers e.g. for smaller executors (else it would be waste of memory).
=> Setting yarn.scheduler.maximum-allocation-mb to the maximum value (e.g. equal to yarn.nodemanager.resource.memory-mb) allows you to define bigger executors (more memory is allocated if needed, e.g. by --executor-memory parameter).

Resources