Spark number of cores used - apache-spark

I have a very simple spark job which reads million movie ratings and tell the ratings and number of times its rated.
The job is run on the spark cluster and its running fine.
Have couple of questions on the parameter that I use to run the job?
I have 2 nodes runnings.
Node-1 = 24GB RAM & 8 VCPU's.
Node-2 = 8GB RAM & 2 VCPU's.
so totally I have 32GB RAM and 10 VCPU's.
spark-submit command.
spark-submit --master spark://hadoop-master:7077 --executor-memory 4g --num-executors 4 --executor-cores 4 /home/hduser/ratings-counter.py
When I run the above command, which cores spark uses, is it from node-1 or node-2 or does it randomly allocates?
2.If I don't use number of executors what is the default executors spark uses?
from pyspark import SparkConf, SparkContext
import collections
conf = SparkConf().setMaster("hadoop-master").setAppName("RatingsHistogram")
sc = SparkContext(conf = conf)
lines = sc.textFile("hdfs://hadoop-master:8020/user/hduser/gutenberg/ml-10M100K/ratings.dat")
ratings = lines.map(lambda x: x.split('::')[2])
result = ratings.countByValue()
sortedResults = collections.OrderedDict(sorted(result.items()))
for key, value in sortedResults.items():
print("%s %i" % (key, value))

is it from node-1 or node-2 or does it randomly allocates?
It really depends on how many workers you have initialized. Since in your spark-submit cmd you have specified a total of 4 executors, each executor will allocate 4gb of memory and 4 cores from the Spark Worker's total memory and cores. One easy way to see in which node each executor was started is to check the Spark's Master UI (default port is 8080) and from there to select your running app. Then you can check the executors tab within the application's UI.
If I don't use number of executors what is the default executors spark uses?
Usually, it initializes one executor per worker instance, and uses all worker's resources.

Related

Spark on Yarn Number of Cores in EMR Cluster

I have an Emr cluster for spark with below configuration of 2 Instances.
r4.2xlarge
8 vCore
So my total vCores is 16 and the same is reflected in yarn Vcores
I have submitted a spark streaming job with parameters --num-executors 2 --executor-cores 5. So I was assuming it will use up 2*5 total 10 vcores for executors, but what it's doing only using 2 cores in total from the cluster (+1 for the driver)
.
And in spark, the job is still running with parallel tasks of 10 (2*5). Seems like it's just running only 5 threads within each executor core.
I have read in different questions and in documentation --executor-cores uses actual vCores but here, it only running tasks as threads.
Is my understanding correct here?

Why Spark utilizing only one core per executor? How it decides to utilize cores other than number of partitions?

I am running spark in HPC environment on slurm using Spark standalone mode spark version 1.6.1. The problem is my slurm node is not fully used in the spark standalone mode. I am using spark-submit in my slurm script. There are 16 cores available on a node and I get all 16 cores per executor as I see on SPARK UI. But only one core per executor is actually utilized. top + 1 command on the worker node, where executor process is running, shows that only one cpu is being used out of 16 cpus. I have 255 partitions, so partitions does not seems a problem here.
$SPARK_HOME/bin/spark-submit \
--class se.uu.farmbio.vs.examples.DockerWithML \
--master spark://$MASTER:7077 \
--executor-memory 120G \
--driver-memory 10G \
When I change script to
$SPARK_HOME/bin/spark-submit \
--class se.uu.farmbio.vs.examples.DockerWithML \
--master local[*] \
--executor-memory 120G \
--driver-memory 10G \
I see 0 cores allocated to executor on Spark UI which is understandable because we are no more using spark standalone cluster mode. But now all the cores are utilized when I check top + 1 command on worker node which hints that problem is not with the application code but with the utilization of resources by spark standalone mode.
So how spark decides to use one core per executor when it has 16 cores and also have enough partitions? What can I change so it can utilize all cores?
I am using spark-on-slurm for launching the jobs.
Spark configurations in both cases are as fallows:
--master spark://MASTER:7077
(spark.app.name,DockerWithML)
(spark.jars,file:/proj/b2015245/bin/spark-vs/vs.examples/target/vs.examples-0.0.1-jar-with-dependencies.jar)
(spark.app.id,app-20170427153813-0000)
(spark.executor.memory,120G)
(spark.executor.id,driver)
(spark.driver.memory,10G)
(spark.history.fs.logDirectory,/proj/b2015245/nobackup/eventLogging/)
(spark.externalBlockStore.folderName,spark-75831ca4-1a8b-4364-839e-b035dcf1428d)
(spark.driver.maxResultSize,2g)
(spark.executorEnv.OE_LICENSE,/scratch/10230979/SureChEMBL/oe_license.txt)
(spark.driver.port,34379)
(spark.submit.deployMode,client)
(spark.driver.host,x.x.x.124)
(spark.master,spark://m124.uppmax.uu.se:7077)
--master local[*]
(spark.app.name,DockerWithML)
(spark.app.id,local-1493296508581)
(spark.externalBlockStore.folderName,spark-4098cf14-abad-4453-89cd-3ce3603872f8)
(spark.jars,file:/proj/b2015245/bin/spark-vs/vs.examples/target/vs.examples-0.0.1-jar-with-dependencies.jar)
(spark.driver.maxResultSize,2g)
(spark.master,local[*])
(spark.executor.id,driver)
(spark.submit.deployMode,client)
(spark.driver.memory,10G)
(spark.driver.host,x.x.x.124)
(spark.history.fs.logDirectory,/proj/b2015245/nobackup/eventLogging/)
(spark.executorEnv.OE_LICENSE,/scratch/10230648/SureChEMBL/oe_license.txt)
(spark.driver.port,36008)
Thanks,
The problem is that you only have one worker node. In spark standalone mode, one executor is being launched per worker instances. To launch multiple logical worker instances in order to launch multiple executors within a physical worker, you need to configure this property:
SPARK_WORKER_INSTANCES
By default, it is set to 1. You can increase it accordingly based on the computation you are doing in your code to utilize the amount of resources you have.
You want your job to be distributed among executors to utilize the resources properly,but what's happening is only one executor is getting launched which can't utilize the number of core and the amount of memory you have. So, you are not getting the flavor of spark distributed computation.
You can set SPARK_WORKER_INSTANCES = 5
And allocate 2 cores per executor; so, 10 cores would be utilized properly.
Like this, you tune the configuration to get optimum performance.
Try setting spark.executor.cores (default value is 1)
According to the Spark documentation :
the number of cores to use on each executor. For YARN and standalone mode only. In standalone mode, setting this parameter allows an application to run multiple executors on the same worker, provided that there are enough cores on that worker. Otherwise, only one executor per application will run on each worker.
See https://spark.apache.org/docs/latest/configuration.html
In spark cluster mode you should use command --num-executor "numb_tot_cores*num_of_nodes". For example if you have 3 nodes with 8 cores per node you should write --num-executors 24

Why line count job runs slower in spark shell than a mapreduce job

I did a test to compare the performance between spark and mapreduce. I have three node cluster with 128GB memory each.
I run a job to calculate how many lines in a 10GB file.
I run line count job with mapreduce with the default configuration of hadoop. It just takes me about 23 seconds.
When I run the line count job in spark shell with 8GB memory per node.It takes me more than 6 minutes which really astonish me.
Here is the command to start spark-shell and code of spark job.
spark-shell --master spark://10.8.12.16:7077 --executor-memory 8G
val s= sc.textFile("hdfs://ns/alluxio/linecount/10G.txt")
s.count()
Here comes my config file of spark:
spark-env.sh
export JAVA_HOME=/home/appadmin/jdk1.8.0_77
export SPARK_HOME=/home/appadmin/spark-2.0.0-bin-without-hadoop
export HADOOP_HOME=/home/appadmin/hadoop-2.7.2
export SPARK_DIST_CLASSPATH=$(/home/appadmin/hadoop-2.7.2/bin/hadoop classpath)
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_LIBARY_PATH=.:$JAVA_HOME/lib:$JAVA_HOME/jre/lib:$HADOOP_HOME/lib/native
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
SPARK_MASTER_HOST=10.8.12.16
SPARK_MASTER_WEBUI_PORT=28686
SPARK_LOCAL_DIRS=/home/appadmin/spark-2.0.0-bin-without-hadoop/sparkdata/local
SPARK_WORKER_MEMORY=10g
SPARK_WORKER_DIR=/home/appadmin/spark-2.0.0-bin-without-hadoop/sparkdata/work
SPARK_LOG_DIR=/home/appadmin/spark-2.0.0-bin-without-hadoop/logs
spark-default.conf
spark.driver.memory 5g
spark.eventLog.dir hdfs://10.8.12.16:9000/spark-event-log
You can pass number of Partitions i.e defaultMinPartitions
adjust number of partitions
like this
sc.textFile(file, numPartitions)
.count()
you can also try repartition after loading to see the effect.
Also, have a look at how-to-tune-your-apache-spark-jobs
You can further debug and adjust settings by printing
sc.getConf.getAll.mkString("\n")
Also can get number of executors like below example snippet.
/** Method that just returns the current active/registered executors
* excluding the driver.
* #param sc The spark context to retrieve registered executors.
* #return a list of executors each in the form of host:port.
*/
def currentActiveExecutors(sc: SparkContext): Seq[String] = {
val allExecutors = sc.getExecutorMemoryStatus.map(_._1)
val driverHost: String = sc.getConf.get("spark.driver.host")
allExecutors.filter(! _.split(":")(0).equals(driverHost)).toList
}
sc.getConf.getInt("spark.executor.instances", 1)
getExecutorStorageStatus and getExecutorMemoryStatus both return the number of executors including driver.
By default, Spark runs with one executor. I would change my setup to be:
Leave 8GB on each machine for OS and other processes overhead. That would leave you with 120GB. Given that the garbage collector starts to degrade if you have more than around 32GB, I'd have 4 executors per machine with 30GB each.
So, I would set:
spark.executor.instances = 12
spark.executor.cores = (number of cores each of your machines has - 1) / 4 (leave 1 for OS)
spark.executor.memory = 30g
Then run your application again.

Spark num-executors

I have setup a 10 node HDP platform on AWS. Below is my configuration
2 Servers - Name Node and Standby Name node
7 Data Nodes and each node has 40 vCPU and 160 GB of memory.
I am trying to calculate the number of executors while submitting spark applications and after going through different blogs I am confused on what this parameter actually means.
Looking at the below blog it seems the num executors are the total number of executors across all nodes
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
But looking at the below blog it seems that the num executors is per node or server
https://blogs.aws.amazon.com/bigdata/post/Tx578UTQUV7LRP/Submitting-User-Applications-with-spark-submit
Can anyone please clarify and review the below :-
Is the num-executors value is per node or the total number of executors across all the data nodes.
I am using the below calculation to come up with the core count, executor count and memory per executor
Number of cores <= 5 (assuming 5)
Num executors = (40-1)/5 = 7
Memory = (160-1)/7 = 22 GB
With the above calculation which would be the correct way
--master yarn-client --driver-memory 10G --executor-memory 22G --num-executors 7 --executor-cores 5
OR
--master yarn-client --driver-memory 10G --executor-memory 22G --num-executors 49 --executor-cores 5
Thanks,
Jayadeep
Can anyone please clarify and review the below :-
Is the num-executors value is per node or the total number of executors across all the data nodes.
You need to first understand that the executors run on the NodeManagers (You can think of this like workers in Spark standalone). A number of Containers (includes vCPU, memory, network, disk, etc.) equal to number of executors specified will be allocated for your Spark application on YARN. Now these executor containers will be run on multiple NodeManagers and that depends on the CapacityScheduler (default scheduler in HDP).
So to sum up, total number of executors is the number of resource containers you specify for your application to run.
Refer this blog to understand better.
I am using the below calculation to come up with the core count, executor count and memory per executor
Number of cores <= 5 (assuming 5) Num executors = (40-1)/5 = 7 Memory = (160-1)/7 = 22 GB
There is no rigid formula for calculating the number of executors. Instead you can try enabling Dynamic Allocation in YARN for your application.
There is a hiccup with the capacity scheduler. As far as I understand it allows you to only schedule by memory. You will first need to change that to the dominant resource calculator scheduling type. That will allow you to ask for more memory and cores combination. Once you change that out you should be able to ask for both cup and memory with your spark application.
As for --num-executors flag, you can even keep it at a very high value of 1000. It will still allocate only the number of containers that is possible to launch on each node. As and when your cluster resources increase your containers attached to your application will increase. The number of containers that you can launch per node will be limited by the amount of resources allocated to the nodemanagers on those nodes.

Using all resources in Apache Spark with Yarn

I am using Apache Spark with Yarn client.
I have 4 worker PCs with 8 vcpus each and 30 GB of ram in my spark cluster.
Im set my executor memory to 2G and number of instances to 33.
My job is taking 10 hours to run and all machines are about 80% idle.
I dont understand the correlation between executor memory and executor instances. Should I have an instance per Vcpu? Should I set the executor memory to be memory of machine/#executors per machine?
I believe that you have to use the following command:
spark-submit --num-executors 4 --executor-memory 7G --driver-memory 2G --executor-cores 8 --class \"YourClassName\" --master yarn-client
Number of executors should be 4, since you have 4 workers. The executor memory should be close to the maximum memory that each yarn node has allocated, roughly ~5-6GB (I assume you have 30GB total RAM).
You should take a look on the spark-submit parameters and fully understand them.
We were using cassandra as our data source for spark. The problem was there were not enough partitions. We needed to split up the data more. Our mapping for # of cassandra partitions to spark partitions was not small enough and we would only generate 10 or 20 tasks instead of 100s of tasks.

Resources