spark Local mode vs standalone cluster in term of cores and threads usage - apache-spark

im comparing between pyspark local mode and standalone mode where
local :
findspark.init('C:\spark\spark-3.0.3-bin-hadoop2.7')
conf=SparkConf()
conf.setMaster("local[*]")
conf.setAppName('firstapp')
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
standalone :
findspark.init('C:\spark\spark-3.0.3-bin-hadoop2.7')
conf=SparkConf()
conf.setMaster("spark://127.0.0.2:7077")
conf.setAppName('firstapp')
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
plus starting the Master and the workers using :
Master
bin\spark-class2.cmd org.apache.spark.deploy.master.Master
Worker multiple times depending on the number of workers
bin\spark-class2.cmd org.apache.spark.deploy.worker.Worker -c 1 -m 1G spark://127.0.0.1:7077 where '1' mean one core and '1G' mean 1gb or Ram.
my question is : what is the difference between local mode and standalone mode in term of the usage of threads and cores ?

Related

what are the configurations needed to lunch pyspark standalone cluster?

im new to pyspark and im looking for deploying my program in a cluster.
i checked out a tutorial which the steps are:
bin\spark-class2.cmd org.apache.spark.deploy.master.Master.
bin\spark-class2.cmd org.apache.spark.deploy.worker.Worker -c 2 -m 2G spark://192.168.43.78:7077.
lunching the app with python myapp.py with:
findspark.init('C:\spark\spark-3.0.3-bin-hadoop2.7')
conf=SparkConf()
conf.setMaster('spark://192.168.43.78:7077')
conf.setAppName('firstapp')
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
My question is: what are the configurations needed more than that to lunch pyspark standalone cluster ?

SparkConf settings not used when running Spark app in cluster mode on YARN

I wrote a Spark application, which sets sets some configuration stuff via SparkConf instance, like this:
SparkConf conf = new SparkConf().setAppName("Test App Name");
conf.set("spark.driver.cores", "1");
conf.set("spark.driver.memory", "1800m");
conf.set("spark.yarn.am.cores", "1");
conf.set("spark.yarn.am.memory", "1800m");
conf.set("spark.executor.instances", "30");
conf.set("spark.executor.cores", "3");
conf.set("spark.executor.memory", "2048m");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> inputRDD = sc.textFile(...);
...
When I run this application with the command (master=yarn & deploy-mode=client)
spark-submit --class spark.MyApp --master yarn --deploy-mode client /home/myuser/application.jar
everything seems to work fine, the Spark History UI shows correct executor information:
But when running it with (master=yarn & deploy-mode=cluster)
my Spark UI shows wrong executor information (~512 MB instead of ~1400 MB):
Also my App name equals Test App Name when running in client mode, but is spark.MyApp when running in cluster mode. It seems that however some default settings are taken when running in Cluster mode. What am I doing wrong here? How can I make these settings for the Cluster mode?
I'm using Spark 1.6.2 on a HDP 2.5 cluster, managed by YARN.
OK, I think I found out the problem! In short form: There's a difference between running Spark settings in Standalone and in YARN-managed mode!
So when you run Spark applications in the Standalone mode, you can focus on the Configuration documentation of Spark, see http://spark.apache.org/docs/1.6.2/configuration.html
You can use the following settings for Driver & Executor CPU/RAM (just as explained in the documentation):
spark.executor.cores
spark.executor.memory
spark.driver.cores
spark.driver.memory
BUT: When running Spark inside a YARN-managed Hadoop environment, you have to be careful with the following settings and consider the following points:
orientate on the "Spark on YARN" documentation rather then on the Configuration documentation linked above: http://spark.apache.org/docs/1.6.2/running-on-yarn.html (the properties explained here have a higher priority then the ones explained in the Configuration docu (this seems to describe only the Standalone cluster vs. client mode, not the YARN cluster vs. client mode!!))
you can't use SparkConf to set properties in yarn-cluster mode! Instead use the corresponding spark-submit parameters:
--executor-cores 5
--executor-memory 5g
--driver-cores 3
--driver-memory 3g
In yarn-client mode you can't use the spark.driver.cores and spark.driver.memory properties! You have to use the corresponding AM properties in a SparkConf instance:
spark.yarn.am.cores
spark.yarn.am.memory
You can't set these AM properties via spark-submit parameters!
To set executor resources in yarn-client mode you can use
spark.executor.cores and spark.executor.memory in SparkConf
--executor-cores and executor-memory parameters in spark-submit
if you set both, the SparkConf settings overwrite the spark-submit parameter values!
This is the textual form of my notes:
Hope I can help anybody else with this findings...
Just to add on to D. Müller's answer:
Same issue happened to me and I tried the settings with some different combination. I am running Pypark 2.0.0 on YARN cluster.
I found that driver-memory must be written during spark submit but executor-memory can be written in script (i.e. SparkConf) and the application will still work.
My application will die if driver-memory is less than 2g. The error is:
ERROR yarn.ApplicationMaster: RECEIVED SIGNAL TERM
ERROR yarn.ApplicationMaster: User application exited with status 143
CASE 1:
driver & executor both written in SparkConf
spark = (SparkSession
.builder
.appName("driver_executor_inside")
.enableHiveSupport()
.config("spark.executor.memory","4g")
.config("spark.executor.cores","2")
.config("spark.yarn.executor.memoryOverhead","1024")
.config("spark.driver.memory","2g")
.getOrCreate())
spark-submit --master yarn --deploy-mode cluster myscript.py
CASE 2:
- driver in spark submit
- executor in SparkConf in script
spark = (SparkSession
.builder
.appName("executor_inside")
.enableHiveSupport()
.config("spark.executor.memory","4g")
.config("spark.executor.cores","2")
.config("spark.yarn.executor.memoryOverhead","1024")
.getOrCreate())
spark-submit --master yarn --deploy-mode cluster --conf spark.driver.memory=2g myscript.py
The job Finished with succeed status. Executor memory correct.
CASE 3:
- driver in spark submit
- executor not written
spark = (SparkSession
.builder
.appName("executor_not_written")
.enableHiveSupport()
.config("spark.executor.cores","2")
.config("spark.yarn.executor.memoryOverhead","1024")
.getOrCreate())
spark-submit --master yarn --deploy-mode cluster --conf spark.driver.memory=2g myscript.py
Apparently the executor memory is not set. Meaning CASE 2 actually captured executor memory settings despite writing it inside sparkConf.

Running spark application in local mode

I'm trying to start my Spark application in local mode using spark-submit. I am using Spark 2.0.2, Hadoop 2.6 & Scala 2.11.8 on Windows. The application runs fine from within my IDE (IntelliJ), and I can also start it on a cluster with actual, physical executors.
The command I'm running is
spark-submit --class [MyClassName] --master local[*] target/[MyApp]-jar-with-dependencies.jar [Params]
Spark starts up as usual, but then terminates with
java.io.Exception: Failed to connect to /192.168.88.1:56370
What am I missing here?
Check which port you are using: if on cluster: log in to master node and include:
--master spark://XXXX:7077
You can find it always in spark ui under port 8080
Also check your spark builder config if you have set master already as it takes priority when launching eg:
val spark = SparkSession
.builder
.appName("myapp")
.master("local[*]")

Is there a bug with SparkContext() and SparkConf()

When I try to init SparkContext with SparkConf as below:
from pyspark import *
from pyspark.streaming import *
cfg = SparkConf().setMaster('yarn').setAppName('MyApp')
sc = SparkContext(conf=cfg)
print(sc.getConf().getAll())
rdd = sc.parallelize(list('abcdefg')).map(lambda x:(x,1))
print(rdd.collect())
The output show that it does not run with yarn:
[(u'spark.master', u'local[10]'), ...]
It used the config which in $SPARK_HOME/conf/spark-defaults.conf:
spark.master local[10]
My computer:
Python2.7.2 Spark2.1.0
Then I run the same code in spark2.0.2 and SparkConf() works as well
So it is really a bug ?
To utilize yarn, you should specify whether the driver should run on the master or one of the worker nodes.
yarn-client will execute driver on the master node
SparkConf().setMaster('yarn-client')
yarn-cluster will execute driver on one of the worker nodes
SparkConf().setMaster('yarn-cluster')
Here is an example for running in yarn-client mode.

SPARK_WORKER_INSTANCES setting not working in Spark Standalone Windows

I'm trying to setup a standalone Spark 2.0 server to process an analytics function in parallel. To do this I want to run 8 workers, with a single core per each worker. However, the Spark Master/Worker UI doesn't seem to be reflecting my configuration.
I'm using :
Standalone Spark 2.0
8 Cores 24gig RAM
windows server 2008
pyspark
spark-env.sh file is configured as follows:
SPARK_WORKER_INSTANCES = 8
SPARK_WORKER_CORES = 1
SPARK_WORKER_MEMORY = 2g
spark-defaults.conf is configured as follows:
spark.cores.max = 8
I start the master:
spark-class org.apache.spark.deploy.master.Master
I start the workers by running this command 8 times within a batch file:
spark-class org.apache.spark.deploy.worker.Worker spark://10.0.0.10:7077
The problem is that the UI shows up as follows:
As you can see each worker has 8 cores instead of the 1 core I have assigned it via the SPARK_WORKER_CORES setting. Also the memory is reflective of the entire machine memory not the 2g assigned to each worker. How can I configure Spark to run with 1 core/2g per each worker in standalone mode?
I fixed this to adding the cores and memory arguments to the worker itself.
start spark-class org.apache.spark.deploy.worker.Worker --cores 1 --memory 2g spark://10.0.0.10:7077

Resources