How to use standalone Master's resources for workers? - apache-spark

I've installed Apache Spark 1.5.2 (for Hadoop 2.6+). My cluster contains of the following hardware:
Master: 12 CPU Cores & 128 GB RAM
Slave1: 12 CPU Cores & 64 GB RAM
Slave2: 6 CPU Cores & 64 GB RAM
Actually my slaves file has the two entries:
slave1_ip
slave2_ip
Because my master also has a very "strong" hardware, it wouldn't be used to capacity only by the master threads. So I wanted to ask whether it is possible to provide some of the CPU cores and the RAM from the master machine to a third worker instance...? Thank you!
FIRST ATTEMPT TO SOLVE THE PROBLEM
After Jacek Laskowski's answer I set the following settings:
spark-defaults.conf (only on Master machine):
spark.driver.cores=2
spark.driver.memory=4g
spark-env.sh (on Master):
SPARK_WORKER_CORES=10
SPARK_WORKER_MEMORY=120g
spark-env.sh (on Slave1):
SPARK_WORKER_CORES=12
SPARK_WORKER_MEMORY=60g
spark-env.sh (on Slave2):
SPARK_WORKER_CORES=6
SPARK_WORKER_MEMORY=60g
I also added the master's ip address to the slaves file.
The cluster now contains of 3 worker nodes (slaves + master), that's perfect.
BUT: The web UI shows that there're only 1024m of RAM per node, see Screenshot:
Can someone say how to fix this? Setting spark.executor.memory will set the same amount of RAM for each machine, which wouldn't be optimal to use as much RAM as possible...! What am I doing wrong? Thank you!

It's possible. Just limit the number of cores and memory used by the master and run one or more workers on the machine.
Use conf/spark-defaults.conf where you can set up spark.driver.memory and spark.driver.cores. Consult Spark Configuration.
You should however use conf/spark-env.sh to set up more than one instance per node using SPARK_WORKER_INSTANCES. Include the other settings as follows:
SPARK_WORKER_INSTANCES=2
SPARK_WORKER_CORES=2
SPARK_WORKER_MEMORY=2g
You may also want to set up the number of RAM for executors (per worker) using spark.executor.memory or SPARK_EXECUTOR_MEMORY (as depicted in the following screenshot).

In spark standalone cluster manager you should put all conf file same like spark-env.sh is same in master and worker so it cant match the configuration and set default memory for worker its 1g
spark-defaults.conf (only on Master machine):
spark.driver.cores=2
spark.driver.memory=4g
spark-env.sh (on Master)
SPARK_WORKER_CORES=10
SPARK_WORKER_MEMORY=60g
spark-env.sh (on Slave1):
SPARK_WORKER_CORES=10
SPARK_WORKER_MEMORY=60g
spark-env.sh (on Slave2):
SPARK_WORKER_CORES=10
SPARK_WORKER_MEMORY=60g
and in slaves.conf on each machine as below
masterip
slave1ip
slave2ip
after above configuration you have 3 workers one on master machine and 2 other on node and your driver is also on master machine.
But we careful you are giving lot of configuration for memory and core if your machines are small resource manager cant allocate resources.

I know this is a very old post, but why wouldn't you set the property spark.executor.memory in spark-default.xml? (OR --executor-memory)
Note this value is 1024MB by default and that is what you seem to be encountering.
The thing is executor.memory is defined at the application level and not at the node level, so there doesnt seem to be a way to start the executors with different cores/memory on diff nodes.

Related

How to increase the "memory total" that display on Yarn UI?

I have a cluster on EMR (emr-5.20.0) with a m5.2xlarge as Node Master, two m4.large as core and three m4.large as node workers. The sum of memory ram of this cluster is 62GB, but in the YARN UI the total memory displayed is 30GB.
Somebody can help me understand how this value is calculed ?
I have already check the configuration in Yarn-site.xml and spark-default.conf and them is configured according to the AWS recommendadion: https://docs.aws.amazon.com/pt_br/emr/latest/ReleaseGuide/emr-hadoop-task-config.html#emr-hadoop-task-config-m5
Every help is welcome
The memory settings in YARN can be configured using the below parameters of cluster:
yarn.nodemanager.resource.memory-mb
yarn.scheduler.minimum-allocation-mb
yarn.scheduler.increment-allocation-mb
yarn.scheduler.maximum-allocation-mb
My tweaking these parameters you can increase/decrease the total memory allocated to the cluster.
Yarn do not include the master node in it's available memory/cores.
So you should get roughly 5 x 8GB (m4.large). You will get less than that because there are memory overhead left for the OS and services.

emr-5.4.0 (Spark executors memory allocation issue)

I created a spark cluster(learning so did not create high memory-cpu cluster) with 1 master node and 2 Core to run executors using below config
Master:Running1m4.large (2 Core , 8GB)
Core:Running2c4.large (2 core , 3.5 GB)
Hive 2.1.1, Pig 0.16.0, Hue 3.11.0, Spark 2.1.0, Sqoop 1.4.6, HBase 1.3.0
When pyspark is run getting below error
Required executor memory (1024+384 MB) is above the max threshold (896 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.
Before trying to increase yarn-site.xml config , curious to understand why EMR is taking just 896MB as limit when master has 8GB and worker node has 3.5GB each.
And Resource manager URL (for master- http://master-public-dns-name:8088/) is showing 1.75 GB where as memory for vm is 8GB. Is hbase or other sws taking up too much memory?
If anyone encountered similar issue , please share your insight why it is EMR is setting low defaults. Thanks!
Before trying to increase yarn-site.xml config , curious to understand
why EMR is taking just 896MB as limit when master has 8GB and worker
node has 3.5GB each.
If you run spark jobs with yarn cluster mode (which you probably were using) , the executors will be run on core's and masters memory will not be used.
Now, all-though your CORE EC2 instance (c4.large) has 3.75 GB to use, EMR configures YARN not to use all this memory for running YARN containers or spark executors. This is because you gotta leave enough memory for other permanent daemons ( like HDFS's datanode , YARN's nodemanager , EMR's own daemons etc.. based on app's you provision)
EMR does publish this default YARN configuration it sets for all instance types on this page : http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-task-config.html
c4.large
Configuration Option Default Value
mapreduce.map.java.opts -Xmx717m
mapreduce.map.memory.mb 896
yarn.scheduler.maximum-allocation-mb 1792
yarn.nodemanager.resource.memory-mb 1792
So, yarn.nodemanager.resource.memory-mb = 1792, which means 1792 MB is the physical memory that will be allocated to YARN containers on that core node having 3.75 actual memory. Also, check spark-defaults.xml where EMR has some defaults for spark executor memory. These are default's and of course you can change those before starting cluster using EMR's configurations API . But keep in mind that if you over provision memory for YARN containers , you might starve some other processes.
Given that it is important to understand YARN configs and how SPARK interacts with YARN .
https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
http://spark.apache.org/docs/latest/running-on-yarn.html
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
It's not really a property of EMR but rather of YARN, which is the resource manager running on EMR.
My personal take on YARN is that is really build for managing long running clusters that continuously take in a variety of jobs that it has to run simultaneously. In these cases it makes sense for YARN to only assign a small part of the available memory to each job.
Unfortunately, when it comes to specific-purpose clusters (like: "I will just spin up a cluster run my job and terminate the cluster again") these YARN-defaults are simply annoying, and you have to configure a bunch of stuff in order to make YARN utilise your resources optimally. But running on EMR it's what we are stuck with these days, so one has to live with that...

Run spark driver on separate machine

Currently I am using Spark 2.0.0 in cluster mode (Standalone cluster) with the following cluster config:
Workers: 4
Cores in use: 32 Total, 32 Used
Memory in use: 54.7 GB Total, 42.0 GB Used
I have 4 slaves (workers), and 1 master machine. There are 3 main parts to a Spark cluster - Master, Driver, Workers (ref)
Now my problem is that driver is starting up in one of the worker nodes, which is blocking me in using worker nodes in their full capacity (RAM wise). For example, if I run my spark job with 2g memory for driver, then I am left with only ~13gb memory in each machine for executor memory (assuming total RAM in each machine is 15gb). Now I think there can be 2 ways to fix this:
1) Run driver on master machine, this way I can specify full 15gb RAM as executor memory
2) Specify driver machine explicitly (one of the worker nodes), and assign memory to both driver and executor for this machine accordingly. For rest of the worker nodes I can specify max executor memory.
How do I achieve point 1 or 2? Or it is even possible?
Any pointers to it are appreciated.
To run the driver on the master, run spark-submit from the master and specify --deploy-mode client. Launching applications with spark-submit.
It is not possible to specify which worker the driver will run on when using --deploy-mode cluster. However you can run the driver on a worker and achieve maximum cluster utilisation if you use a cluster manager such as yarn or mesos.

Spark standalone configuration having multiple executors

I'm trying to setup a standalone Spark 2.0 server to process an analytics function in parallel. To do this I want to have a single worker with multiple executors.
I'm using :
Standalone Spark 2.0
8 Cores
24gig RAM
windows server 2008
pyspark (although this appears unrelated)
This is just for pure proof of concept purposes but I want to have 8 executors, one per each core.
I've tried to follow the other threads on this topic but for some reason it's not working for me. IE:
Spark Standalone Number Executors/Cores Control
My configuration is as follows:
conf\spark-defaults.conf
spark.cores.max = 8
spark.executor.cores = 1
I have tried to also change my spark-env.sh file to no avail. Instead what is happening is that it shows that my 1 worker only has 1 executor on it. As you can see below, it still shows the standalone with 1 executor with 8 cores to it.
I believe you mixed up local and standalone modes:
Local mode is a development tool where all processes are executed inside a single JVM. Application is started in a local mode by setting master to local, local[*] or local[n]. spark.executor.cores and spark.executor.cores are not applicable in the local mode because there is only one embedded executor.
Standalone mode requires a standalone Spark cluster. It requires a master node (can be started using SPARK_HOME/sbin/start-master.sh script) and at least one worker node (can be started using SPARK_HOME/sbin/start-slave.sh script).
SparkConf should use master node address to create (spark://host:port).
You first need to configure your spark standalone cluster, then set the amount of resources needed for each individual spark application you want to run.
In order to configure the cluster, you can try this:
In conf/spark-env.sh:
Set the SPARK_WORKER_INSTANCES = 10
which determines the number of Worker instances (#Executors) per node (its default value is only 1)
Set the SPARK_WORKER_CORES = 15
number of cores that one Worker can use (default: all cores, your case is 36)
Set SPARK_WORKER_MEMORY = 55g
total amount of memory that can be used on one machine (Worker Node) for running Spark programs.
Copy this configuration file to all Worker Nodes, on the same folder
Start your cluster by running the scripts in sbin (sbin/start-all.sh, ...)
As you have 5 workers, with the above configuration you should see 5 (workers) * 10 (executors per worker) = 50 alive executors on the master's web interface (http://localhost:8080 by default)
When you run an application in standalone mode, by default, it will acquire all available Executors in the cluster. You need to explicitly set the amount of resources for running this application: Eg:
val conf = new SparkConf()
.setMaster(...)
.setAppName(...)
.set("spark.executor.memory", "2g")
.set("spark.cores.max", "10")

Spark multinode is not so fast executing a query comparing to single-node, How to increase the number of cores in virtual box?

Im doing some tests with spark in virtualbox. I have a cpu with 8 cores in the host machine. And I would like to test spark with maximum cores possible in the virtualbox environment to have the best perforamnce possible.
Im using 3 virtual boxs machines, one master machine, two slaves. I configure in the virtualbox settings the master machine with 2gb RAM and 1CPU, and each salve machine with 4GB RAM and 3CPU.
When I start the spark-shell with yarn the cluster "spark-shell --mastere yarn-client" appear with this settings:
enter image description here
But Im executing a query, and the same query with just one node without yarn takes 4min, with 3 nodes is taking 2,5min, so its not much difference.
Do you know how can I configure better this enviornment to increase perforamnce?If its possible configure spark with yarn with more cores, given that I have a cpu with 8 cores in the host machine?
I did not any spark cores configuration to have the values in the above image, the only configs I did in spark were:
(spark-env.sh)
SPARK_JAVA_OPTS=-Dspark.driver.port=53411
HADOOP_CONF_DIR=$HADOOP_HOME/conf
SPARK_MASTER_IP=master
(spark-defaults.conf)
spark.master spark://master:7077
spark.serializer org.apache.spark.serializer.KryoSerializer
(slaves)
slave1
slave2
And to start spark:
spark-shell --master yarn-client

Resources