Spark Standalone cluster, memory per executor issue - apache-spark

Hi i am launch my Spark application with the spark submit script as such
spark-submit --master spark://Maatari-xxxxxxx.local:7077 --class EstimatorApp /Users/sul.maatari/IdeaProjects/Workshit/target/scala-2.11/Workshit-assembly-1.0.jar --d
eploy-mode cluster --executor-memory 15G num-executors 2
I have a spark standalone cluster deployed on two nodes (my 2 laptops). The cluster is running fine. By default it set 15G for the workers and 8 cores for the executors. Now i am experiencing the following strange behavior. Although i am explicity setting the memory and this can also be seen in the environmement variable of the sparconf UI, in the Cluster UI it says that my application is limited to 1024MB for the executor memory. This makes me think of the default 1G parameter. I wonder why that it.
My application indeed fail because of the memory issue. I know that i need a lot of memory for that application.
One last point of confusion is the Driver program. Why given that i am on cluster mode, spark submit does not return immediately ? I though that given that the driver is executed on the cluster, the client i.e. submit application should return immediately. This further suggest me that something is not right with my conf and how things are being executed.
Can anyone help diagnose that ?

Two possibilities:
given that your command line has the --num-executors mis-specified: it may be that Spark "gives up" on the other setting as well.
how much memory does your laptop have? Most of us use mac's .. and then you would not be able to run it with more than about 8GB in my experience.

Related

Spark: huge number of thread get created

Spark version 2.1 Hadoop 2.7.3
I have a spark job, only has 1 stage and 100 partitions, my application itself doesn't create any thread. but after I submit it as
spark-submit --class xxx --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 1g --num-executors 7 --executor-core 1 ./my.jar
I found on every server, it uses about 400 threads, why so many threads are being used? The cluster has 6 servers. so one of the servers get 2 executors, and that use about 800 threads in spark process. when I actually run this. I give it a lot of cores and get a "cannot create native thread" error after system using 32,000 threads, which is the limit from system ulimit setting. even I can assign less core and get around this error, using so many threads won't be efficient anyway, can someone gives some hints?
updated.
it's the connection to hbase causing the problem, not spark using those threads.
Check the scheduler XML configuration in conf directory
Check the scheduler used
Check the weight configured
If there is no pool set, try setting a pool
sc.setLocalProperty("spark.scheduler.pool", "test")
configure the following values
<pool name="test">
<schedulingMode>FAIR</schedulingMode>
<weight>1</weight>
<minShare>2</minShare>
</pool>

can someone let me know how to decide --executor memory and --num-of-executors in spark submit job . What is the concept of -number-of-cores

How to decide the --executor memory and --num-of-executors in spark submit job . What is the concept of -number-of-cores.
Also the clear difference between cluster and client deploy mode. How to choose the deploy mode
The first part of your question where you ask about --executor-memory, --num-executors and --num-executor-cores usually depends on the variety of task your Spark application is going to perform.
Executor Memory indicates the amount of physical memory you want to allocate to the JVM that runs the executor. The value will depend on your requirement. For example, if you're just going to parse a large text file you'll require much less memory than what you need for, say, Image Processing.
The number of executors variable is the number of Executor JVMs you want to spawn on your cluster. Again, it depends on a lot of factors like your cluster size, type of machines in the cluster etc.
Each executor splits the code and performs the instructions in tasks. These tasks are performed in executor cores (or processors). This helps you to achieve parallelism within a certain executor but make sure you don't allocate all the cores of a machine to its executor because some are needed for normal functioning of it.
On to your second part of the question, we have two --deploy-mode in Spark that you have already named i.e. cluster and client.
client mode is when you connect an external machine to a cluster and you run a spark job from that external machine. Like when you connect your laptop to a cluster and run spark-shell from it. The driver JVM is invoked in your laptop and the session is killed as soon as you disconnect your laptop. Similar is the case for a spark-submit job, if you run a job with --deploy-mode client, your laptop acts like the master but the job is killed as soon as it is disconnected (not sure about this one).
cluster mode: When you specify --deploy-mode cluster in your job then even if you run it using your laptop or any other machine, the job (JAR) is taken care of by the ResourceManager and ApplicationMaster, just like any other application in YARN. You won't be able to see the output on your screen but anyway most complex Spark jobs write to a FS so that's taken care of that way.

How do I run multiple spark applications in parallel in standalone master

Using Spark(1.6.1) standalone master, I need to run multiple applications on same spark master.
All application submitted after first one, keep on holding 'WAIT' state always. I also observed, the one running holds all cores sum of workers.
I already tried limiting it by using SPARK_EXECUTOR_CORES but its for yarn config, while I am running is "standalone master". I tried running many workers on same master but every time first submitted application consumes all workers.
I was having same problem on spark standalone cluster.
What I got is, Somehow it is utilising all the resources for one single job. We need to define the resources so that their will be space to run other job as well.
Below is the command I am using to submit spark job.
bin/spark-submit --class classname --master spark://hjvm1:6066 --deploy-mode cluster --driver-memory 500M --conf spark.executor.memory=1g --conf spark.cores.max=1 /data/test.jar
A crucial parameter for running multiple jobs in parallel on a Spark standalone cluster is spark.cores.max. Note that spark.executor.instances,
num-executors and spark.executor.cores alone won't allow you to achieve this on Spark standalone, all your jobs except a single active one will stuck with WAITING status.
Spark-standalone resource scheduling:
The standalone cluster mode currently only supports a simple FIFO
scheduler across applications. However, to allow multiple concurrent
users, you can control the maximum number of resources each
application will use. By default, it will acquire all cores in the
cluster, which only makes sense if you just run one application at a
time. You can cap the number of cores by setting spark.cores.max ...
I am assuming you run all the workers on one server and try to simulate a cluster. The reason for this assumption is that if otherwise you could use one worker and master to run Standalone Spark cluster.
The executor cores are something completely different compared to the normal cores. To set the number of executors you will need YARN to be turned on as you earlier said. The executor cores are the number of Concurrent tasks as executor can run (when using hdfs it is advisable to keep this below 5) [1].
The number of cores you want to limit to make the workers run are the “CPU cores”. These are specified in the configuration of Spark 1.6.1 [2]. In Spark there is the option to set the amount of CPU cores when starting a slave [3]. This happens with -c CORES, --cores CORES . Which defines the total CPU cores to allow Spark applications to use on the machine (default: all available); only on worker.
The command to start Spark would be something like this:
./sbin/start-all.sh --cores 2
Hope this helps
In the configuration settings add this line to "./conf/spark-env.sh " this file.
export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=1"
maximum cores now will limit to 1 for the master.
if multiple spark application is running then it will use only one core for the master. By then defining the amount of workers and give the workers the setting:
export SPARK_WORKER_OPTS="-Dspark.deploy.defaultCores=1"
Each worker has then one core as well. Remember this has to be set for every worker in the configuration settings.

What is the minimum Hardware insfracture required for spark to run on spark standalone cluster mode?

I am running spark standalone cluster mode in my local computer .This is hardware information about my computer
Intel Core i5
Number of Processors: 1
Total Number of Cores: 2
Memory: 4 GB.
I am trying to run spark program from eclipse on spark standalone cluster .This is some part of my code .
String logFile = "/Users/BigDinosaur/Downloads/spark-2.0.1-bin-hadoop2.7 2/README.md"; //
SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("spark://BigDinosaur.local:7077"));
after running program in eclipse I am getting following warning message
Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resource
This is my screen shot of web UI
After going through other people answer on similar problem it seems like hardware resource mismatch is the root cause.
I want to get more information on
What is Minimum Hardware insfracture required for spark standalone cluster to run application on it ?
It started running after i run following command
./start-slave.sh spark://localhost:7077 --cores 1 --memory 1g
I gave for core 1 and memory 1 g
As per I know. Spark allocates memory from whatever memory is available when spark job starts.
You may want to try with explicitely providing cores and executor memory when starting job.

spark-submit executor-memory issue on Amazon EMR 5.0

I launch a Python Spark program like this:
/usr/lib/spark/bin/spark-submit \
--master yarn \
--executor-memory 2g \
--driver-memory 2g \
--num-executors 2 --executor-cores 4 \
my_spark_program.py
I get the error:
Required executor memory (2048+4096 MB) is above the max threshold
(5760 MB) of this cluster! Please check the values of
'yarn.scheduler.maximum-allocation-mb' and/or
'yarn.nodemanager.resource.memory-mb'.
This is a brand new EMR 5 cluster with one master m3.2xlarge systems and two core m3.xlarge systems. Everything should be set to defaults. I am currently the only user running only one job on this cluster.
If I lower executor-memory from 2g to 1500m, it works. This seems awfully low. An EC2 m3.xlarge server has 15GB of RAM. These are Spark worker/executor machines, they have no other purpose, so I would like to use as much of that as possible for Spark.
Can someone explain how I go from having an EC2 worker instance with 15GB to being able to assign a Spark worker only 1.5GB?
On [http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/TaskConfiguration_H2.html] I see that the EC2 m3.xlarge default for yarn.nodemanager.resource.memory-mb default to 11520MB and 5760MB with HBase installed. I'm not using HBase, but I believe it is installed on my cluster. Would removing HBase free up lots of memory? Is that yarn.nodemanager.resource.memory-mbsetting the most relevant setting for available memory?
When I tell spark-submit --executor-memory is that per core or for the whole worker?
When I get the error Required executor memory (2048+4096 MB), the first value (2048) is what I pass to --executor-memory and I can change it and see the error message change accordingly. What is the second 4096MB value? How can I change that? Should I change that?
I tried to post this issue to AWS developer forum (https://forums.aws.amazon.com/forum.jspa?forumID=52) and I get the error "Your message quota has been reached. Please try again later." when I haven't even posted anything? Why would I not have permissions to post a question there?
Yes, if hbase is installed, it will use quite a bit of memory be default. You should not put it on your cluster unless you need it.
Your error would make sense if there was only 1 core node. 6G (4G for the 2 executors, 2G for the driver) would be more memory than your resource manager would have to allocate. With a 2 node core, you should actually be able to allocate 3 2G executors. 1 on the node with the driver, 2 on the other.
In general, this sheet could help make sure you get the most out of your cluster.

Resources