can someone let me know how to decide --executor memory and --num-of-executors in spark submit job . What is the concept of -number-of-cores - apache-spark

How to decide the --executor memory and --num-of-executors in spark submit job . What is the concept of -number-of-cores.
Also the clear difference between cluster and client deploy mode. How to choose the deploy mode

The first part of your question where you ask about --executor-memory, --num-executors and --num-executor-cores usually depends on the variety of task your Spark application is going to perform.
Executor Memory indicates the amount of physical memory you want to allocate to the JVM that runs the executor. The value will depend on your requirement. For example, if you're just going to parse a large text file you'll require much less memory than what you need for, say, Image Processing.
The number of executors variable is the number of Executor JVMs you want to spawn on your cluster. Again, it depends on a lot of factors like your cluster size, type of machines in the cluster etc.
Each executor splits the code and performs the instructions in tasks. These tasks are performed in executor cores (or processors). This helps you to achieve parallelism within a certain executor but make sure you don't allocate all the cores of a machine to its executor because some are needed for normal functioning of it.
On to your second part of the question, we have two --deploy-mode in Spark that you have already named i.e. cluster and client.
client mode is when you connect an external machine to a cluster and you run a spark job from that external machine. Like when you connect your laptop to a cluster and run spark-shell from it. The driver JVM is invoked in your laptop and the session is killed as soon as you disconnect your laptop. Similar is the case for a spark-submit job, if you run a job with --deploy-mode client, your laptop acts like the master but the job is killed as soon as it is disconnected (not sure about this one).
cluster mode: When you specify --deploy-mode cluster in your job then even if you run it using your laptop or any other machine, the job (JAR) is taken care of by the ResourceManager and ApplicationMaster, just like any other application in YARN. You won't be able to see the output on your screen but anyway most complex Spark jobs write to a FS so that's taken care of that way.

Related

Spark Standalone cluster, memory per executor issue

Hi i am launch my Spark application with the spark submit script as such
spark-submit --master spark://Maatari-xxxxxxx.local:7077 --class EstimatorApp /Users/sul.maatari/IdeaProjects/Workshit/target/scala-2.11/Workshit-assembly-1.0.jar --d
eploy-mode cluster --executor-memory 15G num-executors 2
I have a spark standalone cluster deployed on two nodes (my 2 laptops). The cluster is running fine. By default it set 15G for the workers and 8 cores for the executors. Now i am experiencing the following strange behavior. Although i am explicity setting the memory and this can also be seen in the environmement variable of the sparconf UI, in the Cluster UI it says that my application is limited to 1024MB for the executor memory. This makes me think of the default 1G parameter. I wonder why that it.
My application indeed fail because of the memory issue. I know that i need a lot of memory for that application.
One last point of confusion is the Driver program. Why given that i am on cluster mode, spark submit does not return immediately ? I though that given that the driver is executed on the cluster, the client i.e. submit application should return immediately. This further suggest me that something is not right with my conf and how things are being executed.
Can anyone help diagnose that ?
Two possibilities:
given that your command line has the --num-executors mis-specified: it may be that Spark "gives up" on the other setting as well.
how much memory does your laptop have? Most of us use mac's .. and then you would not be able to run it with more than about 8GB in my experience.

How do I run multiple spark applications in parallel in standalone master

Using Spark(1.6.1) standalone master, I need to run multiple applications on same spark master.
All application submitted after first one, keep on holding 'WAIT' state always. I also observed, the one running holds all cores sum of workers.
I already tried limiting it by using SPARK_EXECUTOR_CORES but its for yarn config, while I am running is "standalone master". I tried running many workers on same master but every time first submitted application consumes all workers.
I was having same problem on spark standalone cluster.
What I got is, Somehow it is utilising all the resources for one single job. We need to define the resources so that their will be space to run other job as well.
Below is the command I am using to submit spark job.
bin/spark-submit --class classname --master spark://hjvm1:6066 --deploy-mode cluster --driver-memory 500M --conf spark.executor.memory=1g --conf spark.cores.max=1 /data/test.jar
A crucial parameter for running multiple jobs in parallel on a Spark standalone cluster is spark.cores.max. Note that spark.executor.instances,
num-executors and spark.executor.cores alone won't allow you to achieve this on Spark standalone, all your jobs except a single active one will stuck with WAITING status.
Spark-standalone resource scheduling:
The standalone cluster mode currently only supports a simple FIFO
scheduler across applications. However, to allow multiple concurrent
users, you can control the maximum number of resources each
application will use. By default, it will acquire all cores in the
cluster, which only makes sense if you just run one application at a
time. You can cap the number of cores by setting spark.cores.max ...
I am assuming you run all the workers on one server and try to simulate a cluster. The reason for this assumption is that if otherwise you could use one worker and master to run Standalone Spark cluster.
The executor cores are something completely different compared to the normal cores. To set the number of executors you will need YARN to be turned on as you earlier said. The executor cores are the number of Concurrent tasks as executor can run (when using hdfs it is advisable to keep this below 5) [1].
The number of cores you want to limit to make the workers run are the “CPU cores”. These are specified in the configuration of Spark 1.6.1 [2]. In Spark there is the option to set the amount of CPU cores when starting a slave [3]. This happens with -c CORES, --cores CORES . Which defines the total CPU cores to allow Spark applications to use on the machine (default: all available); only on worker.
The command to start Spark would be something like this:
./sbin/start-all.sh --cores 2
Hope this helps
In the configuration settings add this line to "./conf/spark-env.sh " this file.
export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=1"
maximum cores now will limit to 1 for the master.
if multiple spark application is running then it will use only one core for the master. By then defining the amount of workers and give the workers the setting:
export SPARK_WORKER_OPTS="-Dspark.deploy.defaultCores=1"
Each worker has then one core as well. Remember this has to be set for every worker in the configuration settings.

Spark yarn cluster vs client - how to choose which one to use?

The spark docs have the following paragraph that describes the difference between yarn client and yarn cluster:
There are two deploy modes that can be used to launch Spark applications on YARN. In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
I'm assuming there are two choices for a reason. If so, how do you choose which one to use?
Please use facts to justify your response so that this question and answer(s) meet stackoverflow's requirements.
There are a few similar questions on stackoverflow, however those questions focus on the difference between the two approaches, but don't focus on when one approach is more suitable than the other.
A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup, client mode is appropriate. In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster. The input and output of the application is attached to the console. Thus, this mode is especially suitable for applications that involve the REPL (e.g. Spark shell).
Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to use cluster mode to minimize network latency between the drivers and the executors. Note that cluster mode is currently not supported for Mesos clusters. Currently only YARN supports cluster mode for Python applications." -- Submitting Applications
What I understand from this is that both strategies use the cluster to distribute tasks; the difference is where the "driver program" runs: locally with spark-submit, or, also in the cluster.
When you should use either of them is detailed in the quote above, but I also did another thing: for big jars, I used rsync to copy them to the cluster (or even to master node) with 100 times the network speed, and then submitted from the cluster. This can be better than "cluster mode" for big jars. Note that client mode does not probably transfer the jar to the master. At that point the difference between the 2 is minimal. Probably client mode is better when the driver program is idle most of the time, to make full use of cores on the local machine and perhaps avoid transferring the jar to the master (even on loopback interface a big jar takes quite a bit of seconds). And with client mode you can transfer (rsync) the jar on any cluster node.
On the other hand, if the driver is very intensive, in cpu or I/O, cluster mode may be more appropriate, to better balance the cluster (in client mode, the local machine would run both the driver and as many workers as possible, making it over loaded and making it that local tasks will be slower, making it such that the whole job may end up waiting for a couple of tasks from the local machine).
Conclusion :
To sum up, if I am in the same local network with the cluster, I would
use the client mode and submit it from my laptop. If the cluster is
far away, I would either submit locally with cluster mode, or rsync
the jar to the remote cluster and submit it there, in client or
cluster mode, depending on how heavy the driver program is on
resources.*
AFAIK With the driver program running in the cluster, it is less vulnerable to remote disconnects crashing the driver and the entire spark job.This is especially useful for long running jobs such as stream processing type workloads.
Spark Jobs Running on YARN
When running Spark on YARN, each Spark executor runs as a YARN container. Where MapReduce schedules a container and fires up a JVM for each task, Spark hosts multiple tasks within the same container. This approach enables several orders of magnitude faster task startup time.
Spark supports two modes for running on YARN, “yarn-cluster” mode and “yarn-client” mode. Broadly, yarn-cluster mode makes sense for production jobs, while yarn-client mode makes sense for interactive and debugging uses where you want to see your application’s output immediately.
Understanding the difference requires an understanding of YARN’s Application Master concept. In YARN, each application instance has an Application Master process, which is the first container started for that application. The application is responsible for requesting resources from the ResourceManager, and, when allocated them, telling NodeManagers to start containers on its behalf. Application Masters obviate the need for an active client — the process starting the application can go away and coordination continues from a process managed by YARN running on the cluster.
In yarn-cluster mode, the driver runs in the Application Master. This means that the same process is responsible for both driving the application and requesting resources from YARN, and this process runs inside a YARN container. The client that starts the app doesn’t need to stick around for its entire lifetime.
yarn-cluster mode
The yarn-cluster mode is not well suited to using Spark interactively, but the yarn-client mode is. Spark applications that require user input, like spark-shell and PySpark, need the Spark driver to run inside the client process that initiates the Spark application. In yarn-client mode, the Application Master is merely present to request executor containers from YARN. The client communicates with those containers to schedule work after they start:
yarn-client mode
This table offers a concise list of differences between these modes:
Reference: https://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/ - Apache Spark Resource Management and YARN App Models (web.archive.com mirror)
In yarn-cluster mode, the driver program will run on the node where application master is running where as in yarn-client mode the driver program will run on the node on which job is submitted on centralized gateway node.
Both answers by Ram and Chavati/Wai Lee are useful and helpful, but here I just want to added a couple of other points:
Much has been written about the proximity of driver to the executors, and that is a big factor. The other factors are:
Will the driver process be around until execution of job is finished?
How's returned data being handled?
#1. In spark client mode, the driver process must be up and running the whole time when the spark job is in execution. So if you have a truly long job that say take many hours to run, you need to make sure the driver process is still up and running, and that the driver session is not auto-logout.
On the other hand, after submitting a job to run in cluster mode, the process can go away. The cluster mode will keep running. So this is typically how a production job will run: the job can be triggered by a timer, or by an external event and then the job will run to its completion without worrying about the lifetime of the process submitting the spark job.
#2. In client mode, you can call sc.collect() to gather all the data back from all executors, and then write/save the returned data to a local Linux file on local disk. Now this may not work for cluster mode, as the 'driver' typically run in a different remote host. The data written up therefore need to be persisted in a common mounted volume (such as GPFS, NFS) or in distributed file system like HDFS. If the job is running under Hadoop/YARN, the more common way for cluster mode is simply ask each executor to persist the data to HDFS, and not to run collect( ) at all. Collect() actually have scalability issue when there are a large number of executors returning large amount of data - it can overwhelm the driver process.

Spark on Mesos is much slower than local

I'm running a Spark Streaming process on a 16 CPU's 64 GB RAM host with Mesos.
When I'm running it using Mesos as a cluster manager (by setting --master mesos://leader.mesos:5050) it's running much slower than when it is run in local mode (--master local[4]).
I can't find the reason for that and I have no clue. One of the things I've noticed is that there is one specific task that is taking significantly more time on Mesos than in Local.
The weird thing (maybe that should be the questions' title) is that the task itself takes 6s and its stage (it has only one stage) takes less than a second. See attached pictures (Mesos (1) and (2)). How come? Isn't a job equal to the sum of its parts?
Local:
Mesos:
(1)
(2)
Another note: I did manage to run this exact same Spark Streaming process on another Mesos cluster, and it runs in a sensible amount of time, pretty much like in the local mode described above. The only difference that I can think of is that this cluster has more than one host, and that Spark is running with 2 executors rather than 1. (I couldn't find a way to run more than 1 executor on the same host on Mesos). Is this may be the reason?
Any clues would be much appreciated.
Spark can run over Mesos in two modes: coarse-grained (default) and fine-grained (see documentation).
In coarse-grained mode Spark launches exactly one executor on each machine it was assigned to by Mesos. Inside this task Spark launches other mini-tasks. It has the benefit of lower startup overhead (in your case you don't want to change this mode).
Could you be more specific about your streaming job? Is it CPU, disk, or network bounded? You can easily compare performance if you run some of Spark examples.
If your task is CPU intensive you might consider setting spark.mesos.extra.cores. By default Spark tries to acquire all cores that are being offered by Mesos. So, if there's no other task running on that cluster it shouldn't be a problem.

Spark Standalone Mode multiple shell sessions (applications)

In Spark 1.0.0 Standalone mode with multiple worker nodes, I'm trying to run a Spark shell from two different computers (same Linux user).
In the documentation, it says "By default, applications submitted to the standalone mode cluster will run in FIFO (first-in-first-out) order, and each application will try to use all available nodes."
The number of cores per worker is set to 4 with 8 being available (via SPARK_JAVA_OPTS="-Dspark.cores.max=4"). Memory is also limited such that enough should be available for both.
However, when looking at the Spark Master WebUI, the shell application that was started later will always remain in state "WAITING" until the first one is exited. The number of cores assigned to it is 0, the Memory per node 10G (same as the one that is already running)
Is there a way to have both shells running at the same time without using Mesos?
Before a shell will start processing on a spark standalone cluster, there has to be sufficient cores and memory. You must specify from each spark shell the number of cores you want, or it will use them all. If you specify 5 cores, with executor memory=10G (the amount of memory you allocated for the executors), and the second spark shell to run with 2 cores, and 10G of memory, the second one will still not start, because the first shell is using both executors, and is using all of the memory on both. If you specify 5G of executor memory for each spark shell, then they can concurrently run.
Essentially you want to have multiple jobs running on a standalone cluster -- unfortunately, it is really not designed to handle this case well. If you want to do that you should use either mesos or yarn.
One workaround to this is to restrict the number of cores per spark shell using total-executor-cores. For example to restrict it to 16 cores, launch it like this:
bin/spark-shell --total-executor-cores 16 --master spark://$MASTER:7077
In this case each shell will use only 16 cores, so you can have two shells running on your 32 cores cluster. They can then run simultaneously but never use more than 16 cores each :(
This solution is far from ideal, I know. You depend on users to restrict themselves, to shut down their shells, and resources are wasted when a user is not running code. I have created a request to fix this on JIRA, which you can vote for.
The application ends when your shell dies. So, you cannot run concurrently two spark-shells on two laptops. What you can do is launch one spark-shell, launch the other, and have the second start when the first one dies.
Contrarily to spark-shell, spark-submit does terminate once computation is over. So you can spark-submit one app, launch a spark-shell, and have the shell take over the moment the application is done.
Or you can run two apps sequentially (one after the other) with two spark-submit launches.

Resources