Spark executors with different amounts of memory on Mesos - apache-spark

Is it possible to have executors with different amounts of memory on a Mesos cluster? Or am I bounded by the machine with the least memory? (Assuming I want to use all available cpus).

Short anwer: No.
Unfortunately, Spark Mesos and YARN only allow giving as much resources (cores, memory, etc.) per machine as your worst machine has (discussion). Ideally, the cluster should be homogeneous in order to take full advantage of its resources.
However, there might exist a workaround for your problem. According to the linked source above, Spark standalone allows creating multiple workers on some machines. You might modify your worker configuration to be appropriate for the worst machine, and start multiple workers on these.
For example, given two computers with 4G and 20G memory respectively, you could create 5 workers on the latter, each with a configuration to use just 4G of memory, as limited per the first machine.

Related

Limit Spark application from grabbing all the resources in a YARN cluster

We (an engineering team) are running an EMR cluster with YARN and Spark. What is typically happening is that when one user submits a heavy memory intensive job, it grabs all the YARN available memory and then all the subsequent users submitted jobs have to wait for that memory to clear (I know that autoscaling will solve this problem to a certain extent and we are looking into that, but we would like to avoid a single user occupying all the memory even when the cluster is autoscaled to it's full limits).
Is there a way to configure YARN such that any application (Spark or otherwise) may not occupy more than, say 75% of available memory?
Thanks
According to the documentation, you can manage the amount of memory allocated to an executor using the parameter: spark.executor.memory

Is there a way to set the niceness setting of spark executor processes?

I have a cluster of machines that I have to share with other processes. Lets just say I am not a nice person and want my spark executor processes to have a higher priority then other people's processes. How can I set that?
I am using StandAlone mode, v2.01, running on RHEL7
Spark does not currently (2.4.0) support nice process priorities. Grepping through the codebase, there is no usage of nice, and hence no easy to set process priority on executors using out-of-the-box Spark. It would be a little odd of Spark to do this, since it only assumes it can start a JVM, not that the base operating system is UNIX.
There are hacky ways to get around this that I do NOT recommend. For instance, if you are using Mesos as a resource manager, you could set spark.mesos.executor.docker.image to an image where java actually calls nice -1 old-java "$#".
Allocate all the resources to the spark application leaving minimal
resource needed for os to run.
A simple scenario :
Imagine a cluster with six nodes running NodeManagers(Yarn Mode), each equipped with 16 cores and 64GB of memory. The NodeManager capacities, yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores, should probably be set to 63 * 1024 = 64512 (megabytes) and 15 respectively. We avoid allocating 100% of the resources to YARN containers because the node needs some resources to run the OS and Hadoop daemons. In this case, we leave a gigabyte and a core for these system processes.

Do multiple spark applications running on yarn have any impact on each other?

Do multiple spark jobs running on yarn have any impact on each other?
e.g. If the traffic on one streaming job increases too much does it have any effect on second job? Will it slow it down or any other consequences?
I have enough resources for both of the applications to run concurrently.
Yes they do. Depending on how your scheduler is set up (static vs dynamic) they either share just the network output (important for shuffles) and disk throughput (important for shuffles, reading in of data locally or on HDFS, writing away data locally or on HDFS) or also the memory and CPUs if it's on dynamic allocation. Still, running your two jobs on parallel as opposed to sequentially will benefit on average, due to the network and disk resources not being used constantly. This mostly depends on the amount of shuffling necessary in your jobs.

Spark resource scheduling - Standalone cluster manager

I have pretty low configuration testing machine for my data pipelines developed in Spark. I will use only one AWS t2.large instance, which has only 2 CPUs and 8 GB of RAM.
I need to run 2 spark streaming jobs, as well as leave some memory and CPU power for occasionally testing batch jobs.
So I have master and one worker, which are on the same machine.
I have some general questions:
1) How many executors can run per one worker? I know that default is one, but does it make sense to change this?
2) Can one executor execute multiple applications, or one executor is dedicated only to one application?
3) Is a way to make this work, to set memory that application can use in configuration file, or when I create spark context?
Thank you
How many executors can run per one worker? I know that default is one, but does it make sense to change this?
It makes sense only in case you have enough resources. Say, on a machine with 24 GB and 12 cores it's possible to run 3 executors if you're sure that 8 GB is enough for one executor.
Can one executor execute multiple applications, or one executor is dedicated only to one application?
Nope, every application starts their own executors.
Is a way to make this work, to set memory that application can use in configuration file, or when I create spark context?
I'm not sure I understand the question, but there are 3 ways to provide configuration for applications
file spark-defaults.conf, but don't forget to turn on to read default properties, when you create new SparkConf instance.
providing system properties through -D, when you run the application or --conf if that's spark-submit or spark-shell. Although for memory options there are specific parameters like spark.executor.memory or spark.driver.memory and others to be used.
provides the same options through new SparkConf instance using its set methods.

What are the benefits of running multiple Spark instances per node (physical machine)?

Is there any advantage to starting more than one spark instance (master or worker) on a particular machine/node?
The spark standalone documentation doesn't explicitly say anything about starting a cluster or multiple workers on the same node. It does seem to implicitly conflate that one worker equals one node
Their hardware provisioning page says:
Finally, note that the Java VM does not always behave well with more than 200 GB of RAM. If you purchase machines with more RAM than this, you can run multiple worker JVMs per node. In Spark’s standalone mode, you can set the number of workers per node with the SPARK_WORKER_INSTANCES variable in conf/spark-env.sh, and the number of cores per worker with SPARK_WORKER_CORES.
So aside from working with large amounts of memory or testing cluster configuration, is there any benefit to running more than one worker per node?
I think the obvious benefit is to improve the resource utilization of the hardware per box without losing performance. In terms of parallelism, one big executor with multiple cores seems to be same with multiple executors with less cores.

Resources