Spark on YARN multitenancy - apache-spark

we are currently evaluating Spark on our cluster that already supports MRv2 over YARN.
We have noticed a problem with running jobs concurrently, in particular that a running Spark job will not release its resources until the job is finished. Ideally, if two people run any combination of MRv2 and Spark jobs, the resources should be fairly distributed.
I have noticed a feature called "dynamic resource allocation" in Spark 1.2, but this does not seem to be solving the problem, because it releases resources only when Spark is IDLE, not while it's BUSY.
I haven't been able to locate any further information on this matter. On the other hand, I feel this must be pretty common issue for a lot of users.
So,
What is your experience when dealing with multitenant MRv2 and Spark cluster with YARN?
Is Spark architectually adept to support releasing resources while it's busy? Is this a planned feature or is it something that conflicts with the idea of Spark executors?

Related

Updating jar job on databricks

I have a shared cluster which is used by more than several jobs on databricks.
the update of the jar corresponding to the job is not used when I launch the execution of the job, on cluster, I see that it uses an old version of the jar.
to clarify, I publish the jar through API 2.0 in databricks.
my question why when i start the execution of my Job, the execution on the cluster always uses an old version.
Thank you for you helping
Old jar will be removed from the cluster only when it's terminated. If you have a shared cluster that never terminates, then it doesn't happen. This a limitation not of the Databricks but Java that can't unload classes that are already in use (or it's very hard to implement reliably).
For most of cases it's really not recommended to use shared cluster, for several reasons:
it costs significantly more (~4x)
tasks are affecting each other from performance point of view
there is a high probability of dependencies conflicts + inability of updating libraries without affecting other tasks
there is a kind of "garbage" collected on the driver nodes
...
If you use shared cluster to get faster execution, I recommend to look onto Instance Pools, especially in combination of preloading of Databricks Runtime onto nodes in instance pool.

Can I run spark 2.0.* artifact on a spark 2.2.* stand-alone cluster?

I am aware of the fact that with the change of major version of spark (i.e. from 1.* to 2.*) there will be compile time failures due to changes in existing APIs.
As per my knowledge spark guarantees that with minor version update (i.e. 2.0.* to 2.2.*), changes will be backward compatible.
Although this will eliminate the possibility of compile-time failures with upgrade, would it be safe to assume that there won't be any run time failure too if submit a job on spark 2.2.* stand alone cluster using an artifact(jar) created using 2.0.* dependencies?
would it be safe to assume that there won't be any run time failure too if submit a job on 2.2.* cluster using an artifact(jar) created using 2.0.* dependencies?
Yes.
I'd even say that there's no concept of a Spark cluster unless we talk about the built-in Spark Standalone cluster.
In other words, you deploy a Spark application to a cluster, e.g. Hadoop YARN or Apache Mesos, as a application jar that may or may not contain Spark jars and so disregard what's already available in the environment.
If however you do think of Spark Standalone, things may have been broken between releases even between 2.0 and 2.2 as the jars in your Spark application have to be compatible with the ones on JVM of Spark workers (they are already pre-loaded).
I would not claim full compatibility between releases of Spark Standalone.

Spark Job Server multithreading and dynamic allocation

I had pretty big expectations from Spark Job Server, but found out it critically lack of documentation.
Could you please answer one/all of next questions:
Does Spark Job Server submit jobs through Spark session?
Is it possible to run few jobs in parallel with Spark Job Server? I saw people faced some troubles, I haven't seen solution yet.
Is it possible to run few jobs in parallel with different CPU, cores, executors configs?
Spark jobserver do not support SparkSession yet. We will be working on it.
Either you can create multiple contexts or you could run a context to use FAIR scheduler.
Use different contexts with different resource config.
Basically job server is just a rest API for creating spark contexts. So you should be able to do what you could do with spark context.

Apache Spark application deployment best practices

I have a couple of use cases for Apache Spark applications/scripts, generally of the following form:
General ETL use case -
more specifically a transformation of a Cassandra column family containing many events (think event sourcing) into various aggregated column families.
Streaming use case -
realtime analysis of the events as they arrive in the system.
For (1), I'll need to kick off the Spark application periodically.
For (2), just kick off the long running Spark Streaming process at boot time and let it go.
(Note - I'm using Spark Standalone as the cluster manager, so no yarn or mesos)
I'm trying to figure out the most common / best practice deployment strategies for Spark applications.
So far the options I can see are:
Deploying my program as a jar, and running the various tasks with spark-submit - which seems to be the way recommended in the spark docs. Some thoughts about this strategy:
how do you start/stop tasks - just using simple bash scripts?
how is scheduling managed? - simply use cron?
any resilience? (e.g. Who schedules the jobs to run if the driver server dies?)
Creating a separate webapp as the driver program.
creates a spark context programmatically to talk to the spark cluster
allowing users to kick off tasks through the http interface
using Quartz (for example) to manage scheduling
could use cluster with zookeeper election for resilience
Spark job server (https://github.com/ooyala/spark-jobserver)
I don't think there's much benefit over (2) for me, as I don't (yet) have many teams and projects talking to Spark, and would still need some app to talk to job server anyway
no scheduling built in as far as I can see
I'd like to understand the general consensus w.r.t a simple but robust deployment strategy - I haven't been able to determine one by trawling the web, as of yet.
Thanks very much!
Even though you are not using Mesos for Spark, you could have a look at
-Chronos offering a distributed and fault tolerant cron
-Marathon a Mesos framework for long running applications
Note that this doesn't mean you have to move your spark deployment to mesos, e.g. you could just use chronos to trigger the spark -submit.
I hope I understood your problem correctly and this helps you a bit!

Which cluster type should I choose for Spark?

I am new to Apache Spark, and I just learned that Spark supports three types of cluster:
Standalone - meaning Spark will manage its own cluster
YARN - using Hadoop's YARN resource manager
Mesos - Apache's dedicated resource manager project
I think I should try Standalone first. In the future, I need to build a large cluster (hundreds of instances).
Which cluster type should I choose?
Spark Standalone Manager : A simple cluster manager included with Spark that makes it easy to set up a cluster. By default, each application uses all the available nodes in the cluster.
A few benefits of YARN over Standalone & Mesos:
YARN allows you to dynamically share and centrally configure the same pool of cluster resources between all frameworks that run on YARN.
You can take advantage of all the features of YARN schedulers for categorizing, isolating, and prioritizing workloads.
The Spark standalone mode requires each application to run an executor on every node in the cluster; whereas with YARN, you choose the number of executors to use
YARN directly handles rack and machine locality in your requests, which is convenient.
The resource request model is, oddly, backwards in Mesos. In YARN, you (the framework) request containers with a given specification and give locality preferences. In Mesos you get resource "offers" and choose to accept or reject those based on your own scheduling policy. The Mesos model is a arguably more flexible, but seemingly more work for the person implementing the framework.
If you have a big Hadoop cluster already in place, YARN is better choice.
The Standalone manager requires the user configure each of the nodes with the shared secret. Mesos’ default authentication module, Cyrus SASL, can be replaced with a custom module. YARN has security for authentication, service level authorization, authentication for Web consoles and data confidentiality. Hadoop authentication uses Kerberos to verify that each user and service is authenticated by Kerberos.
High availability is offered by all three cluster managers but Hadoop YARN doesn’t need to run a separate ZooKeeper Failover Controller.
Useful links:
spark documentation page
agildata article
I think the best to answer that are those who work on Spark. So, from Learning Spark
Start with a standalone cluster if this is a new deployment.
Standalone mode is the easiest to set up and will provide almost all
the same features as the other cluster managers if you are only
running Spark.
If you would like to run Spark alongside other applications, or to use
richer resource scheduling capabilities (e.g. queues), both YARN and
Mesos provide these features. Of these, YARN will likely be
preinstalled in many Hadoop distributions.
One advantage of Mesos over both YARN and standalone mode is its
fine-grained sharing option, which lets interactive applications such
as the Spark shell scale down their CPU allocation between commands.
This makes it attractive in environments where multiple users are
running interactive shells.
In all cases, it is best to run Spark on the same nodes as HDFS for
fast access to storage. You can install Mesos or the standalone
cluster manager on the same nodes manually, or most Hadoop
distributions already install YARN and HDFS together.
Standalone is pretty clear as other mentioned it should be used only when you have spark only workload.
Between yarn and mesos, One thing to consider is the fact that unlike mapreduce, spark job grabs executors and hold it for entire lifetime of a job. where in mapreduce a job can get and release mappers and reducers over lifetime.
if you have long running spark jobs which during the lifetime of a job doesn't fully utilize all the resources it got in beginning, you may want to share those resources to other app and that you can only do either via Mesos or Spark dynamic scheduling. https://spark.apache.org/docs/2.0.2/job-scheduling.html#scheduling-across-applications
So with yarn, only way have dynamic allocation for spark is by using spark provided dynamic allocation. Yarn won't interfere in that while Mesos will. Again this whole point is only important if you have a long running spark application and you would like to scale it up and down dynamically.
In this case and similar dilemmas in data engineering, there are many side questions to be answered before choosing one distribution method over another.
For example, if you are not running your processing engine on more than 3 nodes, you usually are not facing too big of a problem to handle so your margin of performance tuning between YARN and SparkStandalone (based on experience) will not clarify your decision. Because usually you will try to make your pipeline simple, specially when your services are not self-managed by cloud and bugs and failures happen often.
I choose standalone for relatively small or not-complex pipelines but if I'm feeling alright and have a Hadoop cluster already in place, I prefer to take advantage of all the extra configs that Hadoop(Yarn) can give me.
Mesos has more sophisticated scheduling design, allowing applications like Spark to negotiate with it. It's more suitable for the diversity of applications today. I found this site really insightful:
https://www.oreilly.com/ideas/a-tale-of-two-clusters-mesos-and-yarn
"... YARN is optimized for scheduling Hadoop jobs, which are historically (and still typically) batch jobs with long run times. This means that YARN was not designed for long-running services, nor for short-lived interactive queries (like small and fast Spark jobs), and while it’s possible to have it schedule other kinds of workloads, this is not an ideal model. The resource demands, execution model, and architectural demands of MapReduce are very different from those of long-running services, such as web servers or SOA applications, or real-time workloads like those of Spark or Storm..."

Resources