Spark Job Server multithreading and dynamic allocation - apache-spark

I had pretty big expectations from Spark Job Server, but found out it critically lack of documentation.
Could you please answer one/all of next questions:
Does Spark Job Server submit jobs through Spark session?
Is it possible to run few jobs in parallel with Spark Job Server? I saw people faced some troubles, I haven't seen solution yet.
Is it possible to run few jobs in parallel with different CPU, cores, executors configs?

Spark jobserver do not support SparkSession yet. We will be working on it.
Either you can create multiple contexts or you could run a context to use FAIR scheduler.
Use different contexts with different resource config.
Basically job server is just a rest API for creating spark contexts. So you should be able to do what you could do with spark context.

Related

Is it possible to run Hive on Spark with YARN capacity scheduler?

I use Apache Hive 2.1.1-cdh6.2.1 (Cloudera distribution) with MR as execution engine and YARN's Resource Manager using Capacity scheduler.
I'd like to try Spark as an execution engine for Hive. While going through the docs, I found a strange limitation:
Instead of the capacity scheduler, the fair scheduler is required. This fairly distributes an equal share of resources for jobs in the YARN cluster.
Having all the queues set up properly, that's very undesirable for me.
Is it possible to run Hive on Spark with YARN capacity scheduler? If not, why?
I'm not sure you can execute Hive using spark Engines. I highly recommend you configure Hive to use Tez https://cwiki.apache.org/confluence/display/Hive/Hive+on+Tez which is faster than MR and it's pretty similar to Spark due to it uses DAG as the task execution engine.
We are running it at work using the command on Beeline as described https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started just writing it at the beginning of the sql file to run
set hive.execution.engine=spark;
select ... from table....
We are not using capacity scheduler because there are hundreds of jobs run per yarn queue, and when jobs are resource avid, we have other queues to let them run. That also allows designing a configuration based on job consumption per queue more realistic based on the actual need of the group of jobs
Hope this helps

running multiple Spark jobs on a Mesos cluster

I would like to run multiple spark jobs on my Mesos cluster, and have all spark jobs share the same spark framework. Is this possible?
I have tried running the MesosClusterDispatcher and have the spark jobs connect to the dispatcher, but each spark job launches its own "Spark Framework" (I have tried running both client-mode and cluster-mode).
Is this the expected behaviour?
Is it possible to share the same spark-framework among multiple spark jobs?
It is normal and it's the expected behaviour.
In Mesos as far as I know, SparkDispatcher is in charge of allocate resources for your Spark Driver which will act as a framework. Once Spark driver has been allocated, it is responsible for talk to Mesos and accept offers to allocate the executors where tasks will be executed.

How to create a custom apache Spark scheduler?

I have a p2p mesh network of nodes. It has its own balancing and given a task T can reliably execute it (if one node fails another will continue). My mesh network has Java and Python apis. I wonder what are the steps needed to make Spark call my API to lunch tasks?
Oh boy, that's a really broad question, but I agree with Daniel. If you really want to do this, you could first start with:
Scheduler Backends, which states things like:
Being a scheduler backend in Spark assumes a Apache Mesos-like model in which "an application" gets resource offers as machines
become available and can launch tasks on them. Once a scheduler
backend obtains the resource allocation, it can start executors.
TaskScheduler, since you need to understand how tasks are meant
to be scheduled to build a scheduler, which mentions things like
this:
A TaskScheduler gets sets of tasks (as TaskSets) submitted to it from the DAGScheduler for each stage, and is responsible for sending
the tasks to the cluster, running them, retrying if there are
failures, and mitigating stragglers.
An important concept here is the Dependency Acyclic Graph (GDA),
where you can take a look at its GitHub page.
You can also read
What is the difference between FAILED AND ERROR in spark application states to get an intuition.
Spark Listeners — Intercepting Events from Spark can also come
in handy:
Spark Listeners intercept events from the Spark scheduler that are emitted over the course of execution of Spark applications.
You could take first the Exercise: Developing Custom SparkListener
to monitor DAGScheduler in Scala to check your understanding.
In general, Mastering Apache Spark 2.0 seems to have plenty of resources, but I will not list more here.
Then, you have to meet the Final Boss in this game, Spark's Scheduler GitHub page and get the feel. Hopefully, all this will be enough to get you started! :)
Take a look at how existing schedulers (YARN and Mesos) are implemented.
Implement the scheduler for your system.
Contribute your changes to the Apache Spark project.

Spark on YARN multitenancy

we are currently evaluating Spark on our cluster that already supports MRv2 over YARN.
We have noticed a problem with running jobs concurrently, in particular that a running Spark job will not release its resources until the job is finished. Ideally, if two people run any combination of MRv2 and Spark jobs, the resources should be fairly distributed.
I have noticed a feature called "dynamic resource allocation" in Spark 1.2, but this does not seem to be solving the problem, because it releases resources only when Spark is IDLE, not while it's BUSY.
I haven't been able to locate any further information on this matter. On the other hand, I feel this must be pretty common issue for a lot of users.
So,
What is your experience when dealing with multitenant MRv2 and Spark cluster with YARN?
Is Spark architectually adept to support releasing resources while it's busy? Is this a planned feature or is it something that conflicts with the idea of Spark executors?

Apache Spark application deployment best practices

I have a couple of use cases for Apache Spark applications/scripts, generally of the following form:
General ETL use case -
more specifically a transformation of a Cassandra column family containing many events (think event sourcing) into various aggregated column families.
Streaming use case -
realtime analysis of the events as they arrive in the system.
For (1), I'll need to kick off the Spark application periodically.
For (2), just kick off the long running Spark Streaming process at boot time and let it go.
(Note - I'm using Spark Standalone as the cluster manager, so no yarn or mesos)
I'm trying to figure out the most common / best practice deployment strategies for Spark applications.
So far the options I can see are:
Deploying my program as a jar, and running the various tasks with spark-submit - which seems to be the way recommended in the spark docs. Some thoughts about this strategy:
how do you start/stop tasks - just using simple bash scripts?
how is scheduling managed? - simply use cron?
any resilience? (e.g. Who schedules the jobs to run if the driver server dies?)
Creating a separate webapp as the driver program.
creates a spark context programmatically to talk to the spark cluster
allowing users to kick off tasks through the http interface
using Quartz (for example) to manage scheduling
could use cluster with zookeeper election for resilience
Spark job server (https://github.com/ooyala/spark-jobserver)
I don't think there's much benefit over (2) for me, as I don't (yet) have many teams and projects talking to Spark, and would still need some app to talk to job server anyway
no scheduling built in as far as I can see
I'd like to understand the general consensus w.r.t a simple but robust deployment strategy - I haven't been able to determine one by trawling the web, as of yet.
Thanks very much!
Even though you are not using Mesos for Spark, you could have a look at
-Chronos offering a distributed and fault tolerant cron
-Marathon a Mesos framework for long running applications
Note that this doesn't mean you have to move your spark deployment to mesos, e.g. you could just use chronos to trigger the spark -submit.
I hope I understood your problem correctly and this helps you a bit!

Resources