How to determine the underlying MapReduce jobs in Spark? - apache-spark

Given a Spark application, how to determine how the application is mapped into its underlying MapReduce jobs?

The Spark application itself doesn't know anything about the underlying execution framework. That is part of the abstraction which allows to run in the different modes (local, mesos, standalone, yarn.client and yarn-cluster).
You will however see the yarn application id after submitting your application with spark-submit, it's usually something like this:
application_1453729472522_0110
You can also use the yarn command to list currently running applications like this:
yarn application -list
that will print all applications running in the cluster, Spark applications have the appliccationType SPARK.

I would say each stage is a MapReduce job. I can not give you a reference for this, but from my experience, looking at the stage construction you can see what was cast as a Map phase (chained maps, filters, flatMaps) and what was cast as a Reduce phase (groupBy,collect,join,etc) and grouped into one stage. You can also deduce Map only or Reduce only Mapreduce jobs.
It also helps to output the DAG as you see again the same chaning.
You can access the Stages in the Spark UI while your spark job is running.
Disclaimer This is deduced from experience and deduction reasoning.

Related

Does spark behave like a library?

When writing an Application in Scala using Spark, when ran, is it a regular Scala application which "delegates the spark jobs to the spark cluster" and gets the desired results back ?
Or does it get completely compiled to something special consumed by a "spark engine" ?
It depends on the "deploy mode"
If you use local mode, then all actions occur locally, and you don't get any benefits from distribution that Spark is meant for. While it can be used to abstract different libraries and provide clean ways to process data via dataframes or ML, it's not really intended to be used like that
Instead, you can use cluster mode, in which your app just defines the steps to take, and then when submitted, everything happens in the remote cluster. In order to process data back in the driver, you need to use methods such as collect(), or otherwise download the results from remote file systems/databases

How to ensure that DAG is not recomputed after the driver is restarted?

How can I ensure that an entire DAG of spark is highly available i.e. not recomputed from scratch when the driver is restarted (default HA in yarn cluster mode).
Currently, I use spark to orchestrate multiple smaller jobs i.e.
read table1
hash some columns
write to HDFS
this is performed for multiple tables.
Now when the driver is restarted i.e. when working on the second table the first one is reprocessed - though it already would have been stored successfully.
I believe that the default mechanism of checkpointing (the raw input values) would not make sense.
What would be a good solution here?
Is it possible to checkpoint the (small) configuration information and only reprocess what has not already been computed?
TL;DR Spark is not a task orchestration tool. While it has built-in scheduler and some fault tolerance mechanisms built-in, it as suitable for granular task management, as for example server orchestration (hey, we can call pipe on each machine to execute bash scripts, right).
If you want granular recovery choose a minimal unit of computation that makes sense for a given process (read, hash, write looks like a good choice, based on the description), make it an application and use external orchestration to submit the jobs.
You can build poor man's alternative, by checking if expected output exist and skipping part of the job in that case, but really don't - we have variety of battle tested tools which can do way better job than this.
As a side note Spark doesn't provide HA for the driver, only supervision with automatic restarts. Also independent jobs (read -> transform -> write) create independent DAGs - there is no global DAG and proper checkpoint of the application would require full snapshot of its state (like good old BLCR).
when the driver is restarted (default HA in yarn cluster mode).
When the driver of a Spark application is gone, your Spark application is gone and so are all the cached datasets. That's by default.
You have to use some sort of caching solution like https://www.alluxio.org/ or https://ignite.apache.org/. Both work with Spark and both claim to be offering the feature to outlive a Spark application.
There has been times when people used Spark Job Server to share data across Spark applications (which is similar to restarting Spark drivers).

Spark Job Server multithreading and dynamic allocation

I had pretty big expectations from Spark Job Server, but found out it critically lack of documentation.
Could you please answer one/all of next questions:
Does Spark Job Server submit jobs through Spark session?
Is it possible to run few jobs in parallel with Spark Job Server? I saw people faced some troubles, I haven't seen solution yet.
Is it possible to run few jobs in parallel with different CPU, cores, executors configs?
Spark jobserver do not support SparkSession yet. We will be working on it.
Either you can create multiple contexts or you could run a context to use FAIR scheduler.
Use different contexts with different resource config.
Basically job server is just a rest API for creating spark contexts. So you should be able to do what you could do with spark context.

How to create a custom apache Spark scheduler?

I have a p2p mesh network of nodes. It has its own balancing and given a task T can reliably execute it (if one node fails another will continue). My mesh network has Java and Python apis. I wonder what are the steps needed to make Spark call my API to lunch tasks?
Oh boy, that's a really broad question, but I agree with Daniel. If you really want to do this, you could first start with:
Scheduler Backends, which states things like:
Being a scheduler backend in Spark assumes a Apache Mesos-like model in which "an application" gets resource offers as machines
become available and can launch tasks on them. Once a scheduler
backend obtains the resource allocation, it can start executors.
TaskScheduler, since you need to understand how tasks are meant
to be scheduled to build a scheduler, which mentions things like
this:
A TaskScheduler gets sets of tasks (as TaskSets) submitted to it from the DAGScheduler for each stage, and is responsible for sending
the tasks to the cluster, running them, retrying if there are
failures, and mitigating stragglers.
An important concept here is the Dependency Acyclic Graph (GDA),
where you can take a look at its GitHub page.
You can also read
What is the difference between FAILED AND ERROR in spark application states to get an intuition.
Spark Listeners — Intercepting Events from Spark can also come
in handy:
Spark Listeners intercept events from the Spark scheduler that are emitted over the course of execution of Spark applications.
You could take first the Exercise: Developing Custom SparkListener
to monitor DAGScheduler in Scala to check your understanding.
In general, Mastering Apache Spark 2.0 seems to have plenty of resources, but I will not list more here.
Then, you have to meet the Final Boss in this game, Spark's Scheduler GitHub page and get the feel. Hopefully, all this will be enough to get you started! :)
Take a look at how existing schedulers (YARN and Mesos) are implemented.
Implement the scheduler for your system.
Contribute your changes to the Apache Spark project.

Apache Spark application deployment best practices

I have a couple of use cases for Apache Spark applications/scripts, generally of the following form:
General ETL use case -
more specifically a transformation of a Cassandra column family containing many events (think event sourcing) into various aggregated column families.
Streaming use case -
realtime analysis of the events as they arrive in the system.
For (1), I'll need to kick off the Spark application periodically.
For (2), just kick off the long running Spark Streaming process at boot time and let it go.
(Note - I'm using Spark Standalone as the cluster manager, so no yarn or mesos)
I'm trying to figure out the most common / best practice deployment strategies for Spark applications.
So far the options I can see are:
Deploying my program as a jar, and running the various tasks with spark-submit - which seems to be the way recommended in the spark docs. Some thoughts about this strategy:
how do you start/stop tasks - just using simple bash scripts?
how is scheduling managed? - simply use cron?
any resilience? (e.g. Who schedules the jobs to run if the driver server dies?)
Creating a separate webapp as the driver program.
creates a spark context programmatically to talk to the spark cluster
allowing users to kick off tasks through the http interface
using Quartz (for example) to manage scheduling
could use cluster with zookeeper election for resilience
Spark job server (https://github.com/ooyala/spark-jobserver)
I don't think there's much benefit over (2) for me, as I don't (yet) have many teams and projects talking to Spark, and would still need some app to talk to job server anyway
no scheduling built in as far as I can see
I'd like to understand the general consensus w.r.t a simple but robust deployment strategy - I haven't been able to determine one by trawling the web, as of yet.
Thanks very much!
Even though you are not using Mesos for Spark, you could have a look at
-Chronos offering a distributed and fault tolerant cron
-Marathon a Mesos framework for long running applications
Note that this doesn't mean you have to move your spark deployment to mesos, e.g. you could just use chronos to trigger the spark -submit.
I hope I understood your problem correctly and this helps you a bit!

Resources