I have a requirement where I need to support adhoc runnable task which are spark streaming jobs.
This is job scheduling mechanism should support orchestrating , monitoring , rescheduling and cancelling operations of these stream jobs. Error handling such as stream complete notifications , steam error notification is also one of the area that it is expected to support.
I tried to explore Airflow. Owning to limited knowledge on Airflow, I am not able to find anything that can help me to orchestra this sparked streaming jobs.Airflow seems supporting scheduling but with batch and intervals.
Could you please suggest any libraries or any mechanism using open-source technology that can help me achieve this particular requirement.References to enterprise solutions are also welcome.
Related
My goal is to optimally run Spark applications alongside the stateless workload in my cluster to make the best use of my cluster resources.
Since Spark applications can suffer from partial scheduling (drivers blocking the executors as the driver pods are started first which then request for the executor pods), a simple strategy to prevent this would be to implement the much talked about gang/co-scheduling to make sure that we only start the driver pod if we can guarantee that all the executors can be started in the future by implementing some kind of a reservations design such that the driver can reserve resources for the executors that will be started in the future.
Also, this reservation definition/implementation must be visible to all the other non-spark pods as well since they would also have to log their resource requests like the Spark pods so we have a clear picture of the cluster resource utilization.
The current implementations include running a new custom scheduler or implementing a scheduler extender to do so but I was wondering if we can achieve this by writing custom scheduler plugins. Additionally, what extension points in the scheduling framework would the plugins have to take advantage of to optimize the scheduling of Spark jobs in a multi-tenant environment (with different kinds of workload) so that my default profile can continue to schedule the stateless workload while the custom profile that uses these plugins can schedule Spark applications?
Finally, would this be the best way to optimize scheduling Spark and Stateless workload in a multi-tenant environment? What would the drawbacks of this approach (using custom plugins) be since we only have a single queue that all the profiles must share?
It sounds like what you would like to have is Gang Scheduling 📆. If you'd like to have that capability, I suggest you use Volcano to schedule/run 🏃 your jobs in Kubernetes with Gang Scheduling.
Another approach is to create your own scheduler using the scheduler extender as described here or use the Palantir gang scheduler extender.
✌️
I have an Apache Spark data loading and transformation application with pyspark.sql that runs for half an hour before throwing an AttributeError or other run-time exceptions.
I want to test my application end-to-end with a small data sample, something like Apache Pig's ILLUSTRATE. Sampling down the data does not help much. Is there a simple way to do this?
It sounds like an idea that could easily be handled by a SparkListener. It gives you access to all the low-level details that the web UI of any Spark application could ever be able to show you. All the events that are flying between the driver (namely DAGScheduler and TaskScheduler with SchedulerBackend) and executors are posted to registered SparkListeners, too.
A Spark listener is an implementation of the SparkListener developer API (that is an extension of SparkListenerInterface where all the callback methods are no-op/do-nothing).
Spark uses Spark listeners for web UI, event persistence (for Spark History Server), dynamic allocation of executors and other services.
You can develop your own custom Spark listeners and register them using SparkContext.addSparkListener method or spark.extraListeners setting.
Go to a Spark UI of your job and you will find a DAG Visualization there. That's a graph representing your job
To test your job on a sample use sample as an input first of all ;) Also you may run your spark locally, not on a cluster and then debug it in IDE of your choice (like IDEA)
More info:
This great answer explaining DAG
DAG introduction from DataBricks
I have a p2p mesh network of nodes. It has its own balancing and given a task T can reliably execute it (if one node fails another will continue). My mesh network has Java and Python apis. I wonder what are the steps needed to make Spark call my API to lunch tasks?
Oh boy, that's a really broad question, but I agree with Daniel. If you really want to do this, you could first start with:
Scheduler Backends, which states things like:
Being a scheduler backend in Spark assumes a Apache Mesos-like model in which "an application" gets resource offers as machines
become available and can launch tasks on them. Once a scheduler
backend obtains the resource allocation, it can start executors.
TaskScheduler, since you need to understand how tasks are meant
to be scheduled to build a scheduler, which mentions things like
this:
A TaskScheduler gets sets of tasks (as TaskSets) submitted to it from the DAGScheduler for each stage, and is responsible for sending
the tasks to the cluster, running them, retrying if there are
failures, and mitigating stragglers.
An important concept here is the Dependency Acyclic Graph (GDA),
where you can take a look at its GitHub page.
You can also read
What is the difference between FAILED AND ERROR in spark application states to get an intuition.
Spark Listeners — Intercepting Events from Spark can also come
in handy:
Spark Listeners intercept events from the Spark scheduler that are emitted over the course of execution of Spark applications.
You could take first the Exercise: Developing Custom SparkListener
to monitor DAGScheduler in Scala to check your understanding.
In general, Mastering Apache Spark 2.0 seems to have plenty of resources, but I will not list more here.
Then, you have to meet the Final Boss in this game, Spark's Scheduler GitHub page and get the feel. Hopefully, all this will be enough to get you started! :)
Take a look at how existing schedulers (YARN and Mesos) are implemented.
Implement the scheduler for your system.
Contribute your changes to the Apache Spark project.
Please consider the following scenario:
created initial pipeline via Spark streaming
enable checkpointing
run the application for a while
stop streaming application
made tiny changes to the pipeline, e.g. business logic remained untouched but did some refactoring, renaming, class moving, etc.
restart the streaming application
get an exception as pipeline stored in checkpoint directory differs on a class level from the new one
What are the best practices to deal with such a scenario? How can we seamlessly upgrade streaming application with checkpointing enabled? What are the best practices for versioning of streaming applications?
tl;dr Checkpointing is for recovery situations not for upgrades.
From the official documentation about Checkpointing:
A streaming application must operate 24/7 and hence must be resilient to failures unrelated to the application logic (e.g., system failures, JVM crashes, etc.). For this to be possible, Spark Streaming needs to checkpoint enough information to a fault- tolerant storage system such that it can recover from failures.
So to answer your question about using checkpointing (that is meant for fault tolerance) and changing your application code, you should not expect it would work since it is against the design.
I have a couple of use cases for Apache Spark applications/scripts, generally of the following form:
General ETL use case -
more specifically a transformation of a Cassandra column family containing many events (think event sourcing) into various aggregated column families.
Streaming use case -
realtime analysis of the events as they arrive in the system.
For (1), I'll need to kick off the Spark application periodically.
For (2), just kick off the long running Spark Streaming process at boot time and let it go.
(Note - I'm using Spark Standalone as the cluster manager, so no yarn or mesos)
I'm trying to figure out the most common / best practice deployment strategies for Spark applications.
So far the options I can see are:
Deploying my program as a jar, and running the various tasks with spark-submit - which seems to be the way recommended in the spark docs. Some thoughts about this strategy:
how do you start/stop tasks - just using simple bash scripts?
how is scheduling managed? - simply use cron?
any resilience? (e.g. Who schedules the jobs to run if the driver server dies?)
Creating a separate webapp as the driver program.
creates a spark context programmatically to talk to the spark cluster
allowing users to kick off tasks through the http interface
using Quartz (for example) to manage scheduling
could use cluster with zookeeper election for resilience
Spark job server (https://github.com/ooyala/spark-jobserver)
I don't think there's much benefit over (2) for me, as I don't (yet) have many teams and projects talking to Spark, and would still need some app to talk to job server anyway
no scheduling built in as far as I can see
I'd like to understand the general consensus w.r.t a simple but robust deployment strategy - I haven't been able to determine one by trawling the web, as of yet.
Thanks very much!
Even though you are not using Mesos for Spark, you could have a look at
-Chronos offering a distributed and fault tolerant cron
-Marathon a Mesos framework for long running applications
Note that this doesn't mean you have to move your spark deployment to mesos, e.g. you could just use chronos to trigger the spark -submit.
I hope I understood your problem correctly and this helps you a bit!