Create a Spark pool by user by default on Zeppelin Notebook - apache-spark

I am working with Spark inside Zeppelin in a collaborative environment. So we have only one interpreter and many users are using this interpreter. For this reason, I defined it using instantiation per user in scoped mode.
With this configuration, a user job X await the resource allocated by jobs of another users.
To change this behavior and allow jobs from different users to execute at the same time, I defined the Spark configuration (on Zeppelin interpreter configurations) spark.scheduler.mode equal to FAIR. To make desired effect, the user need to define manually, on your notebook, your own Spark pool (jobs from different pools can be executed at same time: https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application) with this code:
sc.setLocalProperty("spark.scheduler.pool", "pool1")
Ps.: After one hour, the interpreter shutdown. If users forget to execute this command on next time, they fall in default pool, what is not good.
What I want to know: Is possible to set a Spark user pool automatically when he executes your paragraphs without manual efforts every time?
If there is another way to do this, please let me know if it's possible.

Related

How to manage multiple environments in pyspark clusters?

I want to:
Have multiple python environments in my pyspark dataproc cluster
Specify while submitting the job which environment I want to execute my submitted job in
I want to persist the environments so that I can use them on an as-needed basis. I won't tear down the cluster but I would occasionally stop it. I want the environments to persist the way they do on a normal VM
Currently, I know how to submit the job with the entire environment with a conda pack but, the problem with that is it would ship the entire environment payload each time I want to submit the job and does not address the issue of handling multiple environments for projects

Do I need to/how to clean up Spark Sessions?

I am launching Spark 2.4.6 in a Python Flask web service. I'm running a single Spark Context and I have also enabled FAIR scheduling.
Each time a user makes a request to one of the REST end points I call spark = sparkSession.newSession() and then execute various operations using Spark SQL in this somewhat isolated environment.
My concern is, after 100 or 10,000 or a million requests with an equal number of new sessions, at some point am I going to run into issues? Is there a way to let my SparkContext know that I don't need an old session anymore and that it can be cleared?

How to ensure that DAG is not recomputed after the driver is restarted?

How can I ensure that an entire DAG of spark is highly available i.e. not recomputed from scratch when the driver is restarted (default HA in yarn cluster mode).
Currently, I use spark to orchestrate multiple smaller jobs i.e.
read table1
hash some columns
write to HDFS
this is performed for multiple tables.
Now when the driver is restarted i.e. when working on the second table the first one is reprocessed - though it already would have been stored successfully.
I believe that the default mechanism of checkpointing (the raw input values) would not make sense.
What would be a good solution here?
Is it possible to checkpoint the (small) configuration information and only reprocess what has not already been computed?
TL;DR Spark is not a task orchestration tool. While it has built-in scheduler and some fault tolerance mechanisms built-in, it as suitable for granular task management, as for example server orchestration (hey, we can call pipe on each machine to execute bash scripts, right).
If you want granular recovery choose a minimal unit of computation that makes sense for a given process (read, hash, write looks like a good choice, based on the description), make it an application and use external orchestration to submit the jobs.
You can build poor man's alternative, by checking if expected output exist and skipping part of the job in that case, but really don't - we have variety of battle tested tools which can do way better job than this.
As a side note Spark doesn't provide HA for the driver, only supervision with automatic restarts. Also independent jobs (read -> transform -> write) create independent DAGs - there is no global DAG and proper checkpoint of the application would require full snapshot of its state (like good old BLCR).
when the driver is restarted (default HA in yarn cluster mode).
When the driver of a Spark application is gone, your Spark application is gone and so are all the cached datasets. That's by default.
You have to use some sort of caching solution like https://www.alluxio.org/ or https://ignite.apache.org/. Both work with Spark and both claim to be offering the feature to outlive a Spark application.
There has been times when people used Spark Job Server to share data across Spark applications (which is similar to restarting Spark drivers).

How to enable Fair scheduler?

I'd like to understand the internals of Spark's FAIR scheduling mode. The thing is that it seems not so fair as one would expect according to the official Spark documentation:
Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
It seems like jobs are not handled equally and actually managed in fifo order.
To give more information on the topic:
I am using Spark on YARN. I use the Java API of Spark. To enable the fair mode, The code is :
SparkConf conf = new SparkConf();
conf.set("spark.scheduler.mode", "FAIR");
conf.setMaster("yarn-client").setAppName("MySparkApp");
JavaSparkContext sc = new JavaSparkContext(conf);
Did I miss something?
It appears that you didn't set up the pools and all your jobs end up in a single default pool as described in Configuring Pool Properties:
Specific pools’ properties can also be modified through a configuration file.
and later
A full example is also available in conf/fairscheduler.xml.template. Note that any pools not configured in the XML file will simply get default values for all settings (scheduling mode FIFO, weight 1, and minShare 0).
It can also be that you didn't set up the local property to set up the pool to use for a given job(s) as described in Fair Scheduler Pools:
Without any intervention, newly submitted jobs go into a default pool, but jobs’ pools can be set by adding the spark.scheduler.pool “local property” to the SparkContext in the thread that’s submitting them.
It can finally mean that you use a single default FIFO pool so one pool in FIFO mode changes nothing comparing to FIFO without pools.
It's only you to know the real answer :)

How does a Spark Application work?

I am trying to implement a simple Spark SQL Application that takes a query as input and processes the data. But because I need to cache the data and I have to maintain a single SQL Context object. I am not able to understand how I can use same SQL context and keep getting queries from user.
So how does an application work? When an application is submitted to cluster, does it keep running on the cluster or performs a specific task and shuts down immediately after the task?
Spark application has a driver program that starts and configures the Spark Context. Driver program can be inside your application and you can use the same Spark Context throughout the life of your application.
Spark Context is thread safe, so multiple users can use it to run jobs concurrently.
There is an open source project Zeppelin that does just that.

Resources