Best Practice For Deploying and Running Periodic Spark Job - apache-spark

I have a number of spark batch jobs each of which need to be run every x hours. I'm sure this must be a common problem but there seems to be relatively little on the internet as to what the best practice is here for setting this up. My current setup is as follows:
Build system (sbt) builds a tar.gz containing a fat jar + a script that will invoke spark-submit.
Once tests have passed, CI system (Jenkins) copies the tar.gz to hdfs.
I set up a chronos job to unpack the tar.gz to the local filesystem and run the script that submits to spark.
This setup works reasonably well, but there are some aspects of step 3) that I'm not fond of. Specifically:
I need a separate script (executed by chronos) that copies from hdfs, unpacks and runs the spark-submit task. As far as I can tell chrons can't run scripts from hdfs so I have to have a copy of this script on every mesos worker which makes deployment more complex that it would be if everything just lived on hdfs.
I have a feeling that I have too many moving parts. For example I was wondering if I could create an executable jar that could submit itself (args would be the spark master and the main class) in which case I would do away with at least one of the wrapper scripts. Unfortunately I haven't found a good way of doing this
As this is a problem that everyone faces I was wondering if anyone could give a better solution.

To download and extract archive you can use Mesos fetcher by specifying it in Chronos job config by setting uris field.
To do the same procedure on executors side you can set spark.executor.uri parameter in default Spark conf

Related

Using different version of hadoop client library with apache spark

I'm trying to run two or more jobs in parallel. All jobs write append data using same output path, problem is that first job that finishes does cleanup and erases _temporary folder which causes other jobs to throw exception.
With hadoop-client 3 there is a configuration flag to disable auto cleanup of this folder mapreduce.fileoutputcommitter.cleanup.skipped.
I was able to exclude dependencies from spark-core and add new hadoop-client using maven. This run fine for master=local but I'm not convinced it is correct.
My questions are
Is it possible to use different hadoop-client library with apache spark (e.g. hadoop-client version 3 with apache spark 2.3) and what is the correct approach?
Is there better way to run multiple jobs in parallel writing under same path?

Is there a way to submit spark job on different server running master

We have a requirement to schedule spark jobs, since we are familiar with apache-airflow we want to go ahead with it to create different workflows. I searched web but did not find a step by step guide to schedule spark job on airflow and option to run them on different server running master.
Answer to this will be highly appreciated.
Thanks in advance.
There are 3 ways you can submit Spark jobs using Apache Airflow remotely:
(1) Using SparkSubmitOperator: This operator expects you have a spark-submit binary and YARN client config setup on our Airflow server. It invokes the spark-submit command with given options, blocks until the job finishes and returns the final status. The good thing is, it also streams the logs from the spark-submit command stdout and stderr.
You really only need to configure a yarn-site.xml file, I believe, in order for spark-submit --master yarn --deploy-mode client to work.
Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.
If you really want, you could add a hdfs-site.xml and hive-site.xml to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath
(2) Using SSHOperator: Use this operator to run bash commands on a remote server (using SSH protocol via paramiko library) like spark-submit. The benefit of this approach is you don't need to copy the hdfs-site.xml or maintain any file.
(3) Using SimpleHTTPOperator with Livy: Livy is an open source REST interface for interacting with Apache Spark from anywhere. You just need to have REST calls.
I personally prefer SSHOperator :)

Running spark streaming forever on production

I am developing a spark streaming application which basically reads data off kafka and saves it periodically to HDFS.
I am running pyspark on YARN.
My question is more for production purpose. Right now, I run my application like this:
spark-submit stream.py
Imagine you are going to deliver this spark streaming application (in python) to a client, what would you do in order to keep it running forever? You wouldn't just give this file and say "Run this on the terminal". It's too unprofessional.
What I want to do , is to submit the job to the cluster (or processors in local) and never have to see logs on the console, or use a solution like linux screen to run it in the background (because it seems too unprofessional).
What is the most professional and efficient way to permanently submit a spark-streaming job to the cluster ?
I hope I was unambiguous. Thanks!
You could use spark-jobserver which provides rest interface for uploading your jar and running it . You can find the documentation here spark-jobserver .

How to modify spark source code and build

I just start learning spark. I have imported spark source code to IDEA and made some small changes (just add some println()) to spark source code. What should I do to see these updates? Should I recompile the spark? Thanks!
At the bare minimum, you will need maven 3.3.3 and Java 7+.
You can follow the steps at http://spark.apache.org/docs/latest/building-spark.html
The "make-distribution.sh" script is quite handy which comes within the spark source code root directory. This script will produce a distributable tar.gz which you can simply extract and launch spark-shell or spark-submit. After making the source code changes in spark, you can run this script with the right options (mainly passing the desired hadoop version, yarn or hive support options but these are required if you want to run on top of hadoop distro, or want to connect to existing hive).
BTW, inserting println() will not be a good idea as it can severely slow down the performance of the job. You should use a logger instead.

How can I pass app-specific configuration to Spark workers?

I have a Spark app which uses many workers. I'd like to be able to pass simple configuration information to them easily (without having to recompile): e.g. USE_ALGO_A. If this was a local app, I'd just set the info in environment variables, and read them. I've tried doing something similar using spark-env.sh, but the variables don't seem to propagate properly.
How can I do simple runtime configuration of my code in the workers?
(PS I'm running a spark-ec2 type cluster)
You need to take care of configuring each worker.
From the Spark docs:
You can edit /root/spark/conf/spark-env.sh on each machine to set Spark configuration options, such as JVM options. This file needs to be copied to every machine to reflect the change.
If you use an Amazon EC2 cluster, there is a script that RSYNC s a directory between teh master and all workers.
The easiest way to do this is to use a script we provide called copy-dir. First edit your spark-env.sh file on the master, then run ~/spark-ec2/copy-dir /root/spark/conf to RSYNC it to all the workers.
see https://spark.apache.org/docs/latest/ec2-scripts.html

Resources