Run both batch and real time jobs on Spark with jobserver - apache-spark

I have a spark job that runs every day as part of a pipeline and perform simple batch processing - let's say, adding a column to DF with other column's value squared. (old DF: x, new DF: x,x^2).
I also have a front app that consumes these 2 columns.
I want to allow my users to edit x and get the answer from the same code base.
Since the batch job is already written in spark, i looked for a way to achieve that against my spark cluster and run into spark jobserver which thought might help here.
My questions:
Can spark jobserver support both batch and single processing?
Can i use the same jobserver-compatible JAR to run a spark job on AWS EMR?
Open to hear about other tools that can help in such use case.
Thanks!

Not sure I understood your scenario fully, but with Spark Jobserver you can configure your batch jobs and pass different parameters to it.
Yes, once you have Jobserver-compatible JAR, you should be able to use it with Jobserver running with Spark in Standalone mode, with YARN or with EMR. But please take into account that you will need to make a setup for Jobserver on EMR. Open source documentation seems to be a bit outdated currently.

Related

How to use Airflow to restart a failed structured streaming spark job?

I need to run a structured streaming spark job in AWS EMR. As the resilience requirement, if the spark job failed due to some reasons, we hope the spark job can be recreated in EMR. It is similar as the task orchestration in ECS, which can restart the task if health check is failed. However, EMR is more a compute engine instead of orchestration system.
I am looking for some big data workflow orchestration tool, such as Airflow. However, it can not support the cycle in DAG. How can I implement some functions as below?
step_adder (EmrAddStepsOperator) >> step_checker (EmrStepSensor) >> step_adder (EmrAddStepsOperator).
What is the suggested way to improve such job level resilience? Any comments are welcome!
Some of the resilience are already cover by Apache Spark (jobs submitted with spark-submit), however when then you want to interact with different processes, that are not withing Spark, then Airflow might be a solution. In your case, a Sensor can help detect if a certain condition happened or not. Based on that you can decide in the DAG. Here is a simple HttpSensor that waits for a batch job to see if it's successfully finished
wait_batch_to_finish = HttpSensor(
http_conn_id='spark_web',
task_id="wait_batch_to_finish",
method="GET",
headers={"Content-Type": "application/json"},
endpoint="/json",
response_check=lambda response: check_spark_status(response, "{{ ti.xcom_pull('batch_intel_task')}}"),
poke_interval=60,
dag=dag
)

Apache Nifi - Submitting Spark batch jobs through Apache Livy

I want to schedule my spark batch jobs from Nifi. I can see there is ExecuteSparkInteractive processor which submit spark jobs to Livy, but it executes the code provided in the property or from the content of the incoming flow file. How should I schedule my spark batch jobs from Nifi and also take different actions if the batch job fails or succeeds?
You could use ExecuteProcess to run a spark-submit command.
But what you seem to be looking for, is not a DataFlow management tool, but a workflow manager. Two great examples for workflow managers are: Apache Oozie & Apache Airflow.
If you still want to use it to schedule spark jobs, you can use the GenerateFlowFile processor to be scheduled(on primary node so it won't be scheduled twice - unless you want to), and then connect it to the ExecuteProcess processor, and make it run the spark-submit command.
For a little more complex workflow, I've written an article about :)
Hope it will help.

Can we submit a spark job from spark itself (may be from another spark program)?

Can anyone clarify the question asked in one of my interviews to me? It could be that the question itself is wrong, I am not sure. However, I have searched everywhere and could not find anything related to this question. The question is:
Can we run a spark job from another spark program?
Yup, you are right its not make any sense. Like we can run our application by our driver program but its same as like we are run it from any application using spark launcher https://github.com/phalodi/Spark-launcher . Except that we can't run application inside rdd closures because they run on worker nodes so it will not work.
Can we run a spark job from another spark program?
I'd focus on another part of the question since the following holds for any Spark program:
Can we run a Spark job from any Spark program?
That means that either there was a follow-up question or some introduction to the one you were asked.
If I were you and heard the question, I'd say "Yes, indeed!"
A Spark application is in other words a launcher of Spark jobs and the only reason to have a Spark application is to run Spark jobs sequentially or in parallel.
Any Spark application does this and nothing more.
A Spark application is a Scala application (when Scala is the programming language) and as such it is possible to run a Spark program from another Spark program (where it makes sense in general sense I put aside as there could be conflicts with multiple SparkContexts per one single JVM).
Given the other Spark application is a separate executable jar, you could launch it using Process API in Scala (as any other application):
scala.sys.process This package handles the execution of external processes.

How can we set the execution parameters for an apache spark application

We have setup a multinode cluster for testing the Spark application with 4 nodes.
Each node has 250GB RAM,48 cores.
Running master on one node and 3 as slaves.
And we have developed a spark application using scala.
We use the spark-submit option to run the job.
Now here is the point we are struck and need more clarifications to proceed.
Query 1:
Which is the best option to run a spark job.
a) Spark as master
b) Yarn as master
and the difference.
Query 2:
While running any spark job we can provide option like number of executors,no of cores,executor memory etc.
Could you please advise what would be the optimal value for these parameters for better performance in my case.
Any help would be very much appreciated since it would be helpful for anyone who starts with Spark :)
Thanks.!!
Query1: YARN is a better resource manager and supports more features than Spark Master. For more you can visit
Apache Spark Cluster Managers
Query2: You can only assign resources at the time of job initialization. There are command line flags available. Also, if you don't wish to pass command line flags with spark-submit you can set them when creating spark configuration in the code.
You can see the available flags using
spark-submit --help
Fore more information visit Spark Configuration
Electing resources majorly depends on the size of data you want to process and the problem complexity.
Please visit 5 mistakes to avoid while writng spark applications

Is spark or spark with mesos the easiest to start with?

If I want a simple setup that would give me a quick start: would a combination of apache-spark and mesos would be the easiest? or maybe apache-spark alone would be better because....i.e. mesos would add complexity to the process given what it does, or maybe mesos does way so many things that would be hard to deal with spark alone, etc...
All I want is to be able to submit jobs and manage the cluster and jobs easily, nothing fancy for now, is spark or spark/mesos better or something else...
The easiest way to start using Spark is starting stand alone spark cluster on EC2.
It is as easy as running single script - spark-ec2 and it will do the rest for you.
The only case when stand alone cluster may not suit you - if you want to run more then single spark job at a time (at least it was the case with Spark 1.1).
For me personally the stand alone Spark cluster was good enough for a long time when I was running ad-hoc jobs - analyzing company's logs on S3 and learning Spark, and then destroy the cluster.
If you want to run more than one Spark at a time - I would go with Mesos.
Alternative would be to install CDH from Cloudera which is relatively easy (they provide install scripts and install instructions) and it is available for free.
CDH would provide you powerful tools to manage the cluster.
Using CDH for running Spark - they use YARN, and we have one or another issue from time to time with running Spark on YARN.
The main disadvantage to me - CDHs provider its own build of Spark - so it usually one minor version behind, which is a lot for such rapid progressing project as Spark.
So I would try Mesos for running Spark if I need to run more then one job at a time.
Just for completeness, Hortonworks provides downloadable HDP sandbox VM as well as supports Spark on HDP. It is also a good starting point.
Additionally, you can spin off your own cluster. I do thisonmy laptop, not for real big data usecases but for learning with moderate amount of data.
import subprocess as s
from time import sleep
cmd = "D:\\spark\\spark-1.3.1-bin-hadoop2.6\\spark-1.3.1-bin-hadoop2.6\\spark-1.3.1-bin-hadoop2.6\\bin\\spark-class.cmd"
master = "org.apache.spark.deploy.master.Master"
worker = "org.apache.spark.deploy.worker.Worker"
masterUrl="spark://BigData:7077"
cmds={"masters":1,"workers":3}
masterProcess=[cmd,master]
workerProcess=[cmd,worker,masterUrl]
noWorker = 3
pMaster = s.Popen(masterProcess)
sleep(3)
pWorkers = []
for i in range(noWorker):
pw = s.Popen(workerProcess)
pWorkers.append(pw)
The code above starts master and 3 workers, which I can monitor using the UI. This is just to get going and if you need aquick local set up.

Resources