I was trying to build Kafka and spark streaming use case. In that, Spark Streaming is consuming streaming from Kafka. And we are enhancing stream and storing enhanced stream into some target system.
My question here is that does it make sense to run spark streaming job in yarn-cluster or yarn-client mode? (Hadoop is not involved here)
What I think Spark streaming job should run only local mode but another question is how to improve the performance of spark streaming job.
Thanks,
local[*]
This is specific to run the job in local mode
Usually we use this to perform POC's and on a very small data.
You can debug the job to understand how each line of code is working.
But, you need to be aware that since the job is running in your local you cannot get the most out of sparks distributed architecture.
yarn-client
your driver program is running on the yarn client where you type the command to submit the spark application . But, the tasks are still executed on the Executors.
yarn-cluster
In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. This is the finest way of running the spark job to be benefited by the advantages provided by a cluster manager
I hope this gives you a clarity on how you may want to deploy your spark job.
Infact, Spark provides you a very clean documentation explaining various deployment strategies with examples.
https://spark.apache.org/docs/latest/running-on-yarn.html
the difference will be with yarn-client, you will force the spark job to choose the host where you run spark-submit as the driver , because in yarn-cluster , the choice won't be the same host everytime you run it
so the best choice is to always choose yarn-cluster to avoide overloading the same host if you are going to submit multi job in the same host with yarn-client
Related
I have set up a dataproc cluster to run my spark jobs on. I have just set up the cluster and have not started any spark session yet. Still, I am seeing spark process, mapreduce process, yarn etc in my top command. What is that about? Should not the spark process start after I have started the SparkSession with configurations of my choice?
These are all background processes and daemons running in the background, running and monitoring the hadoop and spark ecosystem, and waiting for you to submit a request or program, that can be run. They need to be up and running first before you can run a spark app. Pretty normal on Linux.
I am exploring spark job recovery mechanism and I have a queries related to it,
How spark recovers from driver node failure
recovery from executor node failures
what are the ways to handle such scenarios ?
Driver node Failure:: If driver node which is running our spark Application is down, then Spark Session details will be lost and all the executors with their in-memory data will get lost. If we restart our application, getorCreate() method will reinitialize spark sesssion from the checkpoint directory and resume processing.
On most cluster managers, Spark does not automatically relaunch the driver if it crashes, so we need to monitor it using a tool like monit and restart it. The best way to do this is probably specific to environment. One place where Spark provides more support is the Standalone cluster manager, which supports a --supervise flag when submitting driver that lets Spark restart it. We will also need to pass --deploy-mode cluster to make the driver run within the cluster and not on your local machine like:
./bin/spark-submit --deploy-mode cluster --supervise --master spark://... App.jar
Imp Point: When the driver crashes, executors in Spark will also restart.
Executor Node Failure: Any of the worker nodes running executor can fail, thus resulting in loss of in-memory.
For failure of a executor node, Spark uses the same techniques as Spark for its fault tolerance. All the data received from external sources is replicated among the worker nodes. All RDDs created through transformations of this replicated input data are tolerant to failure of a worker node, as the RDD lineage allows the system to recompute the lost data all the way from the surviving replica of the input data.
I hope I covered third question in the above points itself
I want to use Airflow for orchestration of jobs that includes running some pig scripts, shell scripts and spark jobs.
Mainly on Spark jobs, I want to use Apache Livy but not sure whether it is good idea to use or run spark-submit.
What is best way to track Spark job using Airflow if even I submitted?
My assumption is you an application JAR containing Java / Scala code that you want to submit to remote Spark cluster. Livy is arguably the best option for remote spark-submit when evaluated against other possibilities:
Specifying remote master IP: Requires modifying global configurations / environment variables
Using SSHOperator: SSH connection might break
Using EmrAddStepsOperator: Dependent on EMR
Regarding tracking
Livy only reports state and not progress (% completion of stages)
If your'e OK with that, you can just poll the Livy server via REST API and keep printing logs in console, those will appear on task logs in WebUI (View Logs)
Other considerations
Livy doesn't support reusing SparkSession for POST/batches request
If that's imperative, you'll have to write your application code in PySpark and use POST/session requests
References
How to submit Spark jobs to EMR cluster from Airflow?
livy/examples/pi_app
rssanders3/livy_spark_operator_python_example
Useful links
How to submit Spark jobs to EMR cluster from Airflow?
Remote spark-submit to YARN running on EMR
I have a spark cluster in yarn mode, and some batch and streaming jobs to run. I find that spring xd could process batch and stream jobs, with its embedded spark.
My question is, can spring xd use an external spark cluster in yarn mode? How to do that? Thanks a lot.
You can't launch a Spark job in yarn mode with Spring XD, only launch local or on a Spark cluster using master URL. With Spring Cloud Data Flow you can launch a spark-yarn task, see http://docs.spring.io/spring-cloud-task-app-starters/docs/1.0.1.RELEASE/reference/html/_spark_yarn_task.html
I would like to run multiple spark jobs on my Mesos cluster, and have all spark jobs share the same spark framework. Is this possible?
I have tried running the MesosClusterDispatcher and have the spark jobs connect to the dispatcher, but each spark job launches its own "Spark Framework" (I have tried running both client-mode and cluster-mode).
Is this the expected behaviour?
Is it possible to share the same spark-framework among multiple spark jobs?
It is normal and it's the expected behaviour.
In Mesos as far as I know, SparkDispatcher is in charge of allocate resources for your Spark Driver which will act as a framework. Once Spark driver has been allocated, it is responsible for talk to Mesos and accept offers to allocate the executors where tasks will be executed.