Task/Job Failed on Mesos - apache-spark

enter image description here
Anyone know why every task/job that I run using spark submit on my apache Mesos cluster is always STATE FAILED? For your information, I using spark mesos dispatcher to run cluster mode and spark submit to run the job.
Is it possible the effect of too little total disk? If yes, how to increase the capacity of the disk? or is there another problem?

Related

What if driver in spark job fails?

I am exploring spark job recovery mechanism and I have a queries related to it,
How spark recovers from driver node failure
recovery from executor node failures
what are the ways to handle such scenarios ?
Driver node Failure:: If driver node which is running our spark Application is down, then Spark Session details will be lost and all the executors with their in-memory data will get lost. If we restart our application, getorCreate() method will reinitialize spark sesssion from the checkpoint directory and resume processing.
On most cluster managers, Spark does not automatically relaunch the driver if it crashes, so we need to monitor it using a tool like monit and restart it. The best way to do this is probably specific to environment. One place where Spark provides more support is the Standalone cluster manager, which supports a --supervise flag when submitting driver that lets Spark restart it. We will also need to pass --deploy-mode cluster to make the driver run within the cluster and not on your local machine like:
./bin/spark-submit --deploy-mode cluster --supervise --master spark://... App.jar
Imp Point: When the driver crashes, executors in Spark will also restart.
Executor Node Failure: Any of the worker nodes running executor can fail, thus resulting in loss of in-memory.
For failure of a executor node, Spark uses the same techniques as Spark for its fault tolerance. All the data received from external sources is replicated among the worker nodes. All RDDs created through transformations of this replicated input data are tolerant to failure of a worker node, as the RDD lineage allows the system to recompute the lost data all the way from the surviving replica of the input data.
I hope I covered third question in the above points itself

Should slave nodes be launched/started separately on Amazon EMR server?

I have just launched Amazon Elastic MapReduce server after trying java.lang.OutofMemorySpace:Java heap space while fetching 120 million rows from database in pyspark where I have 1 master and 2 slave nodes running each having 4 cores and 8G RAM.
I am trying to load a massive dataset from MySQL database (containing approx. 120M rows). The query loads fine but when I do a df.show() operation or when I try to perform operations on the spark dataframe I am getting errors like -
org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down
Task 0 in stage 0.0 failed 1 times; aborting job
java.lang.OutOfMemoryError: GC overhead limit exceeded
My questions are -
When I SSH into the Amazon EMR server and do htop, I see that 5GB out of 8GB is already in use. Why is this?
On the Amazon EMR portal, I can see that the master and slave servers are running. I'm not sure if the slave servers are being used or if its just the master doing all the work. Do I have to separately launch or "start" the 2 slave nodes or does Spark do that automatically? If yes, how do I do this?
If you are running spark as standalone mode (local[*]) from master then it will only use master node.
How are you submitting spark job?
Use yarn cluster or client mode while submitting spark job to use resources efficiently.
Read more on YARN cluster vs client
Master node runs all the other services like hive, mysql, etc. Those services may
taking 5GB of ram if aren’t using standalone mode.
In yarn UI (http://<master-public-dns>:8088) you can check what other containers are running in more detail.
You can check where your spark driver and executer are spinning,
in spark UI http://<master-public-dns>:18080.
Select your job and go to the Executor section, there you would find machine ip of each executor.
Enable ganglia in EMR OR go to CloudWatch ec2 metric to check each machine utilization.
Spark doesn’t start or terminates nodes.
If you want to scale your cluster depending upon job load, apply autoscaling policy to CORE or TASK instance group.
But at-least you need 1 CORE node always running.

Kafka Spark Streaming

I was trying to build Kafka and spark streaming use case. In that, Spark Streaming is consuming streaming from Kafka. And we are enhancing stream and storing enhanced stream into some target system.
My question here is that does it make sense to run spark streaming job in yarn-cluster or yarn-client mode? (Hadoop is not involved here)
What I think Spark streaming job should run only local mode but another question is how to improve the performance of spark streaming job.
Thanks,
local[*]
This is specific to run the job in local mode
Usually we use this to perform POC's and on a very small data.
You can debug the job to understand how each line of code is working.
But, you need to be aware that since the job is running in your local you cannot get the most out of sparks distributed architecture.
yarn-client
your driver program is running on the yarn client where you type the command to submit the spark application . But, the tasks are still executed on the Executors.
yarn-cluster
In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. This is the finest way of running the spark job to be benefited by the advantages provided by a cluster manager
I hope this gives you a clarity on how you may want to deploy your spark job.
Infact, Spark provides you a very clean documentation explaining various deployment strategies with examples.
https://spark.apache.org/docs/latest/running-on-yarn.html
the difference will be with yarn-client, you will force the spark job to choose the host where you run spark-submit as the driver , because in yarn-cluster , the choice won't be the same host everytime you run it
so the best choice is to always choose yarn-cluster to avoide overloading the same host if you are going to submit multi job in the same host with yarn-client

Zeppelin persists job in YARN

When I run a Spark job from Zeppelin, the job finishes with success, but it stays in YARN on mode running.
The problem is the job is taking a resource in YARN. I think that Zeppelin persists the job in YARN.
How can I resolve this problem?
Thank you
There are two solutions.
The quick one is to use the "restart interpreter" functionality, which is misnamed, since it merely stops the interpreter. In this case the Spark job in Yarn.
The elegant one is to configure Zeppelin to use dynamic allocation with Spark. In that case the Yarn application master will continue running, and with it the Spark driver, but all executors (which are the real resource hog) can be freed by Yarn, when they're not in use.
The easiest and straight-forward solution is to restart the spark interpreter.
But as Rick mentioned if you should use the spark dynamic allocation, an additional step of enabling spark shuffle service on all agent nodes is required(this by default is disabled).
Just close your spark context so that the spark job will get the status FINISHED.
Your memory should be released.

How can we set the execution parameters for an apache spark application

We have setup a multinode cluster for testing the Spark application with 4 nodes.
Each node has 250GB RAM,48 cores.
Running master on one node and 3 as slaves.
And we have developed a spark application using scala.
We use the spark-submit option to run the job.
Now here is the point we are struck and need more clarifications to proceed.
Query 1:
Which is the best option to run a spark job.
a) Spark as master
b) Yarn as master
and the difference.
Query 2:
While running any spark job we can provide option like number of executors,no of cores,executor memory etc.
Could you please advise what would be the optimal value for these parameters for better performance in my case.
Any help would be very much appreciated since it would be helpful for anyone who starts with Spark :)
Thanks.!!
Query1: YARN is a better resource manager and supports more features than Spark Master. For more you can visit
Apache Spark Cluster Managers
Query2: You can only assign resources at the time of job initialization. There are command line flags available. Also, if you don't wish to pass command line flags with spark-submit you can set them when creating spark configuration in the code.
You can see the available flags using
spark-submit --help
Fore more information visit Spark Configuration
Electing resources majorly depends on the size of data you want to process and the problem complexity.
Please visit 5 mistakes to avoid while writng spark applications

Resources