I'm trying to setup a spark cluster and I've come across an annoying bug...
When I submit a spark application it runs fine on workers until I kill one (for example by using stop-slave.sh on the worker node).
When the worker is killed spark will then try to relaunch an executor on an available worker node but it fails everytime (I know because the webUI either displays FAILED or LAUNCHING for the executor, it never succeeds).
I can't seem to find any help, even on the documentation, so can someone assure me that spark can and will try to relaunch a worker on an available node if one is killed (on the same node where the worker previously ran or on another available node if the node where it previously rank is unreachable) ?
Here's the output from the worker node :
Spark worker error
Thank you for your help !
Related
I am running spark thrift on EMR (6.6), with managed scaling enabled.
from time to time we have SQL that stack for a long time (45m) until a new request comes to the server and releases it.
when that happens we see that there is one executor on a task node that EMR ask to kill.
What could be the reason for such behavior? How could it be avoided?
it turned out that AWS has a feature that prevents Spark from sending tasks to executors that run on DECOMMISSIONING node.
so in our case, we have min-executor = 1 and the last one was on DECOMMISSIONING node. so spark does not send to it any tasks but it does not ask for new resources because it has that executor.
so that seems to be like an EMR bug.
I have a spark job that periodically hangs, leaving my AWS EMR cluster in a state where an application is RUNNING but really the cluster is stuck. I know that if my job doesn't get stuck, it'll finish in 5 hours or less. If it's still running after that, it's a sign that the job is stuck. Yarn and the Spark UI is still responsive, the it's just that an executor gets stuck on a task.
Background: I'm using an ephemeral EMR cluster that performs only one step before terminating, so it's not a problem to kill it off if I notice this job is hanging.
What's the easiest way to kill the task, job, or cluster in this case? Ideally this would not involve setting up some extra service to monitor the job -- ideally there would be some kind of spark / yarn / emr setting I could use.
Note: I've tried using spark speculation to unblock the stuck spark job, but that doesn't help.
EMR has a Bootstrap Actions feature where you can run scripts that start up when initializing the cluster. I've used this feature along with a startup script that monitors how long the cluster has been online and terminates itself after a certain time.
I use a script based off this one for the bootstrap action. https://github.com/thomhopmans/themarketingtechnologist/blob/master/6_deploy_spark_cluster_on_aws/files/terminate_idle_cluster.sh
Basically make a script that checks /proc/uptime to see how long the EC2 machine has been online and after uptime surpasses your time limit you can send a shutdown command to the cluster.
I have a daily pipeline running on Spark Standalone 2.1. Its deployed in and runs on AWS EC2 and uses S3 for its persistence layer. For the most part, the pipeline runs without a hitch, but occasionally the job hangs on a single worker node during a reduceByKey operation. When I work into the worker, I notice that the CPU (as seen via top) is pegged at 100%. My remedy so far is to reboot the worker node so that Spark re-assigns the task and the job proceeds fine from there.
I would like to be able to mitigate this issue. I gather that I can prevent CPU pegging by switching to use YARN as my cluster manager, but I wonder whether I could configure Spark Standalone to prevent CPU pegging by maybe limiting the number of cores that get assigned to the Spark job ? Any suggestions would be greatly appreciated.
We are running spark streaming application on yarn-cluster on a cluster that was defined using cloudera.
We defined one of the nodes to be spark-gateway and we run the spark-submit command from this node.
We want to test the HA of our cluster and by this we test what happens when different nodes crush(we stop them).
We saw that when we stop the driver node, the application still continues to run but it doesn't do anything and when looking at "yarn -list" it still writes the stopped node as the driver node. When we start back the node the application returns to work and the driver node changes to another node, but this happen only when the node is back up. Shouldn't yarn change the driver to another node as soon as the driver node dies ?
Another thing we saw that if we kill the spark-gateway node the application stops.
How can we run the application so it won't have any one point fail-over ?
I have a 3 node spark standalone cluster and on the master node I also have a worker. When I submit a app to the cluster the two other workers start RUNNING, but the worker on the master node stay with status LOADING and eventually another worker is launched on one of the other machines.
Is having a worker and a master on the same node being the problem ?
If yes, is there a way to workout this problem or I should never have a worker and a master on the same node ?
P.S. The machines have 8 cores each and the workers are set to use 7 and not all of the RAM
Yes, you can, here is from Spark web doc:
In addition to running on the Mesos or YARN cluster managers, Spark also provides a simple standalone deploy mode. You can launch a standalone cluster either manually, by starting a master and workers by hand, or use our provided launch scripts. It is also possible to run these daemons on a single machine for testing.
It is possible to have a machine hosting both Workers and a Master.
Is it possible that you misconfigured the spark-env.sh on that specific machine?