what happens when I kill a yarn's node manager - apache-spark

Suppose there are 10 containers running on this machine(5 is mapreduce tasks, and 5 is spark on yarn executors).
And if I kill the node manager, what happens for these 10 containers process?
Before I restart the node manager ,what should I do first?

Killing nodemanager will only affect the containers of this particular node. All the running containers will get lost on restart/kill. They will get relaunched once the node comes up or the nodemanager process get start(if application/job still running).
NOTE: Jobs ApplicationMaster should not be running on this slave.
what happens when the node with the ApplicationMaster dies ?
In this case the yarn launchs a new ApplicationMaster on some other node. All the containers relaunched again in this case.

Answering according to hadoop 2.7.x dist: check this article: http://hortonworks.com/blog/resilience-of-yarn-applications-across-nodemanager-restarts/
If you don't have yarn.nodemanager.recovery.enabled set to true then you're container will be KILLED (spark or mapreduce or anything else) However you're job will most likely continue to run.
You need to check this property in your env using hadoop conf | grep yarn.nodemanager.recovery.dir . If it false which is for me by default then nothing you can do to prevent getting those container killed upon restart imo. However you can try to modify the flag and set other required property for future cases if you want containers to be recovered.
Look at this one too: http://www.cloudera.com/documentation/enterprise/5-4-x/topics/admin_ha_yarn_work_preserving_recovery.html

Related

How to kill an EMR task programatically

I want to programtically kill an EMR streaming task. If I kill it from EMR UI or boto client, it disappears in EMR, but it is still active in the Hadoop cluster (see this article). Only if I go through the Hadoop resource manager and kill it from there, the job is terminated.
How can do the same programatically?
You can ssh to cluster and use yarn application -kill application_id or use yarn api to kill application
Can not kill a YARN application through REST api
As #maxime-g said, the only way to kill a yarn application is to run the following command: yarn application -kill application_id.
But it is possible to run an EMR which runs a script on the master node, and that script should include this command, and possible take an argument.

spark standalone running on docker cleanup not running

I'm running spark on standalone mode as a docker service where I have one master node and one spark worker. I followed the spark documentation instructions:
https://spark.apache.org/docs/latest/spark-standalone.html
to add the properties where the spark cluster cleans itself and I set those in my docker_entrypoint
export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=900 -Dspark.worker.cleanup.appDataTtl=900
and verify that it was enables following the logs of the worker node service
My question is do we expect to get all directories located on SPARK_WORKER_DIR directory to be cleaned ? or does it only clean the application files
Because I still see some empty directories holding there

Starting multiple workers on a master node in Standalone mode

I have a machine with 80 cores. I'd like to start a Spark server in standalone mode on this machine with 8 executors, each with 10 cores. But, when I try to start my second worker on the master, I get an error.
$ ./sbin/start-master.sh
Starting org.apache.spark.deploy.master.Master, logging to ...
$ ./sbin/start-slave.sh spark://localhost:7077 -c 10
Starting org.apache.spark.deploy.worker.Worker, logging to ...
$ ./sbin/start-slave.sh spark://localhost:7077 -c 10
org.apache.spark.deploy.worker.Worker running as process 64606. Stop it first.
In the documentation, it clearly states "you can start one or more workers and connect them to the master via: ./sbin/start-slave.sh <master-spark-URL>". So why can't I do that?
A way to get the same parallelism is to start many workers.
You can do this by adding to the ./conf/spark-env.sh file:
SPARK_WORKER_INSTANCES=8
SPARK_WORKER_CORES=10
SPARK_EXECUTOR_CORES=10
In a single machine, it is quite complicated but you can try docker or Kubernetes.
Create multiple docker containers for spark workers.
Just specify a new identity for every new worker/master and then launch the start-worker.sh
export SPARK_IDENT_STRING=worker2
./spark-node2/sbin/start-worker.sh spark://DESKTOP-HSK5ETQ.localdomain:7077
thanks to https://stackoverflow.com/a/46205968/1743724

Docker Container with Apache Spark in standalone cluster mode

I am trying to construct a docker image containing Apache Spark. IT is built upon the openjdk-8-jre official image.
The goal is to execute Spark in cluster mode, thus having at least one master (started via sbin/start-master.sh) and one or more slaves (sbin/start-slave.sh). See spark-standalone-docker for my Dockerfile and entrypoint script.
The build itself actually goes through, the problem is that when I want to run the container, it starts and stops shortly after. The cause is that Spark master launch script starts the master in daemon mode and exits. Thus the container terminates, as there is no process running in the foreground anymore.
The obvious solution is to run the Spark master process in foreground, but I could not figure out how (Google did not turn up anything either). My "workaround-solution" is to run tails -f on the Spark log directory.
Thus, my questions are:
How can you run Apache Spark Master in foreground?
If the first is not possible / feasible / whatever, what is the preferred (i.e. best practice) solution to keeping a container "alive" (I really don't want to use an infinite loop and a sleep command)?
UPDATED ANSWER (for spark 2.4.0):
To start spark master on foreground, just set the ENV variable
SPARK_NO_DAEMONIZE=true on your environment before running ./start-master.sh
and you are good to go.
for more info, check $SPARK_HOME/sbin/spark-daemon.sh
# Runs a Spark command as a daemon.
#
# Environment Variables
#
# SPARK_CONF_DIR Alternate conf dir. Default is ${SPARK_HOME}/conf.
# SPARK_LOG_DIR Where log files are stored. ${SPARK_HOME}/logs by default.
# SPARK_MASTER host:path where spark code should be rsync'd from
# SPARK_PID_DIR The pid files are stored. /tmp by default.
# SPARK_IDENT_STRING A string representing this instance of spark. $USER by default
# SPARK_NICENESS The scheduling priority for daemons. Defaults to 0.
# SPARK_NO_DAEMONIZE If set, will run the proposed command in the foreground. It will not output a PID file.
##
How can you run Apache Spark Master in foreground?
You can use spark-class with Master.
bin/spark-class org.apache.spark.deploy.master.Master
and the same thing for workers:
bin/spark-class org.apache.spark.deploy.worker.Worker $MASTER_URL
If you're looking for production ready solution you should consider using a proper supervisor like dumb-init or tini.

How to know the state of Spark job

Now I have a job running on amazon ec2 and I use putty to connect with the ec2 cluster,but just know the connection of putty is lost.After I reconnect with the ec2 cluster I have no output of the job,so I don't know if my job is still running.Anybody know how to check the state of Spark job?
thanks
assuming you are on yarn cluster, you could run
yarn application -list
to get a list of appliactions and then run
yarn application -status applicationId
to know the status
It is good practice to use GNU Screen (or other similar tool) to keep session alive (but detached, if connection lost with machine) when working on remote machines.
The status of a Spark application can be ascertained from Spark UI (or Yarn UI).
If you are looking for cli command:
For stand-alone cluster use:
spark-submit --status <app-driver-id>
For yarn:
yarn application --status <app-id>

Resources