Spark Cluster: initial job has not accept any resources and executor keep exit - apache-spark

I have a spark cluster using cloud resource in two instances. One as master and one as worker. The total resource is 4 cores and 10G ram.
I can start shell, and worker can register successfully. But when I run simple code.
The error from shell is:
Spark version:2.3.0
System: CentOS v7
The firewalls are stopped.
Here is the config:
export JAVA_HOME=/usr/java/jdk1.8.0_144
export SPARK_MASTER_IP=IP
export PYSPARK_PYTHON=/opt/anaconda3/bin/python
export SPARK_WORKER_MEMORY=2g
export SPARK_WORK_INSTANCES=1
export SPARK_WORkER_CORES=4
export SPARK_EXECUTOR_MEMORY=1g
I set up another spark cluster using the similar config using three physical machines and they worked well. At the start I got the same error, but I solved it by stopping the firewalls. Right I want to set up the cluster on cloud, and unfortunately I got the same error, but didn't resolve it using the same solution. I am curious whether it is the port problem, because I only open the port on http 80,4040,6066,7077,8080,8081,8787.
Here is the error:
Here are the logs:
Master log:
2018-04-12 13:09:14 INFO Master:54 - Registering app Spark shell
2018-04-12 13:09:14 INFO Master:54 - Registered app Spark shell with ID app-20180412130914-0000
2018-04-12 13:09:14 INFO Master:54 - Launching executor app-20180412130914-0000/0 on worker worker-20180411144020-192.**.**.**-44986
2018-04-12 13:11:15 INFO Master:54 - Removing executor app-20180412130914-0000/0 because it is EXITED
2018-04-12 13:11:15 INFO Master:54 - Launching executor app-20180412130914-0000/1 on worker worker-20180411144020-192.**.**.**-44986
2018-04-12 13:13:16 INFO Master:54 - Removing executor app-20180412130914-0000/1 because it is EXITED
2018-04-12 13:13:16 INFO Master:54 - Launching executor app-20180412130914-0000/2 on worker worker-20180411144020-192.**.**.**-44986
2018-04-12 13:15:17 INFO Master:54 - Removing executor app-20180412130914-0000/2 because it is EXITED
2018-04-12 13:15:17 INFO Master:54 - Launching executor app-20180412130914-0000/3 on worker worker-20180411144020-192.**.**.**-44986
2018-04-12 13:16:15 INFO Master:54 - Removing app app-20180412130914-0000
2018-04-12 13:16:15 INFO Master:54 - 192.**.**.**:39766 got disassociated, removing it.
2018-04-12 13:16:15 INFO Master:54 - IP:39928 got disassociated, removing it.
2018-04-12 13:16:15 WARN Master:66 - Got status update for unknown executor app-20180412130914-0000/3
Worker log:
2018-04-12 13:09:12 INFO Worker:54 - Asked to launch executor app-20180412130914-0000/0 for Spark shell
2018-04-12 13:09:12 INFO SecurityManager:54 - Changing view acls to: root
2018-04-12 13:09:12 INFO SecurityManager:54 - Changing modify acls to: root
2018-04-12 13:09:12 INFO SecurityManager:54 - Changing view acls groups to:
2018-04-12 13:09:12 INFO SecurityManager:54 - Changing modify acls groups to:
2018-04-12 13:09:12 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
2018-04-12 13:09:12 INFO ExecutorRunner:54 - Launch command: "/usr/java/jdk1.8.0_144/bin/java" "-cp" "/opt/spark-2.3.0-bin-hadoop2.7/conf/:/opt/spark-2.3.0-bin-hadoop2.7/jars/*" "-Xmx1024M" "-Dspark.driver.port=39928" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#IP:39928" "--executor-id" "0" "--hostname" "192.**.**.**" "--cores" "4" "--app-id" "app-20180412130914-0000" "--worker-url" "spark://Worker#192.**.**.**:44986"
2018-04-12 13:11:13 INFO Worker:54 - Executor app-20180412130914-0000/0 finished with state EXITED message Command exited with code 1 exitStatus 1
2018-04-12 13:11:13 INFO Worker:54 - Asked to launch executor app-20180412130914-0000/1 for Spark shell
2018-04-12 13:11:13 INFO SecurityManager:54 - Changing view acls to: root
2018-04-12 13:11:13 INFO SecurityManager:54 - Changing modify acls to: root
2018-04-12 13:11:13 INFO SecurityManager:54 - Changing view acls groups to:
2018-04-12 13:11:13 INFO SecurityManager:54 - Changing modify acls groups to:
2018-04-12 13:11:13 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
2018-04-12 13:11:13 INFO ExecutorRunner:54 - Launch command: "/usr/java/jdk1.8.0_144/bin/java" "-cp" "/opt/spark-2.3.0-bin-hadoop2.7/conf/:/opt/spark-2.3.0-bin-hadoop2.7/jars/*" "-Xmx1024M" "-Dspark.driver.port=39928" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#spark-master.novalocal:39928" "--executor-id" "1" "--hostname" "192.**.**.**" "--cores" "4" "--app-id" "app-20180412130914-0000" "--worker-url" "spark://Worker#192.**.**.**:44986"
2018-04-12 13:13:15 INFO Worker:54 - Executor app-20180412130914-0000/1 finished with state EXITED message Command exited with code 1 exitStatus 1
2018-04-12 13:13:15 INFO Worker:54 - Asked to launch executor app-20180412130914-0000/2 for Spark shell
2018-04-12 13:13:15 INFO SecurityManager:54 - Changing view acls to: root
2018-04-12 13:13:15 INFO SecurityManager:54 - Changing modify acls to: root
2018-04-12 13:13:15 INFO SecurityManager:54 - Changing view acls groups to:
2018-04-12 13:13:15 INFO SecurityManager:54 - Changing modify acls groups to:
2018-04-12 13:13:15 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
2018-04-12 13:13:15 INFO ExecutorRunner:54 - Launch command: "/usr/java/jdk1.8.0_144/bin/java" "-cp" "/opt/spark-2.3.0-bin-hadoop2.7/conf/:/opt/spark-2.3.0-bin-hadoop2.7/jars/*" "-Xmx1024M" "-Dspark.driver.port=39928" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#spark-master.novalocal:39928" "--executor-id" "2" "--hostname" "192.**.**.**" "--cores" "4" "--app-id" "app-20180412130914-0000" "--worker-url" "spark://Worker#192.**.**.**:44986"
2018-04-12 13:15:16 INFO Worker:54 - Executor app-20180412130914-0000/2 finished with state EXITED message Command exited with code 1 exitStatus 1

Related

Apache Spark on IPv6

I am trying to install Spark on the IPv6, however, spark-master comes on the IPv6 - DNS hostname but the spark worker node doesn't start even though I pass IP or DNS the error is the same. I need to use LOCAL_WORKER_IP=127.0.0.1 to make the spark worker start.
SPARK Worker log:
*fd74:ca9b:3a09:868c:172:18:0:462a spark-master.t253-u000265.svc.cluster.local
21/08/09 16:41:40 INFO Worker: Started daemon with process name: 10#v4-virtio-spark-worker-zjgxs
21/08/09 16:41:40 INFO SignalUtils: Registered signal handler for TERM
21/08/09 16:41:40 INFO SignalUtils: Registered signal handler for HUP
21/08/09 16:41:40 INFO SignalUtils: Registered signal handler for INT
21/08/09 16:41:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
21/08/09 16:41:41 INFO SecurityManager: Changing view acls to: root
21/08/09 16:41:41 INFO SecurityManager: Changing modify acls to: root
21/08/09 16:41:41 INFO SecurityManager: Changing view acls groups to:
21/08/09 16:41:41 INFO SecurityManager: Changing modify acls groups to:
21/08/09 16:41:41 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
21/08/09 16:41:42 INFO Utils: Successfully started service 'sparkWorker' on port 37816
21/08/09 16:41:42 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[main,5,main]
java.lang.AssertionError: assertion failed: Expected hostname (not IP) but got fd74:ca9b:3a09:868c:172:18:0:488e
at scala.Predef$.assert(Predef.scala:170)
at org.apache.spark.util.Utils$.checkHost(Utils.scala:1014)
at org.apache.spark.deploy.worker.Worker.<init>(Worker.scala:60)
at org.apache.spark.deploy.worker.Worker$.startRpcEnvAndEndpoint(Worker.scala:811)
at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:779)
at org.apache.spark.deploy.worker.Worker.main(Worker.scala)
21/08/09 16:41:42 INFO ShutdownHookManager: Shutdown hook called*
Does anyone know if the SPARK worker can be configured on the Ipv6 using any configurations?

It's possible to configure the Beam portable runner with the spark configurations?

TLDR;
It's possible to configure the Beam portable runner with the spark configurations? More precisely, it's possible to configure the spark.driver.host in the Portable Runner?
Motivation
Currently, we have airflow implemented in a Kubernetes cluster, and aiming to use TensorFlow Extended we need to use Apache beam. For our use case Spark would be the appropriate runner to be used, and as airflow and TensorFlow are coded in python we would need to use the Apache Beam's Portable Runner (https://beam.apache.org/documentation/runners/spark/#portability).
The problem
The portable runner creates the spark context inside its container and does not leave space for the driver DNS configuration making the executors inside the worker pods non-communicable to the driver (the job server).
Setup
Following the beam documentation, the job serer was implemented in the same pod as the airflow to use the local network between these two containers.
Job server config:
- name: beam-spark-job-server
image: apache/beam_spark_job_server:2.27.0
args: ["--spark-master-url=spark://spark-master:7077"]
Job server/airflow service:
apiVersion: v1
kind: Service
metadata:
name: airflow-scheduler
labels:
app: airflow-k8s
spec:
type: ClusterIP
selector:
app: airflow-scheduler
ports:
- port: 8793
protocol: TCP
targetPort: 8793
name: scheduler
- port: 8099
protocol: TCP
targetPort: 8099
name: job-server
- port: 7077
protocol: TCP
targetPort: 7077
name: spark-master
- port: 8098
protocol: TCP
targetPort: 8098
name: artifact
- port: 8097
protocol: TCP
targetPort: 8097
name: java-expansion
The ports 8097,8098 and 8099 are related to the job server, 8793 to airflow, and 7077 to the spark master.
Development/Errors
When testing a simple beam example python -m apache_beam.examples.wordcount --output ./data_test/ --runner=PortableRunner --job_endpoint=localhost:8099 --environment_type=LOOPBACK from the airflow container I get the following response on the airflow pod:
Defaulting container name to airflow-scheduler.
Use 'kubectl describe pod/airflow-scheduler-local-f685b5bc7-9d7r6 -n airflow-main-local' to see all of the containers in this pod.
airflow#airflow-scheduler-local-f685b5bc7-9d7r6:/opt/airflow$ python -m apache_beam.examples.wordcount --output ./data_test/ --runner=PortableRunner --job_endpoint=localhost:8099 --environment_type=LOOPBACK
INFO:apache_beam.internal.gcp.auth:Setting socket default timeout to 60 seconds.
INFO:apache_beam.internal.gcp.auth:socket default timeout is 60.0 seconds.
INFO:oauth2client.client:Timeout attempting to reach GCE metadata service.
WARNING:apache_beam.internal.gcp.auth:Unable to find default credentials to use: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
Connecting anonymously.
INFO:apache_beam.runners.worker.worker_pool_main:Listening for workers at localhost:35837
WARNING:root:Make sure that locally built Python SDK docker image has Python 3.7 interpreter.
INFO:root:Default Python SDK image for environment is apache/beam_python3.7_sdk:2.27.0
INFO:apache_beam.runners.portability.portable_runner:Environment "LOOPBACK" has started a component necessary for the execution. Be sure to run the pipeline using
with Pipeline() as p:
p.apply(..)
This ensures that the pipeline finishes before this program exits.
INFO:apache_beam.runners.portability.portable_runner:Job state changed to STOPPED
INFO:apache_beam.runners.portability.portable_runner:Job state changed to STARTING
INFO:apache_beam.runners.portability.portable_runner:Job state changed to RUNNING
And the worker log:
21/02/19 19:50:00 INFO Worker: Asked to launch executor app-20210219194804-0000/47 for BeamApp-root-0219194747-7d7938cf_51452c51-dffe-4c61-bcb7-60c7779e3256
21/02/19 19:50:00 INFO SecurityManager: Changing view acls to: root
21/02/19 19:50:00 INFO SecurityManager: Changing modify acls to: root
21/02/19 19:50:00 INFO SecurityManager: Changing view acls groups to:
21/02/19 19:50:00 INFO SecurityManager: Changing modify acls groups to:
21/02/19 19:50:00 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
21/02/19 19:50:00 INFO ExecutorRunner: Launch command: "/usr/local/openjdk-8/bin/java" "-cp" "/opt/spark/conf/:/opt/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=44447" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#airflow-scheduler-local-f685b5bc7-9d7r6:44447" "--executor-id" "47" "--hostname" "172.18.0.3" "--cores" "1" "--app-id" "app-20210219194804-0000" "--worker-url" "spark://Worker#172.18.0.3:35837"
21/02/19 19:50:02 INFO Worker: Executor app-20210219194804-0000/47 finished with state EXITED message Command exited with code 1 exitStatus 1
21/02/19 19:50:02 INFO ExternalShuffleBlockResolver: Clean up non-shuffle files associated with the finished executor 47
21/02/19 19:50:02 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20210219194804-0000, execId=47)
21/02/19 19:50:02 INFO Worker: Asked to launch executor app-20210219194804-0000/48 for BeamApp-root-0219194747-7d7938cf_51452c51-dffe-4c61-bcb7-60c7779e3256
21/02/19 19:50:02 INFO SecurityManager: Changing view acls to: root
21/02/19 19:50:02 INFO SecurityManager: Changing modify acls to: root
21/02/19 19:50:02 INFO SecurityManager: Changing view acls groups to:
21/02/19 19:50:02 INFO SecurityManager: Changing modify acls groups to:
21/02/19 19:50:02 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
21/02/19 19:50:02 INFO ExecutorRunner: Launch command: "/usr/local/openjdk-8/bin/java" "-cp" "/opt/spark/conf/:/opt/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=44447" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#airflow-scheduler-local-f685b5bc7-9d7r6:44447" "--executor-id" "48" "--hostname" "172.18.0.3" "--cores" "1" "--app-id" "app-20210219194804-0000" "--worker-url" "spark://Worker#172.18.0.3:35837"
21/02/19 19:50:04 INFO Worker: Executor app-20210219194804-0000/48 finished with state EXITED message Command exited with code 1 exitStatus 1
21/02/19 19:50:04 INFO ExternalShuffleBlockResolver: Clean up non-shuffle files associated with the finished executor 48
21/02/19 19:50:04 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20210219194804-0000, execId=48)
21/02/19 19:50:04 INFO Worker: Asked to launch executor app-20210219194804-0000/49 for BeamApp-root-0219194747-7d7938cf_51452c51-dffe-4c61-bcb7-60c7779e3256
21/02/19 19:50:04 INFO SecurityManager: Changing view acls to: root
21/02/19 19:50:04 INFO SecurityManager: Changing modify acls to: root
21/02/19 19:50:04 INFO SecurityManager: Changing view acls groups to:
21/02/19 19:50:04 INFO SecurityManager: Changing modify acls groups to:
21/02/19 19:50:04 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
21/02/19 19:50:04 INFO ExecutorRunner: Launch command: "/usr/local/openjdk-8/bin/java" "-cp" "/opt/spark/conf/:/opt/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=44447" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#airflow-scheduler-local-f685b5bc7-9d7r6:44447" "--executor-id" "49" "--hostname" "172.18.0.3" "--cores" "1" "--app-id" "app-20210219194804-0000" "--worker-url" "spark://Worker#172.18.0.3:35837"
.
.
.
As we can see, the executor is being exited constantly, and by what I know this issue is created by the missing communication between the executor and the driver (the job server in this case). Also, the "--driver-url" is translated to the driver pod name using the random port "-Dspark.driver.port".
As we can't define the name of the service, the worker tries to use the original name from the driver and to use a randomly generated port. As the configuration comes from the driver, changing the default conf files in the worker/master doesn't create any results.
Using this answer as an example, I tried to use the env variable SPARK_PUBLIC_DNS in the job server but this didn't result in any changes in the worker logs.
Obs
Using directly in kubernetes a spark job
kubectl run spark-base --rm -it --labels="app=spark-client" --image bde2020/spark-base:2.4.5-hadoop2.7 -- bash ./spark/bin/pyspark --master spark://spark-master:7077 --conf spark.driver.host=spark-client
having the service:
apiVersion: v1
kind: Service
metadata:
name: spark-client
spec:
selector:
app: spark-client
clusterIP: None
I get a full working pyspark shell. If I omit the --conf parameter I get the same behavior as the first setup (exiting executors indefinitely)
21/02/19 20:21:02 INFO Worker: Executor app-20210219202050-0002/4 finished with state EXITED message Command exited with code 1 exitStatus 1
21/02/19 20:21:02 INFO ExternalShuffleBlockResolver: Clean up non-shuffle files associated with the finished executor 4
21/02/19 20:21:02 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20210219202050-0002, execId=4)
21/02/19 20:21:02 INFO Worker: Asked to launch executor app-20210219202050-0002/5 for Spark shell
21/02/19 20:21:02 INFO SecurityManager: Changing view acls to: root
21/02/19 20:21:02 INFO SecurityManager: Changing modify acls to: root
21/02/19 20:21:02 INFO SecurityManager: Changing view acls groups to:
21/02/19 20:21:02 INFO SecurityManager: Changing modify acls groups to:
21/02/19 20:21:02 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
21/02/19 20:21:02 INFO ExecutorRunner: Launch command: "/usr/local/openjdk-8/bin/java" "-cp" "/opt/spark/conf/:/opt/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=46161" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#spark-base:46161" "--executor-id" "5" "--hostname" "172.18.0.20" "--cores" "1" "--app-id" "app-20210219202050-0002" "--worker-url" "spark://Worker#172.18.0.20:45151"
I have three solutions to choose from depending on your deployment requirements. In order of difficulty:
Use the Spark "uber jar" job server. This starts an embedded job server inside the Spark master, instead of using a standalone job server in a container. This would simplify your deployment a lot, since you would not need to start the beam_spark_job_server container at all.
python -m apache_beam.examples.wordcount \
--output ./data_test/ \
--runner=SparkRunner \
--spark_submit_uber_jar \
--spark_master_url=spark://spark-master:7077 \
--environment_type=LOOPBACK
You can pass the properties through a Spark configuration file. Create the Spark configuration file, and add spark.driver.host and whatever other properties you need. In the docker run command for the job server, mount that configuration file to the container, and set the SPARK_CONF_DIR environment variable to point to that directory.
If that neither of those work for you, you can alternatively build your own customized version of the job server container. Pull Beam source from Github. Check out the release branch you want to use (e.g. git checkout origin/release-2.28.0). Modify the entrypoint spark-job-server.sh and set -Dspark.driver.host=x there. Then build the container using ./gradlew :runners:spark:job-server:container:docker -Pdocker-repository-root="your-repo" -Pdocker-tag="your-tag".
Let me revise the answer. The Job server need to able to communicate with the workers vice verse. The error of keep exiting is due to this. You need to configure such that they can communicate. A k8s headless service able to solve this.
reference of workable example at https://github.com/cometta/python-apache-beam-spark . If it is useful for you, can help me to 'Star' the repository

spark-submit: unable to get driver status

I'm running a job on a test Spark standalone in cluster mode, but I'm finding myself unable to monitor the status of the driver.
Here is a minimal example using spark-2.4.3 (master and one worker running on the same node, started running sbin/start-all.sh on a freshly unarchived installation using the default conf, no conf/slaves set), executing spark-submit from the node itself:
$ spark-submit --master spark://ip-172-31-15-245:7077 --deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
/home/ubuntu/spark/examples/jars/spark-examples_2.11-2.4.3.jar 100
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/06/27 09:08:28 INFO SecurityManager: Changing view acls to: ubuntu
19/06/27 09:08:28 INFO SecurityManager: Changing modify acls to: ubuntu
19/06/27 09:08:28 INFO SecurityManager: Changing view acls groups to:
19/06/27 09:08:28 INFO SecurityManager: Changing modify acls groups to:
19/06/27 09:08:28 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); groups with view permissions: Set(); users with modify permissions: Set(ubuntu); groups with modify permissions: Set()
19/06/27 09:08:28 INFO Utils: Successfully started service 'driverClient' on port 36067.
19/06/27 09:08:28 INFO TransportClientFactory: Successfully created connection to ip-172-31-15-245/172.31.15.245:7077 after 29 ms (0 ms spent in bootstraps)
19/06/27 09:08:28 INFO ClientEndpoint: Driver successfully submitted as driver-20190627090828-0008
19/06/27 09:08:28 INFO ClientEndpoint: ... waiting before polling master for driver state
19/06/27 09:08:33 INFO ClientEndpoint: ... polling master for driver state
19/06/27 09:08:33 INFO ClientEndpoint: State of driver-20190627090828-0008 is RUNNING
19/06/27 09:08:33 INFO ClientEndpoint: Driver running on 172.31.15.245:41057 (worker-20190627083412-172.31.15.245-41057)
19/06/27 09:08:33 INFO ShutdownHookManager: Shutdown hook called
19/06/27 09:08:33 INFO ShutdownHookManager: Deleting directory /tmp/spark-34082661-f0de-4c56-92b7-648ea24fa59c
> spark-submit --master spark://ip-172-31-15-245:7077 --status driver-20190627090828-0008
19/06/27 09:09:27 WARN RestSubmissionClient: Unable to connect to server spark://ip-172-31-15-245:7077.
Exception in thread "main" org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable to connect to server
at org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$requestSubmissionStatus$3.apply(RestSubmissionClient.scala:165)
at org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$requestSubmissionStatus$3.apply(RestSubmissionClient.scala:148)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at org.apache.spark.deploy.rest.RestSubmissionClient.requestSubmissionStatus(RestSubmissionClient.scala:148)
at org.apache.spark.deploy.SparkSubmit.requestStatus(SparkSubmit.scala:111)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:88)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.deploy.rest.SubmitRestConnectionException: No response from server
at org.apache.spark.deploy.rest.RestSubmissionClient.readResponse(RestSubmissionClient.scala:285)
at org.apache.spark.deploy.rest.RestSubmissionClient.org$apache$spark$deploy$rest$RestSubmissionClient$$get(RestSubmissionClient.scala:195)
at org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$requestSubmissionStatus$3.apply(RestSubmissionClient.scala:152)
... 11 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.deploy.rest.RestSubmissionClient.readResponse(RestSubmissionClient.scala:278)
... 13 more
Spark is in good health (I'm able to run other jobs after the one above), the driver-20190627090828-0008 appears as "FINISHED" in the web UI.
Is there something I am missing?
UPDATE:
on the master log all I get is
19/07/01 09:40:24 INFO master.Master: 172.31.15.245:42308 got disassociated, removing it.

Spark on yarn runs indefinity

I had spark (2.2 on hadoop 2.7) jobs running and had to restart the sparkmaster machine. Now the spark jobs on yarn is getting submitted, Accepted and running but does not end.
Cluster ( 1 + 3 nodes). Resourcemanager & Namenode running on sparkmaster node. And Nodemanager and Datanode running on 3 worker nodes.
Executor Log:
/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/12/15 08:58:02 INFO executor.CoarseGrainedExecutorBackend: Started daemon with process name: 130256#cassandralake1node3.localdomain
17/12/15 08:58:02 INFO util.SignalUtils: Registered signal handler for TERM
17/12/15 08:58:02 INFO util.SignalUtils: Registered signal handler for HUP
17/12/15 08:58:02 INFO util.SignalUtils: Registered signal handler for INT
17/12/15 08:58:03 WARN util.Utils: Your hostname, cassandralake1node3.localdomain resolves to a loopback address: 127.0.0.1; using 10.204.211.105 instead (on interface em1)
17/12/15 08:58:03 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/12/15 08:58:03 INFO spark.SecurityManager: Changing view acls to: root
17/12/15 08:58:03 INFO spark.SecurityManager: Changing modify acls to: root
17/12/15 08:58:03 INFO spark.SecurityManager: Changing view acls groups to:
17/12/15 08:58:03 INFO spark.SecurityManager: Changing modify acls groups to:
17/12/15 08:58:03 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
17/12/15 08:58:03 INFO client.TransportClientFactory: Successfully created connection to /10.204.211.105:40866 after 85 ms (0 ms spent in bootstraps)
17/12/15 08:58:04 INFO spark.SecurityManager: Changing view acls to: root
17/12/15 08:58:04 INFO spark.SecurityManager: Changing modify acls to: root
17/12/15 08:58:04 INFO spark.SecurityManager: Changing view acls groups to:
17/12/15 08:58:04 INFO spark.SecurityManager: Changing modify acls groups to:
17/12/15 08:58:04 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
17/12/15 08:58:04 INFO client.TransportClientFactory: Successfully created connection to /10.204.211.105:40866 after 1 ms (0 ms spent in bootstraps)
17/12/15 08:58:04 INFO storage.DiskBlockManager: Created local directory at /tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1513329182871_0010/blockmgr-15ae52df-c267-427e-b8f1-ef1c84059740
17/12/15 08:58:04 INFO memory.MemoryStore: MemoryStore started with capacity 1311.0 MB
17/12/15 08:58:04 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler#10.204.211.105:40866
17/12/15 08:58:04 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver
17/12/15 08:58:04 INFO executor.Executor: Starting executor ID 1 on host cassandranode3
17/12/15 08:58:04 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 35983.
17/12/15 08:58:04 INFO netty.NettyBlockTransferService: Server created on cassandranode3:35983
17/12/15 08:58:04 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/12/15 08:58:04 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(1, cassandranode3, 35983, None)
17/12/15 08:58:04 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(1, cassandranode3, 35983, None)
17/12/15 08:58:04 INFO storage.BlockManager: external shuffle service port = 7337
17/12/15 08:58:04 INFO storage.BlockManager: Registering executor with local external shuffle service.
17/12/15 08:58:04 INFO client.TransportClientFactory: Successfully created connection to cassandranode3/10.204.211.105:7337 after 1 ms (0 ms spent in bootstraps)
17/12/15 08:58:04 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(1, cassandranode3, 35983, None)
Driver Log:
O util.Utils: Using initial executors = 2, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
17/12/15 09:50:06 INFO yarn.YarnAllocator: Will request 2 executor container(s), each with 1 core(s) and 3072 MB memory (including 1024 MB of overhead)
17/12/15 09:50:06 INFO yarn.YarnAllocator: Submitted 2 unlocalized container requests.
17/12/15 09:50:06 INFO yarn.ApplicationMaster: Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals
17/12/15 09:50:07 INFO impl.AMRMClientImpl: Received new token for : cassandranode2:38628
17/12/15 09:50:07 INFO impl.AMRMClientImpl: Received new token for : cassandranode3:39212
17/12/15 09:50:07 INFO yarn.YarnAllocator: Launching container container_1513329182871_0011_01_000002 on host cassandranode2 for executor with ID 1
17/12/15 09:50:07 INFO yarn.YarnAllocator: Launching container container_1513329182871_0011_01_000003 on host cassandranode3 for executor with ID 2
17/12/15 09:50:07 INFO yarn.YarnAllocator: Received 2 containers from YARN, launching executors on 2 of them.
17/12/15 09:50:07 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
17/12/15 09:50:07 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
17/12/15 09:50:07 INFO impl.ContainerManagementProtocolProxy: Opening proxy : cassandranode3:39212
17/12/15 09:50:07 INFO impl.ContainerManagementProtocolProxy: Opening proxy : cassandranode2:38628
17/12/15 09:50:09 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.204.211.105:47622) with ID 2
17/12/15 09:50:09 INFO spark.ExecutorAllocationManager: New executor 2 has registered (new total is 1)
17/12/15 09:50:09 INFO storage.BlockManagerMasterEndpoint: Registering block manager cassandranode3:33779 with 1311.0 MB RAM, BlockManagerId(2, cassandranode3, 33779, None)
17/12/15 09:50:11 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.204.211.103:43578) with ID 1
17/12/15 09:50:11 INFO spark.ExecutorAllocationManager: New executor 1 has registered (new total is 2)
17/12/15 09:50:11 INFO storage.BlockManagerMasterEndpoint: Registering block manager cassandranode2:37931 with 1311.0 MB RAM, BlockManagerId(1, cassandranode2, 37931, None)
17/12/15 09:50:11 INFO cluster.YarnClusterSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
17/12/15 09:50:11 INFO cluster.YarnClusterScheduler: YarnClusterScheduler.postStartHook done
17/12/15 09:50:11 INFO internal.SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1513329182871_0011/container_1513329182871_0011_01_000001/spark-warehouse').
17/12/15 09:50:11 INFO internal.SharedState: Warehouse path is 'file:/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1513329182871_0011/container_1513329182871_0011_01_000001/spark-warehouse'.
17/12/15 09:50:11 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#e087bd4{/SQL,null,AVAILABLE,#Spark}
17/12/15 09:50:11 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#c93af1f{/SQL/json,null,AVAILABLE,#Spark}
17/12/15 09:50:11 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#53fd3a5d{/SQL/execution,null,AVAILABLE,#Spark}
17/12/15 09:50:11 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#7dcd6778{/SQL/execution/json,null,AVAILABLE,#Spark}
17/12/15 09:50:11 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#3a25ecc9{/static/sql,null,AVAILABLE,#Spark}
17/12/15 09:50:12 INFO state.StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
17/12/15 09:51:09 INFO spark.ExecutorAllocationManager: Request to remove executorIds: 2
17/12/15 09:51:11 INFO spark.ExecutorAllocationManager: Request to remove executorIds: 1
spark-default.conf
spark.master yarn
spark.eventLog.enabled true
spark.eventLog.dir file:///home/sparkeventlogs
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 5g
spark.driver.cores 1
spark.yarn.am.memory 2048m
spark.yarn.am.cores 1
spark.submit.deployMode cluster
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
spark.driver.maxResultSize 20g
spark.jars.packages datastax:spark-cassandra-connector:2.0.5-s_2.11
spark.cassandra.connection.host 10.204.211.101,10.204.211.103,10.204.211.105
spark.executor.extraJavaOptions -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCDateStamps
spark.driver.extraJavaOptions -Dhdp.version=2.7.4
spark.cassandra.read.timeout_ms 180000
spark.yarn.stagingDir hdfs:///tmp
spark.network.timeout 2400
spark.yarn.driver.memoryOverhead 2048
spark.yarn.executor.memoryOverhead 1024
spark.network.timeout 2400
yarn.resourcemanager.app.timeout.minutes=-1
spark.yarn.submit.waitAppCompletion true
spark.sql.inMemoryColumnarStorage.compressed true
spark.sql.inMemoryColumnarStorage.batchSize 10000
Spark Submit command:
spark-submit --class com.swcassandrautil.popstatsclone.popihits --master yarn --deploy-mode cluster --executor-cores 1 --executor-memory 2g --conf spark.dynamicAllocation.initialExecutors=2 --conf spark.dynamicAllocation.maxExecutors=8 --conf spark.dynamicAllocation.minExecutors=2 --conf spark.memory.fraction=0.75 --conf spark.memory.storageFraction=0.75 /scala/statscloneihits/target/scala-2.11/popstatscloneihits_2.11-1.0.jar "/mnt/data/tmp/xyz*" "\t";
Request your input and Appreciate.
Thanks

Can't connect slaves to master in Spark

Using 4 instances on Compute Engine, each running spark set up with Cloudera Manager. I have no problems starting the master and connecting in my local browser, and it connects as spark://instance-1:7077. When I start the start-slave on the remaining instances I get no errors, until I look in the log:
16/05/02 13:10:18 INFO worker.Worker: Started daemon with process name: 12612#instance-2.c.cluster1-1294.internal
16/05/02 13:10:18 INFO worker.Worker: Registered signal handlers for [TERM, HUP, INT]
16/05/02 13:10:18 INFO spark.SecurityManager: Changing view acls to: root
16/05/02 13:10:18 INFO spark.SecurityManager: Changing modify acls to: root
16/05/02 13:10:18 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with mod$
16/05/02 13:10:19 INFO util.Utils: Successfully started service 'sparkWorker' on port 60270.
16/05/02 13:10:19 INFO worker.Worker: Starting Spark worker 10.142.0.3:60270 with 2 cores, 6.3 GB RAM
16/05/02 13:10:19 INFO worker.Worker: Running Spark version 1.6.0
16/05/02 13:10:19 INFO worker.Worker: Spark home: /opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/spark
16/05/02 13:10:19 ERROR worker.Worker: Failed to create work directory /opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/spark/work
If i use mkdir to create 'work' then it throws and error and says the directory already exists:
mkdir: cannot create directory ‘work’: File exists
The file does exist and when using ls to find it it is highlighted in red with a black background. Any help would be appreciated.
Maybe this is the permission issue,
Try this,
$sudo chown -R your_userName:your_groupName /opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/spark
Now change the Mode of the above path
$sudo chmod 777 /opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/spark
Also all the slaves must have ssh to each other and can able to talk one another.
And Copy all the Configuration file of spark to the slave nodes also.

Resources