Spark executors fails to run on kubernetes cluster - apache-spark

I am trying to run a simple spark job on a kubernetes cluster. I deployed a pod that starts a pyspark shell and in that shell I am changing the spark configuration as specified below:
>>> sc.stop()
>>> sparkConf = SparkConf()
>>> sparkConf.setMaster("k8s://https://kubernetes.default.svc:443")
>>> sparkConf.setAppName("pyspark_test")
>>> sparkConf.set("spark.submit.deployMode", "client")
>>> sparkConf.set("spark.executor.instances", 2)
>>> sparkConf.set("spark.kubernetes.container.image", "us.icr.io/testspark/spark:v1")
>>> sparkConf.set("spark.kubernetes.namespace", "anonymous")
>>> sparkConf.set("spark.driver.memory", "1g")
>>> sparkConf.set("spark.executor.memory", "1g")
>>> sparkConf.set("spark.driver.host", "testspark")
>>> sparkConf.set("spark.driver.port", "37771")
>>> sparkConf.set("spark.kubernetes.driver.pod.name", "testspark")
>>> sparkConf.set("spark.driver.bindAddress", "0.0.0.0")
>>>
>>> spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
>>> sc = spark.sparkContext
This starts two new executor pods but both fails:
satyam#Satyams-MBP ~ % kubectl get pods -n anonymous
NAME READY STATUS RESTARTS AGE
pysparktest-c1c8f177591feb60-exec-1 0/2 Error 0 111m
pysparktest-c1c8f177591feb60-exec-2 0/2 Error 0 111m
testspark 2/2 Running 0 116m
I checked logs for one of the executor pod and it shows following error:
satyam#Satyams-MBP ~ % kubectl logs -n anonymous pysparktest-c1c8f177591feb60-exec-1 -c spark-kubernetes-executor
++ id -u
+ myuid=185
++ id -g
+ mygid=0
+ set +e
++ getent passwd 185
+ uidentry=
+ set -e
+ '[' -z '' ']'
+ '[' -w /etc/passwd ']'
+ echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false'
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sed 's/[^=]*=\(.*\)/\1/g'
+ sort -t_ -k4 -n
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' '' == 2 ']'
+ '[' '' == 3 ']'
+ '[' -n '' ']'
+ '[' -z ']'
+ case "$1" in
+ shift 1
+ CMD=(${JAVA_HOME}/bin/java "${SPARK_EXECUTOR_JAVA_OPTS[#]}" -Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp "$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH" org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url $SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores $SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname $SPARK_EXECUTOR_POD_IP)
+ exec /usr/bin/tini -s -- /usr/local/openjdk-8/bin/java -Dio.netty.tryReflectionSetAccessible=true -Dspark.driver.port=37771 -Xms1g -Xmx1g -cp ':/opt/spark/jars/*:' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler#testspark:37771 --executor-id 1 --cores 1 --app-id spark-application-1612108001966 --hostname 172.30.174.196
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/01/31 15:46:49 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 14#pysparktest-c1c8f177591feb60-exec-1
21/01/31 15:46:49 INFO SignalUtils: Registered signal handler for TERM
21/01/31 15:46:49 INFO SignalUtils: Registered signal handler for HUP
21/01/31 15:46:49 INFO SignalUtils: Registered signal handler for INT
21/01/31 15:46:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/01/31 15:46:49 INFO SecurityManager: Changing view acls to: 185,root
21/01/31 15:46:49 INFO SecurityManager: Changing modify acls to: 185,root
21/01/31 15:46:49 INFO SecurityManager: Changing view acls groups to:
21/01/31 15:46:49 INFO SecurityManager: Changing modify acls groups to:
21/01/31 15:46:49 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(185, root); groups with view permissions: Set(); users with modify permissions: Set(185, root); groups with modify permissions: Set()
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:283)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:272)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:302)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$3(CoarseGrainedExecutorBackend.scala:303)
at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)
at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
at scala.collection.immutable.Range.foreach(Range.scala:158)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$1(CoarseGrainedExecutorBackend.scala:301)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
... 4 more
Caused by: java.io.IOException: Failed to connect to testspark/172.30.174.253:37771
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:253)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:195)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:204)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: testspark/172.30.174.253:37771
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
I have also created a headless service according to the instruction here https://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode-networking. Below is the yaml for the service as well as the driver pod:
Service
apiVersion: v1
kind: Service
metadata:
name: testspark
spec:
clusterIP: "None"
selector:
spark-app-selector: testspark
ports:
- name: driver-rpc-port
protocol: TCP
port: 37771
targetPort: 37771
- name: blockmanager
protocol: TCP
port: 37772
targetPort: 37772
Driver Pod
apiVersion: v1
kind: Pod
metadata:
name: testspark
labels:
spark-app-selector: testspark
spec:
containers:
- name: testspark
securityContext:
runAsUser: 0
image: jupyter/pyspark-notebook
ports:
- containerPort: 37771
command: ["tail", "-f", "/dev/null"]
serviceAccountName: default-editor
This should have allowed executor pods to connect to the driver (which, I checked have the correct ip 172.30.174.249). To debug the network, I started a shell in the driver container and netstat the listening ports. Here is the output for the same:
root#testspark:/opt/spark/work-dir# netstat -tulpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.1:15000 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:15001 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:15090 0.0.0.0:* LISTEN -
tcp6 0 0 :::4040 :::* LISTEN 35/java
tcp6 0 0 :::37771 :::* LISTEN 35/java
tcp6 0 0 :::15020 :::* LISTEN -
tcp6 0 0 :::41613 :::* LISTEN 35/java
I also tried to connect to the driver pod on the port 37771 via telnet from another running pod on the same namespace and it was able to connect.
root#test:/# telnet 172.30.174.249 37771
Trying 172.30.174.249...
Connected to 172.30.174.249.
Escape character is '^]'.
I am not sure why my executor pods are not able to connect to the driver on the same port. Am I missing any configuration or am I doing anything wrong? I can supply more information if required.
UPDATE
I created a fake spark executor image with the following docker file:
FROM us.icr.io/testspark/spark:v1
ENTRYPOINT ["tail", "-f", "/dev/null"]
and passed this image as spark.kubernetes.container.image config while instantiating the spark context. I got two running executor pods. I exec into one of them with command kubectl exec -n anonymous -it pysparktest-c1c8f177591feb60-exec-1 -c spark-kubernetes-executor bash and ran the following command /opt/entrypoint.sh executor and to my surprise, executor could connect with the driver just fine. Here is the stack trace for the same:
++ id -u
+ myuid=185
++ id -g
+ mygid=0
+ set +e
++ getent passwd 185
+ uidentry='185:x:185:0:anonymous uid:/opt/spark:/bin/false'
+ set -e
+ '[' -z '185:x:185:0:anonymous uid:/opt/spark:/bin/false' ']'
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' '' == 2 ']'
+ '[' '' == 3 ']'
+ '[' -n '' ']'
+ '[' -z ']'
+ case "$1" in
+ shift 1
+ CMD=(${JAVA_HOME}/bin/java "${SPARK_EXECUTOR_JAVA_OPTS[#]}" -Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp "$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH" org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url $SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores $SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname $SPARK_EXECUTOR_POD_IP)
+ exec /usr/bin/tini -s -- /usr/local/openjdk-8/bin/java -Dio.netty.tryReflectionSetAccessible=true -Dspark.driver.port=37771 -Xms1g -Xmx1g -cp ':/opt/spark/jars/*:' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler#testspark.anonymous.svc.cluster.local:37771 --executor-id 1 --cores 1 --app-id spark-application-1612191192882 --hostname 172.30.174.249
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/02/01 15:00:16 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 39#pysparktest-27b678775e1556d9-exec-1
21/02/01 15:00:16 INFO SignalUtils: Registered signal handler for TERM
21/02/01 15:00:16 INFO SignalUtils: Registered signal handler for HUP
21/02/01 15:00:16 INFO SignalUtils: Registered signal handler for INT
21/02/01 15:00:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/02/01 15:00:17 INFO SecurityManager: Changing view acls to: 185,root
21/02/01 15:00:17 INFO SecurityManager: Changing modify acls to: 185,root
21/02/01 15:00:17 INFO SecurityManager: Changing view acls groups to:
21/02/01 15:00:17 INFO SecurityManager: Changing modify acls groups to:
21/02/01 15:00:17 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(185, root); groups with view permissions: Set(); users with modify permissions: Set(185, root); groups with modify permissions: Set()
21/02/01 15:00:17 INFO TransportClientFactory: Successfully created connection to testspark.anonymous.svc.cluster.local/172.30.174.253:37771 after 173 ms (0 ms spent in bootstraps)
21/02/01 15:00:18 INFO SecurityManager: Changing view acls to: 185,root
21/02/01 15:00:18 INFO SecurityManager: Changing modify acls to: 185,root
21/02/01 15:00:18 INFO SecurityManager: Changing view acls groups to:
21/02/01 15:00:18 INFO SecurityManager: Changing modify acls groups to:
21/02/01 15:00:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(185, root); groups with view permissions: Set(); users with modify permissions: Set(185, root); groups with modify permissions: Set()
21/02/01 15:00:18 INFO TransportClientFactory: Successfully created connection to testspark.anonymous.svc.cluster.local/172.30.174.253:37771 after 3 ms (0 ms spent in bootstraps)
21/02/01 15:00:18 INFO DiskBlockManager: Created local directory at /var/data/spark-839bad93-b01c-4bc9-a33f-51c7493775e3/blockmgr-ad6a42b9-cfe2-4cdd-aa28-37a0ab77fb16
21/02/01 15:00:18 INFO MemoryStore: MemoryStore started with capacity 413.9 MiB
21/02/01 15:00:19 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler#testspark.anonymous.svc.cluster.local:37771
21/02/01 15:00:19 INFO ResourceUtils: ==============================================================
21/02/01 15:00:19 INFO ResourceUtils: Resources for spark.executor:
21/02/01 15:00:19 INFO ResourceUtils: ==============================================================
21/02/01 15:00:19 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
21/02/01 15:00:19 INFO Executor: Starting executor ID 1 on host 172.30.174.249
21/02/01 15:00:19 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40515.
21/02/01 15:00:19 INFO NettyBlockTransferService: Server created on 172.30.174.249:40515
21/02/01 15:00:19 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/02/01 15:00:19 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(1, 172.30.174.249, 40515, None)
21/02/01 15:00:19 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(1, 172.30.174.249, 40515, None)
21/02/01 15:00:19 INFO BlockManager: Initialized BlockManager: BlockManagerId(1, 172.30.174.249, 40515, None)
I am actually puzzeled why this might be happening. Is there any work around I can try to get this thing working automatically instead of me having to run it manually?

I was finally able to solve this problem with the help of my colleague. I just added these two configs to disable the istio sidecar injection and it started working.
sparkConf.set("spark.kubernetes.driver.annotation.sidecar.istio.io/inject", "false")
sparkConf.set("spark.kubernetes.executor.annotation.sidecar.istio.io/inject", "false")

I don't have much experience with PySpark but I once setup Java Spark to run on a Kubernetes cluster in client mode, like you are trying now.. and I believe the configuration should mostly be the same.
First of all, you should check if the headless service is working as expected or not. First with a:
kubectl describe -n anonymous testspark
and see if there are any endpoints and the whole description. Second, from inside one of your Pods you could check if nslookup resolve the hostname you are expecting your driver Pod to have.
kubectl exec -n <namespace> -it <pod-name> -- bash // exec into a Pod which have nslookup
nslookup testspark
nslookup testspark.testspark
if the names do resolve correctly to the driver Pod current ip address, then it could be something related to the Spark configuration.
The only difference I found between your configuration and the one I was using with java, is that as spark.driver.host I was using something like:
service-name.namespace.svc.cluster.local
but, in theory, it should be the same.
Another thing which could be useful in debugging the problem, is a describe of one of the executors Pods, just to check that the configuration are all correct.
Edit:
This is the spark-submit command I was using:
sh /opt/spark/bin/spark-submit --master k8s://https://master-ip:6443 --deploy-mode client --name ${POD_NAME} --class "my.spark.application.main.Processor" --conf "spark.executor.instances=2" --conf "spark.kubernetes.namespace=${POD_NAMESPACE}" --conf "spark.kubernetes.driver.pod.name=${POD_NAME}" --conf "spark.kubernetes.container.image=${SPARK_IMAGE}" --conf "spark.kubernetes.executor.request.cores=2" --conf "spark.kubernetes.executor.limit.cores=2" --conf "spark.driver.memory=4g" --conf "spark.executor.cores=2" --conf "spark.executor.memory=4g" --conf "spark.driver.host=${SPARK_DRIVER_HOST}" --conf "spark.ui.dagGraph.retainedRootRDDs=1000" --conf "spark.driver.port=${SPARK_DRIVER_PORT}" --conf "spark.driver.extraJavaOptions=${DRIVER_JAVA_ARGS}" --conf "spark.executor.extraJavaOptions=${EXECUTOR_JAVA_ARGS}" /opt/spark/jars/app.jar -jobConfig ${JOB_CONFIGURATION}
And most of the configurations like spark_driver_port or spark_driver_host were completely similar to yours, just different because names and ports were different.

Related

How to solve "no org.apache.spark.deploy.worker.Worker to stop" issue?

I a using a spark standalone in Google Cloud, composed of 1 master and 4 worker nodes. When I start the cluster. I can see the master and worker running. But when I try to stop-all, I get the following issue. Maybe this the reason I cannot run spark-submit. How to solve this issue. The following are the terminal screen.
sparkuser#master:/opt/spark/logs$ jps
1867 Jps
sparkuser#master:/opt/spark/logs$ start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.master.Master-1-master.out
worker4: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.worker.Worker-1-worker4.out
worker1: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.worker.Worker-1-worker1.out
worker2: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.worker.Worker-1-worker2.out
worker3: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.worker.Worker-1-worker3.out
sparkuser#master:/opt/spark/logs$ jps -lm
1946 sun.tools.jps.Jps -lm
1886 org.apache.spark.deploy.master.Master --host master --port 7077 --webui-port 8080
sparkuser#master:/opt/spark/logs$ cat spark-sparkuser-org.apache.spark.deploy.master.Master-1-master.out
Spark Command: /usr/lib/jvm/jdk1.8.0_202/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host master --port 7077 --webui-port 8080
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/10/13 04:28:23 INFO Master: Started daemon with process name: 1886#master
22/10/13 04:28:23 INFO SignalUtils: Registering signal handler for TERM
22/10/13 04:28:23 INFO SignalUtils: Registering signal handler for HUP
22/10/13 04:28:23 INFO SignalUtils: Registering signal handler for INT
22/10/13 04:28:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/10/13 04:28:24 INFO SecurityManager: Changing view acls to: sparkuser
22/10/13 04:28:24 INFO SecurityManager: Changing modify acls to: sparkuser
22/10/13 04:28:24 INFO SecurityManager: Changing view acls groups to:
22/10/13 04:28:24 INFO SecurityManager: Changing modify acls groups to:
22/10/13 04:28:24 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sparkuser); groups with view permissions: Set(); users with modify permissions: Set(sparkuser); groups with modify permissions: Set()
22/10/13 04:28:24 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
22/10/13 04:28:24 INFO Master: Starting Spark master at spark://master:7077
22/10/13 04:28:24 INFO Master: Running Spark version 3.2.2
22/10/13 04:28:25 INFO Utils: Successfully started service 'MasterUI' on port 8080.
22/10/13 04:28:25 INFO MasterWebUI: Bound MasterWebUI to 127.0.0.1, and started at http://localhost:8080
22/10/13 04:28:25 INFO Master: I have been elected leader! New state: ALIVE
sparkuser#master:/opt/spark/logs$ stop-all.sh
worker2: no org.apache.spark.deploy.worker.Worker to stop
worker4: no org.apache.spark.deploy.worker.Worker to stop
worker1: no org.apache.spark.deploy.worker.Worker to stop
worker3: no org.apache.spark.deploy.worker.Worker to stop
stopping org.apache.spark.deploy.master.Master
sparkuser#master:/opt/spark/logs$

Spark on kubernetes does not starting executors, even not trying, why?

Following instructions I am trying to deploy my pyspark app on Azure AKS free tier with spark.executor.instances=5
spark-submit \
--master k8s://https://xxxxxxx-xxxxxxx.hcp.westeurope.azmk8s.io:443 \
--deploy-mode cluster \
--name sparkbasics \
--conf spark.executor.instances=5 \
--conf spark.kubernetes.container.image=aosb06.azurecr.io/sparkbasics:v300 \
local:///opt/spark/work-dir/main.py
Everything works fine (including application itself), except I see no executors pods at all, only driver pod.
kubectl get pods
NAME READY STATUS RESTARTS AGE
sparkbasics-f374377b3c78ac68-driver 0/1 Completed 0 52m
Dockerfile is from Spark distribution.
What can be an issue? Is there problem with resource allocation?
In driver logs seems there are no issues.
kubectl logs <driver-pod>
021-08-12 22:25:54,332 INFO spark.SparkContext: Running Spark version 3.1.2
2021-08-12 22:25:54,378 INFO resource.ResourceUtils: ==============================================================
2021-08-12 22:25:54,378 INFO resource.ResourceUtils: No custom resources configured for spark.driver.
2021-08-12 22:25:54,379 INFO resource.ResourceUtils: ==============================================================
2021-08-12 22:25:54,379 INFO spark.SparkContext: Submitted application: SimpleApp
2021-08-12 22:25:54,403 INFO resource.ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
2021-08-12 22:25:54,422 INFO resource.ResourceProfile: Limiting resource is cpu
2021-08-12 22:25:54,422 INFO resource.ResourceProfileManager: Added ResourceProfile id: 0
2021-08-12 22:25:54,475 INFO spark.SecurityManager: Changing view acls to: 185,aovsyannikov
2021-08-12 22:25:54,475 INFO spark.SecurityManager: Changing modify acls to: 185,aovsyannikov
2021-08-12 22:25:54,475 INFO spark.SecurityManager: Changing view acls groups to:
2021-08-12 22:25:54,475 INFO spark.SecurityManager: Changing modify acls groups to:
2021-08-12 22:25:54,475 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(185, aovsyannikov); groups with view permissions: Set(); users with modify permissions: Set(185, aovsyannikov); groups with modify permissions: Set()
2021-08-12 22:25:54,717 INFO util.Utils: Successfully started service 'sparkDriver' on port 7078.
2021-08-12 22:25:54,781 INFO spark.SparkEnv: Registering MapOutputTracker
2021-08-12 22:25:54,818 INFO spark.SparkEnv: Registering BlockManagerMaster
2021-08-12 22:25:54,843 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2021-08-12 22:25:54,844 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
2021-08-12 22:25:54,848 INFO spark.SparkEnv: Registering BlockManagerMasterHeartbeat
2021-08-12 22:25:54,862 INFO storage.DiskBlockManager: Created local directory at /var/data/spark-1e9aa64b-e0a1-44ae-a097-ebb3c2f32404/blockmgr-c51b9095-5426-4a00-b17a-461de2b80357
2021-08-12 22:25:54,892 INFO memory.MemoryStore: MemoryStore started with capacity 413.9 MiB
2021-08-12 22:25:54,909 INFO spark.SparkEnv: Registering OutputCommitCoordinator
2021-08-12 22:25:55,023 INFO util.log: Logging initialized #3324ms to org.sparkproject.jetty.util.log.Slf4jLog
2021-08-12 22:25:55,114 INFO server.Server: jetty-9.4.40.v20210413; built: 2021-04-13T20:42:42.668Z; git: b881a572662e1943a14ae12e7e1207989f218b74; jvm 1.8.0_275-b01
2021-08-12 22:25:55,139 INFO server.Server: Started #3442ms
2021-08-12 22:25:55,184 INFO server.AbstractConnector: Started ServerConnector#59b3b32{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
2021-08-12 22:25:55,184 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
kubectl describe pod <driver-pod>
Name: sparkbasics-f374377b3c78ac68-driver
Namespace: default
Priority: 0
Node: aks-default-31057657-vmss000000/10.240.0.4
Start Time: Fri, 13 Aug 2021 01:25:47 +0300
Labels: spark-app-selector=spark-256cc7f64af9451b89e0098397980974
spark-role=driver
Annotations: <none>
Status: Succeeded
IP: 10.244.0.28
IPs:
IP: 10.244.0.28
Containers:
spark-kubernetes-driver:
Container ID: containerd://b572a4056014cd4b0520b808d64d766254d30c44ba12fc98717aee3b4814f17d
Image: aosb06.azurecr.io/sparkbasics:v300
Image ID: aosb06.azurecr.io/sparkbasics#sha256:965393784488025fffc7513edcb4a62333ba59a5ee3076346fd8d335e1715213
Ports: 7078/TCP, 7079/TCP, 4040/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP
Args:
driver
--properties-file
/opt/spark/conf/spark.properties
--class
org.apache.spark.deploy.PythonRunner
local:///opt/spark/work-dir/main.py
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 13 Aug 2021 01:25:51 +0300
Finished: Fri, 13 Aug 2021 01:56:40 +0300
Ready: False
Restart Count: 0
Limits:
memory: 1433Mi
Requests:
cpu: 1
memory: 1433Mi
Environment:
SPARK_USER: aovsyannikov
SPARK_APPLICATION_ID: spark-256cc7f64af9451b89e0098397980974
SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP)
SB_KEY_STORAGE: <set to the key 'STORAGE' in secret 'sparkbasics'> Optional: false
SB_KEY_OPENCAGE: <set to the key 'OPENCAGE' in secret 'sparkbasics'> Optional: false
SB_KEY_STORAGEOUT: <set to the key 'STORAGEOUT' in secret 'sparkbasics'> Optional: false
SPARK_LOCAL_DIRS: /var/data/spark-1e9aa64b-e0a1-44ae-a097-ebb3c2f32404
SPARK_CONF_DIR: /opt/spark/conf
Mounts:
/opt/spark/conf from spark-conf-volume-driver (rw)
/var/data/spark-1e9aa64b-e0a1-44ae-a097-ebb3c2f32404 from spark-local-dir-1 (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-wlqjt (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
spark-local-dir-1:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
spark-conf-volume-driver:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: spark-drv-6f83b17b3c78af1f-conf-map
Optional: false
default-token-wlqjt:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-wlqjt
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
I have found a mistake in pyspark app itself.
...
SparkSession.builder.master("local")
...
Should be without master
...
SparkSession.builder
...
as simple as that :(

Spark worker does not resolve host on ECS but works fine with IP address

I'm trying to run Spark 3.0 on AWS ECS on EC2. I have a spark-worker service and a spark-master service. When try I run the worker with the hostname of the master (as exposed through ECS service discovery), it fails to resolve. When I put the hard-coded IP address/port, it works.
Here are some commands I ran inside the worker docker container after I ssh'ed into the EC2 backing the ECS:
# as can be seen below, the master host is reachable from the worker Docker container
root#b87fad6a3ffa:/usr/spark-3.0.0# ping spark_master.mynamespace
PING spark_master.mynamespace (172.21.60.11) 56(84) bytes of data.
64 bytes from ip-172-21-60-11.eu-west-1.compute.internal (172.21.60.11): icmp_seq=1 ttl=254 time=0.370 ms
# the following works just fine -- starting the worker successfully and connecting to the master:
root#b87fad6a3ffa:/usr/spark-3.0.0# /bin/sh -c "bin/spark-class org.apache.spark.deploy.worker.Worker spark://172.21.60.11:7077"
# !!! this is the fail
root#b87fad6a3ffa:/usr/spark-3.0.0# /bin/sh -c "bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark_master.mynamespace:7077"
20/07/01 21:03:41 INFO worker.Worker: Started daemon with process name: 422#b87fad6a3ffa
20/07/01 21:03:41 INFO util.SignalUtils: Registered signal handler for TERM
20/07/01 21:03:41 INFO util.SignalUtils: Registered signal handler for HUP
20/07/01 21:03:41 INFO util.SignalUtils: Registered signal handler for INT
20/07/01 21:03:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/07/01 21:03:42 INFO spark.SecurityManager: Changing view acls to: root
20/07/01 21:03:42 INFO spark.SecurityManager: Changing modify acls to: root
20/07/01 21:03:42 INFO spark.SecurityManager: Changing view acls groups to:
20/07/01 21:03:42 INFO spark.SecurityManager: Changing modify acls groups to:
20/07/01 21:03:42 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
20/07/01 21:03:42 INFO util.Utils: Successfully started service 'sparkWorker' on port 39915.
20/07/01 21:03:42 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[main,5,main]
org.apache.spark.SparkException: Invalid master URL: spark://spark_master.mynamespace:7077
at org.apache.spark.util.Utils$.extractHostPortFromSparkUrl(Utils.scala:2397)
at org.apache.spark.rpc.RpcAddress$.fromSparkURL(RpcAddress.scala:47)
at org.apache.spark.deploy.worker.Worker$.$anonfun$startRpcEnvAndEndpoint$3(Worker.scala:859)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
at org.apache.spark.deploy.worker.Worker$.startRpcEnvAndEndpoint(Worker.scala:859)
at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:828)
at org.apache.spark.deploy.worker.Worker.main(Worker.scala)
20/07/01 21:03:42 INFO util.ShutdownHookManager: Shutdown hook called
# following is just FYI
root#b87fad6a3ffa:/usr/spark-3.0.0# /bin/sh -c "bin/spark-class org.apache.spark.deploy.worker.Worker --help"
20/07/01 21:16:10 INFO worker.Worker: Started daemon with process name: 552#b87fad6a3ffa
20/07/01 21:16:10 INFO util.SignalUtils: Registered signal handler for TERM
20/07/01 21:16:10 INFO util.SignalUtils: Registered signal handler for HUP
20/07/01 21:16:10 INFO util.SignalUtils: Registered signal handler for INT
Usage: Worker [options] <master>
Master must be a URL of the form spark://hostname:port
Options:
-c CORES, --cores CORES Number of cores to use
-m MEM, --memory MEM Amount of memory to use (e.g. 1000M, 2G)
-d DIR, --work-dir DIR Directory to run apps in (default: SPARK_HOME/work)
-i HOST, --ip IP Hostname to listen on (deprecated, please use --host or -h)
-h HOST, --host HOST Hostname to listen on
-p PORT, --port PORT Port to listen on (default: random)
--webui-port PORT Port for web UI (default: 8081)
--properties-file FILE Path to a custom Spark properties file.
Default is conf/spark-defaults.conf.
...
Master node itself works just fine, I can see its admin ui through 8080, etc.
Any ideas why Spark does not resolve the hostname but only works with an IP address?
The issue is about the _ that I used in the hostname. When I changed spark_master and spark_worker to use - instead, the problem was solved.
Relevant links:
https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6587184
URI - getHost returns null. Why?
The relevant code piece from Spark codebase:
def extractHostPortFromSparkUrl(sparkUrl: String): (String, Int) = {
try {
val uri = new java.net.URI(sparkUrl)
val host = uri.getHost
val port = uri.getPort
if (uri.getScheme != "spark" ||
host == null ||
port < 0 ||
(uri.getPath != null && !uri.getPath.isEmpty) || // uri.getPath returns "" instead of null
uri.getFragment != null ||
uri.getQuery != null ||
uri.getUserInfo != null) {
throw new SparkException("Invalid master URL: " + sparkUrl)
}
(host, port)
} catch {
case e: java.net.URISyntaxException =>
throw new SparkException("Invalid master URL: " + sparkUrl, e)
}
}

Apache Spark: worker can't connect to master but can ping and ssh from worker to master

I'm trying to setup an 8-node cluster on 8 RHEL 7.3 x86 machines using Spark 2.0.1. start-master.sh goes through fine:
Spark Command: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.102-4.b14.el7.x86_64/jre/bin/java -cp /usr/local/bin/spark-2.0.1-bin-hadoop2.7/conf/:/usr/local/bin/spark-2.0.1-bin-hadoop2.7/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host lambda.foo.net --port 7077 --webui-port 8080
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/12/08 04:26:46 INFO Master: Started daemon with process name: 22181#lambda.foo.net
16/12/08 04:26:46 INFO SignalUtils: Registered signal handler for TERM
16/12/08 04:26:46 INFO SignalUtils: Registered signal handler for HUP
16/12/08 04:26:46 INFO SignalUtils: Registered signal handler for INT
16/12/08 04:26:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/12/08 04:26:46 INFO SecurityManager: Changing view acls to: root
16/12/08 04:26:46 INFO SecurityManager: Changing modify acls to: root
16/12/08 04:26:46 INFO SecurityManager: Changing view acls groups to:
16/12/08 04:26:46 INFO SecurityManager: Changing modify acls groups to:
16/12/08 04:26:46 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
16/12/08 04:26:46 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
16/12/08 04:26:46 INFO Master: Starting Spark master at spark://lambda.foo.net:7077
16/12/08 04:26:46 INFO Master: Running Spark version 2.0.1
16/12/08 04:26:46 INFO Utils: Successfully started service 'MasterUI' on port 8080.
16/12/08 04:26:46 INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://19.341.11.212:8080
16/12/08 04:26:46 INFO Utils: Successfully started service on port 6066.
16/12/08 04:26:46 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
16/12/08 04:26:46 INFO Master: I have been elected leader! New state: ALIVE
But when I try to bring up the workers, using start-slaves.sh, what I see in the log of the workers is:
Spark Command: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.102-4.b14.el7.x86_64/jre/bin/java -cp /usr/local/bin/spark-2.0.1-bin-hadoop2.7/conf/:/usr/local/bin/spark-2.0.1-bin-hadoop2.7/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://lambda.foo.net:7077
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/12/08 04:30:00 INFO Worker: Started daemon with process name: 14649#hawk040os4.foo.net
16/12/08 04:30:00 INFO SignalUtils: Registered signal handler for TERM
16/12/08 04:30:00 INFO SignalUtils: Registered signal handler for HUP
16/12/08 04:30:00 INFO SignalUtils: Registered signal handler for INT
16/12/08 04:30:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/12/08 04:30:00 INFO SecurityManager: Changing view acls to: root
16/12/08 04:30:00 INFO SecurityManager: Changing modify acls to: root
16/12/08 04:30:00 INFO SecurityManager: Changing view acls groups to:
16/12/08 04:30:00 INFO SecurityManager: Changing modify acls groups to:
16/12/08 04:30:00 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
16/12/08 04:30:00 INFO Utils: Successfully started service 'sparkWorker' on port 35858.
16/12/08 04:30:00 INFO Worker: Starting Spark worker 15.242.22.179:35858 with 24 cores, 1510.2 GB RAM
16/12/08 04:30:00 INFO Worker: Running Spark version 2.0.1
16/12/08 04:30:00 INFO Worker: Spark home: /usr/local/bin/spark-2.0.1-bin-hadoop2.7
16/12/08 04:30:00 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
16/12/08 04:30:00 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://15.242.22.179:8081
16/12/08 04:30:00 INFO Worker: Connecting to master lambda.foo.net:7077...
16/12/08 04:30:00 WARN Worker: Failed to connect to master lambda.foo.net:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:216)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to lambda.foo.net/19.341.11.212:7077
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:191)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
... 4 more
Caused by: java.net.NoRouteToHostException: No route to host: lambda.foo.net/19.341.11.212:7077
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
16/12/08 04:30:12 INFO Worker: Retrying connection to master (attempt # 1)
16/12/08 04:30:12 INFO Worker: Connecting to master lambda.foo.net:7077...
16/12/08 04:30:12 WARN Worker: Failed to connect to master lambda.foo.net:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
So it says "No route to host". But I could successfully ping the master from the worker node, as well as ssh from the worker to the master node.
Why does spark say "No route to host"?
Problem solved: the firewall was blocking the packets.

Spark UI is inaccessible from host for sequenceiq/spark:1.6.0

I have used given command. I am able to run a spark PI and stand-alone programs, but when I tried to spark UI its not accessible from host. Any help ?
docker pull sequenceiq/spark:1.6.0
docker build --rm -t sequenceiq/spark:1.6.0 .
docker run -it -p 8088:8088 -p 8042:8042 -p 4040:4040 -h sandbox sequenceiq/spark:1.6.0 bash
docker run -i -t -v /scratch:/mnt -e 8088 -e 8042 -e 4040 sequenceiq/spark:1.6.0 /bin/bash
docker run -d -h sandbox sequenceiq/spark:1.6.0 -d
#run the spark shell
spark-shell \
--master yarn-client \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1
#execute the the following command which should return 1000
scala> sc.parallelize(1 to 1000).count()
Spark shell snippet :
16/12/08 22:42:27 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem#172.17.0.5:43371]
16/12/08 22:42:27 INFO util.Utils: Successfully started service 'sparkDriverActorSystem' on port 43371.
16/12/08 22:42:27 INFO spark.SparkEnv: Registering MapOutputTracker
16/12/08 22:42:27 INFO spark.SparkEnv: Registering BlockManagerMaster
16/12/08 22:42:27 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-61bf4e4d-e0f9-48a9-924f-8971c3d916e7
16/12/08 22:42:27 INFO storage.MemoryStore: MemoryStore started with capacity 511.5 MB
16/12/08 22:42:28 INFO spark.SparkEnv: Registering OutputCommitCoordinator
16/12/08 22:42:28 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/12/08 22:42:28 INFO server.AbstractConnector: Started SelectChannelConnector#0.0.0.0:4040
16/12/08 22:42:28 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
16/12/08 22:42:28 INFO ui.SparkUI: Started SparkUI at http://172.17.0.5:4040
16/12/08 22:42:28 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
Try setting SPARK_LOCAL_IP to host's ip and then try to access
Got the Answer, I was running a spark shell with master as Yarn . so this job was visible under resource manager UI.
If I run a spark shell with local as master Spark UI works fine.
Please make sure what is your master and choose your UI

Resources