Spark on kubernetes does not starting executors, even not trying, why?

Spark on kubernetes does not starting executors, even not trying, why? - apache-spark

Following instructions I am trying to deploy my pyspark app on Azure AKS free tier with spark.executor.instances=5
spark-submit \
--master k8s://https://xxxxxxx-xxxxxxx.hcp.westeurope.azmk8s.io:443 \
--deploy-mode cluster \
--name sparkbasics \
--conf spark.executor.instances=5 \
--conf spark.kubernetes.container.image=aosb06.azurecr.io/sparkbasics:v300 \
local:///opt/spark/work-dir/main.py
Everything works fine (including application itself), except I see no executors pods at all, only driver pod.
kubectl get pods
NAME READY STATUS RESTARTS AGE
sparkbasics-f374377b3c78ac68-driver 0/1 Completed 0 52m
Dockerfile is from Spark distribution.
What can be an issue? Is there problem with resource allocation?
In driver logs seems there are no issues.
kubectl logs <driver-pod>
021-08-12 22:25:54,332 INFO spark.SparkContext: Running Spark version 3.1.2
2021-08-12 22:25:54,378 INFO resource.ResourceUtils: ==============================================================
2021-08-12 22:25:54,378 INFO resource.ResourceUtils: No custom resources configured for spark.driver.
2021-08-12 22:25:54,379 INFO resource.ResourceUtils: ==============================================================
2021-08-12 22:25:54,379 INFO spark.SparkContext: Submitted application: SimpleApp
2021-08-12 22:25:54,403 INFO resource.ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
2021-08-12 22:25:54,422 INFO resource.ResourceProfile: Limiting resource is cpu
2021-08-12 22:25:54,422 INFO resource.ResourceProfileManager: Added ResourceProfile id: 0
2021-08-12 22:25:54,475 INFO spark.SecurityManager: Changing view acls to: 185,aovsyannikov
2021-08-12 22:25:54,475 INFO spark.SecurityManager: Changing modify acls to: 185,aovsyannikov
2021-08-12 22:25:54,475 INFO spark.SecurityManager: Changing view acls groups to:
2021-08-12 22:25:54,475 INFO spark.SecurityManager: Changing modify acls groups to:
2021-08-12 22:25:54,475 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(185, aovsyannikov); groups with view permissions: Set(); users with modify permissions: Set(185, aovsyannikov); groups with modify permissions: Set()
2021-08-12 22:25:54,717 INFO util.Utils: Successfully started service 'sparkDriver' on port 7078.
2021-08-12 22:25:54,781 INFO spark.SparkEnv: Registering MapOutputTracker
2021-08-12 22:25:54,818 INFO spark.SparkEnv: Registering BlockManagerMaster
2021-08-12 22:25:54,843 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2021-08-12 22:25:54,844 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
2021-08-12 22:25:54,848 INFO spark.SparkEnv: Registering BlockManagerMasterHeartbeat
2021-08-12 22:25:54,862 INFO storage.DiskBlockManager: Created local directory at /var/data/spark-1e9aa64b-e0a1-44ae-a097-ebb3c2f32404/blockmgr-c51b9095-5426-4a00-b17a-461de2b80357
2021-08-12 22:25:54,892 INFO memory.MemoryStore: MemoryStore started with capacity 413.9 MiB
2021-08-12 22:25:54,909 INFO spark.SparkEnv: Registering OutputCommitCoordinator
2021-08-12 22:25:55,023 INFO util.log: Logging initialized #3324ms to org.sparkproject.jetty.util.log.Slf4jLog
2021-08-12 22:25:55,114 INFO server.Server: jetty-9.4.40.v20210413; built: 2021-04-13T20:42:42.668Z; git: b881a572662e1943a14ae12e7e1207989f218b74; jvm 1.8.0_275-b01
2021-08-12 22:25:55,139 INFO server.Server: Started #3442ms
2021-08-12 22:25:55,184 INFO server.AbstractConnector: Started ServerConnector#59b3b32{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
2021-08-12 22:25:55,184 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
kubectl describe pod <driver-pod>
Name: sparkbasics-f374377b3c78ac68-driver
Namespace: default
Priority: 0
Node: aks-default-31057657-vmss000000/10.240.0.4
Start Time: Fri, 13 Aug 2021 01:25:47 +0300
Labels: spark-app-selector=spark-256cc7f64af9451b89e0098397980974
spark-role=driver
Annotations: <none>
Status: Succeeded
IP: 10.244.0.28
IPs:
IP: 10.244.0.28
Containers:
spark-kubernetes-driver:
Container ID: containerd://b572a4056014cd4b0520b808d64d766254d30c44ba12fc98717aee3b4814f17d
Image: aosb06.azurecr.io/sparkbasics:v300
Image ID: aosb06.azurecr.io/sparkbasics#sha256:965393784488025fffc7513edcb4a62333ba59a5ee3076346fd8d335e1715213
Ports: 7078/TCP, 7079/TCP, 4040/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP
Args:
driver
--properties-file
/opt/spark/conf/spark.properties
--class
org.apache.spark.deploy.PythonRunner
local:///opt/spark/work-dir/main.py
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 13 Aug 2021 01:25:51 +0300
Finished: Fri, 13 Aug 2021 01:56:40 +0300
Ready: False
Restart Count: 0
Limits:
memory: 1433Mi
Requests:
cpu: 1
memory: 1433Mi
Environment:
SPARK_USER: aovsyannikov
SPARK_APPLICATION_ID: spark-256cc7f64af9451b89e0098397980974
SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP)
SB_KEY_STORAGE: <set to the key 'STORAGE' in secret 'sparkbasics'> Optional: false
SB_KEY_OPENCAGE: <set to the key 'OPENCAGE' in secret 'sparkbasics'> Optional: false
SB_KEY_STORAGEOUT: <set to the key 'STORAGEOUT' in secret 'sparkbasics'> Optional: false
SPARK_LOCAL_DIRS: /var/data/spark-1e9aa64b-e0a1-44ae-a097-ebb3c2f32404
SPARK_CONF_DIR: /opt/spark/conf
Mounts:
/opt/spark/conf from spark-conf-volume-driver (rw)
/var/data/spark-1e9aa64b-e0a1-44ae-a097-ebb3c2f32404 from spark-local-dir-1 (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-wlqjt (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
spark-local-dir-1:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
spark-conf-volume-driver:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: spark-drv-6f83b17b3c78af1f-conf-map
Optional: false
default-token-wlqjt:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-wlqjt
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>

I have found a mistake in pyspark app itself.
...
SparkSession.builder.master("local")
...
Should be without master
...
SparkSession.builder
...
as simple as that :(

Related

Spark-Submit cannot connect to ResourceManager

I'm trying to start a example spark job (hadoop is 2.7 and spark is 3.3.1) on my hadoop cluster containing of namenode and datanode0.
Upon running start-dfs.sh, I can see the datanode within the UI, and running jps on datanode returns me with "Datanode" process.
When I try to run the spark-submit with example, I get following output:
spark#namenode:~$ spark-submit --deploy-mode client --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.3.1.jar 10
23/02/07 21:23:08 INFO SparkContext: Running Spark version 3.3.1
23/02/07 21:23:08 INFO ResourceUtils: ==============================================================
23/02/07 21:23:08 INFO ResourceUtils: No custom resources configured for spark.driver.
23/02/07 21:23:08 INFO ResourceUtils: ==============================================================
23/02/07 21:23:08 INFO SparkContext: Submitted application: Spark Pi
23/02/07 21:23:09 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 512, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
23/02/07 21:23:09 INFO ResourceProfile: Limiting resource is cpus at 1 tasks per executor
23/02/07 21:23:09 INFO ResourceProfileManager: Added ResourceProfile id: 0
23/02/07 21:23:09 INFO SecurityManager: Changing view acls to: spark
23/02/07 21:23:09 INFO SecurityManager: Changing modify acls to: spark
23/02/07 21:23:09 INFO SecurityManager: Changing view acls groups to:
23/02/07 21:23:09 INFO SecurityManager: Changing modify acls groups to:
23/02/07 21:23:09 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark); groups with view permissions: Set(); users with modify permissions: Set(spark); groups with modify permissions: Set()
23/02/07 21:23:09 INFO Utils: Successfully started service 'sparkDriver' on port 45161.
23/02/07 21:23:09 INFO SparkEnv: Registering MapOutputTracker
23/02/07 21:23:09 INFO SparkEnv: Registering BlockManagerMaster
23/02/07 21:23:09 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
23/02/07 21:23:09 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
23/02/07 21:23:09 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
23/02/07 21:23:09 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-09c3077f-8d16-4496-9808-0626cefe1cc7
23/02/07 21:23:09 INFO MemoryStore: MemoryStore started with capacity 93.3 MiB
23/02/07 21:23:09 INFO SparkEnv: Registering OutputCommitCoordinator
23/02/07 21:23:10 INFO Utils: Successfully started service 'SparkUI' on port 4040.
23/02/07 21:23:10 INFO SparkContext: Added JAR file:/home/spark/spark/examples/jars/spark-examples_2.12-3.3.1.jar at spark://namenode:45161/jars/spark-examples_2.12-3.3.1.jar with timestamp 1675801388738
23/02/07 21:23:10 INFO RMProxy: Connecting to ResourceManager at namenode/192.168.1.17:8032
23/02/07 21:23:11 INFO Client: Retrying connect to server: namenode/192.168.1.17:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
23/02/07 21:23:12 INFO Client: Retrying connect to server: namenode/192.168.1.17:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
23/02/07 21:23:13 INFO Client: Retrying connect to server: namenode/192.168.1.17:8032. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
23/02/07 21:23:14 INFO Client: Retrying connect to server: namenode/192.168.1.17:8032. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
23/02/07 21:23:15 INFO Client: Retrying connect to server: namenode/192.168.1.17:8032. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
23/02/07 21:23:16 INFO Client: Retrying connect to server: namenode/192.168.1.17:8032. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
^C23/02/07 21:23:17 INFO DiskBlockManager: Shutdown hook called
23/02/07 21:23:17 INFO ShutdownHookManager: Shutdown hook called
23/02/07 21:23:17 INFO ShutdownHookManager: Deleting directory /tmp/spark-5a15bfc6-654d-4332-be86-3f20b4cf40f1/userFiles-049088cd-d541-4424-930b-dcaa18634860
23/02/07 21:23:17 INFO ShutdownHookManager: Deleting directory /tmp/spark-4d529cc6-0bc8-43a9-b2af-3671ee11f963
23/02/07 21:23:17 INFO ShutdownHookManager: Deleting directory /tmp/spark-5a15bfc6-654d-4332-be86-3f20b4cf40f1
Is it possible that I've messed up the yarn configuration?
Here's my etc hosts:
192.168.1.17 namenode
192.168.1.23 datanode0
This is my yarn-site.xml, maybe I've messed up the configuration?
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>namenode</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>${yarn.resourcemanager.hostname}:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>${yarn.resourcemanager.hostname}:8032</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>${yarn.resourcemanager.hostname}:8088</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>${yarn.resourcemanager.hostname}:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>${yarn.resourcemanager.hostname}:8088</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>1536</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>1536</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>128</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
</configuration>

Spark Cluster on docker swarm Worker web UI issue

Hi I'm facing issue with start spark cluster on docker swarm
I've already created a few o clusters on docker swarm but right now I don't know why I couldn't reach worker UI on dedicated port .
I have a 3 physical host (what does mean on 3 docker swarm nodes)
spark_stack_spark-worker-1.0.u2qvp1swxp5g#test_host_a | 22/06/14 18:46:58 INFO SecurityManager: Changing view acls to: 185
spark_stack_spark-worker-1.0.u2qvp1swxp5g#test_host_a | 22/06/14 18:46:58 INFO SecurityManager: Changing modify acls to: 185
spark_stack_spark-worker-1.0.u2qvp1swxp5g#test_host_a | 22/06/14 18:46:58 INFO SecurityManager: Changing view acls groups to:
spark_stack_spark-worker-1.0.u2qvp1swxp5g#test_host_a | 22/06/14 18:46:58 INFO SecurityManager: Changing modify acls groups to:
spark_stack_spark-worker-1.0.u2qvp1swxp5g#test_host_a | 22/06/14 18:46:58 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(185); groups with view permissions: Set(); users with modify permissions: Set(185); groups with modify permissions: Set()
spark_stack_spark-worker-1.0.u2qvp1swxp5g#test_host_a | 22/06/14 18:46:58 INFO Utils: Successfully started service 'sparkWorker' on port 7000.
spark_stack_spark-worker-1.0.u2qvp1swxp5g#test_host_a | 22/06/14 18:46:58 INFO Worker: Worker decommissioning not enabled.
spark_stack_spark-worker-1.0.u2qvp1swxp5g#test_host_a | 22/06/14 18:46:59 INFO Worker: Starting Spark worker 10.0.2.148:7000 with 6 cores, 10.0 GiB RAM
spark_stack_spark-worker-1.0.u2qvp1swxp5g#test_host_a | 22/06/14 18:46:59 INFO Worker: Running Spark version 3.2.1
spark_stack_spark-worker-1.0.u2qvp1swxp5g#test_host_a | 22/06/14 18:46:59 INFO Worker: Spark home: /opt/spark
spark_stack_spark-worker-1.0.u2qvp1swxp5g#test_host_a | 22/06/14 18:46:59 INFO ResourceUtils: ==============================================================
spark_stack_spark-worker-1.0.u2qvp1swxp5g#test_host_a | 22/06/14 18:46:59 INFO ResourceUtils: No custom resources configured for spark.worker.
spark_stack_spark-worker-1.0.u2qvp1swxp5g#test_host_a | 22/06/14 18:46:59 INFO ResourceUtils: ==============================================================
spark_stack_spark-worker-1.0.u2qvp1swxp5g#test_host_a | 22/06/14 18:46:59 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
spark_stack_spark-worker-1.0.u2qvp1swxp5g#test_host_a | 22/06/14 18:46:59 INFO WorkerWebUI: Bound WorkerWebUI to 10.0.2.148, and started at http://10.242.130.225:8081
spark_stack_spark-worker-1.0.u2qvp1swxp5g#test_host_a | 22/06/14 18:46:59 INFO Worker: Connecting to master spark-master:7077...
spark_stack_spark-worker-1.0.u2qvp1swxp5g#test_host_a | 22/06/14 18:46:59 INFO TransportClientFactory: Successfully created connection to spark-master/10.0.2.144:7077 after 46 ms (0 ms spent in bootstraps)
spark_stack_spark-worker-1.0.u2qvp1swxp5g#test_host_a | 22/06/14 18:46:59 INFO Worker: Successfully registered with master spark://10.0.2.145:7077
gsi49cc8bevb spark_stack_spark-master global 1/1 new_spark-cluster:3.2.1 *:7077->7077/tcp, *:9090->8080/tcp
fiszfm8mgl2p spark_stack_spark-worker-1 global 1/1 new_spark-cluster:3.2.1 *:7001->7000/tcp, *:9091->8081/tcp
"Endpoint": {
"Spec": {
"Mode": "vip",
"Ports": [
{
"Protocol": "tcp",
"TargetPort": 7000,
"PublishedPort": 7001,
"PublishMode": "ingress"
},
{
"Protocol": "tcp",
"TargetPort": 8081,
"PublishedPort": 9091,
"PublishMode": "ingress"
}
]
},
"Ports": [
{
"Protocol": "tcp",
"TargetPort": 7000,
"PublishedPort": 7001,
"PublishMode": "ingress"
},
{
"Protocol": "tcp",
"TargetPort": 8081,
"PublishedPort": 9091,
"PublishMode": "ingress"
}
Worker web UI should be reach on 9091 within PUBLIC DNS as it was ported, but it couldn't be. why?
In the same way master web UI works as well.

Spark executors fails to run on kubernetes cluster

I am trying to run a simple spark job on a kubernetes cluster. I deployed a pod that starts a pyspark shell and in that shell I am changing the spark configuration as specified below:
>>> sc.stop()
>>> sparkConf = SparkConf()
>>> sparkConf.setMaster("k8s://https://kubernetes.default.svc:443")
>>> sparkConf.setAppName("pyspark_test")
>>> sparkConf.set("spark.submit.deployMode", "client")
>>> sparkConf.set("spark.executor.instances", 2)
>>> sparkConf.set("spark.kubernetes.container.image", "us.icr.io/testspark/spark:v1")
>>> sparkConf.set("spark.kubernetes.namespace", "anonymous")
>>> sparkConf.set("spark.driver.memory", "1g")
>>> sparkConf.set("spark.executor.memory", "1g")
>>> sparkConf.set("spark.driver.host", "testspark")
>>> sparkConf.set("spark.driver.port", "37771")
>>> sparkConf.set("spark.kubernetes.driver.pod.name", "testspark")
>>> sparkConf.set("spark.driver.bindAddress", "0.0.0.0")
>>>
>>> spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
>>> sc = spark.sparkContext
This starts two new executor pods but both fails:
satyam#Satyams-MBP ~ % kubectl get pods -n anonymous
NAME READY STATUS RESTARTS AGE
pysparktest-c1c8f177591feb60-exec-1 0/2 Error 0 111m
pysparktest-c1c8f177591feb60-exec-2 0/2 Error 0 111m
testspark 2/2 Running 0 116m
I checked logs for one of the executor pod and it shows following error:
satyam#Satyams-MBP ~ % kubectl logs -n anonymous pysparktest-c1c8f177591feb60-exec-1 -c spark-kubernetes-executor
++ id -u
+ myuid=185
++ id -g
+ mygid=0
+ set +e
++ getent passwd 185
+ uidentry=
+ set -e
+ '[' -z '' ']'
+ '[' -w /etc/passwd ']'
+ echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false'
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sed 's/[^=]*=\(.*\)/\1/g'
+ sort -t_ -k4 -n
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' '' == 2 ']'
+ '[' '' == 3 ']'
+ '[' -n '' ']'
+ '[' -z ']'
+ case "$1" in
+ shift 1
+ CMD=(${JAVA_HOME}/bin/java "${SPARK_EXECUTOR_JAVA_OPTS[#]}" -Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp "$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH" org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url $SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores $SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname $SPARK_EXECUTOR_POD_IP)
+ exec /usr/bin/tini -s -- /usr/local/openjdk-8/bin/java -Dio.netty.tryReflectionSetAccessible=true -Dspark.driver.port=37771 -Xms1g -Xmx1g -cp ':/opt/spark/jars/*:' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler#testspark:37771 --executor-id 1 --cores 1 --app-id spark-application-1612108001966 --hostname 172.30.174.196
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/01/31 15:46:49 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 14#pysparktest-c1c8f177591feb60-exec-1
21/01/31 15:46:49 INFO SignalUtils: Registered signal handler for TERM
21/01/31 15:46:49 INFO SignalUtils: Registered signal handler for HUP
21/01/31 15:46:49 INFO SignalUtils: Registered signal handler for INT
21/01/31 15:46:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/01/31 15:46:49 INFO SecurityManager: Changing view acls to: 185,root
21/01/31 15:46:49 INFO SecurityManager: Changing modify acls to: 185,root
21/01/31 15:46:49 INFO SecurityManager: Changing view acls groups to:
21/01/31 15:46:49 INFO SecurityManager: Changing modify acls groups to:
21/01/31 15:46:49 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(185, root); groups with view permissions: Set(); users with modify permissions: Set(185, root); groups with modify permissions: Set()
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:283)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:272)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:302)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$3(CoarseGrainedExecutorBackend.scala:303)
at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)
at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
at scala.collection.immutable.Range.foreach(Range.scala:158)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$1(CoarseGrainedExecutorBackend.scala:301)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
... 4 more
Caused by: java.io.IOException: Failed to connect to testspark/172.30.174.253:37771
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:253)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:195)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:204)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: testspark/172.30.174.253:37771
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
I have also created a headless service according to the instruction here https://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode-networking. Below is the yaml for the service as well as the driver pod:
Service
apiVersion: v1
kind: Service
metadata:
name: testspark
spec:
clusterIP: "None"
selector:
spark-app-selector: testspark
ports:
- name: driver-rpc-port
protocol: TCP
port: 37771
targetPort: 37771
- name: blockmanager
protocol: TCP
port: 37772
targetPort: 37772
Driver Pod
apiVersion: v1
kind: Pod
metadata:
name: testspark
labels:
spark-app-selector: testspark
spec:
containers:
- name: testspark
securityContext:
runAsUser: 0
image: jupyter/pyspark-notebook
ports:
- containerPort: 37771
command: ["tail", "-f", "/dev/null"]
serviceAccountName: default-editor
This should have allowed executor pods to connect to the driver (which, I checked have the correct ip 172.30.174.249). To debug the network, I started a shell in the driver container and netstat the listening ports. Here is the output for the same:
root#testspark:/opt/spark/work-dir# netstat -tulpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.1:15000 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:15001 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:15090 0.0.0.0:* LISTEN -
tcp6 0 0 :::4040 :::* LISTEN 35/java
tcp6 0 0 :::37771 :::* LISTEN 35/java
tcp6 0 0 :::15020 :::* LISTEN -
tcp6 0 0 :::41613 :::* LISTEN 35/java
I also tried to connect to the driver pod on the port 37771 via telnet from another running pod on the same namespace and it was able to connect.
root#test:/# telnet 172.30.174.249 37771
Trying 172.30.174.249...
Connected to 172.30.174.249.
Escape character is '^]'.
I am not sure why my executor pods are not able to connect to the driver on the same port. Am I missing any configuration or am I doing anything wrong? I can supply more information if required.
UPDATE
I created a fake spark executor image with the following docker file:
FROM us.icr.io/testspark/spark:v1
ENTRYPOINT ["tail", "-f", "/dev/null"]
and passed this image as spark.kubernetes.container.image config while instantiating the spark context. I got two running executor pods. I exec into one of them with command kubectl exec -n anonymous -it pysparktest-c1c8f177591feb60-exec-1 -c spark-kubernetes-executor bash and ran the following command /opt/entrypoint.sh executor and to my surprise, executor could connect with the driver just fine. Here is the stack trace for the same:
++ id -u
+ myuid=185
++ id -g
+ mygid=0
+ set +e
++ getent passwd 185
+ uidentry='185:x:185:0:anonymous uid:/opt/spark:/bin/false'
+ set -e
+ '[' -z '185:x:185:0:anonymous uid:/opt/spark:/bin/false' ']'
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' '' == 2 ']'
+ '[' '' == 3 ']'
+ '[' -n '' ']'
+ '[' -z ']'
+ case "$1" in
+ shift 1
+ CMD=(${JAVA_HOME}/bin/java "${SPARK_EXECUTOR_JAVA_OPTS[#]}" -Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp "$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH" org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url $SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores $SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname $SPARK_EXECUTOR_POD_IP)
+ exec /usr/bin/tini -s -- /usr/local/openjdk-8/bin/java -Dio.netty.tryReflectionSetAccessible=true -Dspark.driver.port=37771 -Xms1g -Xmx1g -cp ':/opt/spark/jars/*:' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler#testspark.anonymous.svc.cluster.local:37771 --executor-id 1 --cores 1 --app-id spark-application-1612191192882 --hostname 172.30.174.249
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/02/01 15:00:16 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 39#pysparktest-27b678775e1556d9-exec-1
21/02/01 15:00:16 INFO SignalUtils: Registered signal handler for TERM
21/02/01 15:00:16 INFO SignalUtils: Registered signal handler for HUP
21/02/01 15:00:16 INFO SignalUtils: Registered signal handler for INT
21/02/01 15:00:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/02/01 15:00:17 INFO SecurityManager: Changing view acls to: 185,root
21/02/01 15:00:17 INFO SecurityManager: Changing modify acls to: 185,root
21/02/01 15:00:17 INFO SecurityManager: Changing view acls groups to:
21/02/01 15:00:17 INFO SecurityManager: Changing modify acls groups to:
21/02/01 15:00:17 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(185, root); groups with view permissions: Set(); users with modify permissions: Set(185, root); groups with modify permissions: Set()
21/02/01 15:00:17 INFO TransportClientFactory: Successfully created connection to testspark.anonymous.svc.cluster.local/172.30.174.253:37771 after 173 ms (0 ms spent in bootstraps)
21/02/01 15:00:18 INFO SecurityManager: Changing view acls to: 185,root
21/02/01 15:00:18 INFO SecurityManager: Changing modify acls to: 185,root
21/02/01 15:00:18 INFO SecurityManager: Changing view acls groups to:
21/02/01 15:00:18 INFO SecurityManager: Changing modify acls groups to:
21/02/01 15:00:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(185, root); groups with view permissions: Set(); users with modify permissions: Set(185, root); groups with modify permissions: Set()
21/02/01 15:00:18 INFO TransportClientFactory: Successfully created connection to testspark.anonymous.svc.cluster.local/172.30.174.253:37771 after 3 ms (0 ms spent in bootstraps)
21/02/01 15:00:18 INFO DiskBlockManager: Created local directory at /var/data/spark-839bad93-b01c-4bc9-a33f-51c7493775e3/blockmgr-ad6a42b9-cfe2-4cdd-aa28-37a0ab77fb16
21/02/01 15:00:18 INFO MemoryStore: MemoryStore started with capacity 413.9 MiB
21/02/01 15:00:19 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler#testspark.anonymous.svc.cluster.local:37771
21/02/01 15:00:19 INFO ResourceUtils: ==============================================================
21/02/01 15:00:19 INFO ResourceUtils: Resources for spark.executor:
21/02/01 15:00:19 INFO ResourceUtils: ==============================================================
21/02/01 15:00:19 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
21/02/01 15:00:19 INFO Executor: Starting executor ID 1 on host 172.30.174.249
21/02/01 15:00:19 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40515.
21/02/01 15:00:19 INFO NettyBlockTransferService: Server created on 172.30.174.249:40515
21/02/01 15:00:19 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/02/01 15:00:19 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(1, 172.30.174.249, 40515, None)
21/02/01 15:00:19 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(1, 172.30.174.249, 40515, None)
21/02/01 15:00:19 INFO BlockManager: Initialized BlockManager: BlockManagerId(1, 172.30.174.249, 40515, None)
I am actually puzzeled why this might be happening. Is there any work around I can try to get this thing working automatically instead of me having to run it manually?

I was finally able to solve this problem with the help of my colleague. I just added these two configs to disable the istio sidecar injection and it started working.
sparkConf.set("spark.kubernetes.driver.annotation.sidecar.istio.io/inject", "false")
sparkConf.set("spark.kubernetes.executor.annotation.sidecar.istio.io/inject", "false")

I don't have much experience with PySpark but I once setup Java Spark to run on a Kubernetes cluster in client mode, like you are trying now.. and I believe the configuration should mostly be the same.
First of all, you should check if the headless service is working as expected or not. First with a:
kubectl describe -n anonymous testspark
and see if there are any endpoints and the whole description. Second, from inside one of your Pods you could check if nslookup resolve the hostname you are expecting your driver Pod to have.
kubectl exec -n <namespace> -it <pod-name> -- bash // exec into a Pod which have nslookup
nslookup testspark
nslookup testspark.testspark
if the names do resolve correctly to the driver Pod current ip address, then it could be something related to the Spark configuration.
The only difference I found between your configuration and the one I was using with java, is that as spark.driver.host I was using something like:
service-name.namespace.svc.cluster.local
but, in theory, it should be the same.
Another thing which could be useful in debugging the problem, is a describe of one of the executors Pods, just to check that the configuration are all correct.
Edit:
This is the spark-submit command I was using:
sh /opt/spark/bin/spark-submit --master k8s://https://master-ip:6443 --deploy-mode client --name ${POD_NAME} --class "my.spark.application.main.Processor" --conf "spark.executor.instances=2" --conf "spark.kubernetes.namespace=${POD_NAMESPACE}" --conf "spark.kubernetes.driver.pod.name=${POD_NAME}" --conf "spark.kubernetes.container.image=${SPARK_IMAGE}" --conf "spark.kubernetes.executor.request.cores=2" --conf "spark.kubernetes.executor.limit.cores=2" --conf "spark.driver.memory=4g" --conf "spark.executor.cores=2" --conf "spark.executor.memory=4g" --conf "spark.driver.host=${SPARK_DRIVER_HOST}" --conf "spark.ui.dagGraph.retainedRootRDDs=1000" --conf "spark.driver.port=${SPARK_DRIVER_PORT}" --conf "spark.driver.extraJavaOptions=${DRIVER_JAVA_ARGS}" --conf "spark.executor.extraJavaOptions=${EXECUTOR_JAVA_ARGS}" /opt/spark/jars/app.jar -jobConfig ${JOB_CONFIGURATION}
And most of the configurations like spark_driver_port or spark_driver_host were completely similar to yours, just different because names and ports were different.

Apache Spark: worker can't connect to master but can ping and ssh from worker to master

I'm trying to setup an 8-node cluster on 8 RHEL 7.3 x86 machines using Spark 2.0.1. start-master.sh goes through fine:
Spark Command: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.102-4.b14.el7.x86_64/jre/bin/java -cp /usr/local/bin/spark-2.0.1-bin-hadoop2.7/conf/:/usr/local/bin/spark-2.0.1-bin-hadoop2.7/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host lambda.foo.net --port 7077 --webui-port 8080
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/12/08 04:26:46 INFO Master: Started daemon with process name: 22181#lambda.foo.net
16/12/08 04:26:46 INFO SignalUtils: Registered signal handler for TERM
16/12/08 04:26:46 INFO SignalUtils: Registered signal handler for HUP
16/12/08 04:26:46 INFO SignalUtils: Registered signal handler for INT
16/12/08 04:26:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/12/08 04:26:46 INFO SecurityManager: Changing view acls to: root
16/12/08 04:26:46 INFO SecurityManager: Changing modify acls to: root
16/12/08 04:26:46 INFO SecurityManager: Changing view acls groups to:
16/12/08 04:26:46 INFO SecurityManager: Changing modify acls groups to:
16/12/08 04:26:46 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
16/12/08 04:26:46 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
16/12/08 04:26:46 INFO Master: Starting Spark master at spark://lambda.foo.net:7077
16/12/08 04:26:46 INFO Master: Running Spark version 2.0.1
16/12/08 04:26:46 INFO Utils: Successfully started service 'MasterUI' on port 8080.
16/12/08 04:26:46 INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://19.341.11.212:8080
16/12/08 04:26:46 INFO Utils: Successfully started service on port 6066.
16/12/08 04:26:46 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
16/12/08 04:26:46 INFO Master: I have been elected leader! New state: ALIVE
But when I try to bring up the workers, using start-slaves.sh, what I see in the log of the workers is:
Spark Command: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.102-4.b14.el7.x86_64/jre/bin/java -cp /usr/local/bin/spark-2.0.1-bin-hadoop2.7/conf/:/usr/local/bin/spark-2.0.1-bin-hadoop2.7/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://lambda.foo.net:7077
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/12/08 04:30:00 INFO Worker: Started daemon with process name: 14649#hawk040os4.foo.net
16/12/08 04:30:00 INFO SignalUtils: Registered signal handler for TERM
16/12/08 04:30:00 INFO SignalUtils: Registered signal handler for HUP
16/12/08 04:30:00 INFO SignalUtils: Registered signal handler for INT
16/12/08 04:30:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/12/08 04:30:00 INFO SecurityManager: Changing view acls to: root
16/12/08 04:30:00 INFO SecurityManager: Changing modify acls to: root
16/12/08 04:30:00 INFO SecurityManager: Changing view acls groups to:
16/12/08 04:30:00 INFO SecurityManager: Changing modify acls groups to:
16/12/08 04:30:00 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
16/12/08 04:30:00 INFO Utils: Successfully started service 'sparkWorker' on port 35858.
16/12/08 04:30:00 INFO Worker: Starting Spark worker 15.242.22.179:35858 with 24 cores, 1510.2 GB RAM
16/12/08 04:30:00 INFO Worker: Running Spark version 2.0.1
16/12/08 04:30:00 INFO Worker: Spark home: /usr/local/bin/spark-2.0.1-bin-hadoop2.7
16/12/08 04:30:00 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
16/12/08 04:30:00 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://15.242.22.179:8081
16/12/08 04:30:00 INFO Worker: Connecting to master lambda.foo.net:7077...
16/12/08 04:30:00 WARN Worker: Failed to connect to master lambda.foo.net:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:216)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to lambda.foo.net/19.341.11.212:7077
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:191)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
... 4 more
Caused by: java.net.NoRouteToHostException: No route to host: lambda.foo.net/19.341.11.212:7077
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
16/12/08 04:30:12 INFO Worker: Retrying connection to master (attempt # 1)
16/12/08 04:30:12 INFO Worker: Connecting to master lambda.foo.net:7077...
16/12/08 04:30:12 WARN Worker: Failed to connect to master lambda.foo.net:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
So it says "No route to host". But I could successfully ping the master from the worker node, as well as ssh from the worker to the master node.
Why does spark say "No route to host"?

Problem solved: the firewall was blocking the packets.

Spark fail when running pi.py example with yarn-client mode

I can successfully run the java version of pi example as follows.
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn-client \
--num-executors 3 \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 \
--queue thequeue \
lib/spark-examples*.jar \
10
However, the python version failed with the following error information. I used yarn-client mode. The pyspark command line with yarn-client mode returned the same info. Can anyone help me to figure out this problem?
nlp#yyy2:~/spark$ ./bin/spark-submit --master yarn-client examples/src/main/python/pi.py
15/01/05 17:22:26 INFO spark.SecurityManager: Changing view acls to: nlp
15/01/05 17:22:26 INFO spark.SecurityManager: Changing modify acls to: nlp
15/01/05 17:22:26 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(nlp); users with modify permissions: Set(nlp)
15/01/05 17:22:26 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/01/05 17:22:26 INFO Remoting: Starting remoting
15/01/05 17:22:26 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#yyy2:42747]
15/01/05 17:22:26 INFO util.Utils: Successfully started service 'sparkDriver' on port 42747.
15/01/05 17:22:26 INFO spark.SparkEnv: Registering MapOutputTracker
15/01/05 17:22:26 INFO spark.SparkEnv: Registering BlockManagerMaster
15/01/05 17:22:26 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-local-20150105172226-aeae
15/01/05 17:22:26 INFO storage.MemoryStore: MemoryStore started with capacity 265.1 MB
15/01/05 17:22:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/01/05 17:22:27 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-cbe0079b-79c5-426b-b67e-548805423b11
15/01/05 17:22:27 INFO spark.HttpServer: Starting HTTP Server
15/01/05 17:22:27 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/01/05 17:22:27 INFO server.AbstractConnector: Started SocketConnector#0.0.0.0:57169
15/01/05 17:22:27 INFO util.Utils: Successfully started service 'HTTP file server' on port 57169.
15/01/05 17:22:27 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/01/05 17:22:27 INFO server.AbstractConnector: Started SelectChannelConnector#0.0.0.0:4040
15/01/05 17:22:27 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
15/01/05 17:22:27 INFO ui.SparkUI: Started SparkUI at http://yyy2:4040
15/01/05 17:22:27 INFO client.RMProxy: Connecting to ResourceManager at yyy14/10.112.168.195:8032
15/01/05 17:22:27 INFO yarn.Client: Requesting a new application from cluster with 6 NodeManagers
15/01/05 17:22:27 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
15/01/05 17:22:27 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/01/05 17:22:27 INFO yarn.Client: Setting up container launch context for our AM
15/01/05 17:22:27 INFO yarn.Client: Preparing resources for our AM container
15/01/05 17:22:28 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 24 for xxx on ha-hdfs:hzdm-cluster1
15/01/05 17:22:28 INFO yarn.Client: Uploading resource file:/home/nlp/platform/spark-1.2.0-bin-2.5.2/lib/spark-assembly-1.2.0-hadoop2.5.2.jar -> hdfs://hzdm-cluster1/user/nlp/.sparkStaging/application_1420444011562_0023/spark-assembly-1.2.0-hadoop2.5.2.jar
15/01/05 17:22:29 INFO yarn.Client: Uploading resource file:/home/nlp/platform/spark-1.2.0-bin-2.5.2/examples/src/main/python/pi.py -> hdfs://hzdm-cluster1/user/nlp/.sparkStaging/application_1420444011562_0023/pi.py
15/01/05 17:22:29 INFO yarn.Client: Setting up the launch environment for our AM container
15/01/05 17:22:29 INFO spark.SecurityManager: Changing view acls to: nlp
15/01/05 17:22:29 INFO spark.SecurityManager: Changing modify acls to: nlp
15/01/05 17:22:29 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(nlp); users with modify permissions: Set(nlp)
15/01/05 17:22:29 INFO yarn.Client: Submitting application 23 to ResourceManager
15/01/05 17:22:30 INFO impl.YarnClientImpl: Submitted application application_1420444011562_0023
15/01/05 17:22:31 INFO yarn.Client: Application report for application_1420444011562_0023 (state: ACCEPTED)
15/01/05 17:22:31 INFO yarn.Client:
client token: Token { kind: YARN_CLIENT_TOKEN, service: }
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.default
start time: 1420449749969
final status: UNDEFINED
tracking URL: http://yyy14:8070/proxy/application_1420444011562_0023/
user: nlp
15/01/05 17:22:32 INFO yarn.Client: Application report for application_1420444011562_0023 (state: ACCEPTED)
15/01/05 17:22:33 INFO yarn.Client: Application report for application_1420444011562_0023 (state: ACCEPTED)
15/01/05 17:22:34 INFO yarn.Client: Application report for application_1420444011562_0023 (state: ACCEPTED)
15/01/05 17:22:35 INFO yarn.Client: Application report for application_1420444011562_0023 (state: ACCEPTED)
15/01/05 17:22:36 INFO yarn.Client: Application report for application_1420444011562_0023 (state: ACCEPTED)
15/01/05 17:22:36 INFO cluster.YarnClientSchedulerBackend: ApplicationMaster registered as Actor[akka.tcp://sparkYarnAM#yyy16:52855/user/YarnAM#435880073]
15/01/05 17:22:36 INFO cluster.YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> yyy14, PROXY_URI_BASES -> http://yyy14:8070/proxy/application_1420444011562_0023), /proxy/application_1420444011562_0023
15/01/05 17:22:36 INFO ui.JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
15/01/05 17:22:37 INFO yarn.Client: Application report for application_1420444011562_0023 (state: RUNNING)
15/01/05 17:22:37 INFO yarn.Client:
client token: Token { kind: YARN_CLIENT_TOKEN, service: }
diagnostics: N/A
ApplicationMaster host: yyy16
ApplicationMaster RPC port: 0
queue: root.default
start time: 1420449749969
final status: UNDEFINED
tracking URL: http://yyy14:8070/proxy/application_1420444011562_0023/
user: nlp
15/01/05 17:22:37 INFO cluster.YarnClientSchedulerBackend: Application application_1420444011562_0023 has started running.
15/01/05 17:22:37 INFO netty.NettyBlockTransferService: Server created on 35648
15/01/05 17:22:37 INFO storage.BlockManagerMaster: Trying to register BlockManager
15/01/05 17:22:37 INFO storage.BlockManagerMasterActor: Registering block manager yyy2:35648 with 265.1 MB RAM, BlockManagerId(<driver>, yyy2, 35648)
15/01/05 17:22:37 INFO storage.BlockManagerMaster: Registered BlockManager
15/01/05 17:22:37 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkYarnAM#yyy16:52855] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/01/05 17:22:38 ERROR cluster.YarnClientSchedulerBackend: Yarn application has already exited with state FINISHED!
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage/kill,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/static,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors/threadDump/json,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors/threadDump,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors/json,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/environment/json,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/environment,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage/rdd/json,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage/rdd,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage/json,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/pool/json,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/pool,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage/json,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/json,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/jobs/job/json,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/jobs/job,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/jobs/json,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/jobs,null}
15/01/05 17:22:38 INFO ui.SparkUI: Stopped Spark web UI at http://yyy2:4040
15/01/05 17:22:38 INFO scheduler.DAGScheduler: Stopping DAGScheduler
15/01/05 17:22:38 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors
15/01/05 17:22:38 INFO cluster.YarnClientSchedulerBackend: Asking each executor to shut down
15/01/05 17:22:38 INFO cluster.YarnClientSchedulerBackend: Stopped
15/01/05 17:22:39 INFO spark.MapOutputTrackerMasterActor: MapOutputTrackerActor stopped!
15/01/05 17:22:39 INFO storage.MemoryStore: MemoryStore cleared
15/01/05 17:22:39 INFO storage.BlockManager: BlockManager stopped
15/01/05 17:22:39 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
15/01/05 17:22:39 INFO spark.SparkContext: Successfully stopped SparkContext
15/01/05 17:22:39 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
15/01/05 17:22:39 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
15/01/05 17:22:39 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
15/01/05 17:22:57 INFO cluster.YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms)
Traceback (most recent call last):
File "/home/nlp/platform/spark-1.2.0-bin-2.5.2/examples/src/main/python/pi.py", line 29, in <module>
sc = SparkContext(appName="PythonPi")
File "/home/nlp/spark/python/pyspark/context.py", line 105, in __init__
conf, jsc)
File "/home/nlp/spark/python/pyspark/context.py", line 153, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "/home/nlp/spark/python/pyspark/context.py", line 201, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "/home/nlp/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 701, in __call__
File "/home/nlp/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NullPointerException
at org.apache.spark.SparkContext.<init>(SparkContext.scala:497)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)

If you're running this example on Java 8, this may be due to Java 8's excessive memory allocation strategy: https://issues.apache.org/jira/browse/YARN-4714
You can force YARN to ignore this by setting up the following properties in yarn-site.xml
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>

Try with deploy mode parameter, like this:
--deploy-mode cluster
I had problem like your, with this parameter it worked.

I experienced a similar problem using spark-submit and yarn-client (I got the same NPE/stacktrace). Tuning down my memory settings did the trick. It seems to fail like this when you try to allot too much memory. I would start by removing the --executor-memory and --driver-memory switches.

I reduced the number of cores in the Advanced spark-env to make it work.

I ran into this issue running (hdp 2.3 spark 1.3.1)
spark-shell
--master yarn-client
--driver-memory 4g
--executor-memory 4g
--executor-cores 1
--num-executors 4
Solution for me was to set the spark config value:
spark.yarn.am.extraJavaOptions=-Dhdp.version=2.3.0.0-2557

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark on kubernetes does not starting executors, even not trying, why? - apache-spark

I have found a mistake in pyspark app itself. ... SparkSession.builder.master("local") ... Should be without master ... SparkSession.builder ... as simple as that :(

Related

Spark-Submit cannot connect to ResourceManager

Spark Cluster on docker swarm Worker web UI issue

Spark executors fails to run on kubernetes cluster

Apache Spark: worker can't connect to master but can ping and ssh from worker to master

Spark fail when running pi.py example with yarn-client mode

Categories

Resources