How to fix "Forbidden!Configured service account doesn't have access" with Spark on Kubernetes? - apache-spark

I am trying to run the basic example of submitting a spark application with a k8s cluster.
I created my docker image, using the script from the spark folder :
sudo ./bin/docker-image-tool.sh -mt spark-docker build
sudo docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
spark-r spark-docker 793527583e00 17 minutes ago 740MB
spark-py spark-docker c984e15fe747 18 minutes ago 446MB
spark spark-docker 71950de529b3 18 minutes ago 355MB
openjdk 8-alpine 88d1c219f815 15 hours ago 105MB
hello-world latest fce289e99eb9 3 months ago 1.84kB
And then tried to submit the SparkPi examples (as in the official documentation).
./bin/spark-submit \
--master k8s://[MY_IP]:8443 \
--deploy-mode cluster \
--name spark-pi --class org.apache.spark.examples.SparkPi \
--driver-memory 1g \
--executor-memory 3g \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image=spark:spark-docker \
local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar
But the run fail with the following Exception :
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://kubernetes.default.svc/api/v1/namespaces/default/pods/spark-pi-1554304245069-driver.
Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "spark-pi-1554304245069-driver" is forbidden: User "system:serviceaccount:default:default" cannot get resource "pods" in API group "" in the namespace "default".
Here are the full logs of the pod from the Kubernetes Dashboard :
2019-04-03 15:10:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#49096b06{/executors/threadDump,null,AVAILABLE,#Spark}
2019-04-03 15:10:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#4a183d02{/executors/threadDump/json,null,AVAILABLE,#Spark}
2019-04-03 15:10:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#5d05ef57{/static,null,AVAILABLE,#Spark}
2019-04-03 15:10:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#34237b90{/,null,AVAILABLE,#Spark}
2019-04-03 15:10:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#1d01dfa5{/api,null,AVAILABLE,#Spark}
2019-04-03 15:10:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#31ff1390{/jobs/job/kill,null,AVAILABLE,#Spark}
2019-04-03 15:10:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#759d81f3{/stages/stage/kill,null,AVAILABLE,#Spark}
2019-04-03 15:10:50 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://spark-pi-1554304245069-driver-svc.default.svc:4040
2019-04-03 15:10:50 INFO SparkContext:54 - Added JAR file:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar at spark://spark-pi-1554304245069-driver-svc.default.svc:7078/jars/spark-examples_2.11-2.4.0.jar with timestamp 1554304250157
2019-04-03 15:10:51 ERROR SparkContext:91 - Error initializing SparkContext.
org.apache.spark.SparkException: External scheduler cannot be instantiated
at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2794)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:493)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://kubernetes.default.svc/api/v1/namespaces/default/pods/spark-pi-1554304245069-driver. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "spark-pi-1554304245069-driver" is forbidden: User "system:serviceaccount:default:default" cannot get resource "pods" in API group "" in the namespace "default".
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:470)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:407)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:379)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:343)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:312)
Notes :
Spark 2.4
Kubernetes 1.14.0
I use Minikube for my k8s cluster

Hello I had the same issue.
I then found this Github issue
https://github.com/GoogleCloudPlatform/continuous-deployment-on-kubernetes/issues/113
That point me to the problem. I solved the issue following the Spark guide for RBAC cluster here
https://github.com/GoogleCloudPlatform/continuous-deployment-on-kubernetes/issues/113
Create a serviceaccount
kubectl create serviceaccount spark
Give the service account the edit role on the cluster
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default
Run spark submit with the following flag, in order to run it with the (just created(service account)
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
Hope it helps!

Simone's solution perfectly works for me. put more hints for newbies.
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
Above conf sould add as first argument. Appending it at the end of spark submit command wont work.

Related

Why does Spark fail with "No File System for scheme: local"?

I am trying to submit Spark job onto the Spark Cluster which is setup on AWS EKS as
NAME READY STATUS RESTARTS AGE
spark-master-5f98d5-5kdfd 1/1 Running 0 22h
spark-worker-878598b54-jmdcv 1/1 Running 2 3d11h
spark-worker-878598b54-sz6z6 1/1 Running 2 3d11h
i am using below manifest
apiVersion: batch/v1
kind: Job
metadata:
name: spark-on-eks
spec:
template:
spec:
containers:
- name: spark
image: repo:spark-appv6
command: [
"/bin/sh",
"-c",
"/opt/spark/bin/spark-submit \
--master spark://192.XXX.XXX.XXX:7077 \
--deploy-mode cluster \
--name spark-app \
--class com.xx.migration.convert.TestCase \
--conf spark.kubernetes.container.image=repo:spark-appv6
--conf spark.kubernetes.namespace=spark-pi \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-pi \
--conf spark.executor.instances=2 \
local:///opt/spark/examples/jars/testing-jar-with-dependencies.jar"
]
serviceAccountName: spark-pi
restartPolicy: Never
backoffLimit: 4
and getting below error log
20/12/25 10:06:41 INFO Utils: Successfully started service 'driverClient' on port 34511.
20/12/25 10:06:41 INFO TransportClientFactory: Successfully created connection to /192.XXX.XXX.XXX:7077 after 37 ms (0 ms spent in bootstraps)
20/12/25 10:06:41 INFO ClientEndpoint: Driver successfully submitted as driver-20201225100641-0011
20/12/25 10:06:41 INFO ClientEndpoint: ... waiting before polling master for driver state
20/12/25 10:06:46 INFO ClientEndpoint: ... polling master for driver state
20/12/25 10:06:46 INFO ClientEndpoint: State of driver-2020134340641-0011 is ERROR
20/12/25 10:06:46 ERROR ClientEndpoint: Exception from cluster was: java.io.IOException: No FileSystem for scheme: local
java.io.IOException: No FileSystem for scheme: local
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1853)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:737)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:535)
at org.apache.spark.deploy.worker.DriverRunner.downloadUserJar(DriverRunner.scala:166)
at org.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:177)
at org.apache.spark.deploy.worker.DriverRunner$$anon$2.run(DriverRunner.scala:96)
20/12/25 10:06:46 INFO ShutdownHookManager: Shutdown hook called
20/12/25 10:06:46 INFO ShutdownHookManager: Deleting directory /tmp/spark-d568b819-fe8e-486f-9b6f-741rerf87cf1
Also when i try to submit job in client mode without container parameter, it gets submitted successfully but job keeps runnings and spins multiple executors on worker nodes.
Spark version- 3.0.0
When used k8s://http://Spark-Master-ip:7077 \ i get following error
20/12/28 06:59:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/12/28 06:59:12 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
20/12/28 06:59:12 INFO KerberosConfDriverFeatureStep: You have not specified a krb5.conf file locally or via a ConfigMap. Make sure that you have the krb5.conf locally on the driver image.
20/12/28 06:59:13 WARN WatchConnectionManager: Exec Failure
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:209)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at okio.Okio$2.read(Okio.java:140)
at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
at okio.RealBufferedSource.indexOf(RealBufferedSource.java:354)
at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:226)
at okhttp3.internal.http1.Http1Codec.readHeaderLine(Http1Codec.java:215)
at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189)
at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:127)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:134)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:109)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:257)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:201)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Please help with above requirement, Thanks
Assuming you're not using spark on k8s operator the master should be:
k8s://https://kubernetes.default.svc.cluster.local
if not, you can get your master address by running:
$ kubectl cluster-info
Kubernetes master is running at https://kubernetes.docker.internal:6443
EDIT:
In spark-on-k8s cluster-mode the k8s://<api_server_host>:<k8s-apiserver-port> should be provided (note that adding the port is must!)
In spark-on-k8s the role of "master" (in spark) is played by kubernetes itself - which is responsible to allocate resources for running your driver and workers.
The real reason for the exception:
java.io.IOException: No FileSystem for scheme: local
was that a Worker of the Spark Standalone cluster wanted to downloadUserJar, but simply didn't recognize local URI scheme.
This is because Spark Standalone does not understand it and, unless I'm mistaken, the only cluster environments that support this local URI scheme are Spark on YARN and Spark on Kubernetes.
And that's where you can connect the dots why this exception was sorted out by changing the master URL. Well, the OP wanted to deploy the Spark application to Kubernetes (and followed the rules for Spark on Kubernetes) while the master URL was spark://192.XXX.XXX.XXX:7077 which is for Spark Standalone.

How to fix: pods "" is forbidden: User "system:anonymous" cannot watch resource "pods" in API group "" in the namespace "default"

I am trying to run my spark over k8, I have set up my RBAC using the below commands:
kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default
Spark command from outside of k8 cluster:
bin/spark-submit --master k8s://https://<master_ip>:6443 --deploy-mode cluster --conf spark.kubernetes.authenticate.submission.caCertFile=/usr/local/spark/spark-2.4.5-bin-hadoop2.7/ca.crt --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.container.image=bitnami/spark:latest test.py
error:
Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: pods "test-py-1590306482639-driver" is forbidden: User "system:anonymous" cannot watch resource "pods" in API group "" in the namespace "default"
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onFailure(WatchConnectionManager.java:206)
at okhttp3.internal.ws.RealWebSocket.failWebSocket(RealWebSocket.java:571)
at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:198)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Suppressed: java.lang.Throwable: waiting here
at io.fabric8.kubernetes.client.utils.Utils.waitUntilReady(Utils.java:134)
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.waitUntilReady(WatchConnectionManager.java:350)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.watch(BaseOperation.java:759)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.watch(BaseOperation.java:738)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.watch(BaseOperation.java:69)
at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$1.apply(KubernetesClientApplication.scala:140)
at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$1.apply(KubernetesClientApplication.scala:140)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2542)
at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:140)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:250)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:241)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:241)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:204)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
20/05/24 07:48:04 INFO ShutdownHookManager: Shutdown hook called
20/05/24 07:48:04 INFO ShutdownHookManager: Deleting directory /tmp/spark-f0eeb957-a02e-458f-8778-21fb2307cf42
Spark Docker images source --> docker pull bitnami/spark
I am also giving my crt file here present on the master of k8 cluster. I am trying to run spark-submit command from another GCP instance.
Can someone please help me here i am stuck with this since last couple of days.
Edit
I have created another clusterrole with cluster-admin permission but still it is not working
spark.kubernetes.authenticate applies only to deploy-mode client, and you run with deploy-mode cluster
Depending on how you authenticate to the kubernetes cluster, you might need to provide different config parameters starting with spark.kubernetes.authenticate.submission (these config parameters apply when running with deploy-mode cluster). Look for ~/.kube/config file and search for the user. For example, if the user section specifies
access-token: XXXX
then pass spark.kubernetes.authenticate.submission.oauthToken

Spark Kubernetes Error : Pod Already Exists

When i try to submit my app through spark-submit i get the following error:
Please help me resolve the problem
Error:
pod name: newdriver
namespace: default
labels: spark-app-selector -> spark-a17960c79886423383797eaa77f9f706, spark-role -> driver
pod uid: 0afa41ae-4e4c-47be-86a3-1ef77739506c
creation time: 2020-05-06T14:11:29Z
service account name: spark
volumes: spark-local-dir-1, spark-conf-volume, spark-token-tks2g
node name: minikube
start time: 2020-05-06T14:11:29Z
phase: Running
container status:
container name: spark-kubernetes-driver
container image: spark-py:v3.0
container state: running
container started at: 2020-05-06T14:11:31Z
Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://172.17.0.2:8443/api/v1/namespaces/default/pods. Message: pods "newtrydriver" already exists. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=null, kind=pods, name=newtrydriver, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=pods "newtrydriver" already exists, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=AlreadyExists, status=Failure, additionalProperties={}).
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:510)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:449)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:413)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:372)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:241)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:819)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:334)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:330)
at org.apache.spark.deploy.k8s.submit.Client.$anonfun$run$2(KubernetesClientApplication.scala:130)
at org.apache.spark.deploy.k8s.submit.Client.$anonfun$run$2$adapted(KubernetesClientApplication.scala:129)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2539)
at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:129)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4(KubernetesClientApplication.scala:221)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4$adapted(KubernetesClientApplication.scala:215)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2539)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:215)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:188)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
20/05/06 14:11:34 INFO ShutdownHookManager: Shutdown hook called
20/05/06 14:11:34 INFO ShutdownHookManager: Deleting directory /tmp/spark-b7ea9c80-6040-460a-ba43-5c6e656d3039
Configuration for Submitting the job to k8s
./spark-submit
--master k8s://https://172.17.0.2:8443
--deploy-mode cluster
--conf spark.executor.instances=3
--conf spark.kubernetes.container.image=spark-py:v3.0
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
--name newtry
--conf spark.kubernetes.driver.pod.name=newdriver
local:///opt/spark/examples/src/main/python/spark-submit-old.py
Running spark on k8s in Cluster Mode
No other Pod with the name newdriver running on my minikube
Please check if there is a Pod named newdriver in namespace default by running kubectl get pods --namespace default --show-all. You probably already have Terminated or Completed Spark Driver Pod with this name left from the previous runs. If so, delete it by running kubectl delete pod newdriver --namespace default and then try to launch new Spark job again.

org.apache.hadoop.net.ConnectTimeoutException: Call From local_computer to hadoop_server failed on socket timeout exception

We're trying to execute a local spark submission to a YARN that exists in another server:
The idea here is we're trying to externalize the submission from a jupyterlab that exists in a cloud container, into a jupyterlb in local environment; we exported the same configuration files that existed in the jupyterlab container into a local environment (laptop in our case, with a vpn to allow access).
The problem is, in the cloud container, the submission to yarn get executed succesfully.
2020-03-18 12:45:15 INFO RequestHedgingRMFailoverProxyProvider:89 - Created wrapped proxy for [rm1, rm2]
2020-03-18 12:45:15 INFO AHSProxy:42 - Connecting to Application History server at server_address
2020-03-18 12:45:15 INFO RequestHedgingRMFailoverProxyProvider:147 - Looking for the active RM in [rm1, rm2]...
2020-03-18 12:45:15 INFO RequestHedgingRMFailoverProxyProvider:171 - Found active RM [rm2]
2020-03-18 12:45:15 INFO Client:54 - Requesting a new application from cluster with 348 NodeManagers
2020-03-18 12:45:15 INFO Configuration:2752 - resource-types.xml not found
2020-03-18 12:45:15 INFO ResourceUtils:418 - Unable to find 'resource-types.xml'.
2020-03-18 12:45:15 INFO Client:54 - Verifying our application has not requested more than the maximum memory capability of the cluster (346687 MB per container)
2020-03-18 12:45:15 INFO Client:54 - Will allocate AM container, with 61952 MB memory including 5632 MB overhead
2020-03-18 12:45:15 INFO Client:54 - Setting up container launch context for our AM
2020-03-18 12:45:15 INFO Client:54 - Setting up the launch environment for our AM container
2020-03-18 12:45:15 INFO Client:54 - Preparing resources for our AM container
2020-03-18 12:45:15 INFO HadoopFSDelegationTokenProvider:54 - getting token for: DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1993770491_1, ugi=id(auth:KERBEROS)]]
2020-03-18 12:45:15 INFO DFSClient:703 - Created token for owner: HDFS_DELEGATION_TOKEN owner=id, renewer=yarn, realUser=, issueDate=1584535515764, maxDate=1585140315764, sequenceNumber=31968916, masterKeyId=874 on ha-hdfs:edwbiprdmil
2020-03-18 12:45:16 INFO metastore:377 - Trying to connect to metastore with URI server_uri
2020-03-18 12:45:16 INFO metastore:473 - Connected to metastore.
So at the end, we get connected to the metastore.
Now, we we execute the same script, in our local machine, and after a bunch of configuration,
!spark-submit --conf spark.authenticate=True --conf spark.sql.hive.manageFilesourcePartitions=False \
--name test_spark --master yarn --deploy-mode cluster --driver-memory 55g --num-executors 20 \
--executor-memory 24g --executor-cores 30 --queue ADHOC_TACTICAL --conf spark.dynamicAllocation.enabled=true \
--conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.maxExecutors=100 \
--conf spark.dynamicAllocation.minExecutors=10 --conf spark.dynamicAllocation.executorIdleTimeout=240s \
--conf spark.yarn.tags=tag submit.py
and the result of the execution, goes as follows:
2020-03-18 13:08:48,677] WARN Unable to load native-hadoop library for your platform... using builtin-java classes where applicable (org.apache.hadoop.util.NativeCodeLoader)
[2020-03-18 13:08:51,835] WARN The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. (org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory)
[2020-03-18 13:08:52,590] INFO Timeline service address: server/ws/v1/timeline/ (org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl)
[2020-03-18 13:09:22,519] INFO Failing over to rm2 (org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider)
[2020-03-18 13:09:48,026] INFO Exception while invoking getClusterMetrics of class ApplicationClientProtocolPBClientImpl over rm2 after 1 fail over attempts. Trying to fail over after sleeping for 36102ms. (org.apache.hadoop.io.retry.RetryInvocationHandler)
org.apache.hadoop.net.ConnectTimeoutException: Call From local to server failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=server:8032]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:751)
at org.apache.hadoop.ipc.Client.call(Client.java:1479)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy30.getClusterMetrics(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:206)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy31.getClusterMetrics(Unknown Source)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:487)
at org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:165)
at org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:165)
at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:60)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:164)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1135)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1527)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=dcmipphm13002.edc.nam.gm.com/10.125.2.20:8032]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)
at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
at org.apache.hadoop.ipc.Client.call(Client.java:1451)
... 26 more
So, any thoughts? We thought about a firewall that stops us from the connection, or a permission requires to allow the server to accept requests from local machines, but does any of you faced the same problem with the same issue?

User "system:anonymous" cannot create resource "pods" in API group "" in the namespace "default"

I'm trying to run Spark on EKS. Created an EKS cluster, added nodes and then trying to submit a Spark job from an EC2 instance.
Ran following commands for access:
kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole=admin --serviceaccount=default:spark --namespace=default
spark-submit command used:
bin/spark-submit \
--master k8s://https://XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.us-east-1.eks.amazonaws.com \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=2 \
--conf spark.app.name=spark-pi \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image=k8sspark:latest \
--conf spark.kubernetes.authenticate.submission.caCertFile=ca.pem \
local:////usr/spark-2.4.3-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.4.3.jar 100000
It returns:
log4j:WARN No appenders could be found for logger (io.fabric8.kubernetes.client.Config).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/06/06 16:03:50 WARN WatchConnectionManager: Executor didn't terminate in time after shutdown in close(), killing it in: io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager#5b43fbf6
Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.us-east-1.eks.amazonaws.com/api/v1/namespaces/default/pods. Message: pods is forbidden: User "system:anonymous" cannot create resource "pods" in API group "" in the namespace "default". Received status: Status(apiVersion=v1, code=403, details=StatusDetails(causes=[], group=null, kind=pods, name=null, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=pods is forbidden: User "system:anonymous" cannot create resource "pods" in API group "" in the namespace "default", metadata=ListMeta(_continue=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Forbidden, status=Failure, additionalProperties={}).
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:478)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:417)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:381)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:344)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:227)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:787)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:357)
at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$2.apply(KubernetesClientApplication.scala:141)
at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$2.apply(KubernetesClientApplication.scala:140)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543)
at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:140)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:250)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:241)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:241)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:204)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
19/06/06 16:03:50 INFO ShutdownHookManager: Shutdown hook called
19/06/06 16:03:50 INFO ShutdownHookManager: Deleting directory /tmp/spark-0060fe01-33eb-4cb4-b96b-d5be687016bc
Tried creating different clusterrole with admin privilege. But it did not work.
Any idea how to fix this one?

Resources