file access error running spark on kubernetes - apache-spark

I followed the Spark on Kubernetes blog but got to a point where it runs the job but fails inside the worker pods with an file access error.
2018-05-22 22:20:51 WARN TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, 172.17.0.15, executor 3): java.nio.file.AccessDeniedException: ./spark-examples_2.11-2.3.0.jar
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:84)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixCopyFile.copyFile(UnixCopyFile.java:243)
at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:581)
at sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253)
at java.nio.file.Files.copy(Files.java:1274)
at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:632)
at org.apache.spark.util.Utils$.copyFile(Utils.scala:603)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:478)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:755)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:747)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:747)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:312)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The command i use to run the SparkPi example is :
$DIR/$SPARKVERSION/bin/spark-submit \
--master=k8s://https://192.168.99.101:8443 \
--deploy-mode=cluster \
--conf spark.executor.instances=3 \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.kubernetes.container.image=172.30.1.1:5000/myapp/spark-docker:latest \
--conf spark.kubernetes.namespace=$namespace \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.driver.pod.name=spark-pi-driver \
local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
On working through the code it seems like the spark jar files are being copied to an internal location inside the container. But:
Should this happen since they are local and are already there
If the do need to be copied to another location in the container how do i make this part of the container writable since it is created by the master node.
RBAC has been setup as follows: (oc get rolebinding -n myapp)
NAME ROLE USERS GROUPS SERVICE ACCOUNTS SUBJECTS
admin /admin developer
spark-role /edit spark
And the service account (oc get sa -n myapp)
NAME SECRETS AGE
builder 2 18d
default 2 18d
deployer 2 18d
pusher 2 13d
spark 2 12d
Or am i doing something silly here?
My kubernetes system is running inside Docker Machine (via virtualbox on osx)
I am using:
openshift v3.9.0+d0f9aed-12
kubernetes v1.9.1+a0ce1bc657
Any hints on solving this greatly appreciated?

I know this is an 5m old post, but it looks that there's not enough information related to this issue around, so I'm posting my answer in case it can help someone.
It looks like you are not running the process inside the container as root, if that's the case you can take a look at this link (https://github.com/minishift/minishift/issues/2836).
Since it looks like you are also using openshift you can do:
oc adm policy add-scc-to-user anyuid -z spark-sa -n spark
In my case I'm using kubernetes and I need to use runAsUser:XX. Thus I gave group read/write access to /opt/spark inside the container and that solved the issue, just add the following line to resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.
RUN chmod g+rwx -R /opt/spark
Of course you have to re-build the docker images manually or using the provided script like shown below.
./bin/docker-image-tool.sh -r YOUR_REPO -t YOUR_TAG build
./bin/docker-image-tool.sh -r YOUR_REPO -t YOUR_TAG push

Related

Submitting a spark job to a kubernetes cluster using bitnami spark docker image

I have a local setup with minikube and I'm trying to use spark-submit to submit a job to a local Kubernetes. The idea here is to use my local machine's spark-submit to submit to the kubernetes master which will handle creating a spark cluster and taking it down when the work is finished.
I'm using the image bitnami/spark:3.2.1 and the following command:
./bin/spark-submit --master k8s://https://127.0.0.1:52388 \
--deploy-mode cluster \
--conf spark.executor.instances=1 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image=bitnami/spark:3.2.1 \
--class org.apache.spark.examples.JavaSparkPi \
--name spark-pi \
local:///opt/bitnami/spark/examples/jars/spark-examples_2.12-3.2.1.jar
This does not seem to work and the logs in the spark driver are:
[...]
Caused by: java.io.IOException: Failed to connect to spark-master:7077
[...]
and
[...]
Caused by: java.net.UnknownHostException: spark-master
[...]
If I use the docker-image-tool.sh to build a custom spark docker image with the python bindings and use that, it works perfectly. How is bitnami's image special and why doesn't it recognise that the master in this case is kubernetes?
I also tried using the option conf spark.kubernetes.driverEnv.SPARK_MASTER_URL=spark://127.0.0.1:7077 when submitting but the error was similar to above.

How to fix: pods "" is forbidden: User "system:anonymous" cannot watch resource "pods" in API group "" in the namespace "default"

I am trying to run my spark over k8, I have set up my RBAC using the below commands:
kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default
Spark command from outside of k8 cluster:
bin/spark-submit --master k8s://https://<master_ip>:6443 --deploy-mode cluster --conf spark.kubernetes.authenticate.submission.caCertFile=/usr/local/spark/spark-2.4.5-bin-hadoop2.7/ca.crt --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.container.image=bitnami/spark:latest test.py
error:
Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: pods "test-py-1590306482639-driver" is forbidden: User "system:anonymous" cannot watch resource "pods" in API group "" in the namespace "default"
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onFailure(WatchConnectionManager.java:206)
at okhttp3.internal.ws.RealWebSocket.failWebSocket(RealWebSocket.java:571)
at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:198)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Suppressed: java.lang.Throwable: waiting here
at io.fabric8.kubernetes.client.utils.Utils.waitUntilReady(Utils.java:134)
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.waitUntilReady(WatchConnectionManager.java:350)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.watch(BaseOperation.java:759)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.watch(BaseOperation.java:738)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.watch(BaseOperation.java:69)
at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$1.apply(KubernetesClientApplication.scala:140)
at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$1.apply(KubernetesClientApplication.scala:140)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2542)
at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:140)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:250)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:241)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:241)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:204)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
20/05/24 07:48:04 INFO ShutdownHookManager: Shutdown hook called
20/05/24 07:48:04 INFO ShutdownHookManager: Deleting directory /tmp/spark-f0eeb957-a02e-458f-8778-21fb2307cf42
Spark Docker images source --> docker pull bitnami/spark
I am also giving my crt file here present on the master of k8 cluster. I am trying to run spark-submit command from another GCP instance.
Can someone please help me here i am stuck with this since last couple of days.
Edit
I have created another clusterrole with cluster-admin permission but still it is not working
spark.kubernetes.authenticate applies only to deploy-mode client, and you run with deploy-mode cluster
Depending on how you authenticate to the kubernetes cluster, you might need to provide different config parameters starting with spark.kubernetes.authenticate.submission (these config parameters apply when running with deploy-mode cluster). Look for ~/.kube/config file and search for the user. For example, if the user section specifies
access-token: XXXX
then pass spark.kubernetes.authenticate.submission.oauthToken

Spark's example throws FileNotFoundException in client mode

I have: Ubuntu 14.04, Hadoop 2.7.7, Spark 2.2.0.
I just installed everything.
When I try to run the the Spark's example:
bin/spark-submit --deploy-mode client \
--class org.apache.spark.examples.SparkPi \
examples/jars/spark-examples_2.11-2.2.0.jar 10
I get the following error:
INFO yarn.Client:
client token: N/A
diagnostics: Application application_1552490646290_0007 failed 2 times due to AM Container for
appattempt_1552490646290_0007_000002 exited with exitCode: -1000 For
more detailed output, check application tracking
page:http://ip-123-45-67-89:8088/cluster/app/application_1552490646290_0007 Then,
click on links to logs of each attempt. Diagnostics: File
file:/tmp/spark-f5879f52-6777-481a-8ecf-bbb55e376901/__spark_libs__6948713644593068670.zip
does not exist java.io.FileNotFoundException: File
file:/tmp/spark-f5879f52-6777-481a-8ecf-bbb55e376901/__spark_libs__6948713644593068670.zip
does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:421)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:358)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:473)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
at java.lang.Thread.run(Thread.java:748)
I get the same error both in client mode and cluster mode.
It seems that it fails by loading the spark libs. As Daniel points out, it could be related to your read rights. Besides, it could be related to running out of disk space.
However, in our case to avoid transfer latencies to the master and read/write rights in the local machine, we put the spark-libs in the HDFS of the Yarn cluster and then, we point them in the spark.yarn.archive property.
jar cv0f spark-libs.jar -C $SPARK_HOME/jars/ .
export HADOOP_USER_NAME=hadoop
hadoop fs -mkdir -p /apps/spark/
hadoop fs -put -f ${SPARK_HOME}/spark-libs.jar /apps/spark/
# spark-defaults.conf
spark.yarn.archive hdfs:///apps/spark/spark-libs.jar
First, Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
Second , If you run on YARN mode you have point your master to yarn submitting-applications, and put your jar file in hdfs
# Run on a YARN cluster
# Connect to a YARN cluster in client or cluster mode depending on the value
# of --deploy-mode. The cluster location will be found based on the HADOOP_CONF_DIR
# or YARN_CONF_DIR variable.
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \ # can be client for client mode
hdfs://path/to/spark-examples.jar
1000

Exception in thread "main" org.apache.spark.SparkException: Must specify the driver container image

I am trying to do spark-submit on minikube(Kubernetes) from local machine CLI with command
spark-submit --master k8s://https://127.0.0.1:8001 --name cfe2
--deploy-mode cluster --class com.yyy.Test --conf spark.executor.instances=2 --conf spark.kubernetes.container.image docker.io/anantpukale/spark_app:1.1 local://spark-0.0.1-SNAPSHOT.jar
I have a simple spark job jar built on verison 2.3.0. I also have containerized it in docker and minikube up and running on virtual box.
Below is exception stack:
Exception in thread "main" org.apache.spark.SparkException: Must specify the driver container image at org.apache.spark.deploy.k8s.submit.steps.BasicDriverConfigurationStep$$anonfun$3.apply(BasicDriverConfigurationStep.scala:51) at org.apache.spark.deploy.k8s.submit.steps.BasicDriverConfigurationStep$$anonfun$3.apply(BasicDriverConfigurationStep.scala:51) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.deploy.k8s.submit.steps.BasicDriverConfigurationStep.<init>(BasicDriverConfigurationStep.scala:51)
at org.apache.spark.deploy.k8s.submit.DriverConfigOrchestrator.getAllConfigurationSteps(DriverConfigOrchestrator.scala:82)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:229)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:227)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2585)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:227)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:192)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 2018-04-06 13:33:52 INFO ShutdownHookManager:54 - Shutdown hook called 2018-04-06 13:33:52 INFO ShutdownHookManager:54 - Deleting directory C:\Users\anant\AppData\Local\Temp\spark-6da93408-88cb-4fc7-a2de-18ed166c3c66
Look like bug with default value for parameters spark.kubernetes.driver.container.image, that must be spark.kubernetes.container.image. So try specify driver/executor container image directly:
spark.kubernetes.driver.container.image
spark.kubernetes.executor.container.image
From the source code, the only available conf options are:
spark.kubernetes.container.image
spark.kubernetes.driver.container.image
spark.kubernetes.executor.container.image
And I noticed that Spark 2.3.0 has changed a lot in terms of k8s implementation compared to 2.2.0. For example, instead of specifying driver and executor separately, the official starter's guide is to use a single image given to spark.kubernetes.container.image.
See if this works:
spark-submit \
--master k8s://http://127.0.0.1:8001 \
--name cfe2 \
--deploy-mode cluster \
--class com.oracle.Test \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image=docker/anantpukale/spark_app:1.1 \
--conf spark.kubernetes.authenticate.submission.oauthToken=YOUR_TOKEN \
--conf spark.kubernetes.authenticate.submission.caCertFile=PATH_TO_YOUR_CERT \
local://spark-0.0.1-SNAPSHOT.jar
The token and cert can be found on k8s dashboard. Follow the instructions to make Spark 2.3.0 compatible docker images.

Spark 2 broadcast inside Docker uses random port

I'm trying to run Spark 2 inside Docker containers and it is being kind of hard for me. So long I think have come a long way, having been able to deploy a Standalone master to host A and a worker to host B. I configured the /etc/hosts of the Docker container so master and worker can access their respective hosts. They see each other and everything looks to be fine.
I deploy master with this set of opened ports:
docker run -ti --rm \
--name sparkmaster \
--hostname=host.a \
--add-host host.a:xx.xx.xx.xx \
--add-host host.b:xx.xx.xx.xx \
-p 18080:18080 \
-p 7001:7001 \
-p 7002:7002 \
-p 7003:7003 \
-p 7004:7004 \
-p 7005:7005 \
-p 7006:7006 \
-p 4040:4040 \
-p 7077:7077 \
malkab/spark:ablative_alligator
Then I submit this Spark config options:
export SPARK_MASTER_OPTS="-Dspark.driver.port=7001
-Dspark.fileserver.port=7002
-Dspark.broadcast.port=7003 -Dspark.replClassServer.port=7004
-Dspark.blockManager.port=7005 -Dspark.executor.port=7006
-Dspark.ui.port=4040 -Dspark.broadcast.blockSize=4096"
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=18080
I deploy the worker on its host with the SPARK_WORKER_XXX version of the aforementioned env variables, and with an analogous docker run.
Then I enter into the master container and spark-submit a job:
spark-submit --master spark://host.a:7077 /src/Test06.py
Everything starts fine: I can see the job being distributed to the worker. But when the Block Manager tries to register the block, it seems to be using a random port, which is not accesible outside the container:
INFO BlockManagerMasterEndpoint: Registering block manager host.b:39673 with 366.3 MB RAM, BlockManagerId(0, host.b, 39673, None)
Then I get this error:
WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, host.b, executor 0): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_0_piece0 of broadcast_0
And the worker reports:
java.io.IOException: Failed to connect to host.a/xx.xx.xx.xx:55638
I've been able so far to avoid the usage of random ports with the previous settings, but this Block Manager port in particular seems to be random. I though it was controlled by the spark.blockManager.port, but this seems not to be the case. I've reviewed all configuration options to no avail.
So, final question: what is this port and can be avoided it to be random?
Thanks in advance.
EDIT:
This is the executor launched. As you can see, random ports are open both on the driver (which is in the same host as the master) and the worker:
Random ports for executors
I understand that a worker instantiates many executors, and that they should have a port assigned. Is there any way to, at least, limit the range of ports for this communications? How do the people handle Spark behind tight firewalls then?
EDIT 2:
I finally got it working. If I pass, as I mentioned before, the property spark.driver.host=host.a in the SPARK_MASTER_OPTS and SPARK_WORKER_OPTS it won't work, however, if I configure it in my code at the configuration of the context:
conf = SparkConf().setAppName("Test06") \
.set("spark.driver.port", "7001") \
.set("spark.driver.host", "host.a") \
.set("spark.fileserver.port", "7002") \
.set("spark.broadcast.port", "7003") \
.set("spark.replClassServer.port", "7004") \
.set("spark.blockManager.port", "7005") \
.set("spark.executor.port", "7006") \
.set("spark.ui.port", "4040") \
.set("spark.broadcast.blockSize", "4096") \
.set("spark.local.dir", "/tmp") \
.set("spark.driver.extraClassPath", "/classes/postgresql.jar") \
.set("spark.executor.extraClassPath", "/classes/postgresql.jar")
it somehow worked. Why is the setting not honored in SPARK_MASTER_OPTS or SPARK_WORKER_OPTS?

Resources