Run Spark example on Kubernetes failed

Run Spark example on Kubernetes failed - apache-spark

My Mac OS/X Version : 10.15.3
Minikube Version: 1.9.2
I start the minikube use the following command without any extra configuration.
minikube start --driver=virtualbox
--image-repository='registry.cn-hangzhou.aliyuncs.com/google_containers' --cpus 4 --memory 4096 --alsologtostderr
And I download spark-2.4.5-bin-hadoop2.7 from the Spark official website and build spark images by the following command
eval $(minikube docker-env)
./bin/docker-image-tool.sh -m -t 2.4.5 build
then I run Spark-pi using the follwing command within my local machine where store the Spark 2.4.5.
kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole=admin --serviceaccount=default:spark --namespace=default
./bin/spark-submit \
--master k8s://https://192.168.99.104:8443 \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.executor.instances=1 \
--conf spark.kubernetes.container.image=spark:2.4.5 \
local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar
I get the following error
the full log can be found at full log
Can anyone explain this error and how to solve it?

Please check the Kubernetes version you launched with Minikube.
Spark v2.4.5 fabric8 Kubernetes client v4.6.1 is compatible with Kubernetes API up to 1.15.2 (refer answer).
You can launch the specific Kubernetes API version with Minikube by adding --kubernetes-version flag to minikube start command (refer docs).
Also the issue might be caused by OkHttp library bug described in the comment of this qustion.

Another spark image (from gcr.io/spark-operator/spark) worked for me, without downgrading the version of Kubernetes.
bin/spark-submit \
--master k8s://https://192.168.99.100:8443 \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.driver.cores=1 \
--conf spark.driver.memory=512m \
--conf spark.executor.instances=2 \
--conf spark.executor.memory=512m \
--conf spark.executor.cores=1 \
--conf spark.kubernetes.container.image=gcr.io/spark-operator/spark:v2.4.5 \
--conf spark.kubernetes.container.image.pullPolicy=IfNotPresent \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar

Related

Apache Spark application is not seen in Spark Web UI (Java)

I am trying to run an ETL job using Apache Spark (Java) in Kubernetes cluster. The Application is running, and data is getting inserted into database (mysql). But, the application is not seen in Spark Web UI.
The command I used for submitting the application is:
./spark-submit --class com.xxxx.etl.EtlApplication \
--name MyETL \
--master k8s://XXXXXXXXXX.xxx.us-west-2.eks.amazonaws.com:443 \
--conf "spark.kubernetes.container.image=YYYYYY.yyy.ecr.us-west-2.amazonaws.com/spark-poc:32" \
--conf "spark.kubernetes.driverEnv.SPARK_MASTER_URL=spark://my-spark-master-headless.default.svc.cluster.local:7077" \
--conf "spark.kubernetes.authenticate.driver.serviceAccountName=my-spark" \
--conf "spark.kubernetes.driver.request.cores=256m" \
--conf "spark.kubernetes.driver.limit.cores=512m" \
--conf "spark.kubernetes.executor.request.cores=256m" \
--conf "spark.kubernetes.executor.limit.cores=512m" \
--deploy-mode cluster \
local:///opt/bitnami/spark/examples/jars/EtlApplication-with-dependencies.jar 1000
I use a jenkins job to build my code and move the jar to /opt/bitnami/spark/examples/jars folder in the container inside the cluster.
The job is seen running in the pod when I check with kubectl get pods, and is seen on taking localhost:4040 after mapping the port to localhost using kubectl port-forward pod/myetl-df26f5843cb88da7-driver 4040:4040
Tried the same spark-submit command with Spark example jar (which came along with Spark installation in the container):
./spark-submit --class org.apache.spark.examples.SparkPi \
--conf spark.kubernetes.container.image=YYYYYY.yyy.ecr.us-west-2.amazonaws.com/spark-poc:5" \
--master k8s://XXXXXXXXXX.xxx.us-west-2.eks.amazonaws.com:443 \
--conf spark.kubernetes.driverEnv.SPARK_MASTER_URL=spark://my-spark-master-headless.default.svc.cluster.local:7077" \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=my-spark \
--deploy-mode cluster \
local:///opt/bitnami/spark/examples/jars/spark-examples_2.12-3.3.0.jar 1000
This time this application is getting listed in the Spark Web UI. I tried several options, and on removing the line --conf spark.kubernetes.driverEnv.SPARK_MASTER_URL=spark://my-spark-master-headless.default.svc.cluster.local:7077", the SparkPi example application is also not displayed in Spark Web UI.
Am I missing something? Do I need to change my java code to accept spark.kubernetes.driverEnv.SPARK_MASTER_URL? Tried several options buut nothing works.
Thanks in advance.

How to submit PySpark job on Kubernetes (minikube) using spark-submit

I have a PySpark job present locally on my laptop. If I want to submit it on my minikube cluster using spark-submit, any idea how to pass the python file ?
I'm using following command, but it isn't working
./spark-submit \
--master k8s://https://192.168.64.6:8443 \
--deploy-mode cluster \
--name amazon-data-review \
--conf spark.kubernetes.namespace=jupyter \
--conf spark.executor.instances=1 \
--conf spark.kubernetes.driver.limit.cores=1 \
--conf spark.executor.cores=1 \
--conf spark.executor.memory=500m \
--conf spark.kubernetes.container.image=prateek/spark-ubuntu-2.4.5 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.kubernetes.container.image.pullSecrets=dockerlogin \
--conf spark.eventLog.enabled=true \
--conf spark.eventLog.dir=s3a://prateek/spark-hs/ \
--conf spark.hadoop.fs.s3a.access.key=xxxxx \
--conf spark.hadoop.fs.s3a.secret.key=xxxxx \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.fast.upload=true \
/Users/prateek/apache-spark/amazon_data_review.py
Getting following error -
python3: can't open file '/Users/prateek/apache-spark/amazon_data_review.py': [Errno 2] No such file or directory
Is it required to keep the file within the Docker image itself. Can't we run it locally by keeping it on laptop

Spark on Kubernetes doesn't support submitting locally stored files with spark-submit.
What you could do to make it work in cluster mode is to build Spark Docker image based on prateek/spark-ubuntu-2.4.5 with amazon_data_review.py put inside of it (eg using Docker COPY /Users/prateek/apache-spark/amazon_data_review.py /amazon_data_review.py statement).
Then just refer to it in the spark-submit command using local:// file system, eg.:
spark-submit \
--master ... \
--conf ... \
...
local:///amazon_data_review.py
The alternative is to store that file on http(s):// or hdfs://-like accessible location.

It's solved. Running it with client mode helped to run it
--deploy-mode client

Kubernetes sport submit in cluster mode --packages not working as expected

I am trying to submit a spark job to a kubernetes cluster in cluster mode from a client in the cluster with --packages attribute to enable dependencies are downloaded by driver and executer but it is not working. It refers to path on submitting client. ( kubectl proxyis on )
here it the the submit options
/usr/local/bin/spark-submit \
--verbose \
--master=k8s://http://127.0.0.1:8001 \
--deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.namespace=spark \
--conf spark.kubernetes.container.image= <...> \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.pyspark.pythonVersion=3 \
--conf spark.kubernetes.driver.secretKeyRef.AWS_ACCESS_KEY_ID=datazone-s3-secret:AWS_ACCESS_KEY_ID \
--conf spark.kubernetes.driver.secretKeyRef.AWS_SECRET_ACCESS_KEY=datazone-s3-secret:AWS_SECRET_ACCESS_KEY \
--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 \
s3.py 10
On the logs I can see that packages are referring my local file system.
Spark config:
(spark.kubernetes.namespace,spark)
(spark.jars,file:///Users/<my username>/.ivy2/jars/com.amazonaws_aws-java-sdk-1.7.4.jar,file:///Users/<my username>/.ivy2/jars/org.apache.hadoop_hadoop-aws-2.7.3.jar,file:///Users/<my username>/.ivy2/jars/joda-time_joda-time-2.10.5.jar, ....
Did someone face this problem?

Spark submit from windows vs. linux

I'm experiencing with Spark (2.3.0) over Kubernetes in the past few days.
I've tested the example SparkPi from both linux and windows machines and found the linux spark-submit to run ok and give my proper results (spoiler: Pi is roughly 3.1402157010785055)
while on windows spark fails with class path issues (Could not find or load main class org.apache.spark.examples.SparkPi)
I've noticed that when running spark-submit from linux the classpath looks like that:
-cp ':/opt/spark/jars/*:/var/spark-data/spark-jars/spark-examples_2.11-2.3.0.jar:/var/spark-data/spark-jars/spark-examples_2.11-2.3.0.jar'
While on windows, the logs show a bit different version:
-cp ':/opt/spark/jars/*:/var/spark-data/spark-jars/spark-examples_2.11-2.3.0.jar;/var/spark-data/spark-jars/spark-examples_2.11-2.3.0.jar'
Note the : vs. ; in the classpath which I think is the cause for this issue.
Suggestions how to spark-submit from windows machine without the classpath issue?
This is our spark-submit command:
bin/spark-submit \
--master k8s://https://xxx.xxx.xxx.xxx:6443 \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.driver.memory=1G \
--conf spark.driver.cores=1 \
--conf spark.executor.instances=5 \
--conf spark.executor.cores=1 \
--conf spark.executor.memory=500m \
--conf spark.kubernetes.container.image=spark:2.3.0 \
http://xxx.xxx.xxx.xxx:9080/spark-examples_2.11-2.3.0.jar
Thanks

As a workaround, you can overwrite the environment variable SPARK_MOUNTED_CLASSPATH in the script $SPARK_HOME/kubernetes/dockerfiles/spark/entrypoint.sh, such that the wrong semicolon is replaced by the correct colon.
Then you need to rebuild the docker image, e.g., with $SPARK_HOME/bin/docker-image-tool.sh. After that, spark-submit on Windows should work.
See also Spark issue tracker: https://issues.apache.org/jira/browse/SPARK-24599

java.lang.ClassNotFoundException: org.apache.spark.deploy.kubernetes.submit.Client

I am running a sample spark job in kubernetes cluster with following command:
bin/spark-submit \
--deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
--master k8s://https://XXXXX \
--kubernetes-namespace sidartha-spark-cluster \
--conf spark.executor.instances=2 \
--conf spark.app.name=spark-pi \
--conf spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.1.0-kubernetes-0.1.0-rc1 \
--conf spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.1.0-kubernetes-0.1.0-rc1 \
examples/jars/spark-examples_2.11-2.1.0-k8s-0.1.0-SNAPSHOT.jar 1000
I am building the spark from apache-spark-on-k8s
I am not able find the jar for org.apache.spark.deploy.kubernetes.submit.Client Class.

This issue is resolved. We need to build the spark/resource-manager/kubernetes from the source.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Run Spark example on Kubernetes failed - apache-spark

Related

Apache Spark application is not seen in Spark Web UI (Java)

How to submit PySpark job on Kubernetes (minikube) using spark-submit

Kubernetes sport submit in cluster mode --packages not working as expected

Spark submit from windows vs. linux

java.lang.ClassNotFoundException: org.apache.spark.deploy.kubernetes.submit.Client

Categories

Resources