AWS EKS Spark 3.0, Hadoop 3.2 Error - NoClassDefFoundError: com/amazonaws/services/s3/model/MultiObjectDeleteException - apache-spark

I'm running Jupyterhub on EKS and wants to leverage EKS IRSA functionalities to run Spark workloads on K8s. I had prior experience of using Kube2IAM, however now I'm planning to move to IRSA.
This error is not because of IRSA, as service accounts are getting attached perfectly fine to Driver and Executor pods and I can access S3 via CLI and SDK from both. This issue is related to accessing S3 using Spark on Spark 3.0/ Hadoop 3.2
Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.lang.NoClassDefFoundError: com/amazonaws/services/s3/model/MultiObjectDeleteException
I'm using following versions -
APACHE_SPARK_VERSION=3.0.1
HADOOP_VERSION=3.2
aws-java-sdk-1.11.890
hadoop-aws-3.2.0
Python 3.7.3
I tested with different version as well.
aws-java-sdk-1.11.563.jar
Please help to give a solution if someone has come across this issue.
PS: This is not an IAM Policy error as well, because IAM policies are perfectly fine.

Finally all the issues are solved with below jars -
hadoop-aws-3.2.0.jar
aws-java-sdk-bundle-1.11.874.jar (https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle/1.11.874)
Anyone who's trying to run Spark on EKS using IRSA this is the correct spark config -
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("pyspark-data-analysis-1") \
.config("spark.kubernetes.driver.master","k8s://https://xxxxxx.gr7.ap-southeast-1.eks.amazonaws.com:443") \
.config("spark.kubernetes.namespace", "jupyter") \
.config("spark.kubernetes.container.image", "xxxxxx.dkr.ecr.ap-southeast-1.amazonaws.com/spark-ubuntu-3.0.1") \
.config("spark.kubernetes.container.image.pullPolicy" ,"Always") \
.config("spark.kubernetes.authenticate.driver.serviceAccountName", "spark") \
.config("spark.kubernetes.authenticate.executor.serviceAccountName", "spark") \
.config("spark.kubernetes.executor.annotation.eks.amazonaws.com/role-arn","arn:aws:iam::xxxxxx:role/spark-irsa") \
.config("spark.hadoop.fs.s3a.aws.credentials.provider", "com.amazonaws.auth.WebIdentityTokenCredentialsProvider") \
.config("spark.kubernetes.authenticate.submission.caCertFile", "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt") \
.config("spark.kubernetes.authenticate.submission.oauthTokenFile", "/var/run/secrets/kubernetes.io/serviceaccount/token") \
.config("spark.hadoop.fs.s3a.multiobjectdelete.enable", "false") \
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.config("spark.hadoop.fs.s3a.fast.upload","true") \
.config("spark.executor.instances", "1") \
.config("spark.executor.cores", "3") \
.config("spark.executor.memory", "10g") \
.getOrCreate()

Can check out this blog (https://medium.com/swlh/how-to-perform-a-spark-submit-to-amazon-eks-cluster-with-irsa-50af9b26cae) with:
Spark 2.4.4
Hadoop 2.7.3
AWS SDK 1.11.834
The example spark-submit is
/opt/spark/bin/spark-submit \
--master=k8s://https://4A5<i_am_tu>545E6.sk1.ap-southeast-1.eks.amazonaws.com \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.kubernetes.driver.pod.name=spark-pi-driver \
--conf spark.kubernetes.container.image=vitamingaugau/spark:spark-2.4.4-irsa \
--conf spark.kubernetes.namespace=spark-pi \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-pi \
--conf spark.kubernetes.authenticate.executor.serviceAccountName=spark-pi \
--conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.WebIdentityTokenCredentialsProvider \
--conf spark.kubernetes.authenticate.submission.caCertFile=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
--conf spark.kubernetes.authenticate.submission.oauthTokenFile=/var/run/secrets/kubernetes.io/serviceaccount/token \
local:///opt/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.4.4.jar 20000

Related

Multiple spark session in one job submit on Kubernetes

can we use multiple starts and stop spark sessions in Kubernetes in one submit a job?
like: if I submit one job using this
bin/spark-submit \
--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=5 \
--conf spark.kubernetes.container.image=<spark-image> \
local:///path/to/examples.jar
In my python code, can I starts and stop spark sessions?
examples:
# start spark session
session = SparkSession \
.builder \
.appName(appname) \
.getOrCreate()
## Doing some operations using spark
session.stop()
## some python code.
# start spark session
session = SparkSession \
.builder \
.appName(appname) \
.getOrCreate()
## Doing some operations using spark
session.stop()
is it possible or not?

How to submit PySpark job on Kubernetes (minikube) using spark-submit

I have a PySpark job present locally on my laptop. If I want to submit it on my minikube cluster using spark-submit, any idea how to pass the python file ?
I'm using following command, but it isn't working
./spark-submit \
--master k8s://https://192.168.64.6:8443 \
--deploy-mode cluster \
--name amazon-data-review \
--conf spark.kubernetes.namespace=jupyter \
--conf spark.executor.instances=1 \
--conf spark.kubernetes.driver.limit.cores=1 \
--conf spark.executor.cores=1 \
--conf spark.executor.memory=500m \
--conf spark.kubernetes.container.image=prateek/spark-ubuntu-2.4.5 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.kubernetes.container.image.pullSecrets=dockerlogin \
--conf spark.eventLog.enabled=true \
--conf spark.eventLog.dir=s3a://prateek/spark-hs/ \
--conf spark.hadoop.fs.s3a.access.key=xxxxx \
--conf spark.hadoop.fs.s3a.secret.key=xxxxx \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.fast.upload=true \
/Users/prateek/apache-spark/amazon_data_review.py
Getting following error -
python3: can't open file '/Users/prateek/apache-spark/amazon_data_review.py': [Errno 2] No such file or directory
Is it required to keep the file within the Docker image itself. Can't we run it locally by keeping it on laptop
Spark on Kubernetes doesn't support submitting locally stored files with spark-submit.
What you could do to make it work in cluster mode is to build Spark Docker image based on prateek/spark-ubuntu-2.4.5 with amazon_data_review.py put inside of it (eg using Docker COPY /Users/prateek/apache-spark/amazon_data_review.py /amazon_data_review.py statement).
Then just refer to it in the spark-submit command using local:// file system, eg.:
spark-submit \
--master ... \
--conf ... \
...
local:///amazon_data_review.py
The alternative is to store that file on http(s):// or hdfs://-like accessible location.
It's solved. Running it with client mode helped to run it
--deploy-mode client

Run Spark example on Kubernetes failed

My Mac OS/X Version : 10.15.3
Minikube Version: 1.9.2
I start the minikube use the following command without any extra configuration.
minikube start --driver=virtualbox
--image-repository='registry.cn-hangzhou.aliyuncs.com/google_containers' --cpus 4 --memory 4096 --alsologtostderr
And I download spark-2.4.5-bin-hadoop2.7 from the Spark official website and build spark images by the following command
eval $(minikube docker-env)
./bin/docker-image-tool.sh -m -t 2.4.5 build
then I run Spark-pi using the follwing command within my local machine where store the Spark 2.4.5.
kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole=admin --serviceaccount=default:spark --namespace=default
./bin/spark-submit \
--master k8s://https://192.168.99.104:8443 \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.executor.instances=1 \
--conf spark.kubernetes.container.image=spark:2.4.5 \
local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar
I get the following error
the full log can be found at full log
Can anyone explain this error and how to solve it?
Please check the Kubernetes version you launched with Minikube.
Spark v2.4.5 fabric8 Kubernetes client v4.6.1 is compatible with Kubernetes API up to 1.15.2 (refer answer).
You can launch the specific Kubernetes API version with Minikube by adding --kubernetes-version flag to minikube start command (refer docs).
Also the issue might be caused by OkHttp library bug described in the comment of this qustion.
Another spark image (from gcr.io/spark-operator/spark) worked for me, without downgrading the version of Kubernetes.
bin/spark-submit \
--master k8s://https://192.168.99.100:8443 \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.driver.cores=1 \
--conf spark.driver.memory=512m \
--conf spark.executor.instances=2 \
--conf spark.executor.memory=512m \
--conf spark.executor.cores=1 \
--conf spark.kubernetes.container.image=gcr.io/spark-operator/spark:v2.4.5 \
--conf spark.kubernetes.container.image.pullPolicy=IfNotPresent \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar

Kubernetes sport submit in cluster mode --packages not working as expected

I am trying to submit a spark job to a kubernetes cluster in cluster mode from a client in the cluster with --packages attribute to enable dependencies are downloaded by driver and executer but it is not working. It refers to path on submitting client. ( kubectl proxyis on )
here it the the submit options
/usr/local/bin/spark-submit \
--verbose \
--master=k8s://http://127.0.0.1:8001 \
--deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.namespace=spark \
--conf spark.kubernetes.container.image= <...> \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.pyspark.pythonVersion=3 \
--conf spark.kubernetes.driver.secretKeyRef.AWS_ACCESS_KEY_ID=datazone-s3-secret:AWS_ACCESS_KEY_ID \
--conf spark.kubernetes.driver.secretKeyRef.AWS_SECRET_ACCESS_KEY=datazone-s3-secret:AWS_SECRET_ACCESS_KEY \
--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 \
s3.py 10
On the logs I can see that packages are referring my local file system.
Spark config:
(spark.kubernetes.namespace,spark)
(spark.jars,file:///Users/<my username>/.ivy2/jars/com.amazonaws_aws-java-sdk-1.7.4.jar,file:///Users/<my username>/.ivy2/jars/org.apache.hadoop_hadoop-aws-2.7.3.jar,file:///Users/<my username>/.ivy2/jars/joda-time_joda-time-2.10.5.jar, ....
Did someone face this problem?

java.lang.ClassNotFoundException: org.apache.spark.deploy.kubernetes.submit.Client

I am running a sample spark job in kubernetes cluster with following command:
bin/spark-submit \
--deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
--master k8s://https://XXXXX \
--kubernetes-namespace sidartha-spark-cluster \
--conf spark.executor.instances=2 \
--conf spark.app.name=spark-pi \
--conf spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.1.0-kubernetes-0.1.0-rc1 \
--conf spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.1.0-kubernetes-0.1.0-rc1 \
examples/jars/spark-examples_2.11-2.1.0-k8s-0.1.0-SNAPSHOT.jar 1000
I am building the spark from apache-spark-on-k8s
I am not able find the jar for org.apache.spark.deploy.kubernetes.submit.Client Class.
This issue is resolved. We need to build the spark/resource-manager/kubernetes from the source.

Resources