How to use pyspark on yarn-cluster mode by code - apache-spark

This is my code.
spark = SparkSession.builder. \
master("yarn"). \
config("hive.metastore.uris", settings.HIVE_URL). \
config("spark.executor.memory", '5g'). \
config("spark.driver.memory", '2g').\
config('spark.submit.deployMode', 'cluster').\
enableHiveSupport(). \
appName(app_name). \
getOrCreate()
I want to use this way to submit a task to a yarn-cluster.
But it throws an exception:
Cluster deploy mode is not applicable to Spark shells.
What should I do?
I need use cluster mode, not client.

Related

Multiple spark session in one job submit on Kubernetes

can we use multiple starts and stop spark sessions in Kubernetes in one submit a job?
like: if I submit one job using this
bin/spark-submit \
--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=5 \
--conf spark.kubernetes.container.image=<spark-image> \
local:///path/to/examples.jar
In my python code, can I starts and stop spark sessions?
examples:
# start spark session
session = SparkSession \
.builder \
.appName(appname) \
.getOrCreate()
## Doing some operations using spark
session.stop()
## some python code.
# start spark session
session = SparkSession \
.builder \
.appName(appname) \
.getOrCreate()
## Doing some operations using spark
session.stop()
is it possible or not?

AWS EKS Spark 3.0, Hadoop 3.2 Error - NoClassDefFoundError: com/amazonaws/services/s3/model/MultiObjectDeleteException

I'm running Jupyterhub on EKS and wants to leverage EKS IRSA functionalities to run Spark workloads on K8s. I had prior experience of using Kube2IAM, however now I'm planning to move to IRSA.
This error is not because of IRSA, as service accounts are getting attached perfectly fine to Driver and Executor pods and I can access S3 via CLI and SDK from both. This issue is related to accessing S3 using Spark on Spark 3.0/ Hadoop 3.2
Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.lang.NoClassDefFoundError: com/amazonaws/services/s3/model/MultiObjectDeleteException
I'm using following versions -
APACHE_SPARK_VERSION=3.0.1
HADOOP_VERSION=3.2
aws-java-sdk-1.11.890
hadoop-aws-3.2.0
Python 3.7.3
I tested with different version as well.
aws-java-sdk-1.11.563.jar
Please help to give a solution if someone has come across this issue.
PS: This is not an IAM Policy error as well, because IAM policies are perfectly fine.
Finally all the issues are solved with below jars -
hadoop-aws-3.2.0.jar
aws-java-sdk-bundle-1.11.874.jar (https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle/1.11.874)
Anyone who's trying to run Spark on EKS using IRSA this is the correct spark config -
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("pyspark-data-analysis-1") \
.config("spark.kubernetes.driver.master","k8s://https://xxxxxx.gr7.ap-southeast-1.eks.amazonaws.com:443") \
.config("spark.kubernetes.namespace", "jupyter") \
.config("spark.kubernetes.container.image", "xxxxxx.dkr.ecr.ap-southeast-1.amazonaws.com/spark-ubuntu-3.0.1") \
.config("spark.kubernetes.container.image.pullPolicy" ,"Always") \
.config("spark.kubernetes.authenticate.driver.serviceAccountName", "spark") \
.config("spark.kubernetes.authenticate.executor.serviceAccountName", "spark") \
.config("spark.kubernetes.executor.annotation.eks.amazonaws.com/role-arn","arn:aws:iam::xxxxxx:role/spark-irsa") \
.config("spark.hadoop.fs.s3a.aws.credentials.provider", "com.amazonaws.auth.WebIdentityTokenCredentialsProvider") \
.config("spark.kubernetes.authenticate.submission.caCertFile", "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt") \
.config("spark.kubernetes.authenticate.submission.oauthTokenFile", "/var/run/secrets/kubernetes.io/serviceaccount/token") \
.config("spark.hadoop.fs.s3a.multiobjectdelete.enable", "false") \
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.config("spark.hadoop.fs.s3a.fast.upload","true") \
.config("spark.executor.instances", "1") \
.config("spark.executor.cores", "3") \
.config("spark.executor.memory", "10g") \
.getOrCreate()
Can check out this blog (https://medium.com/swlh/how-to-perform-a-spark-submit-to-amazon-eks-cluster-with-irsa-50af9b26cae) with:
Spark 2.4.4
Hadoop 2.7.3
AWS SDK 1.11.834
The example spark-submit is
/opt/spark/bin/spark-submit \
--master=k8s://https://4A5<i_am_tu>545E6.sk1.ap-southeast-1.eks.amazonaws.com \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.kubernetes.driver.pod.name=spark-pi-driver \
--conf spark.kubernetes.container.image=vitamingaugau/spark:spark-2.4.4-irsa \
--conf spark.kubernetes.namespace=spark-pi \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-pi \
--conf spark.kubernetes.authenticate.executor.serviceAccountName=spark-pi \
--conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.WebIdentityTokenCredentialsProvider \
--conf spark.kubernetes.authenticate.submission.caCertFile=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
--conf spark.kubernetes.authenticate.submission.oauthTokenFile=/var/run/secrets/kubernetes.io/serviceaccount/token \
local:///opt/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.4.4.jar 20000

spark-submit failing to connect to metastore due to Kerberos : Caused by GSSException: No valid credentials provided . but works in local-client mode

it seems, in docker pyspark shell in local-client mode is working and able to connect to hive. However, issuing spark-submit with all dependencies it fails with below error.
20/08/24 14:03:01 INFO storage.BlockManagerMasterEndpoint: Registering block manager test.server.com:41697 with 6.2 GB RAM, BlockManagerId(3, test.server.com, 41697, None)
20/08/24 14:03:02 INFO hive.HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
20/08/24 14:03:02 INFO hive.metastore: Trying to connect to metastore with URI thrift://metastore.server.com:9083
20/08/24 14:03:02 ERROR transport.TSaslTransport: SASL negotiation failure
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
at org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
Running a simple pi example on pyspark works fine with no kerberos issues, but when trying to access hive getting kerberos error.
Spark-submit command:
spark-submit --master yarn --deploy-mode cluster --files=/etc/hive/conf/hive-site.xml,/etc/hive/conf/yarn-site.xml,/etc/hive/conf/hdfs-site.xml,/etc/hive/conf/core-site.xml,/etc/hive/conf/mapred-site.xml,/etc/hive/conf/ssl-client.xml --name fetch_hive_test --executor-memory 12g --num-executors 20 test_hive_minimal.py
test_hive_minimal.py is a simple pyspark script to show tables in test db:
from pyspark.sql import SparkSession
#declaration
appName = "test_hive_minimal"
master = "yarn"
# Create the Spark session
sc = SparkSession.builder \
.appName(appName) \
.master(master) \
.enableHiveSupport() \
.config("spark.hadoop.hive.enforce.bucketing", "True") \
.config("spark.hadoop.hive.support.quoted.identifiers", "none") \
.config("hive.exec.dynamic.partition", "True") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.getOrCreate()
# Define the function to load data from Teradata
#custom freeform query
sql = "show tables in user_tables"
df_new = sc.sql(sql)
df_new.show()
sc.stop()
Can anyone throw some light how to fix this? Isnt kerberos tickets managed automatically by yarn? all other hadoop resources are accessible.
UPDATE:
Issue was fixed after sharing vol mount on the docker container and passing keytab/principal along with hive-site.xml for accessing metastore.
spark-submit --master yarn \
--deploy-mode cluster \
--jars /srv/python/ext_jars/terajdbc4.jar \
--files=/etc/hive/conf/hive-site.xml \
--keytab /home/alias/.kt/alias.keytab \ #this is mounted and kept in docker local path
--principal alias#realm.com.org \
--name td_to_hive_test \
--driver-cores 2 \
--driver-memory 2G \
--num-executors 44 \
--executor-cores 5 \
--executor-memory 12g \
td_to_hive_test.py
I think that your driver have tickets but that not the case of your executors. Add the following parameters to your spark submit :
--principal : you can get principal this way : klist -k
--keytab : path to keytab
more informations : https://spark.apache.org/docs/latest/running-on-yarn.html#yarn-specific-kerberos-configuration
Can you try below command line property while running a job on the cluster.
-Djavax.security.auth.useSubjectCredsOnly=false
You can add above property to Spark submit command

How does spark-submit.sh work with different modes and different cluster managers?

In Apache Spark, how does spark-submit.sh work with different modes and different cluster managers? Specifically:
In local deployment mode,
does spark-submit.sh skip calling any cluster manager?
Is it correct that there is no need to install a cluster manager on the local machine?
In client or cluster deployment mode,
Does spark-submit.sh work with different cluster managers (Spark standalone, YARN, Mesos, Kubernetes)? Do different cluster managers have different interfaces, and spark-submit.sh has to invoke them in different ways?
Does spark-submit.sh appear to programmers the same interface except --master? option --master of spark-submit.sh is used to specify a cluster manager.
Thanks.
To make things clear, there is absolutely no need to specify any cluster manager while running spark on any mode (client or cluster or whether you run spark in local mode). The cluster manager is only there to make resource allocation easier and independent, but it is always your choice to use one or not.
The spark-submit command doesn't need a cluster manager present to run.
The different ways in which you can use the command are:
1) local mode:
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local[8] \
/path/to/examples.jar \
100
2) client mode without a resource manager (also known as spark standalone mode):
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://207.184.161.138:7077 \
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000
3) cluster mode with spark standalone mode:
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://207.184.161.138:7077 \
--deploy-mode cluster \
--supervise \
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000
4) Client/Cluster mode with a resource manager:
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \ # can be client for client mode
--executor-memory 20G \
--num-executors 50 \
/path/to/examples.jar \
1000
As you can see above, the spark-submit.sh will behave in the same way whether there is a cluster manager or not. Also, if you want to use a resource manager like yarn, mesos, the behaviour of spark-submit will remain the same.
You can read more about spark-submit here.

How to use Apache Spark to query Hive table with Kerberos?

I am attempting to use Scala with Apache Spark locally to query Hive table which is secured with Kerberos. I have no issues connecting and querying the data programmatically without Spark. However, the problem comes when I try to connect and query in Spark.
My code when run locally without spark:
Class.forName("org.apache.hive.jdbc.HiveDriver")
System.setProperty("kerberos.keytab", keytab)
System.setProperty("kerberos.principal", keytab)
System.setProperty("java.security.krb5.conf", krb5.conf)
System.setProperty("java.security.auth.login.config", jaas.conf)
val conf = new Configuration
conf.set("hadoop.security.authentication", "Kerberos")
UserGroupInformation.setConfiguration(conf)
UserGroupInformation.createProxyUser("user", UserGroupInformation.getLoginUser)
UserGroupInformation.loginUserFromKeytab(user, keytab)
UserGroupInformation.getLoginUser.checkTGTAndReloginFromKeytab()
if (UserGroupInformation.isLoginKeytabBased) {
UserGroupInformation.getLoginUser.reloginFromKeytab()
}
else if (UserGroupInformation.isLoginTicketBased) UserGroupInformation.getLoginUser.reloginFromTicketCache()
val con = DriverManager.getConnection("jdbc:hive://hdpe-hive.company.com:10000", user, password)
val ps = con.prepareStatement("select * from table limit 5").executeQuery();
Does anyone know how I could include the keytab, krb5.conf and jaas.conf into my Spark initialization function so that I am able to authenticate with Kerberos to get the TGT?
My Spark initialization function:
conf = new SparkConf().setAppName("mediumData")
.setMaster(numCores)
.set("spark.driver.host", "localhost")
.set("spark.ui.enabled","true") //enable spark UI
.set("spark.sql.shuffle.partitions",defaultPartitions)
sparkSession = SparkSession.builder.config(conf).enableHiveSupport().getOrCreate()
I do not have files such as hive-site.xml, core-site.xml.
Thank you!
Looking at your code, you need to set the following properties in the spark-submit command on the terminal.
spark-submit --master yarn \
--principal YOUR_PRINCIPAL_HERE \
--keytab YOUR_KEYTAB_HERE \
--conf spark.driver.extraJavaOptions="-Djava.security.auth.login.config=JAAS_CONF_PATH" \
--conf spark.driver.extraJavaOptions="-Djava.security.krb5.conf=KRB5_PATH" \
--conf spark.executor.extraJavaOptions="-Djava.security.auth.login.config=JAAS_CONF_PATH" \
--conf spark.executor.extraJavaOptions="-Djava.security.krb5.conf=KRB5_PATH" \
--class YOUR_MAIN_CLASS_NAME_HERE code.jar

Resources