How to use Apache Spark to query Hive table with Kerberos? - apache-spark

I am attempting to use Scala with Apache Spark locally to query Hive table which is secured with Kerberos. I have no issues connecting and querying the data programmatically without Spark. However, the problem comes when I try to connect and query in Spark.
My code when run locally without spark:
Class.forName("org.apache.hive.jdbc.HiveDriver")
System.setProperty("kerberos.keytab", keytab)
System.setProperty("kerberos.principal", keytab)
System.setProperty("java.security.krb5.conf", krb5.conf)
System.setProperty("java.security.auth.login.config", jaas.conf)
val conf = new Configuration
conf.set("hadoop.security.authentication", "Kerberos")
UserGroupInformation.setConfiguration(conf)
UserGroupInformation.createProxyUser("user", UserGroupInformation.getLoginUser)
UserGroupInformation.loginUserFromKeytab(user, keytab)
UserGroupInformation.getLoginUser.checkTGTAndReloginFromKeytab()
if (UserGroupInformation.isLoginKeytabBased) {
UserGroupInformation.getLoginUser.reloginFromKeytab()
}
else if (UserGroupInformation.isLoginTicketBased) UserGroupInformation.getLoginUser.reloginFromTicketCache()
val con = DriverManager.getConnection("jdbc:hive://hdpe-hive.company.com:10000", user, password)
val ps = con.prepareStatement("select * from table limit 5").executeQuery();
Does anyone know how I could include the keytab, krb5.conf and jaas.conf into my Spark initialization function so that I am able to authenticate with Kerberos to get the TGT?
My Spark initialization function:
conf = new SparkConf().setAppName("mediumData")
.setMaster(numCores)
.set("spark.driver.host", "localhost")
.set("spark.ui.enabled","true") //enable spark UI
.set("spark.sql.shuffle.partitions",defaultPartitions)
sparkSession = SparkSession.builder.config(conf).enableHiveSupport().getOrCreate()
I do not have files such as hive-site.xml, core-site.xml.
Thank you!

Looking at your code, you need to set the following properties in the spark-submit command on the terminal.
spark-submit --master yarn \
--principal YOUR_PRINCIPAL_HERE \
--keytab YOUR_KEYTAB_HERE \
--conf spark.driver.extraJavaOptions="-Djava.security.auth.login.config=JAAS_CONF_PATH" \
--conf spark.driver.extraJavaOptions="-Djava.security.krb5.conf=KRB5_PATH" \
--conf spark.executor.extraJavaOptions="-Djava.security.auth.login.config=JAAS_CONF_PATH" \
--conf spark.executor.extraJavaOptions="-Djava.security.krb5.conf=KRB5_PATH" \
--class YOUR_MAIN_CLASS_NAME_HERE code.jar

Related

Not able to create Parquet table as default using spark-submit jobs on EMR

I was able to run an EMR step like this
spark-sql -f "script_location" --jars EMR_SPARK_JARFILE_FULL_PATH --hiveconf hive.default.fileformat=parquet --hiveconf hive.default.fileformat.managed=parquet --conf spark.sql.crossJoin.enabled=true -deploy-mode cluster
On which I set Spark to create tables in Parquet by default using SparkSQL scripts(this is working as expected).
Now, following the documentation https://spark.apache.org/docs/latest/configuration.html#custom-hadoophive-configuration
I want to do the same but in this time, I need to use PySpark, so I tried to run
spark-submit "script_location" \
--jars "jar_location" \
--conf spark.hive.default.fileformat=Parquet \
--conf spark.hive.default.fileformat.managed=Parquet \
--conf spark.sql.crossJoin.enabled=true \
-deploy-mode cluster
It seems spark-submit is not setting parquet as default for tables creation.
Is there something I'm missing?

pyspark connection to MariaDB fails with ClassNotFoundException

I'm trying to retrieve data from MariaDB with pyspark.
I created spark_session with configuration to include jdbc jar file, but couldn't solve problem. Current code to create session looks like below.
path = "hdfs://nameservice1/user/PATH/TO/JDBC/mariadb-java-client-2.7.1.jar"
# or path = "/home/PATH/TO/JDBC/mariadb-java-client-2.7.1.jar"
spark = SparkSession.config("spark.jars", path)\
.config("spark.driver.extraClassPath", path)\
.config("spark.executor.extraClassPath", path)\
.enableHiveSupport()
.getOrCreate()
Note that I've tried every case of configuration I know
(Check Permission, change directory both hdfs or local, add or remove configuration ...)
And then, code to load data is.
sql = "SOME_SQL_TO_RETRIEVE_DATA"
spark = spark.read.format('jdbc').option('dbtable', sql)
.option('url', 'jdbc:mariadb://{host}:{port}/{db}')\
.option("user", SOME_USER)
.option("password", SOME_PASSWORD)
.option("driver", 'org.mariadb.jdbc.Driver')
.load()
But it fails with java.lang.ClassNotFoundException: org.mariadb.jdbc.Driver
When I tried this with spark-submit, I saw log message.
... INFO SparkContext: Added Jar /PATH/TO/JDBC/mariadb-java-client-2.7.1.jar at spark://SOME_PATH/jars/mariadb-java-client-2.7.1.jar with timestamp SOME_TIMESTAMP
What is wrong?
For anyone who suffers from same problem.
I figured out. Spark Document says that
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.
So instead setting configuration on python code, I added arguments on spark-submit following this document.
spark-submit {other arguments ...} \
--driver-class-path PATH/TO/JDBC/my-jdbc.jar \
--jars PATH/TO/JDBC/my-jdbc.jar \
MY_PYTHON_SCRIPT.py

spark-submit failing to connect to metastore due to Kerberos : Caused by GSSException: No valid credentials provided . but works in local-client mode

it seems, in docker pyspark shell in local-client mode is working and able to connect to hive. However, issuing spark-submit with all dependencies it fails with below error.
20/08/24 14:03:01 INFO storage.BlockManagerMasterEndpoint: Registering block manager test.server.com:41697 with 6.2 GB RAM, BlockManagerId(3, test.server.com, 41697, None)
20/08/24 14:03:02 INFO hive.HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
20/08/24 14:03:02 INFO hive.metastore: Trying to connect to metastore with URI thrift://metastore.server.com:9083
20/08/24 14:03:02 ERROR transport.TSaslTransport: SASL negotiation failure
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
at org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
Running a simple pi example on pyspark works fine with no kerberos issues, but when trying to access hive getting kerberos error.
Spark-submit command:
spark-submit --master yarn --deploy-mode cluster --files=/etc/hive/conf/hive-site.xml,/etc/hive/conf/yarn-site.xml,/etc/hive/conf/hdfs-site.xml,/etc/hive/conf/core-site.xml,/etc/hive/conf/mapred-site.xml,/etc/hive/conf/ssl-client.xml --name fetch_hive_test --executor-memory 12g --num-executors 20 test_hive_minimal.py
test_hive_minimal.py is a simple pyspark script to show tables in test db:
from pyspark.sql import SparkSession
#declaration
appName = "test_hive_minimal"
master = "yarn"
# Create the Spark session
sc = SparkSession.builder \
.appName(appName) \
.master(master) \
.enableHiveSupport() \
.config("spark.hadoop.hive.enforce.bucketing", "True") \
.config("spark.hadoop.hive.support.quoted.identifiers", "none") \
.config("hive.exec.dynamic.partition", "True") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.getOrCreate()
# Define the function to load data from Teradata
#custom freeform query
sql = "show tables in user_tables"
df_new = sc.sql(sql)
df_new.show()
sc.stop()
Can anyone throw some light how to fix this? Isnt kerberos tickets managed automatically by yarn? all other hadoop resources are accessible.
UPDATE:
Issue was fixed after sharing vol mount on the docker container and passing keytab/principal along with hive-site.xml for accessing metastore.
spark-submit --master yarn \
--deploy-mode cluster \
--jars /srv/python/ext_jars/terajdbc4.jar \
--files=/etc/hive/conf/hive-site.xml \
--keytab /home/alias/.kt/alias.keytab \ #this is mounted and kept in docker local path
--principal alias#realm.com.org \
--name td_to_hive_test \
--driver-cores 2 \
--driver-memory 2G \
--num-executors 44 \
--executor-cores 5 \
--executor-memory 12g \
td_to_hive_test.py
I think that your driver have tickets but that not the case of your executors. Add the following parameters to your spark submit :
--principal : you can get principal this way : klist -k
--keytab : path to keytab
more informations : https://spark.apache.org/docs/latest/running-on-yarn.html#yarn-specific-kerberos-configuration
Can you try below command line property while running a job on the cluster.
-Djavax.security.auth.useSubjectCredsOnly=false
You can add above property to Spark submit command

SparkSession Application Source Code Config Properties not Overriding JupyterHub & Zeppelin on AWS EMR defaults

I have Spark Driver setup to use Zeppelin and or JupyterHub as client for interactive Spark Programming on AWS EMR. However, when I create the SparkSession with custom config properties (application name, # of cores, executor ram, # of executors, serializer, etc) it is not overriding the default values for those configs (confirmed under Environment tab in Spark UI and spark.conf.get(...)).
Like any Spark App these clients on EMR should be using my custom config properties because SparkSession code is the 1st highest override before spark-submit, spark config file, and then spark-defaults. JupyterHub also immediately launches a Spark Application w/o coding one or when just running an empty cell.
Is there a setting specific to Zeppelin, JupyterHub, or a separate xml conf that needs adjusted to get custom configs recognized and working? Any help is much appreciated.
Example of creating a basic application where these cluster resource configs should be implemented instead of the standard default configs which is what is happening with Zeppelin/JupyterHub on EMR.
# via zep or jup [configs NOT being recognized]
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("app_name")\
.master("yarn")\
.config("spark.submit.deployMode","client")\
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\
.config("spark.executor.instances", 11)\
.config("spark.executor.cores", 5)\
.config("spark.executor.memory", "19g")\
.getOrCreate()
# via ssh terminal [configs ARE recognized at run-time]
pyspark \
--name "app_name" \
--master yarn \
--deploy-mode client \
--num-executors 11 \
--executor-cores 5 \
--executor-memory 19 \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer"
Found a solution. The config.json file under /etc/jupyter/conf had some default spark config values hence I removed them to display an empty json key/value like => _configs":{}. Creating a custom SparkSession via JupyterHub now understands the specified cluster configs.
These magic commands are always working %%configure
https://github.com/jupyter-incubator/sparkmagic

A master URL must be set in your configuration gives lot of confusion

I have compiled my spark-scala code in eclipse.
I am trying to run my jar in EMR (5.9.0 Spark 2.2.0)using spark-submit option.
But when I run I get an error:
Details : Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
After reading lots of StackOverflow solution I get confused and did not find a correct explanation of how and why to set app master.
This is how I run my jar.I have tried all below option
spark-submit --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master yarn --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master yarn-client --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --deploy-mode cluster --master yarn --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --deploy-mode cluster --master yarn-client --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master local[*] --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master local[1] --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master local[2] --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master local[3] --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master local[4] --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master local[5] --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
I am not setting any app master in my Scala code .
package financialLineItem
import org.apache.spark.SparkConf
import org.apache.spark._
import org.apache.spark.sql.SQLContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import org.apache.spark.sql.functions.rank
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.apache.spark.{ SparkConf, SparkContext }
import java.sql.{ Date, Timestamp }
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.input_file_name
import org.apache.spark.sql.functions.regexp_extract
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.expressions._
object FinancialLineItem {
def main(args: Array[String]) {
println("Enterin In to Spark Mode ")
val conf = new SparkConf().setAppName("FinanicalLineItem");
println("After conf")
val sc = new SparkContext(conf); //Creating spark context
println("After SC")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val get_cus_val = sqlContext.udf.register("get_cus_val", (filePath: String) => filePath.split("\\.")(3))
val rdd = sc.textFile("s3://path/FinancialLineItem/MAIN")
val header = rdd.filter(_.contains("LineItem.organizationId")).map(line => line.split("\\|\\^\\|")).first()
val schema = StructType(header.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)
val data = sqlContext.createDataFrame(rdd.filter(!_.contains("LineItem.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema)
val schemaHeader = StructType(header.map(cols => StructField(cols.replace(".", "."), StringType)).toSeq)
val dataHeader = sqlContext.createDataFrame(rdd.filter(!_.contains("LineItem.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schemaHeader)
val df1resultFinal = data.withColumn("DataPartition", get_cus_val(input_file_name))
val rdd1 = sc.textFile("s3://path/FinancialLineItem/INCR")
val header1 = rdd1.filter(_.contains("LineItem.organizationId")).map(line => line.split("\\|\\^\\|")).first()
val schema1 = StructType(header1.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)
val data1 = sqlContext.createDataFrame(rdd1.filter(!_.contains("LineItem.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema1)
val windowSpec = Window.partitionBy("LineItem_organizationId", "LineItem_lineItemId").orderBy($"TimeStamp".cast(LongType).desc)
val latestForEachKey = data1.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")
val dfMainOutput = df1resultFinal.join(latestForEachKey, Seq("LineItem_organizationId", "LineItem_lineItemId"), "outer")
.select($"LineItem_organizationId", $"LineItem_lineItemId",
when($"DataPartition_1".isNotNull, $"DataPartition_1").otherwise($"DataPartition").as("DataPartition"),
when($"FinancialConceptCodeGlobalSecondaryId_1".isNotNull, $"FinancialConceptCodeGlobalSecondaryId_1").otherwise($"FinancialConceptCodeGlobalSecondaryId").as("FinancialConceptCodeGlobalSecondaryId"),
when($"FFAction_1".isNotNull, $"FFAction_1").otherwise($"FFAction|!|").as("FFAction|!|"))
.filter(!$"FFAction|!|".contains("D|!|"))
val dfMainOutputFinal = dfMainOutput.na.fill("").select($"DataPartition", $"StatementTypeCode", concat_ws("|^|", dfMainOutput.schema.fieldNames.filter(_ != "DataPartition").map(c => col(c)): _*).as("concatenated"))
val headerColumn = dataHeader.columns.toSeq
val headerLast = headerColumn.mkString("", "|^|", "|!|").dropRight(3)
val dfMainOutputFinalWithoutNull = dfMainOutputFinal.withColumn("concatenated", regexp_replace(col("concatenated"), "|^|null", "")).withColumnRenamed("concatenated", headerLast)
dfMainOutputFinalWithoutNull.repartition(1).write.partitionBy("DataPartition", "StatementTypeCode")
.format("csv")
.option("nullValue", "")
.option("delimiter", "\t")
.option("quote", "\u0000")
.option("header", "true")
.option("codec", "gzip")
.save("s3://path/FinancialLineItem/output")
Even i tried setting master url in spark-scala code.
This is working in EMR example for spark
spark-submit --deploy-mode cluster --class org.apache.spark.examples.JavaSparkPi /usr/lib/spark/examples/jars/spark-examples.jar 5
If this working then why my jar is not working ?
I tried printing statement in my scala class before creating spark context and it is printing ,so there is no issue in jar file creation .
I don't know what am i missing ?
Updating my eclipse IDE setup also .
Followed below docs
AWS add steps document
This is what my observation
A master URL like "spark://..." is for Spark Standalone, but EMR uses Spark on YARN, so the master URL should be just "yarn". This is already configured for you in spark-defaults.conf,
More findings .
When i tried to submit from spark-shell i got below error
User class threw exception: java.lang.UnsupportedOperationException: empty collection.
I think there might some issue with my code also .
But i am getting correct result when i run it from Zeppelin .
There's a lot of confusion going on here in the question and in the first answer. If you're running on EMR, which runs Spark on YARN, you do not need to set a master URL at all. It automatically defaults to "yarn", which is the correct value when running Spark on YARN (as opposed to Spark Standalone, which would have a master URL like spark://:7077).
As mentioned in one of the other answers, "--master local" and "--deploy-mode cluster" also don't make sense together. "--master local" should only be used for local development and testing purposes and doesn't make sense to use on a cluster of machines such as on EMR. All it does is run your entire application in a single JVM; it won't run on YARN, it won't be distributed across the cluster, and there won't even be a separation between your driver code and the tasks.
As for "--deploy-mode cluster", as also stated in the other answer, this means that your driver runs in a YARN container on the cluster along with the executors, as opposed to the default of "--deploy-mode client", where the driver runs on the master node outside of YARN.
For more information, please see the Spark documentation, mainly https://spark.apache.org/docs/latest/submitting-applications.html and https://spark.apache.org/docs/latest/running-on-yarn.html.
As explained in the documentation, --deploy-mode cluster asks spark-submit to run the driver on one of the executors.
That, however, isn't applicable to your execution. as you're running locally. You should be using the client deploy mode. For that, just remove the --deploy-mode parameter altogether.
You have to choose one of the following calls, depending on how you want to run the driver program (or executors, for the last option). It's important to understand the differences as they are consequential.
If you want to run the driver program on the cluster (cluster mode, master chooses where on the cluster):
spark-submit --master master.address.com:7077 --deploy-mode cluster #other options
If you want to run the driver program on the compute that is calling spark-submit (client mode, executors remain on the cluster):
spark-submit --master master.address.com:7077 --deploy-mode client #other options
If you are running all locally (driver and executors), then your local master is appropriate:
spark-submit --master local[*] #other options

Resources