Passing kerberos keytab and principal in spark conf - apache-spark

I am trying to run my spark application in local mode from within Intellij. The application reads a text file from hdfs using sc.textFile("hdfs://..."). The hdfs is secured by Kerberos authentication. I know you can use spark launcher and specify kerberos keytab and principal but for that I will have to do sbt assembly everytime I do a code change and want to test my chage. Is there an alternate/better way of specifying, kerberos keytab file and kerberos principal to spark? Also is there a parameter for providing the hdfs namenode information?
Thanks!

First, you may provide these parameters when building your SparkSession ( described here).
The second option is to pass principal and keytab as command-line arguments of your app.

Related

Is there a Delegation Token example of spark on k8s by using secrets?

We are using spark on k8s with a kerberized HDFS. I saw there are several ways to enable kerberos and one of them is by using delegation token with pre-populated secrets within the namespace. But I can hardly find a complete example.
How do I create a delegation token in HDFS side then populate it into secret in k8s ?
any answer or reference is greatly apprecaited

Databricks and Informatica Delta Lake connector spark configuration

I am working with Informatica Data Integrator and trying to set up a connection with a Databricks cluster. So far everything seems to work fine, but one issue is that under Spark configuration we had to put the SAS key for the ADLS gen 2 storage account.
The reason for this is that when Informatica tries to write to Databricks it first has to write that data into a folder in ADLS gen 2 and then Databricks essentially takes that file and writes it as a Delta Lake table.
Now one issue is that the field where we put the Spark config contains the full SAS value (url plus token and password). That is not really a good thing unless we only make 1 person an admin.
Did anyone work with Informatica and Databricks? Is it possible to put the Spark config as a file and then have the Informatica connector read that file? Or is it possible to add that SAS key to the Spark cluster (the interactive cluster we use) and have that cluster read the info from that file?
Thank you for any help with this.
You really don't need to put SAS key value into Spark configuration, but instead you need to store that value in the Azure KeyVault-baked secret scope (on Azure) or Databricks secret scope (in other clouds), and then refer to that value from Spark configuration using the syntax {{secrets/<secret-scope-name>/<secret-key>}} (see doc) - in this case, SAS key value will be read on the cluster start, and won't available to the users who have access to a cluster UI.

Authorization through Apache Ranger in Spark

We have ranger policies defined on hive table and authorization works as expected when we use hive cli and beeline. But when we access those hive tables using spark-shell or spark-submit it does not work.
Is there any way to set it up?
Problem Statement:
Ranger secures Hive (JDBC) server only. But Spark does not interact with HS2, but directly interacts with Metastore. Hence, the only way to use Ranger policies if you use Hive via JDBC. Another option is HDFS or Storage ACLs, which are coarse grain control over file path etc. You can use Ranger to manage HDFS ACLs as well. In such scenario spark will be bound by those policies. But, if I use Ranger to manage HDFS ACLS, as you mentioned it will coarse grain control over file. I might have few fine grained use cases at row/column level
Check for ranger audits in ranger ui and check for the denied results for those tables, verify the user.

Specify Azure key in Spark 2.x version

I'm trying to access a wasb(Azure blob storage) file in Spark and need to specify the account key.
How do I specify the account in the spark-env.sh file?
fs.azure.account.key.test.blob.core.windows.net
EC5sNg3qGN20qqyyr2W1xUo5qApbi/zxkmHMo5JjoMBmuNTxGNz+/sF9zPOuYA==
WHen I try this it throws the following error
fs.azure.account.key.test.blob.core.windows.net: command not found
From your description, it is not clear that the Spark you used is either on Azure or on local.
For Spark running on local, refer this blog post which introduces how to access Azure Blob Storage from Spark. The key is that you need to configure Azure Storage account as HDFS-compatible storage in core-site.xml file and add two jars hadoop-azure & azure-storage to your classpath for accessing HDFS via the protocol wasb[s].
For Spark running on Azure, the difference is just only access HDFS with wasb, all configurations have been done by Azure when creating HDInsight cluster with Spark.

How to authenticate with spark-submit cluster mode?

I'd like to remotely run spark-submit from a local machine to submit a job to spark cluster (cluster mode). What method do I use to authenticate myself to the cluster?
You need to be on spark client machine to submit a job. Once you are in spark client and after you submit a job, you will be authenticated by user id by which you are submitting job. If your user id has permissions on cluster it will be taken care automatically. For kerberos enabled you need to generate keytab first for that session and user to authenticate with cluster.

Resources