How to authenticate with spark-submit cluster mode? - apache-spark

I'd like to remotely run spark-submit from a local machine to submit a job to spark cluster (cluster mode). What method do I use to authenticate myself to the cluster?

You need to be on spark client machine to submit a job. Once you are in spark client and after you submit a job, you will be authenticated by user id by which you are submitting job. If your user id has permissions on cluster it will be taken care automatically. For kerberos enabled you need to generate keytab first for that session and user to authenticate with cluster.

Related

how to remote submit spark jobs to Azure HDInsights cluster without Livy

I want to submit spark job to azure hdInsights cluster from airflow, I don't want to use livy as it doesn't accumulate logs on airflow. Is it possible to do remotely submit job. SSH is 1 option but if job is long running it might break connection. Is there any other option?
Note - Airflow cluster is remote cluster, it's not colocated with spark cluster.

Passing kerberos keytab and principal in spark conf

I am trying to run my spark application in local mode from within Intellij. The application reads a text file from hdfs using sc.textFile("hdfs://..."). The hdfs is secured by Kerberos authentication. I know you can use spark launcher and specify kerberos keytab and principal but for that I will have to do sbt assembly everytime I do a code change and want to test my chage. Is there an alternate/better way of specifying, kerberos keytab file and kerberos principal to spark? Also is there a parameter for providing the hdfs namenode information?
Thanks!
First, you may provide these parameters when building your SparkSession ( described here).
The second option is to pass principal and keytab as command-line arguments of your app.

submit hdinsight Spark job using IntelliJ IDEA failure

When I submit hdinsight Spark job using IntelliJ IDEA Community
Error :
Failed to submit application to spark cluster.
Exception : Forbidden. Attached Azure DataLake Store is not supported in Automated login model.
Please logout first and try Interactive login model
The expcetion are shown when selecting Automated option in Azure Sign In dialog and sumitting a Spark Job into a Cluster which store is Azure DataLake Store. So, use the Interactive option for the cluster, please.
The Automated login model is only used for the Azure Blob Store Cluster.
You could try the following steps:
Sign out from the Azure Explorer firstly
Sign in with the Interactive option
Select the Spark cluster with Azure DataLake Store in Spark job submission dialog and submit the job.
Refer to https://learn.microsoft.com/en-us/azure/azure-toolkit-for-intellij-sign-in-instructions for more instructions.
[Update]
If your account have no permission to access that Azure DataLake Store, the same exception will be thrown.
Refer to https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-security-overview
The compiled Spark Job will be uploaded to the ADL folder adl://<adls>.azuredatalakestore.net/<cluster attached folder>/SparkSubmission/**. So the user needs
permission to write. You'd better to ask admin to check your role access.

How to submit Spark job to AWS EC2 cluster?

I am new to AWS EC2 and need to know how I can submit my Spark job to AWS EC2 spark cluster. Like in azure we can directly submit the job through IntelliJ idea with azure plugin.
You can submit a spark job easily through spark-submit command. Refer http://spark.apache.org/docs/latest/submitting-applications.html
Options:
1) login to master or other driver gateway node and use spark-submit to submit the job through YARN/media/etc
2) Use spark submit cluster deploy mode from any machine with sufficient ports and firewall access (may require configuration, such as client config files from Cloudera manager for CDH cluster)
3) Use a server setup like Livy (open source through Cloudera, and MS Azure HDinsights uses and contributes to it) or maybe thrift server. Livy (Livy.io) is a nice simple REST service that also has language APIs for Scala/Java to make it extra easy to submit jobs (and run interactive persisted sessions!)

Hive ODBC connecting to HDInsight

I'm setting up a VM SQL server in Azure and I want it to be able to connect to Hive on a HDInsight cluster. I'm trying to set the ODBC DSN up and I'm unsure of what the various settings are and how to find them in my Azure portal:
Hostname
Username
Password (can I reset this if I've forgotten it)
Cheers, Chris.
Hostname: HDinsight cluster name
Username: HDInsight cluster username
Password: HDinsight cluster password
I don't think you can recover the password. You can delete the HDInsight cluster, and create another one cluster. Because Hadoop jobs are batch jobs, and HDInsight cluster usually contains multiple nodes, poeple usually create a cluster, run a MapReduce job, and delete the cluster right after the job is completed. It is too costly to let an HDInsight cluster sitting in the cloud.
Because HDInsight cluster uses Windows Azure Blob storage for data storage, deleting a cluster will not impact the data.

Resources