HDInsight Spark cluster - can't connect to Azure Data Lake Store - azure

So I have created an HDInsight Spark Cluster. I want it to access Azure Data Lake Store.
To create the HDInsight Spark cluster I followed the instructions at: https://azure.microsoft.com/en-gb/documentation/articles/data-lake-store-hdinsight-hadoop-use-portal however there was no option in the Azure Portal to configure the AAD or add a Service Principle.
So my cluster was created using Azure Blob Storage only. Now I want to extend it to access Azure Data Lake Store. However the "Cluster AAD Identity" dialog states "Service Principal: DISABLED" and all fields in the dialog are greyed our and disabled. I can't see any way to extend the storage to point to ADL.
Any help would be appreciated!
Thanks :-)

You can move your data from Blob to ADLS with Data Factory, but you can't access direct to ADLS from a Spark cluster.

Please create an Azure Hdinsight cluster with ServicePrincipal. ServicePrincipal should have access to your data lake storage account.
You can configure your cluster to use Data lake storage but that is very complicated. And in fact there is no documentation for that.
So recommended way to create is with ServicePrincipal.

Which type of cluster did you create?
In our Linux cluster all the option listed in the guide you linked are available.

Related

Azure Synapse spark read from default storage

We are working on an Azure Synapse Analytics project with CI/CD pipeline. I want to read data with serverless spark-pool from storage account, but not specify the storage account name. Is this possible? We are using the default storage account but a separate container for datalake data.
I can read data with spark.read.parquet('abfss://{container_name}#{account_name}.dfs.core.windows.net/filepath.parquet) but since the name of the storage account is different between dev test and prod this will need to be parameterized and I would like to avoid it if possible. Is there any native spark way to do this? I found some documentation about doing this with pandas and FSSPEC but not with only spark.

How to access ADLS blob containers from Databricks using User Assigned Identity

I have ADLS storage account with blob containers. I have successfully mounted ADLS with Service Principal in Databricks and able to do my necessary transformations on the Data.
Now I'm in a process of using User Assigned Managed Identities to avoid keeping the secrets in my code. For the process, I have created required Managed Identity and enabled it to my service principal by assigning necessary role in the Storage account.
My question is how can I use the managed Identity or how can I do my transformation on the ADLS storage from Databricks without mounting or using secrets?
Please suggest a working solution or any helpful forum for the same.
Thanks.
You can authenticate automatically to Azure Data Lake Storage Gen1
(ADLS Gen1) and Azure Data Lake Storage Gen2 (ADLS Gen2) from Azure
Databricks clusters using the same Azure Active Directory (Azure AD)
identity that you use to log into Azure Databricks. When you enable
Azure Data Lake Storage credential passthrough for your cluster,
commands that you run on that cluster can read and write data in Azure
Data Lake Storage without requiring you to configure service principal
credentials for access to storage.
Enable Azure Data Lake Storage credential passthrough for a High Concurrency cluster
High concurrency clusters can be shared by multiple users. They support only Python and SQL with Azure Data Lake Storage credential passthrough.
When you create a cluster, set Cluster Mode to High Concurrency.
Under Advanced Options, select Enable credential passthrough for user-level data access and only allow Python and SQL commands.
Enable Azure Data Lake Storage credential passthrough for a Standard cluster
When you create a cluster, set the Cluster Mode to Standard.
Under Advanced Options, select Enable credential passthrough for user-level data access and select the user name from the Single User Access drop-down.
Access Azure Data Lake Storage directly using credential passthrough
After configuring Azure Data Lake Storage credential passthrough and creating storage containers, you can access data directly in Azure Data Lake Storage Gen1 using an adl:// path and Azure Data Lake Storage Gen2 using an abfss:// path.
Example:
Python - spark.read.csv("adl://<storage-account-name>.azuredatalakestore.net/MyData.csv").collect()
Refer this offcicial documentation: Access Azure Data Lake Storage using Azure Active Directory credential passthrough

How to connect Azure Data Factory with SQL Endpoints instead of interactive cluster?

Is it possible to connect Azure Data Factory with Azure Databricks SQL Endpoints (Delta table and views) instead of interactive cluster. I tried with Azure delta lake connector but it has options for cluster and not Endpoints?
Unfortunately, you cannot connect Azure Databricks SQL endpoints with Azure Databricks using ADF.
Note: With compute option - you can connect Azure Databricks workspace with the below cluster options:
New Job cluster
Existing interactive cluster
Existing instance pool
Note: With Datastore option - Azure Databricks Delta Lake option you can connect only existing interactive clusters:
Appreciate if you could share the feedback on our feedback channel. Which would be open for the user community to upvote & comment on. This allows our product teams to effectively prioritize your request against our existing feature backlog and gives insight into the potential impact of implementing the suggested feature.

How to check whether the storage account V2 created is having data lake gen2 property or not in Azure?

I'm very new to Azure and would like to know how can i check an existing Storage account V2 available in resource group is having type Data lake Gen2 or not.
I know the process to create data lake gen 2 by using the option Hierarchical namespace enabled == Data Lake Gen2 while creation.
But how can i check after creation:
Any where in portal.
Azure CLI - any CLI commands to check
Thanks in advance.
On portal, select the storage account and click on Configuration. You should be able to see if hierarchical namespace has been enabled on the right hand side as shown in the picture below.

attaching additional storage accounts with SAS key while creating HDInsight cluster from the Azure portal

How do I specify an additional storage account with SAS keyfrom Azure portal while creating HDInsgith cluster? It's expecting actual storage key , not SAS key. Ideally I want to do that and export a template out of it. My goal is to get ARM template example for attaching storage with SAS key to HDInsight cluster. But I am not able to find this template anywhere. I just need an example that I can use.
Unfortunately, you don't have option to attach additional storage accounts with SAS key while creating HDInsight cluster from the Azure portal.
I would request you to provide the feedback here:
https://feedback.azure.com/forums/217335-hdinsight
All of the feedback you share in these forums will be monitored and reviewed by the Microsoft engineering teams responsible for building Azure.

Resources