azureml tabular dataset over azure gen2 datalake - azure-machine-learning-service

What have I tried
set up an AzureML DataStore using Identity based authentication
set up an AzureML Dataset for a single file under a specific file system
workspace = Workspace.from_config("config.json", auth= auth)
dataset = Dataset.get_by_name(workspace, 'engage_event_type')
frame = dataset.to_pandas_dataframe()
I am able to explore the dataset from azure portal and it displays the right data correctly.
However running ^ where auth is a Service Principal which has the same rights as Azure Workspace Instance I get a bunch of calls like, but no errors / exceptions / completion.
The data underneath is < 10kb
Resolving access token for scope "https://datalake.azure.net//.default" using identity of type "SP".
Resolving access token for scope "https://datalake.azure.net//.default" using identity of type "SP".
I have tried running the script on a local compute
I have tried running the script on a compute instance
both gave the same issue

Related

Databricks Delta - Error: Overlapping auth mechanisms using deltaTable.detail()

In Azure Databricks. I have a unity catalog metastore created on ADLS on its own container (metastore#stgacct.dfs.core.windows.net/) connected w/ the Azure identity. Works fine.
I have a container on the same storage account called data. I'm using Notebook-scoped creds to gain access to that container. Using abfss://data#stgacct... Works fine.
Using the python Delta API, I'm creating an object for my DeltaTable using: deltaTable = DeltaTable.forName(spark, "mycat.myschema.mytable"). I'm able to perform normal Delta functions using that object like MERGE. Works fine.
However, if I attempt to run the deltaTable.detail() command, I get the error: "Your query is attempting to access overlapping paths through multiple authorization mechanisms, which is not currently supported."
It's as if Spark doesn't know which credential to use to fulfill the .detail() command; the metastore identity or the SPN I used when I scoped my creds for the data container - which also has rights to the metastore container.
To test: If I restart my cluster, which drops the spark conf for ADLS, and I attempt to run the command deltaTable = DeltaTable.forName(spark, "mycat.myschema.mytable") and then deltaTable.detail(), I get the error "Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key" - as if it's not using the metastore credentials which I would have expected since it's a unity/managed table (??).
Suggestions?

Accessing Azure ADLS gen2 with Pyspark on Databricks

I'm trying to learn Spark, Databricks & Azure.
I'm trying to access GEN2 from Databricks using Pyspark.
I can't find a proper way, I believe it's super simple but I failed.
Currently each time I receive the following:
Unable to access container {name} in account {name} using anonymous
credentials, and no credentials found for them in the configuration.
I have already running GEN2 + I have a SAS_URI to access.
What I was trying so far:
(based on this link: https://learn.microsoft.com/pl-pl/azure/databricks/data/data-sources/azure/adls-gen2/azure-datalake-gen2-sas-access):
spark.conf.set(f"fs.azure.account.auth.type.{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net", {SAS_URI})
spark.conf.set(f"fs.azure.sas.token.provider.type.{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net", {SAS_URI})
Then to reach out to data:
sd_xxx = spark.read.parquet(f"wasbs://{CONTAINER_NAME}#{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net/{proper_path_to_files/}")
Your configuration is incorrect. The first parameter should be set to just SAS value, while second - to name of Scala/Java class that will return the SAS token - you can't use just URI with SAS information in it, you need to implement some custom code.
If you want to use wasbs that the protocol for accessing Azure Blog Storage, and although it could be used for accessing ADLS Gen2 (not recommended although), but you need to use blob.core.windows.net instead of dfs.core.windows.net, and also set correct spark property for Azure Blob access.
The more common procedure to follow is here: Access Azure Data Lake Storage Gen2 using OAuth 2.0 with an Azure service principal

Loading file from Azure Blob storage into Azure SQL DB: error code 86 The specified network password is not correct

I've been trying to run the following script to read the file from azure blob storage.
--------------------------------------------
--CREATING CREDENTIAL
-- --------------------------------------------
--------------------------------------------
--shared access signature
-- --------------------------------------------
CREATE DATABASE SCOPED CREDENTIAL dlcred
with identity='SHARED ACCESS SIGNATURE',
SECRET = 'sv=2018-03-28&ss=bfqt&srt=sco&sp=rwdlacup&se=2019-12-01T07:28:58Z&st=2019-08-31T23:28:58Z&spr=https,http&sig=<signature from storage account>';
--------------------------------------------
--CREATING SOURCE
--------------------------------------------
CREATE EXTERNAL DATA SOURCE datalake
WITH (
TYPE =  BLOB_STORAGE,
LOCATION='https://<storageaccount>.blob.core.windows.net/<blob>',
CREDENTIAL = dlcred
);
Originally, the script worked just fine, but later on it started giving the following error when running the last query below - Cannot bulk load because the file "test.txt" could not be opened. Operating system error code 86(The specified network password is not correct.)
--TEST
--------------------------------------------
SELECT CAST(BulkColumn AS XML)
FROM OPENROWSET
(
 BULK 'test.xml',
 DATA_SOURCE = 'datalake', 
 SINGLE_BLOB
) as xml_import
The same error happens if I create a credential with service principal or access key.
Tried literally everything and logged the ticket with Azure support, however they are struggling to replicate this error.
I feel like it's an issue outside of the storage account and SQL server - Azure has a whole bunch of services that can be activated/deactivated against a subscription, and I feel like it's one of these that's preventing us from successfully mapping the storage account.
Has anyone encountered this error? And if so, how did you solve it?
I was able to get this issue resolved with Microsoft support. In section F here, I granted Storage Blob Data Contributor access to the managed identity of the SQL Server instance, then ran the SQL statements using the managed identity section in here: https://learn.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-transact-sql?view=sql-server-ver15#f-importing-data-from-a-file-in-azure-blob-storage.
Preserving the code solution below:
CREATE DATABASE SCOPED CREDENTIAL msi_cred WITH IDENTITY = 'Managed Identity';
GO
CREATE EXTERNAL DATA SOURCE MyAzureBlobStorage
WITH ( TYPE = BLOB_STORAGE,
LOCATION = 'https://****************.blob.core.windows.net/curriculum'
, CREDENTIAL= msi_cred );
BULK INSERT Sales.Invoices
FROM 'inv-2017-12-08.csv'
WITH (DATA_SOURCE = 'MyAzureBlobStorage');
In order to do this, the SQL server instance requires a managed identity to be assigned to it. This can be done at creation time with the --assign-identity flag.

Error running Spark on Databricks: constructor public XXX is not whitelisted

I was using Azure Databricks and trying to run some example python code from this page.
But I get this exception:
py4j.security.Py4JSecurityException: Constructor public org.apache.spark.ml.classification.LogisticRegression(java.lang.String) is not whitelisted.
This error shows up with some library methods when using High Concurrency cluster with credential pass through enabled. If that is your scenario a work around that may be an option is to use a different cluster mode.
py4j.security.Py4JSecurityException: ... is not whitelisted
This exception is thrown when you have accessed a method that Azure Databricks has not explicitly marked as safe for Azure Data Lake Storage credential passthrough clusters. In most cases, this means that the method could allow a user on a Azure Data Lake Storage credential passthrough cluster to access another user’s credentials.
Reference: https://docs.azuredatabricks.net/spark/latest/data-sources/azure/adls-passthrough.html

Azure ML studio export data Azure Storage V2

I already post my problem here and they suggested me to post it here.
I am trying to export data from Azure ML to Azure Storage but I have this error:
Error writing to cloud storage: The remote server returned an error: (400) Bad Request.. Please check the url. . ( Error 0151 )
My blob storage configuration is Storage v2 / Standard and Require secure transfer set as enabled.
If I set the Require secure transfer set as disabled, the export works fine.
How can I export data to my blob storage with the require secure transfer set as enabled ?
According to the offical tutorial Export to Azure Blob Storage, there are two authentication types for exporting data to Azure Blob Storage: SAS and Account. The description for them as below.
For Authentication type, choose Public (SAS URL) if you know that the storage supports access via a SAS URL.
A SAS URL is a special type of URL that can be generated by using an Azure storage utility, and is available for only a limited time. It contains all the information that is needed for authentication and download.
For URI, type or paste the full URI that defines the account and the public blob.
For private accounts, choose Account, and provide the account name and the account key, so that the experiment can write to the storage account.
Account name: Type or paste the name of the account where you want to save the data. For example, if the full URL of the storage account is http://myshared.blob.core.windows.net, you would type myshared.
Account key: Paste the storage access key that is associated with the account.
I try to use a simple module combination as the figure and Python code below to test the issue you got.
import pandas as pd
def azureml_main(dataframe1 = None, dataframe2 = None):
dataframe1 = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
return dataframe1,
When I tried to use the authentication type Account of my Blob Storage V2 account, I got the same issue as yours which the error code is Error 0151 as below via click the View error log Button under the link of View output log.
Error 0151
There was an error writing to cloud storage. Please check the URL.
This error in Azure Machine Learning occurs when the module tries to write data to cloud storage but the URL is unavailable or invalid.
Resolution
Check the URL and verify that it is writable.
Exception Messages
Error writing to cloud storage (possibly a bad url).
Error writing to cloud storage: {0}. Please check the url.
Based on the error description above, the error should be caused by the blob url with SAS incorrectly generated by the Export Data module code with account information. May I think the code is old and not compatible with the new V2 storage API or API version information. You can report it to feedback.azure.com.
However, I switched to use SAS authentication type to type a blob url with a SAS query string of my container which I generated via Azure Storage Explorer tool as below, it works fine.
Fig 1: Right click on the container of your Blob Storage account, and click the Get Shared Access Signature
Fig 2: Enable the permission Write (recommended to use UTC timezone) and click Create button
Fig 3: Copy the Query string value, and build a blob url with a container SAS query string like https://<account name>.blob.core.windows.net/<container name>/<blob name><query string>
Note: The blob must be not exist in the container, otherwise an Error 0057 will be caused.

Resources