How to access files stored in AzureDataLake and use this file as input to AzureBatchStep in azure.pipleline.step? - azure-machine-learning-service

I registered an Azure data lake datastore as in the documentation in order to access the files stored in it.
I used
DataReference(datastore, data_reference_name=None, path_on_datastore=None, mode='mount', path_on_compute=None, overwrite=False)
and used it as input to azure pipeline step in AzureBatchStep method.
But I got an issue: that datastore name could not be fetched in input.
Is Azure Data Lake not accessible in Azure ML or am I getting it wrong?

Azure Data Lake is not supported as an input in AzureBatchStep. You should probably use a DataTransferStep to copy data from ADL to Blob and then use the output of the DataTransferStep as an input to AzureBatchStep.

Related

How to Read Append Blobs as DataFrames in Azure DataBricks

My batch processing pipeline in Azure has the following scenario: I am using the copy activity in Azure Data Factory to unzip thousands of zip files, stored in a blob storage container. These zip files are stored in a nested folder structure inside the container, e.g.
zipContainer/deviceA/component1/20220301.zip
The resulting unzipped files will be stored in another container, preserving the hierarchy in the sink's copy behavior option, e.g.
unzipContainer/deviceA/component1/20220301.zip/measurements_01.csv
I enabled the logging of the copy activity as:
And then provided the folder path to store the generated logs (in txt format), which have the following structure:
Timestamp
Level
OperationName
OperationItem
Message
2022-03-01 15:14:06.9880973
Info
FileWrite
"deviceA/component1/2022.zip/measurements_01.csv"
"Complete writing file. File is successfully copied."
I want to read the content of these logs in an R notebook in Azure DataBricks, in order to get the complete paths for these csv files for processing. The command I used, read.df is part of SparkR library:
Logs <- read.df(log_path, source = "csv", header="true", delimiter=",")
The following exception is returned:
Exception: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB.
The generated logs from the copy activity is of append blob type. read.df() can read block blobs without any issue.
From the above scenario, how can I read these logs successfully into my R session in DataBricks ?
According to this Microsoft documentation, Azure Databricks and Hadoop Azure WASB implementations do not support reading append blobs.
https://learn.microsoft.com/en-us/azure/databricks/kb/data-sources/wasb-check-blob-types
And when you try to read this log file of append blob type, it gives error saying that Exception: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB.
So, you cannot read the log file of append blob type from blob storage account. A solution to this would be to use an azure datalake gen2 storage container for logging. When you run the pipeline using ADLS gen2 for logs, it creates log file of block blob type. You can now read this file without any issue from dataricks.
Using blob storage for logging:
Using ADLS gen2 for logging:

Azure Data Factory Copy Snowflake to Azure blog storage ErrorCode 2200User configuration issue ErrorCode=SnowflakeExportCopyCommandValidationFailed

Need help with this error.
ADF copy activity, Moving data from snowflake to azure blob storage delimited text.
I am able to preview the snowflake source data. I am also able to browse the containers via sink browse. This doesn't look like an issue with permissions.
ErrorCode=SnowflakeExportCopyCommandValidationFailed,
'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,
Message=Snowflake Export Copy Command validation failed:
'The Snowflake copy command payload is invalid.
Cannot specify property: column mapping,
Source=Microsoft.DataTransfer.ClientLibrary,'
Thanks for your help
Clear the mapping from copy activity, it worked.
ErrorCode=SnowflakeExportCopyCommandValidationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Snowflake Export Copy Command validation failed: 'The Snowflake copy command payload is invalid. Cannot specify property: column mapping,Source=Microsoft.DataTransfer.ClientLibrary,'"
To resolve the above error, please check the following:
Please check the copy command you are using, it means the copy command you are using is not valid.
To know more in detail about the error, try like below:
COPY INTO MYTABLE VALIDATION_MODE = 'RETURN_ERRORS' FILES=('result.csv');
Check if you have any issue with the data files before loading the data again.
Try checking the storage account you are currently using and note that Snowflake doesn't support Data Lake Storage Gen1.
Use the COPY INTO command to copy the data from the Snowflake database table into Azure blob storage container.
Note:
Use the blob.core.windows.net endpoint for all supported types of Azure blob storage accounts, including Data Lake Storage Gen2.
Make sure to have either ACCOUNTADMIN role or a role with the global CREATE INTEGRATION privilege to run the below sample command:
copy into 'azure://myaccount.blob.core.windows.net/mycontainer/unload/' from mytable storage_integration = myint;
By using the above command, you no need to include credentials to access the storage.
For more in detail, please refer below links:
Unloading into Microsoft Azure — Snowflake Documentation
COPY INTO < table> — Snowflake Documentation

Advance analytics on the container logs which is present inside storage accountV2(append_blob )

I sent my azure databricks logs to storage account & Microsoft by default contains those entry in append_blob. I tried to read the Json data with access key I got a error ( shaded.databricks.org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB).
Error -
Is there any way to read directly that data path(insights-logs-jobs.mdd.blob.core.windows.net/resourceId=/SUBSCRIPTIONS/xxxxxx-xxxx-xxxxx/RESOURCEGROUPS/ssd--RG/PROVIDERS/MICROSOFT.DATABRICKS/WORKSPACES/addd-PROCESS-xx-ADB/y%3D2021/m%3D12/d%3D07/h%3D00/m%3D00/PT1H.json")
Second way I tired to copy data to other container where data comes in block_blob & tried to read using databricks it works. But need to automate the copy data from multiple container to another. As seen in diagram.

Azure Function can write files in Data Lake, when it is bound to Event Grid, but not when it is called from Azure Data Factory

I have an Azure Function that should process zip files and convert their contents to csv files and save them on a Data Lake gen 1.
I have enabled managed Identity of this Azure Functions. Then I have added this managed Identity as an OWNER on the Access control (IAM) of Data Lake.
First Scenario:
I call this Azure function from Azure Data Factory, and send the fileuri of zip files, which are persisted in a Storage Acccount from Azure Data Factory to Azure Functions, Azure Function process the files by saving csv files in data lake I get this error:
Error in creating file root\folder1\folder2\folder3\folder4\test.csv.
Operation: CREATE failed with HttpStatus:Unauthorized Token Length: 1162
Unknown Error: Unexpected type of exception in JSON error output.
Expected: RemoteException Actual: error Source:
Microsoft.Azure.DataLake.Store StackTrace: at
Microsoft.Azure.DataLake.Store.WebTransport.ParseRemoteError(Byte[]
errorBytes, Int32 errorBytesLength, OperationResponse resp, String
contentType).
RemoteJsonErrorResponse: Content-Type of error response:
application/json; charset=utf-8.
Error:{"error":{"code":"AuthenticationFailed","message":"The access token in the 'Authorization' header is expired.
Second Scenario
I have set an Event Grid for this Azure Functions. After dropping zip files in storage account, which is bound to Azure Functions, zip files are processed and csv files are successfully saved in the Data Lake.
good to know azure function and Data Lake are in the same vnet
Could someone explain to me, why my function works fine with the event grid but doesn't work if I call it from Azure Data Factory? (saving csv files in Data Lake Gen 1)
Have you show all of the error?
From the error seems it is related to Azure Active AD authentication(Data Lake Storage Gen1 uses Azure Active Directory for authentication). Have you use Bearer token in your Azure Function when you try to send something to Data Lake Storage Gen1?
please show the code of your azure function, otherwise it will be hard to find the cause of the error.

Write DataFrame from Databricks to Data Lake

It happens that I am manipulating some data using Azure Databricks. Such data is in an Azure Data Lake Storage Gen1. I mounted the data into DBFS, but now, after transforming the data I would like to write it back into my data lake.
To mount the data I used the following:
configs = {"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
"dfs.adls.oauth2.client.id": "<your-service-client-id>",
"dfs.adls.oauth2.credential": "<your-service-credentials>",
"dfs.adls.oauth2.refresh.url": "https://login.microsoftonline.com/<your-directory-id>/oauth2/token"}
dbutils.fs.mount(source = "adl://<your-data-lake-store-account-name>.azuredatalakestore.net/<your-directory-name>", mount_point = "/mnt/<mount-name>",extra_configs = configs)
I want to write back a .csv file. For this task I am using the following line
dfGPS.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("adl://<your-data-lake-store-account-name>.azuredatalakestore.net/<your-directory-name>")
However, I get the following error:
IllegalArgumentException: u'No value for dfs.adls.oauth2.access.token.provider found in conf file.'
Any piece of code that can help me? Or link that walks me through.
Thanks.
If you mount Azure Data Lake Store, you should use the mountpoint to store your data, instead of "adl://...". For details how to mount Azure Data Lake Store
(ADLS ) Gen1 see the Azure Databricks documentation. You can verify if the mountpoint works with:
dbutils.fs.ls("/mnt/<newmountpoint>")
So try after mounting ADLS Gen 1:
dfGPS.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("mnt/<mount-name>/<your-directory-name>")
This should work if you added the mountpoint properly and you have also the access rights with the Service Principal on the ADLS.
Spark writes always multiple files in a directory, because each partition is saved individually. See also the following stackoverflow question.

Resources