How to Read Append Blobs as DataFrames in Azure DataBricks - azure

My batch processing pipeline in Azure has the following scenario: I am using the copy activity in Azure Data Factory to unzip thousands of zip files, stored in a blob storage container. These zip files are stored in a nested folder structure inside the container, e.g.
zipContainer/deviceA/component1/20220301.zip
The resulting unzipped files will be stored in another container, preserving the hierarchy in the sink's copy behavior option, e.g.
unzipContainer/deviceA/component1/20220301.zip/measurements_01.csv
I enabled the logging of the copy activity as:
And then provided the folder path to store the generated logs (in txt format), which have the following structure:
Timestamp
Level
OperationName
OperationItem
Message
2022-03-01 15:14:06.9880973
Info
FileWrite
"deviceA/component1/2022.zip/measurements_01.csv"
"Complete writing file. File is successfully copied."
I want to read the content of these logs in an R notebook in Azure DataBricks, in order to get the complete paths for these csv files for processing. The command I used, read.df is part of SparkR library:
Logs <- read.df(log_path, source = "csv", header="true", delimiter=",")
The following exception is returned:
Exception: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB.
The generated logs from the copy activity is of append blob type. read.df() can read block blobs without any issue.
From the above scenario, how can I read these logs successfully into my R session in DataBricks ?

According to this Microsoft documentation, Azure Databricks and Hadoop Azure WASB implementations do not support reading append blobs.
https://learn.microsoft.com/en-us/azure/databricks/kb/data-sources/wasb-check-blob-types
And when you try to read this log file of append blob type, it gives error saying that Exception: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB.
So, you cannot read the log file of append blob type from blob storage account. A solution to this would be to use an azure datalake gen2 storage container for logging. When you run the pipeline using ADLS gen2 for logs, it creates log file of block blob type. You can now read this file without any issue from dataricks.
Using blob storage for logging:
Using ADLS gen2 for logging:

Related

Advance analytics on the container logs which is present inside storage accountV2(append_blob )

I sent my azure databricks logs to storage account & Microsoft by default contains those entry in append_blob. I tried to read the Json data with access key I got a error ( shaded.databricks.org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB).
Error -
Is there any way to read directly that data path(insights-logs-jobs.mdd.blob.core.windows.net/resourceId=/SUBSCRIPTIONS/xxxxxx-xxxx-xxxxx/RESOURCEGROUPS/ssd--RG/PROVIDERS/MICROSOFT.DATABRICKS/WORKSPACES/addd-PROCESS-xx-ADB/y%3D2021/m%3D12/d%3D07/h%3D00/m%3D00/PT1H.json")
Second way I tired to copy data to other container where data comes in block_blob & tried to read using databricks it works. But need to automate the copy data from multiple container to another. As seen in diagram.

Databricks DBFS file not found after upload

I am using the following code in Databricks to upload files to DBFS. The files are showing when I do dbutils.fs.ls(path). However when I try to read, I am getting a file not found error (see further down). Also, the file sizes are showing as zero?
def WriteFileToDbfs(file_path,test_folder_file_path,target_test_file_name):
df = spark.read.format("delta").load(file_path)
df2 = df.limit(1000)
df2.write.mode("overwrite").parquet(test_folder_file_path+target_test_file_name)
Here is the error:
AnalysisException: Path does not exist: dbfs:/tmp/qa_test/test-file.parquet;
Here are the files listed but with zero sizes:
In Azure Databricks, this is expected behavior.
For Files, it displays the actual file size.
For Directories, it displays the size=0
For Corrupted files displays the size=0
You can get more details using Azure Databricks CLI:
You can get more details using Databricks Explorer:
DBFS Explorer was created as a quick way to upload and download files to the Databricks filesystem (DBFS). This will work with both AWS and Azure instances of Databricks. You will need to create a bearer token in the web interface in order to connect.

How to access files stored in AzureDataLake and use this file as input to AzureBatchStep in azure.pipleline.step?

I registered an Azure data lake datastore as in the documentation in order to access the files stored in it.
I used
DataReference(datastore, data_reference_name=None, path_on_datastore=None, mode='mount', path_on_compute=None, overwrite=False)
and used it as input to azure pipeline step in AzureBatchStep method.
But I got an issue: that datastore name could not be fetched in input.
Is Azure Data Lake not accessible in Azure ML or am I getting it wrong?
Azure Data Lake is not supported as an input in AzureBatchStep. You should probably use a DataTransferStep to copy data from ADL to Blob and then use the output of the DataTransferStep as an input to AzureBatchStep.

Get list of all files in a azure data lake directory to a look up activity in ADFV2

I have a number of files in azure data lake storage, i am creating a pipeline in ADFV2 to get the list of all the files in a folder in ADLS. How to do this?
You should use Get metadata activity.
Check this
You could follow below steps to list files in ADLS.
1: Use ADLS SDK to get the list file names in a specific directory and output the results. Such as Java SDK here. Of course, you could use .net or Python.
// list directory contents
List<DirectoryEntry> list = client.enumerateDirectory("/a/b", 2000);
System.out.println("Directory listing for directory /a/b:");
for (DirectoryEntry entry : list) {
printDirectoryInfo(entry);
}
System.out.println("Directory contents listed.");
2. Compile the file so that it could be executed.Store it into azure blob storage.
3.Use custom activity in azure data factory to configure the blob storage path and execute the program. More details,please follow this document.
You could use custom activity in Azure data factory.
https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-get-started-java-sdk#list-directory-contents

Write DataFrame from Databricks to Data Lake

It happens that I am manipulating some data using Azure Databricks. Such data is in an Azure Data Lake Storage Gen1. I mounted the data into DBFS, but now, after transforming the data I would like to write it back into my data lake.
To mount the data I used the following:
configs = {"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
"dfs.adls.oauth2.client.id": "<your-service-client-id>",
"dfs.adls.oauth2.credential": "<your-service-credentials>",
"dfs.adls.oauth2.refresh.url": "https://login.microsoftonline.com/<your-directory-id>/oauth2/token"}
dbutils.fs.mount(source = "adl://<your-data-lake-store-account-name>.azuredatalakestore.net/<your-directory-name>", mount_point = "/mnt/<mount-name>",extra_configs = configs)
I want to write back a .csv file. For this task I am using the following line
dfGPS.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("adl://<your-data-lake-store-account-name>.azuredatalakestore.net/<your-directory-name>")
However, I get the following error:
IllegalArgumentException: u'No value for dfs.adls.oauth2.access.token.provider found in conf file.'
Any piece of code that can help me? Or link that walks me through.
Thanks.
If you mount Azure Data Lake Store, you should use the mountpoint to store your data, instead of "adl://...". For details how to mount Azure Data Lake Store
(ADLS ) Gen1 see the Azure Databricks documentation. You can verify if the mountpoint works with:
dbutils.fs.ls("/mnt/<newmountpoint>")
So try after mounting ADLS Gen 1:
dfGPS.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("mnt/<mount-name>/<your-directory-name>")
This should work if you added the mountpoint properly and you have also the access rights with the Service Principal on the ADLS.
Spark writes always multiple files in a directory, because each partition is saved individually. See also the following stackoverflow question.

Resources