Databricks DBFS file not found after upload - databricks

I am using the following code in Databricks to upload files to DBFS. The files are showing when I do dbutils.fs.ls(path). However when I try to read, I am getting a file not found error (see further down). Also, the file sizes are showing as zero?
def WriteFileToDbfs(file_path,test_folder_file_path,target_test_file_name):
df = spark.read.format("delta").load(file_path)
df2 = df.limit(1000)
df2.write.mode("overwrite").parquet(test_folder_file_path+target_test_file_name)
Here is the error:
AnalysisException: Path does not exist: dbfs:/tmp/qa_test/test-file.parquet;
Here are the files listed but with zero sizes:

In Azure Databricks, this is expected behavior.
For Files, it displays the actual file size.
For Directories, it displays the size=0
For Corrupted files displays the size=0
You can get more details using Azure Databricks CLI:
You can get more details using Databricks Explorer:
DBFS Explorer was created as a quick way to upload and download files to the Databricks filesystem (DBFS). This will work with both AWS and Azure instances of Databricks. You will need to create a bearer token in the web interface in order to connect.

Related

How to Read Append Blobs as DataFrames in Azure DataBricks

My batch processing pipeline in Azure has the following scenario: I am using the copy activity in Azure Data Factory to unzip thousands of zip files, stored in a blob storage container. These zip files are stored in a nested folder structure inside the container, e.g.
zipContainer/deviceA/component1/20220301.zip
The resulting unzipped files will be stored in another container, preserving the hierarchy in the sink's copy behavior option, e.g.
unzipContainer/deviceA/component1/20220301.zip/measurements_01.csv
I enabled the logging of the copy activity as:
And then provided the folder path to store the generated logs (in txt format), which have the following structure:
Timestamp
Level
OperationName
OperationItem
Message
2022-03-01 15:14:06.9880973
Info
FileWrite
"deviceA/component1/2022.zip/measurements_01.csv"
"Complete writing file. File is successfully copied."
I want to read the content of these logs in an R notebook in Azure DataBricks, in order to get the complete paths for these csv files for processing. The command I used, read.df is part of SparkR library:
Logs <- read.df(log_path, source = "csv", header="true", delimiter=",")
The following exception is returned:
Exception: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB.
The generated logs from the copy activity is of append blob type. read.df() can read block blobs without any issue.
From the above scenario, how can I read these logs successfully into my R session in DataBricks ?
According to this Microsoft documentation, Azure Databricks and Hadoop Azure WASB implementations do not support reading append blobs.
https://learn.microsoft.com/en-us/azure/databricks/kb/data-sources/wasb-check-blob-types
And when you try to read this log file of append blob type, it gives error saying that Exception: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB.
So, you cannot read the log file of append blob type from blob storage account. A solution to this would be to use an azure datalake gen2 storage container for logging. When you run the pipeline using ADLS gen2 for logs, it creates log file of block blob type. You can now read this file without any issue from dataricks.
Using blob storage for logging:
Using ADLS gen2 for logging:

write/save Dataframe to azure file share from azure databricks

How to write to azure file share from azure databricks spark jobs.
I configured the Hadoop storage key and values.
spark.sparkContext.hadoopConfiguration.set(
"fs.azure.account.key.STORAGEKEY.file.core.windows.net",
"SECRETVALUE"
)
val wasbFileShare =
s"wasbs://testfileshare#STORAGEKEY.file.core.windows.net/testPath"
df.coalesce(1).write.mode("overwrite").csv(wasbBlob)
When tried to save the dataframe to azure file share I'm seeing the following the resource not found error although the URI is present.
Exception in thread "main" org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: The requested URI does not represent any resource on the server.
Steps to connect to azure file share from databricks
first install Microsoft Azure Storage File Share client library for Python using pip install in Databricks. https://pypi.org/project/azure-storage-file-share/
after installing, create a storage account. Then you can create a fileshare from databricks
from azure.storage.fileshare import ShareClient
share = ShareClient.from_connection_string(conn_str="<connection_string consists of FileEndpoint=myFileEndpoint(https://storageaccountname.file.core.windows.net/);SharedAccessSignature=sasToken>", share_name="<file share name that you want to create>")
share.create_share()
This code is to upload a file into fileshare through databricks
from azure.storage.fileshare import ShareFileClient
file_client = ShareFileClient.from_connection_string(conn_str="<connection_string consists of FileEndpoint=myFileEndpoint(https://storageaccountname.file.core.windows.net/);SharedAccessSignature=sasToken>", share_name="<your_fileshare_name>", file_path="my_file")
with open("./SampleSource.txt", "rb") as source_file:
file_client.upload_file(source_file)
Refer this link for further information https://pypi.org/project/azure-storage-file-share/

Azure AdlCopy Error: Invalid JSON Primitive:

I am trying to copy several csv files from an Azure Storage Blob to an Azure Data Lake Storage Gen1 using AdlCopy in Standalone Mode. The Data Lake Storage folder I am trying to move the data to is currently empty. Here is my CMD script that I am using:
C:\Users\username\Applications>AdlCopy /source https://myblobstorage.blob.core.windows.net/Folder/SubFolder/SubFolder/ /dest swebhdfs://mydatalakestorage.azuredatalakestore.net/Folder/ /sourcekey mysourcekey
When I run the script I am prompted for my credentials. After I enter them this is the error that I get:
Initializing Copy.
Invalid JSON primitive: .
Copy Failed.
Has anyone else had any experience with this error? I have yet to see any documentation on how to handle this error or what might be causing it. I appreciate any help or guidance!

Write DataFrame from Databricks to Data Lake

It happens that I am manipulating some data using Azure Databricks. Such data is in an Azure Data Lake Storage Gen1. I mounted the data into DBFS, but now, after transforming the data I would like to write it back into my data lake.
To mount the data I used the following:
configs = {"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
"dfs.adls.oauth2.client.id": "<your-service-client-id>",
"dfs.adls.oauth2.credential": "<your-service-credentials>",
"dfs.adls.oauth2.refresh.url": "https://login.microsoftonline.com/<your-directory-id>/oauth2/token"}
dbutils.fs.mount(source = "adl://<your-data-lake-store-account-name>.azuredatalakestore.net/<your-directory-name>", mount_point = "/mnt/<mount-name>",extra_configs = configs)
I want to write back a .csv file. For this task I am using the following line
dfGPS.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("adl://<your-data-lake-store-account-name>.azuredatalakestore.net/<your-directory-name>")
However, I get the following error:
IllegalArgumentException: u'No value for dfs.adls.oauth2.access.token.provider found in conf file.'
Any piece of code that can help me? Or link that walks me through.
Thanks.
If you mount Azure Data Lake Store, you should use the mountpoint to store your data, instead of "adl://...". For details how to mount Azure Data Lake Store
(ADLS ) Gen1 see the Azure Databricks documentation. You can verify if the mountpoint works with:
dbutils.fs.ls("/mnt/<newmountpoint>")
So try after mounting ADLS Gen 1:
dfGPS.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("mnt/<mount-name>/<your-directory-name>")
This should work if you added the mountpoint properly and you have also the access rights with the Service Principal on the ADLS.
Spark writes always multiple files in a directory, because each partition is saved individually. See also the following stackoverflow question.

Error reading csv file in Azure HDInsight Blob Store.

Jupyter note book gives the following error.
u'Path does not exist: wasb://spk123-2018#dreamcatcher.blob.core.windows.net/data/raw-flight-data.csv;
The csv file does exist in the blob store of Azure HDInsight
What is the command you are doing in Jupyter?
Check for a couple of two problems:
1. The file actually doesn't exist. You can check that by going into Azure portal -> storage account and checking if this container has the file.
2. You didn't connect the storage account to the cluster when you created the cluster.

Resources