Error reading csv file in Azure HDInsight Blob Store.

Error reading csv file in Azure HDInsight Blob Store. - azure

Jupyter note book gives the following error.
u'Path does not exist: wasb://spk123-2018#dreamcatcher.blob.core.windows.net/data/raw-flight-data.csv;
The csv file does exist in the blob store of Azure HDInsight

What is the command you are doing in Jupyter?
Check for a couple of two problems:
1. The file actually doesn't exist. You can check that by going into Azure portal -> storage account and checking if this container has the file.
2. You didn't connect the storage account to the cluster when you created the cluster.

Related

How to Read Append Blobs as DataFrames in Azure DataBricks

My batch processing pipeline in Azure has the following scenario: I am using the copy activity in Azure Data Factory to unzip thousands of zip files, stored in a blob storage container. These zip files are stored in a nested folder structure inside the container, e.g.
zipContainer/deviceA/component1/20220301.zip
The resulting unzipped files will be stored in another container, preserving the hierarchy in the sink's copy behavior option, e.g.
unzipContainer/deviceA/component1/20220301.zip/measurements_01.csv
I enabled the logging of the copy activity as:
And then provided the folder path to store the generated logs (in txt format), which have the following structure:
Timestamp
Level
OperationName
OperationItem
Message
2022-03-01 15:14:06.9880973
Info
FileWrite
"deviceA/component1/2022.zip/measurements_01.csv"
"Complete writing file. File is successfully copied."
I want to read the content of these logs in an R notebook in Azure DataBricks, in order to get the complete paths for these csv files for processing. The command I used, read.df is part of SparkR library:
Logs <- read.df(log_path, source = "csv", header="true", delimiter=",")
The following exception is returned:
Exception: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB.
The generated logs from the copy activity is of append blob type. read.df() can read block blobs without any issue.
From the above scenario, how can I read these logs successfully into my R session in DataBricks ?

According to this Microsoft documentation, Azure Databricks and Hadoop Azure WASB implementations do not support reading append blobs.
https://learn.microsoft.com/en-us/azure/databricks/kb/data-sources/wasb-check-blob-types
And when you try to read this log file of append blob type, it gives error saying that Exception: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB.
So, you cannot read the log file of append blob type from blob storage account. A solution to this would be to use an azure datalake gen2 storage container for logging. When you run the pipeline using ADLS gen2 for logs, it creates log file of block blob type. You can now read this file without any issue from dataricks.
Using blob storage for logging:
Using ADLS gen2 for logging:

Running query using serverless sql pool (built-in) on CSV file in Azure Data Lake Storage Gen2 failed

I uploaded my CSV file into my Azure Data Lake Storage Gen2 using Azure Synapse portal. Then I tried select Top 100 rows and got an error after running auto-generated SQL.
Auto-generated SQL:
SELECT
    TOP 100 *
FROM
    OPENROWSET(
        BULK 'https://accountname.dfs.core.windows.net/filesystemname/test_file/contract.csv',
        FORMAT = 'CSV',
        PARSER_VERSION='2.0'
) AS [result]
Error:
File 'https://accountname.dfs.core.windows.net/filesystemname/test_file/contract.csv'
cannot be opened because it does not exist or it is used by another process.

This error in Synapse Studio has link (which leads to self-help document) underneath it which explains the error itself.
Do you have rights needed on the storage account?
You must have Storage Blob Data Contributor or Storage Blob Data Reader in order for this query to work.
Summary from the docs:
You need to have a Storage Blob Data Owner/Contributor/Reader role to
use your identity to access the data. Even if you are an Owner of a
Storage Account, you still need to add yourself into one of the
Storage Blob Data roles.
Check out the full documentation for Control Storage account access for serverless SQL pool
If your storage account is protected with firewall rules then take a look at this stack overflow answer.
Reference full docs article.

I just took your code & updated the path to what I have and it worked just worked fine
SELECT
    TOP 100 *
FROM
    OPENROWSET(
        BULK 'https://XXX.dfs.core.windows.net/himanshu/NYCTaxi/PassengerCountStats.csv',
        FORMAT = 'CSV',
        PARSER_VERSION='2.0'
) AS [result]
Please check if the path to which you have uploaded the file and the one used in the script is the same .
You can do this to check that
Navigate to WS -> Data -> ADLS gen2 -> Go to the file -> right click go to the property and copy the Uri from there paste in the script .

Databricks DBFS file not found after upload

I am using the following code in Databricks to upload files to DBFS. The files are showing when I do dbutils.fs.ls(path). However when I try to read, I am getting a file not found error (see further down). Also, the file sizes are showing as zero?
def WriteFileToDbfs(file_path,test_folder_file_path,target_test_file_name):
df = spark.read.format("delta").load(file_path)
df2 = df.limit(1000)
df2.write.mode("overwrite").parquet(test_folder_file_path+target_test_file_name)
Here is the error:
AnalysisException: Path does not exist: dbfs:/tmp/qa_test/test-file.parquet;
Here are the files listed but with zero sizes:

In Azure Databricks, this is expected behavior.
For Files, it displays the actual file size.
For Directories, it displays the size=0
For Corrupted files displays the size=0
You can get more details using Azure Databricks CLI:
You can get more details using Databricks Explorer:
DBFS Explorer was created as a quick way to upload and download files to the Databricks filesystem (DBFS). This will work with both AWS and Azure instances of Databricks. You will need to create a bearer token in the web interface in order to connect.

write/save Dataframe to azure file share from azure databricks

How to write to azure file share from azure databricks spark jobs.
I configured the Hadoop storage key and values.
spark.sparkContext.hadoopConfiguration.set(
"fs.azure.account.key.STORAGEKEY.file.core.windows.net",
"SECRETVALUE"
)
val wasbFileShare =
s"wasbs://testfileshare#STORAGEKEY.file.core.windows.net/testPath"
df.coalesce(1).write.mode("overwrite").csv(wasbBlob)
When tried to save the dataframe to azure file share I'm seeing the following the resource not found error although the URI is present.
Exception in thread "main" org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: The requested URI does not represent any resource on the server.

Steps to connect to azure file share from databricks
first install Microsoft Azure Storage File Share client library for Python using pip install in Databricks. https://pypi.org/project/azure-storage-file-share/
after installing, create a storage account. Then you can create a fileshare from databricks
from azure.storage.fileshare import ShareClient
share = ShareClient.from_connection_string(conn_str="<connection_string consists of FileEndpoint=myFileEndpoint(https://storageaccountname.file.core.windows.net/);SharedAccessSignature=sasToken>", share_name="<file share name that you want to create>")
share.create_share()
This code is to upload a file into fileshare through databricks
from azure.storage.fileshare import ShareFileClient
file_client = ShareFileClient.from_connection_string(conn_str="<connection_string consists of FileEndpoint=myFileEndpoint(https://storageaccountname.file.core.windows.net/);SharedAccessSignature=sasToken>", share_name="<your_fileshare_name>", file_path="my_file")
with open("./SampleSource.txt", "rb") as source_file:
file_client.upload_file(source_file)
Refer this link for further information https://pypi.org/project/azure-storage-file-share/

Creating database in Azure databricks on External Blob Storage giving error

I have mapped my blob storage to dbfs:/mnt/ under name /mnt/deltalake
and blob storage container name is deltalake.
Mounting to Dbfs is done using Azure KeyVault backed secret scope.
When I try to create a database using CREATE DATABASE abc with location '/mnt/deltalake/databases/abc' this errors out saying path does not exist.
However when I was using the dbfs path as storage by using .. CREATE DATABASE abc with location '/user/hive/warehouse/databases/abc' .. it was always successful.
Wonder what is going wrong .
Suggestions please.

Using a mount point, you should be able to access existing files or write new files through databricks.
However, I believe the SQL commands, such as CREATE DATABASE, only work on the underlying hive metastore.
You may need to create a database for your blob storage outside of databricks, and then connect to the database to read and write from it using databricks.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string