Azure Data Lake Gen2 Storage Account blob vs adf choice

Azure Data Lake Gen2 Storage Account blob vs adf choice - azure

I am new to Azure Data Lake Storage Gen2 service. I have a Storage Account with "Hierarchical namespace" option Enabled.
I am using AzCopy to move some files and folders. From the command line I can - within the address string - use either the option "blob" or the "adf" string tokens:
'https://myaccount.blob.core.windows.net/mycontainer/myfolder'
or
'https://myaccount.adf.core.windows.net/mycontainer/myfolder'
again within the .\azcopy.exe copy command.
"Apparently" both ways succeed giving the same result. My question is: is there any difference if I use blob or adf in the address string? If yes, what is it?
Also, whatever string token I choose, in the Azure portal a file address is always given with the blob string token..
thanks

In the storage account Endpoint page, you can see all the available endpoints for you to use for their services.
Both blob and dfs work for you because both of them are supported in Azure Data Lake Storage Gen2 . However, in Gen1, you may only have the blob service but not the dfs service available (like below). In that case, you won't be able to use the dfs endpoint.

blob and dfs represent the resource type in the endpoint URL

Related

Connecting power bi to Azure data lake gen 2

Hey everyone I am trying to connect Powerbi to my data lake gen 2 on azure I am set as
Storage Blob Data Contributor Aswell as Storage Blob Data Reader on the Storage account level i am not sure if I am doing something wrong but I also used the MS docs yet nothing
https://learn.microsoft.com/en-us/power-query/connectors/datalakestorage
The URL is formatted according to this format https://.dfs.core.windows.net//

To resolve Access to the resource is forbidden error, try following ways:
As suggested by Etienne Oosthuysen, check the date-time settings of your system.
As per documentation:
Only this format is supported https://<accountname>.dfs.core.windows.net/<container>
It doesn't support filename or subfolder like this,https://<accountname>.dfs.core.windows.net/<container>/<filename> or https://<accountname>.dfs.core.windows.net/<container>/<subfolder>
You can refer to Get Data from Azure Data Lake Gen 2 : Access to the resource is forbidden

After disabling both Enable soft delete for blobs and containers in azure storage account connection issue was resolved.
To Change Settings
Go To Azure storage account ---> Data Protection--->Disable soft delete for blobs/Enable soft delete for containers

Azure Synapse severless SQL pool - query execution fails

After completing tutorial 1, I am working on this tutorial 2 from Microsoft Azure team to run the following query (shown in step 3). But the query execution gives the error shown below:
Question: What may be the cause of the error, and how can we resolve it?
Query:
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://contosolake.dfs.core.windows.net/users/NYCTripSmall.parquet',
FORMAT='PARQUET'
) AS [result]
Error:
Warning: No datasets were found that match the expression 'https://contosolake.dfs.core.windows.net/users/NYCTripSmall.parquet'. Schema cannot be determined since no files were found matching the name pattern(s) 'https://contosolake.dfs.core.windows.net/users/NYCTripSmall.parquet'. Please use WITH clause in the OPENROWSET function to define the schema.
NOTE: The path of the file in the container is correct, and actually I generated the following query just by right clicking the file inside container and generated the script as shown below:
Remarks:
Azure Data Lake Storage Gen2 account name: contosolake
Container name: users
Firewall settings used on the Azure Data lake account:
Azure Data Lake Storage Gen2 account is allowing public access (ref):
Container has required access level (ref)
UPDATE:
The owner of the subscription is someone else, and I did not get the option Check the "Assign myself the Storage Blob Data Contributor role on the Data Lake Storage Gen2 account" box described in item 3 of Basics tab > Workspace details section of tutorial 1. I also do not have permissions to add roles - although I'm the owner of synapse workspace. So I am using workaround described in the Configure anonymous public read access for containers and blobs from Azure team.

--Workaround
If you are unable to granting Storage Blob Data Contributor, use ACL to grant permissions.
All users that need access to some data in this container also needs
to have the EXECUTE permission on all parent folders up to the root
(the container). Learn more about how to set ACLs in Azure Data Lake
Storage Gen2.
Note:
Execute permission on the container level needs to be set within the
Azure Data Lake Gen2. Permissions on the folder can be set within
Azure Synapse.
Go to the container holding NYCTripSmall.parquet.
--Update
As per your update in comments, it seems you would have to do as below.
Contact the Owner of the storage account, and ask them to perform the following tasks:
Assign the workspace MSI to the Storage Blob Data Contributor role on
the storage account
Assign you to the Storage Blob Data Contributor role on the storage
account
--
I was able to get the query results following the tutorial doc you have mentioned for the same dataset.
Since you confirm that the file is present and in the right path, refresh linked ADLS source and publish query before running, just in case if a transient issue.
Two things I suspect are
Try setting Microsoft network routing in Network Routing settings in ADLS account.
Check if built-in pool is online and you have atleast contributer roles on both Synapse workspace and Storage account. (If the current credentials using to run the query has not created the resources)

Accessing Azure ADLS gen2 with Pyspark on Databricks

I'm trying to learn Spark, Databricks & Azure.
I'm trying to access GEN2 from Databricks using Pyspark.
I can't find a proper way, I believe it's super simple but I failed.
Currently each time I receive the following:
Unable to access container {name} in account {name} using anonymous
credentials, and no credentials found for them in the configuration.
I have already running GEN2 + I have a SAS_URI to access.
What I was trying so far:
(based on this link: https://learn.microsoft.com/pl-pl/azure/databricks/data/data-sources/azure/adls-gen2/azure-datalake-gen2-sas-access):
spark.conf.set(f"fs.azure.account.auth.type.{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net", {SAS_URI})
spark.conf.set(f"fs.azure.sas.token.provider.type.{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net", {SAS_URI})
Then to reach out to data:
sd_xxx = spark.read.parquet(f"wasbs://{CONTAINER_NAME}#{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net/{proper_path_to_files/}")

Your configuration is incorrect. The first parameter should be set to just SAS value, while second - to name of Scala/Java class that will return the SAS token - you can't use just URI with SAS information in it, you need to implement some custom code.
If you want to use wasbs that the protocol for accessing Azure Blog Storage, and although it could be used for accessing ADLS Gen2 (not recommended although), but you need to use blob.core.windows.net instead of dfs.core.windows.net, and also set correct spark property for Azure Blob access.

The more common procedure to follow is here: Access Azure Data Lake Storage Gen2 using OAuth 2.0 with an Azure service principal

Running query using serverless sql pool (built-in) on CSV file in Azure Data Lake Storage Gen2 failed

I uploaded my CSV file into my Azure Data Lake Storage Gen2 using Azure Synapse portal. Then I tried select Top 100 rows and got an error after running auto-generated SQL.
Auto-generated SQL:
SELECT
    TOP 100 *
FROM
    OPENROWSET(
        BULK 'https://accountname.dfs.core.windows.net/filesystemname/test_file/contract.csv',
        FORMAT = 'CSV',
        PARSER_VERSION='2.0'
) AS [result]
Error:
File 'https://accountname.dfs.core.windows.net/filesystemname/test_file/contract.csv'
cannot be opened because it does not exist or it is used by another process.

This error in Synapse Studio has link (which leads to self-help document) underneath it which explains the error itself.
Do you have rights needed on the storage account?
You must have Storage Blob Data Contributor or Storage Blob Data Reader in order for this query to work.
Summary from the docs:
You need to have a Storage Blob Data Owner/Contributor/Reader role to
use your identity to access the data. Even if you are an Owner of a
Storage Account, you still need to add yourself into one of the
Storage Blob Data roles.
Check out the full documentation for Control Storage account access for serverless SQL pool
If your storage account is protected with firewall rules then take a look at this stack overflow answer.
Reference full docs article.

I just took your code & updated the path to what I have and it worked just worked fine
SELECT
    TOP 100 *
FROM
    OPENROWSET(
        BULK 'https://XXX.dfs.core.windows.net/himanshu/NYCTaxi/PassengerCountStats.csv',
        FORMAT = 'CSV',
        PARSER_VERSION='2.0'
) AS [result]
Please check if the path to which you have uploaded the file and the one used in the script is the same .
You can do this to check that
Navigate to WS -> Data -> ADLS gen2 -> Go to the file -> right click go to the property and copy the Uri from there paste in the script .

Azcopy throws error while executing via Terraform

I am using the Azcopy tool to copy a storage account to another. While executing the command using terminal it executes perfectly. But while executing the same using Terraform's local-executioner it throws an error. Please find the code and error below.
Code:
resource "null_resource" "backup" {
provisioner "local-exec" {
command= <<EOF
azcopy cp "https://${var.src_storage_acc_name}.blob.core.windows.net${var.src_sas}" "https://${var.dest_storage_acc_name}.blob.core.windows.net${var.dest_sas}"
EOF
}
}
Error:
Error running command ' azcopy cp "https://strsrc.blob.core.windows.net?[SAS]" "https://strdest.blob.core.windows.net?[SAS]"
': exit status 1. Output: INFO: The parameters you supplied were Source: '"https://strsrc.blob.core.windows.net?[SAS]-REDACTED- of type Local, and Destination: '"https://strdest.blob.core.windows.net?[SAS]-REDACTED- of type Local
INFO: Based on the parameters supplied, a valid source-destination combination could not automatically be found. Please check the parameters you supplied. If they are correct, please specify an exact source and destination type using the --from-to switch. Valid values are two-word phases of the form BlobLocal, LocalBlob etc. Use the word 'Blob' for Blob Storage, 'Local' for the local file system, 'File' for Azure Files, and 'BlobFS' for ADLS Gen2. If you need a combination that is not supported yet, please log an issue on the AzCopy GitHub issues list.
failed to parse user input due to error: the inferred source/destination combination could not be identified, or is currently not supported
Please provide your thoughts on this.

Today I needed to implement a similar task, and I used the azcopy cp command with --recursive=true option which is given in the document.
It successfully copied all contents of the source container to the destination.
Copy all blob containers, directories, and blobs from storage account to another by using a SAS token:
- azcopy cp "https://[srcaccount].blob.core.windows.net?[SAS]" "https://[destaccount].blob.core.windows.net?[SAS]" --recursive=true

azcopy only support certain combinations of source and destination types (blob, Gen1, Gen2, S3, Local file system, ...) for copy sub-command.
azcopy tries to guess source/destination types based on URL & params.
This error means that
you're trying to use a combination that isn't supported OR
Nothing you can do. Raise a issue as suggested. They'll probably just ignore it like this or this.
there is something wrong with your URL. E.g. you have blob.core.windows.net when you should've had dfs.core.windows.net or vice versa. This in turn causes mis-identification of source and destination types.
If you're sure that the combination is supported then you can tell azcopy the types using --from-to. Ironically, when you use a combination that isn't supported (e.g. BlobFSBlobFS), it gives the same error message instead of saying "source destination combination not supported).
When dealing with Gen2, you could use blob instead of dfs in the URL to make it use older (blob/Gen1) APIs to interact with your Gen2 account. Though less performant, it still might work.
'Blob' for Blob Storage
'Local' for the local file system
'File' for Azure Files
'BlobFS' for ADLS Gen2
As of now following combinations are supported per documentation:
local <-> Azure Blob (SAS or OAuth authentication)
local <-> Azure Files (Share/directory SAS authentication)
local <-> Azure Data Lake Storage Gen 2 (SAS, OAuth, or shared key authentication)
Azure Blob (SAS or public) -> Azure Blob (SAS or OAuth authentication)
Azure Blob (SAS or public) -> Azure Files (SAS)
Azure Files (SAS) -> Azure Files (SAS)
Azure Files (SAS) -> Azure Blob (SAS or OAuth authentication)
Amazon Web Services (AWS) S3 (Access Key) -> Azure Block Blob (SAS or OAuth authentication)
Google Cloud Storage (Service Account Key) -> Azure Block Blob (SAS or OAuth authentication) [Preview]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string