Azure Data Factory - Azure Data Lake Gen1 access

Azure Data Factory - Azure Data Lake Gen1 access - azure

A file is being added by the Logic Apps to the Data Factory V2
I have a Data Factory that access 'data lake gen 1' to process the file. I receive the following error, when I try to debug the data factory after file is added.
"ErrorCode=FileForbidden,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Failed to read a 'AzureDataLakeStore' file. File path: 'Stem/Benchmark/DB_0_Measures_1_05052020 - Copy - Copy - rounded, date changed - Copy (3).csv'.,Source=Microsoft.DataTransfer.ClientLibrary,''Type=System.Net.WebException,Message=The remote server returned an error: (403) Forbidden.,Source=System,'",
When I "Apply to children" after next load permission, error is gone.
Tried so far:
- Assigned permission in Data Lake for the Data Factory and it`s children.
Assigned permission in Data Lake Folder for the Data Factory and it's children.
Added data factory as a contributor to data lake.
Added data factory as an owner to data lake.
Allowed "all Azure services to access this Data Lake Storage Gen1 account".
After all tries, still need manually to "apply permission to children" for each file added.
Is there anyway to fix this?

Can reproduce your error:
This is how I resolve:
And my account is the owner of the data lake gen1. The datafactory is the contributor of the data lake gen1.

you need to give Read + Execute Access on parent folders and then do what #Bowman Zhu mentioned above.

Related

Azure Data Factory Exception while reading table from Synapse and using staging for Polybase

I'm using Data Flow in Data Factory and I need to join a table from Synapse with my flow of data.
When I added the new source in Azure Data Flow I had to add a Staging linked service (as the label said: "For SQL DW, please specify a staging location for PolyBase.")
So I specified a path in Azure Data Lake Gen2 in which Polybase can create its tem dir.
Nevertheless I'm getting this error:
{"StatusCode":"DFExecutorUserError","Message":"Job failed due to reason: at Source 'keyMapCliente': shaded.msdataflow.com.microsoft.sqlserver.jdbc.SQLServerException: CREATE EXTERNAL TABLE AS SELECT statement failed as the path name 'abfss://MyContainerName#mystorgaename.dfs.core.windows.net/Raw/Tmp/e3e71c102e0a46cea0b286f17cc5b945/' could not be used for export. Please ensure that the specified path is a directory which exists or can be created, and that files can be created in that directory.","Details":"shaded.msdataflow.com.microsoft.sqlserver.jdbc.SQLServerException: CREATE EXTERNAL TABLE AS SELECT statement failed as the path name 'abfss://MyContainerName#mystorgaename.dfs.core.windows.net/Raw/Tmp/e3e71c102e0a46cea0b286f17cc5b945/' could not be used for export. Please ensure that the specified path is a directory which exists or can be created, and that files can be created in that directory.\n\tat shaded.msdataflow.com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:262)\n\tat shaded.msdataflow.com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1632)\n\tat shaded.msdataflow.com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:872)\n\tat shaded.msdataflow.com.microsoft.sqlserver.jdbc.SQLServerStatement$StmtExecCmd.doExecute(SQLServerStatement.java:767)\n\tat shaded.msdataflow.com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7418)\n\tat shaded.msdataflow.com.microsoft.sqlserver.jd"}
The following are the Azure Data Flow Settings:
this the added source inside the data flow:
Any help is appreciated

I have reproed and was able to enable stagging location as azure data lake Gen2 storage account for polybase and connected synapse table data successfully.
Create your database scooped credentials with azure storage account key as secret.
Create an external data source and an external table with the scooped credentials created.
In Azure data factory:
Enable staging and connect to azure data lake Gen2 storage account with Account key authentication type.
In the data flow, connect your source to the synapse table and enable staging property in the source option

Azure Synapse severless SQL pool - query execution fails

After completing tutorial 1, I am working on this tutorial 2 from Microsoft Azure team to run the following query (shown in step 3). But the query execution gives the error shown below:
Question: What may be the cause of the error, and how can we resolve it?
Query:
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://contosolake.dfs.core.windows.net/users/NYCTripSmall.parquet',
FORMAT='PARQUET'
) AS [result]
Error:
Warning: No datasets were found that match the expression 'https://contosolake.dfs.core.windows.net/users/NYCTripSmall.parquet'. Schema cannot be determined since no files were found matching the name pattern(s) 'https://contosolake.dfs.core.windows.net/users/NYCTripSmall.parquet'. Please use WITH clause in the OPENROWSET function to define the schema.
NOTE: The path of the file in the container is correct, and actually I generated the following query just by right clicking the file inside container and generated the script as shown below:
Remarks:
Azure Data Lake Storage Gen2 account name: contosolake
Container name: users
Firewall settings used on the Azure Data lake account:
Azure Data Lake Storage Gen2 account is allowing public access (ref):
Container has required access level (ref)
UPDATE:
The owner of the subscription is someone else, and I did not get the option Check the "Assign myself the Storage Blob Data Contributor role on the Data Lake Storage Gen2 account" box described in item 3 of Basics tab > Workspace details section of tutorial 1. I also do not have permissions to add roles - although I'm the owner of synapse workspace. So I am using workaround described in the Configure anonymous public read access for containers and blobs from Azure team.

--Workaround
If you are unable to granting Storage Blob Data Contributor, use ACL to grant permissions.
All users that need access to some data in this container also needs
to have the EXECUTE permission on all parent folders up to the root
(the container). Learn more about how to set ACLs in Azure Data Lake
Storage Gen2.
Note:
Execute permission on the container level needs to be set within the
Azure Data Lake Gen2. Permissions on the folder can be set within
Azure Synapse.
Go to the container holding NYCTripSmall.parquet.
--Update
As per your update in comments, it seems you would have to do as below.
Contact the Owner of the storage account, and ask them to perform the following tasks:
Assign the workspace MSI to the Storage Blob Data Contributor role on
the storage account
Assign you to the Storage Blob Data Contributor role on the storage
account
--
I was able to get the query results following the tutorial doc you have mentioned for the same dataset.
Since you confirm that the file is present and in the right path, refresh linked ADLS source and publish query before running, just in case if a transient issue.
Two things I suspect are
Try setting Microsoft network routing in Network Routing settings in ADLS account.
Check if built-in pool is online and you have atleast contributer roles on both Synapse workspace and Storage account. (If the current credentials using to run the query has not created the resources)

Running query using serverless sql pool (built-in) on CSV file in Azure Data Lake Storage Gen2 failed

I uploaded my CSV file into my Azure Data Lake Storage Gen2 using Azure Synapse portal. Then I tried select Top 100 rows and got an error after running auto-generated SQL.
Auto-generated SQL:
SELECT
    TOP 100 *
FROM
    OPENROWSET(
        BULK 'https://accountname.dfs.core.windows.net/filesystemname/test_file/contract.csv',
        FORMAT = 'CSV',
        PARSER_VERSION='2.0'
) AS [result]
Error:
File 'https://accountname.dfs.core.windows.net/filesystemname/test_file/contract.csv'
cannot be opened because it does not exist or it is used by another process.

This error in Synapse Studio has link (which leads to self-help document) underneath it which explains the error itself.
Do you have rights needed on the storage account?
You must have Storage Blob Data Contributor or Storage Blob Data Reader in order for this query to work.
Summary from the docs:
You need to have a Storage Blob Data Owner/Contributor/Reader role to
use your identity to access the data. Even if you are an Owner of a
Storage Account, you still need to add yourself into one of the
Storage Blob Data roles.
Check out the full documentation for Control Storage account access for serverless SQL pool
If your storage account is protected with firewall rules then take a look at this stack overflow answer.
Reference full docs article.

I just took your code & updated the path to what I have and it worked just worked fine
SELECT
    TOP 100 *
FROM
    OPENROWSET(
        BULK 'https://XXX.dfs.core.windows.net/himanshu/NYCTaxi/PassengerCountStats.csv',
        FORMAT = 'CSV',
        PARSER_VERSION='2.0'
) AS [result]
Please check if the path to which you have uploaded the file and the one used in the script is the same .
You can do this to check that
Navigate to WS -> Data -> ADLS gen2 -> Go to the file -> right click go to the property and copy the Uri from there paste in the script .

Azure Data Factory - Data flow activity changing file names

I am running a data flow activity using Azure Data Factory.
Source data source - Azure bolb
Destination data source - Azure Data Lake Gen 2
For Eg. I have a file named "test_123.csv" in Azure blob. When I create a data flow activity to filter some data and copy to Data Lake it is changing the file name to "part-00.csv" in Data Lake.
I want to keep my original filename?

Yes you can do that , please look at the screenshot below . Please do let me know how it goes .

Azure Function can write files in Data Lake, when it is bound to Event Grid, but not when it is called from Azure Data Factory

I have an Azure Function that should process zip files and convert their contents to csv files and save them on a Data Lake gen 1.
I have enabled managed Identity of this Azure Functions. Then I have added this managed Identity as an OWNER on the Access control (IAM) of Data Lake.
First Scenario:
I call this Azure function from Azure Data Factory, and send the fileuri of zip files, which are persisted in a Storage Acccount from Azure Data Factory to Azure Functions, Azure Function process the files by saving csv files in data lake I get this error:
Error in creating file root\folder1\folder2\folder3\folder4\test.csv.
Operation: CREATE failed with HttpStatus:Unauthorized Token Length: 1162
Unknown Error: Unexpected type of exception in JSON error output.
Expected: RemoteException Actual: error Source:
Microsoft.Azure.DataLake.Store StackTrace: at
Microsoft.Azure.DataLake.Store.WebTransport.ParseRemoteError(Byte[]
errorBytes, Int32 errorBytesLength, OperationResponse resp, String
contentType).
RemoteJsonErrorResponse: Content-Type of error response:
application/json; charset=utf-8.
Error:{"error":{"code":"AuthenticationFailed","message":"The access token in the 'Authorization' header is expired.
Second Scenario
I have set an Event Grid for this Azure Functions. After dropping zip files in storage account, which is bound to Azure Functions, zip files are processed and csv files are successfully saved in the Data Lake.
good to know azure function and Data Lake are in the same vnet
Could someone explain to me, why my function works fine with the event grid but doesn't work if I call it from Azure Data Factory? (saving csv files in Data Lake Gen 1)

Have you show all of the error?
From the error seems it is related to Azure Active AD authentication(Data Lake Storage Gen1 uses Azure Active Directory for authentication). Have you use Bearer token in your Azure Function when you try to send something to Data Lake Storage Gen1?
please show the code of your azure function, otherwise it will be hard to find the cause of the error.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Azure Data Factory - Azure Data Lake Gen1 access - azure

Can reproduce your error: This is how I resolve: And my account is the owner of the data lake gen1. The datafactory is the contributor of the data lake gen1.

you need to give Read + Execute Access on parent folders and then do what #Bowman Zhu mentioned above.

Related

Azure Data Factory Exception while reading table from Synapse and using staging for Polybase

Azure Synapse severless SQL pool - query execution fails

Running query using serverless sql pool (built-in) on CSV file in Azure Data Lake Storage Gen2 failed

Azure Data Factory - Data flow activity changing file names

Azure Function can write files in Data Lake, when it is bound to Event Grid, but not when it is called from Azure Data Factory

Categories

Resources