Unable to read XML files from Azure Blob container using Pyspark - azure

I am trying to read multiple XML files from Azure blob container using Pyspark. When I am running the script in Azure Synapse notebook, I am getting below error.
Note:
I have tested the connection using Azure Data Lake Gen 2 linked services (both to linked service and to file path)
I have added my workspace under 'Role Assignments' and given 'Storage Blob Data Contributor' role
Pyspark code:
Throws error at the below line
df = spark.read.format("xml").options(rowTag="x",inferSchema = True).load(xmlfile.path)
Assumption:
Assumption is that I don't have read permission to the XML files, but I am not sure if I am missing anything. Can you please throw some light?
Py4JJavaError: An error occurred while calling o2333.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in stage 3.0 failed 4 times, most recent failure: Lost task 26.3 in stage 3.0 (TID 288) : java.nio.file.AccessDeniedException: Operation failed: "This request is not authorized to perform this operation using this permission.", 403, HEAD, https://<*path/x.xml*>
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.checkException(AzureBlobFileSystem.java:1185)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.open(AzureBlobFileSystem.java:200)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.open(AzureBlobFileSystem.java:187)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:912)
at com.databricks.spark.xml.XmlRecordReader.initialize(XmlInputFormat.scala:86)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:240)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:237)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:192)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:91)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:338)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:57)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:338)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:57)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:338)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: Operation failed: "This request is not authorized to perform this operation using this permission.", 403, HEAD, <path/x.xml>
at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:207)
at org.apache.hadoop.fs.azurebfs.services.AbfsClient.getPathStatus(AbfsClient.java:570)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.openFileForRead(AzureBlobFileSystemStore.java:627)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.open(AzureBlobFileSystem.java:196)
... 23 more

Operation failed: "This request is not authorized to perform this operation using this permission.", 403
The error occurs when you don't have proper role assigned to access storage account.
Please make sure you have "storage blob contributor" role
I tried in my environment with same process without role assigned myself. I got similar error:
When I added role assignements "storage blob contributor" to my user account and run the code, got file successfully .
Output:
Reference:
Error: This request is not authorized to perform this operation using this permission.", 403 in Azure synapse notebook while running from pyspark - Microsoft Q&A -Pradeepcheekatala-MSFT

Related

The connection to the sink database is failed. Detailed error message is: Login failed for user '<token-identified principal>'AzureSynapse Link forSQL

I'm trying to create an Azure Synapse Link for Azure SQL Database, using the steps from here:
https://learn.microsoft.com/en-us/azure/synapse-analytics/synapse-link/connect-synapse-link-sql-database
After I create the link connection and I want to start it I receive the following error:
The connection to the sink database is failed. Detailed error message is: Login failed for user ''.
ConnectionToAzureDB
LinkConnection
Also I have configurated the Azure SQL database to use ADD Auth. The connection to the Azure Database seems to be working.
My user ( used to create the Synapse workspace is Subscription Owner)
The user is also owner of the storage account.
I added the SQL Managed Identity as Storage Blob Data Contributor
Did anyone else got this error and manage to fix it?
There are certain limitations while connecting SQL Database to Synapse Link as per document:
When setting up your workspace, users must select "Disable Managed Virtual Network" and "Allow connections from any IP addresses."
A link connection cannot be enabled by Azure Synapse link for SQL if the database owner does not have a mapped log in. it will cause to get error.The (ALTER AUTHORIZATION command can be used to workaround this problem by changing the database owner to an user.)
With fewer than 100 DTUs, the Free, Basic, or Standard tiers do not allow Azure Synapse Link for SQL.
With is limitation I tried to Connect SQL Database to Synapse Link and able to connect without error:
I was trying to create a Synapse Link service with On Premises SQL Server and getting following error
Failed to enable Synapse Link on the source due to 'Failed to enable the source database: Some internal error happened due to 'Calling internal service failed: Failed to execute non query on change publisher with status code 400 and error Fail to non-query change publisher with error: 'sqlErrorCode - 22301; exceptionCode - TransferServiceUnknowError; error - A database operation failed with the following error: 'Could not update the metadata. The failure occurred when executing the command '(null)'. The error/state returned was 15517/1: 'Cannot execute as the database principal because the principal "dbo" does not exist, this type of principal cannot be impersonated, or you do not have permission.'. Use the action and error to determine the cause of the failure and resubmit the request.'; detailedError - A database operation failed with the following error: 'Could not update the metadata. The failure occurred when executing the command '(null)'. The error/state returned was 15517/1: 'Cannot execute as the database principal because the principal "dbo" does not exist, this type of principal cannot be impersonated, or you do not have permission.'. Use the action and error to determine the cause of the failure and resubmit the request.'
I resolved by by changing the corresponding database user to 'sa' and it works.
use [YourCorrespondingDatabase] EXEC sp_changedbowner 'sa'

Could not create lake database from synapse notebooks

New to azure synapse, trying to create database (Managed table) from synapse notebook. I also added Storage blob data contributor for synapse workspace and specific user. I have attached the error details.
%%SQL
CREATE DATABASE sample
Error: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: java.nio.file.AccessDeniedException Operation failed: "This request is not authorized to perform this operation.", 403, HEAD, https://XXXXXXXXXX.dfs.core.windows.net/XXXXXXXXXXXXX/?upn=false&action=getAccessControl&timeout=90)
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:112)
org.apache.spark.sql.hive.HiveExternalCatalog.createDatabase(HiveExternalCatalog.scala:193)
org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:137)
org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:124)
org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:153)
org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:151)
org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$2(HiveSessionStateBuilder.scala:60)
org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:99)
org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:99)
org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:218)
org.apache.spark.sql.execution.command.CreateDatabaseCommand.run(ddl.scala:82)
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228)
org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)
The error indicates, that your account doesn't have enough permissions to the workspace. Can you please make sure to check whether the blob storage account role is assigned to Storage Blob Data Contributor or not.
you can also go through here for permissions.

Gettting an error while creating delta table from Azure Synapse notebook

I'm trying to create a delta table from Azure Synapse Notebook. I was getting an error. I also added my current IP address to the storage account. I was able to write as a delta file but when I am trying to create a delta table it throws an error. I Checked all Microsoft documents for this issue they are telling me to add an IP address to the storage account. Is that anything I am missing or else it is a bug? Thanks in Advance.
Error: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: java.nio.file.AccessDeniedException Operation failed: "This request is not authorized to perform this operation.", 403, HEAD, https://xxxxxxxxxxxxxx.dfs.core.windows.net/xxxxxxxfilesystem/?upn=false&action=getAccessControl&timeout=90)org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:112)org.apache.spark.sql.hive.HiveExternalCatalog.createDatabase(HiveExternalCatalog.scala:193)org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:137)org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:124)org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:44)org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$1(HiveSessionStateBuilder.scala:59)org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog$lzycompute(SessionCatalog.scala:98)org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog(SessionCatalog.scala:98)org.apache.spark.sql.catalyst.catalog.SessionCatalog.databaseExists(SessionCatalog.scala:266)
The error indicates, your account doesn't have enough permissions to the workspace.
Add the RBAC Storage Blob Data Contributor to the user that is running the notebook.
You can go through the links from here.

Cannot connect Databricks with Azure blob storage

I successfully mounted Azure blob storage using SAS key with below code
dbutils.fs.mount( source = "wasbs://"+user+"#"+account+".blob.core.windows.net",
mount_point = "/mnt/"+mountName,
extra_configs = {"fs.azure.sas."+user+"."+account+".blob.core.windows.net":key})
However, I cannot save output to the blob or display the blob directory (which it was worked before)
Here are some part of an error
"shaded.databricks.org.apache.hadoop.fs.azure.AzureException: java.util.NoSuchElementException: An error occurred while enumerating the result, check the original exception for details."
"Caused by: com.microsoft.azure.storage.StorageException: This request is not authorized to perform this operation using this resource type.
at com.microsoft.azure.storage.StorageException.translateException(StorageException.java:89)
at com.microsoft.azure.storage.core.StorageRequest.materializeException(StorageRequest.java:305)
at com.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:178)
at com.microsoft.azure.storage.core.LazySegmentedIterator.hasNext(LazySegmentedIterator.java:109)"
Does anyone know cause of this issue?

AuthenticationException when creating Azure ML Dataset from Azure Data Lake Gen2 Datastore

I have an Azure Data Lake Gen2 with public endpoint and a standard Azure ML instance.
I have created both components with my user and I am listed as Contributor.
I want to use data from this data lake in Azure ML.
I have added the data lake as a Datastore using Service Principal authentication.
I then try to create a Tabular Dataset using the Azure ML GUI I get the following error:
Access denied
You do not have permission to the specified path or file.
{
"message": "ScriptExecutionException was caused by StreamAccessException.\n StreamAccessException was caused by AuthenticationException.\n 'AdlsGen2-ListFiles (req=1, existingItems=0)' for '[REDACTED]' on storage failed with status code 'Forbidden' (This request is not authorized to perform this operation using this permission.), client request ID '1f9e329b-2c2c-49d6-a627-91828def284e', request ID '5ad0e715-a01f-0040-24cb-b887da000000'. Error message: [REDACTED]\n"
}
I have tried having our Azure Portal Admin, with Admin access to both Azure ML and Data Lake try the same and she gets the same error.
I tried creating the Dataset using Python sdk and get a similar error:
ExecutionError:
Error Code: ScriptExecution.StreamAccess.Authentication
Failed Step: 667ddfcb-c7b1-47cf-b24a-6e090dab8947
Error Message: ScriptExecutionException was caused by StreamAccessException.
StreamAccessException was caused by AuthenticationException.
'AdlsGen2-ListFiles (req=1, existingItems=0)' for 'https://mydatalake.dfs.core.windows.net/mycontainer?directory=mydirectory/csv&recursive=true&resource=filesystem' on storage failed with status code 'Forbidden' (This request is not authorized to perform this operation using this permission.), client request ID 'a231f3e9-b32b-4173-b631-b9ed043fdfff', request ID 'c6a6f5fe-e01f-0008-3c86-b9b547000000'. Error message: {"error":{"code":"AuthorizationPermissionMismatch","message":"This request is not authorized to perform this operation using this permission.\nRequestId:c6a6f5fe-e01f-0008-3c86-b9b547000000\nTime:2020-11-13T06:34:01.4743177Z"}}
| session_id=75ed3c11-36de-48bf-8f7b-a0cd7dac4d58
I have created Datastore and Datasets of both a normal blob storage and a managed sql database with no issues and I have only contributor access to those so I cannot understand why I should not be Authorized to add data lake. The fact that our admin gets the same error leads me to believe there are some other issue.
I hope you can help me identify what it is or give me some clue of what more to test.
Edit:
I see I might have duplicated this post: How to connect AMLS to ADLS Gen 2?
I will test that solution and close this post if it works
This was actually a duplicate of How to connect AMLS to ADLS Gen 2?.
The solution is to give the service principal that Azure ML uses to access the data lake the Storage Blob Data Reader access. And note you have to wait at least some minutes for this to have effect.

Resources