Azure Synapse Spark Pool Submission Failure with error message "SparkJobDefinitionActionFailed" - azure

I am following the tutorials below and try to submit a Spark Job using the Spark engine in Azure Synapse.
The submission failed with following error:
Error:
{
"code": "SparkJobDefinitionActionFailed",
"message": "Spark job batch request for workspace contosows, spark compute contosospark with session id null failed with a system error. Please try again",
"target": null,
"details": null,
"error": null
}
Can anyone give some guidance/suggestions on how to resolve it?
More information about my setups.
Region: Southeast Asia for both Azure Synapse workspace + ADLS Gen2
I grant myself Both Storage Blob Data Owner and Storage Blob Data Contributor roles as suggested.
Tutorials used:
https://learn.microsoft.com/en-us/azure/synapse-analytics/quickstart-create-workspace
https://learn.microsoft.com/en-us/azure/synapse-analytics/quickstart-create-apache-spark-pool-portal
https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-job-definitions
Thanks!

After investigation, the product group deployed a hotfix to fix this problem.
Now you can submit the Spark job without any issues. Please do let me known, if you are experiencing this issue anymore.

Related

File path error in pipeline for spark notebook in azure synapse

I have a spark notebook which I am running with the help of pipeline. The notebook is running fine manually but in the pipeline it is giving error for file location. In the code I am loading the file in a data frame. The file location in the code is abfss://storage_name/folder_name/* and in pipeline it is taking abfss://storage_name/filename.parquet\n
This is the error
{
"errorCode": "6002",
"message": "org.apache.spark.sql.AnalysisException: Path does not exist: abfss://storage_name/filename.parquet\n at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4(DataSource.scala:806)\n\n at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4$adapted(DataSource.scala:803)\n\n at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:372)\n\n at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)\n\n at scala.util.Success.$anonfun$map$1(Try.scala:255)\n\n at scala.util.Success.map(Try.scala:213)\n\n at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)\n\n at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)\n\n at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)\n\n at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)\n\n at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402)\n\n at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)\n\n at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)\n\n at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)\n\n at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)\n",
"failureType": "UserError",
"target": "notebook_name",
"details": []
}
The above error mainly happens because of permission issue, the synapse workspace required lack of permissions to access storage account, so you need to grant storage blob contributor role.
To add storage account contributor role to your workspace, refer this Microsoft documentation
And also, make sure to check whether you are following ADLS gen2 proper syntax or not.
abfss://<container_name>#<storage_account_name>.dfs.core.windows.net/<path>
Sample code
df = spark.read.load('abfss://<container_name>#<storage_account_name>.dfs.core.windows.net/samplefile.parquet>', format='parquet')
For more detail information refer this link.
Added my synapse workspace under the required access. Hence, worked.

Error on source dataset with REST Connector in Azure Synapse pipeline

I am using Copy and transform data from and to a REST endpoint by using Azure Data Factory to load a file from my Box.com account to an Azure Data Lake Gen2 (ADLSGen2) container. I'm using Synapse pipeline with source as the REST connector where I've identified the Base URL in step3 of the tutorial to be https://api.box.com/2.0/files/:file_id/content where file_id is the id of my file stored in Box.com (ref: here).
When I run the pipeline, I get the following error. Question: What I may be doing wrong and how can the issue be resolved?
"errorCode": "2200",
"message": "Failure happened on 'Source' side. ErrorCode=RestSourceCallFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=The HttpStatusCode 401 indicates failure.\nRequest URL: https://api.box.com/2.0/files/:984786751561/content\nResponse payload:,Source=Microsoft.DataTransfer.ClientLibrary,'",
"failureType": "UserError",
"target": "Copy data1",
"details": []
I'm from the Microsoft for Founders Hub team. A 401 could be due to an unauthorized access. In your tutorial, on step 3, as you mentioned, there is a test connection button. Please click it to make sure you are authorized.

Gettting an error while creating delta table from Azure Synapse notebook

I'm trying to create a delta table from Azure Synapse Notebook. I was getting an error. I also added my current IP address to the storage account. I was able to write as a delta file but when I am trying to create a delta table it throws an error. I Checked all Microsoft documents for this issue they are telling me to add an IP address to the storage account. Is that anything I am missing or else it is a bug? Thanks in Advance.
Error: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: java.nio.file.AccessDeniedException Operation failed: "This request is not authorized to perform this operation.", 403, HEAD, https://xxxxxxxxxxxxxx.dfs.core.windows.net/xxxxxxxfilesystem/?upn=false&action=getAccessControl&timeout=90)org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:112)org.apache.spark.sql.hive.HiveExternalCatalog.createDatabase(HiveExternalCatalog.scala:193)org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:137)org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:124)org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:44)org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$1(HiveSessionStateBuilder.scala:59)org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog$lzycompute(SessionCatalog.scala:98)org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog(SessionCatalog.scala:98)org.apache.spark.sql.catalyst.catalog.SessionCatalog.databaseExists(SessionCatalog.scala:266)
The error indicates, your account doesn't have enough permissions to the workspace.
Add the RBAC Storage Blob Data Contributor to the user that is running the notebook.
You can go through the links from here.

AuthenticationException when creating Azure ML Dataset from Azure Data Lake Gen2 Datastore

I have an Azure Data Lake Gen2 with public endpoint and a standard Azure ML instance.
I have created both components with my user and I am listed as Contributor.
I want to use data from this data lake in Azure ML.
I have added the data lake as a Datastore using Service Principal authentication.
I then try to create a Tabular Dataset using the Azure ML GUI I get the following error:
Access denied
You do not have permission to the specified path or file.
{
"message": "ScriptExecutionException was caused by StreamAccessException.\n StreamAccessException was caused by AuthenticationException.\n 'AdlsGen2-ListFiles (req=1, existingItems=0)' for '[REDACTED]' on storage failed with status code 'Forbidden' (This request is not authorized to perform this operation using this permission.), client request ID '1f9e329b-2c2c-49d6-a627-91828def284e', request ID '5ad0e715-a01f-0040-24cb-b887da000000'. Error message: [REDACTED]\n"
}
I have tried having our Azure Portal Admin, with Admin access to both Azure ML and Data Lake try the same and she gets the same error.
I tried creating the Dataset using Python sdk and get a similar error:
ExecutionError:
Error Code: ScriptExecution.StreamAccess.Authentication
Failed Step: 667ddfcb-c7b1-47cf-b24a-6e090dab8947
Error Message: ScriptExecutionException was caused by StreamAccessException.
StreamAccessException was caused by AuthenticationException.
'AdlsGen2-ListFiles (req=1, existingItems=0)' for 'https://mydatalake.dfs.core.windows.net/mycontainer?directory=mydirectory/csv&recursive=true&resource=filesystem' on storage failed with status code 'Forbidden' (This request is not authorized to perform this operation using this permission.), client request ID 'a231f3e9-b32b-4173-b631-b9ed043fdfff', request ID 'c6a6f5fe-e01f-0008-3c86-b9b547000000'. Error message: {"error":{"code":"AuthorizationPermissionMismatch","message":"This request is not authorized to perform this operation using this permission.\nRequestId:c6a6f5fe-e01f-0008-3c86-b9b547000000\nTime:2020-11-13T06:34:01.4743177Z"}}
| session_id=75ed3c11-36de-48bf-8f7b-a0cd7dac4d58
I have created Datastore and Datasets of both a normal blob storage and a managed sql database with no issues and I have only contributor access to those so I cannot understand why I should not be Authorized to add data lake. The fact that our admin gets the same error leads me to believe there are some other issue.
I hope you can help me identify what it is or give me some clue of what more to test.
Edit:
I see I might have duplicated this post: How to connect AMLS to ADLS Gen 2?
I will test that solution and close this post if it works
This was actually a duplicate of How to connect AMLS to ADLS Gen 2?.
The solution is to give the service principal that Azure ML uses to access the data lake the Storage Blob Data Reader access. And note you have to wait at least some minutes for this to have effect.

Synapse LINK Load streaming DataFrame from Azure Cosmos DB container

I am trying to use feed changes on synapse, I am using synapse link to connect to cosmos,
dfStream = spark.readStream\
.format("cosmos.oltp")\
.option("spark.synapse.linkedService", "<enter linked service name>")\
.option("spark.cosmos.container", "<enter container name>")\
.option("spark.cosmos.changeFeed.readEnabled", "true")\
.option("spark.cosmos.changeFeed.startFromTheBeginning", "true")\
.option("spark.cosmos.changeFeed.checkpointLocation", "/localReadCheckpointFolder")\
.option("spark.cosmos.changeFeed.queryName", "streamQuery")\
.load()
But I'm getting the error below:
Error : org.apache.hadoop.fs.azurebfs.contracts.exceptions.AbfsRestOperationException: Operation failed: "This request is not authorized to perform this operation using this permission.", 403, DELETE, https://adlsgarage7.dfs.core.windows.net/adlsgarage7/localReadCheckpointFolder/streamQuery?
You need the permission to access as a contributor the container of the Data Lake Account that has been connected to the workspace at the time of creation. You need Blob Storage Contributor ARM access to the account adlsgarage7 or at least the container adlsgarage7.
You should also make sure to write the name of the linked service you connect to and the container.

Resources