Azure Synapse transaction stuck in queued state - azure

We have several tables living in an Azure Synapse DB that we need to truncate as part of a larger pipeline. But, starting very recently, whenever we try to truncate the tables, the transactions are unable to complete and get stuck in an eternal state of suspension.
sys.dm_pdw_exec_requests shows these transactions as having a status of "Suspended" while sys.dm_pdw_waits shows them as being in a state of "Queued" with a priority of "8". Does anyone know what these values mean? We haven't found any further documentation on them. Any suggestions on particular queries that can be executed within Synapse that could provide more detail about the reason for these suspensions?
The reason all of this matters to us is that we have a pipeline set up in Azure Data Factory that copies data to these tables. This pipeline is set up to first truncate the data in our tables before inserting new data. Because the truncation is eternally suspended, the copy jobs never complete and the pipeline fails.

I tried to repro the issue in my environment where I have multiple tables in Azure Synapse Dedicated pool.
I have run the command TRUNCATE TABLE <table_name> and it works fine for me as shown in below image.
The select * from sys.dm_pdw_exec_requests is also showing that my query Completed successfully.
I have also tied to truncate multiple tables in a single run and that also works fine for me.
sys.dm_pdw_exec_requests shows these transactions as having a status of "Suspended"
SUSPENDED: It means that the request currently is not active because it is waiting on a resource. The resource can be an I/O for reading a page, A WAIT can be communication on the network, or it is waiting for lock or a latch. It will become active once the task it is waiting for is completed.
sys.dm_pdw_waits shows them as being in a state of "Queued" with a priority of "8"
It means your query is in Queue and will be executed once other queries which are in queue with higher priority will be completed.
To set the priority of your queries, you can configure workload importance in dedicated SQL pool. Setting importance in dedicated SQL pool for Azure Synapse allows you to influence the scheduling of queries. Queries with higher importance will be scheduled to run before queries with lower importance. To assign importance to queries, you need to create a workload classifier.
Refer configure workload importance to implement the same.

Related

Update databricks job status through API

We need to execute a long running exe running on a windows machine and thinking of ways to integrate with the workflow. The plan is to include the exe as a task in the Databricks workflow.
We are thinking of couple of approaches
Create a DB table and enter a row when this particular task gets started in the workflow. Exe which is running on a windows machine will ping the database table for any new records. Once a new record is found, the exe proceeds with actual execution and updates the status after completion. Databricks will query this table constantly for the status and once completed, task finishes.
Using databricks API, check whether the task has started execution in the exe and continue with execution. After application finishes, update the task status to completion until then the Databricks task will run like while (true). But the current API doesn't support updating the task execution status (To Complete) (not 100% sure).
Please share thoughts OR alternate solutions.
This is an interesting problem. Is there a reason you must use Databricks to execute an EXE?
Regardless, I think you have the right kind of idea. How I would do this with the jobs api is as described:
Have your EXE process output a file to a staging location probably in DBFS since this will be locally accessible inside of databricks.
Build a notebook to load this file, having a table is optional but may give you addtional logging capabilities if needed. The output of your notebook should use the dbutils.notebook.exit method which allows you to output any value string or array. You could return "In Progress" and "Success" or the latest line from your file you've written.
Wrap that notebook in a databricks job and execute on an interval with a cron schedule (you said 1 minute) and you can retrieve the output value of your job via the get-output endpoint
Additional Note, the benefit of abstracting this into return values from a notebook is you could orchestrate this via other workflow tools e.g. Databricks Workflows or Azure Data Factory with inside an Until condition. There are no limits so long as you can orchestrate a notebook in that tool.

How to run a long-running dbt job?

I am using the dbt CLI to regularly update data via dbt run. However, this materializes several tables, and can take 20+ hours to do a full refresh.
I am currently running this from my PC/cloud VM, but I don't want to keep my PC on / VM running just to run the dbt CLI. Moreover, I've been burned before by trying to do this (brief Wi-Fi issue interrupting a dbt operation 10h into a 12h table materialization).
Are there any good patterns for this? Note that I'm using SQL Server which is not supported by DBT cloud.
I've considered:
Setting up airflow / prefect
Having a small vm just for DBT to run
Moving to a faster database (eg. from Azure SQL to Azure Synapse)
Any ideas?
I would agree here with Branden. Throwing resources should be the last resort. First thing that you should do is try optimizing sql queries. If the queries are optimized the time for full refresh takes will depend upon data volume. If the volume is high you should be doing incremental runs rather than full refreshes. You can schedule for incremental runs using something like cron scheduler or airflow
Another thing to note is you don't need to do dbt run if you want to run selected models. You can always do dbt run -m +model
+model -> run the model with all upstream dependencies
model- -> run the model with all downstream dependencies
Another aspect, since you're using SQL Server which is row store (more suited to ETL) you also need some dimensional modeling. dbt is the T of ELT in which data is already loaded in a powerful column store warehouse (like snowflake/redshift) and dimensional modeling is not needed as queries already leverage the columnar storage. Doesn't mean dbt cannot work with row stores but dimensional modeling may be needed.
2nd. You can always have a small VM or run it on something like ECS Fargate. This is a serverless solution and you're charged only when dbt runs.
Finally, if nothing works then you should consider of moving to something like Synapse that will likely use compute intense resources to run queries faster.

Solving Timeout Issue with Azure Synapse/ADF Pipeline of Pipelines

Context
We're creating a pipeline of Spark Jobs in Azure Synapse (much like in Azure Data Factory) that read in data from various databases and merge it into a larger dataset. Originally, we had a single pipeline that worked, with many Spark Jobs leading into others. As part of a redesign, we were thinking that we would create a pipeline for each individual Spark job, so that we can create various orchestration pipelines. If the definition of a Spark job changes, we only have to change the definition file in one place.
Problem
When we run our "pipeline of pipelines", we get an error that we don't get with the single pipeline. The error:
Is a series of timeout errors like: Timeout waiting for idle object
Occurs in different areas of the pipeline on different runs
This results in the failure of the pipeline as a whole.
Questions
What is the issue happening here?
Is there a setting that could help solve it? (Azure Synapse does not provide many levers)
Is our pipeline-of-pipelines construction an anti-pattern? (Doesn't seem like it should be since Azure allows it)
Hypotheses
From the post here: are we running out of available connections in Azure? (how would you solve that?)

Azure Data Factory - Limit the number of Databricks pipeline running at the same time

I am using ADF to execute Databricks notebook. At this time, I have 6 pipelines, and they are executed consequently.
Specifically, after the former is done, the latter is executed with multiple parameters by the loop box, and this keeps going. For example, after the first pipeline is done, it will trigger 3 instances of the second pipeline with different parameters, and each of these instances will trigger multiple instances of the third pipeline. As a result, the deeper I go, the more pipelines I have to run.
The issue with me is: when each pipeline is executed, it will ask Databricks to allocate a cluster to run. However, Databricks limits the number of cores to be used for each workspace, which causes the pipeline instance to fail to run.
My question is: is there any solution to control the number of pipeline instance running at the same time, or any solution to handle my issue?
Thanks in advance :-)
Why this issue occurs?
Note: Creating a Databricks clusters always have a dependency with the number of cores available in the subscription.
Before creating any databricks cluster, make sure number of cores are
available in the region selected and the VM Family vCPUs.
You can checkout the core limit of your subscription by going to Azure Portal => Subscriptions => Select your subscription => Settings “Usage + quotes” => Checkout the usage quota available for each regions.
Example: If your subscription has > 72 cores which results in success of ADF runs else results in failure.
Activity Validate failed: Databricks execution failed with error message: Unexpected failure while waiting for the cluster to be ready. Cause Unexpected state for cluster (job-200-run-1): Could not launch cluster due to cloud provider failures. azure_error_code: OperationNotAllowed, azure_error_message: Operation results in exceeding quota limits of Core. Maximum allowed: 350, Current in use: 344
I’m trying to create 6 pipelines with databricks clusters with 2 worker nodes each. Which means it requires
(6 pipelines) * (1 Driver Node + 2 Worker Node) * (4 cores) = 72 cores.
Above calculation used with VM Size Standard_DS3_v2 which has 4 cores.
Note: To create a databricks spark cluster which requires more than 4 cores i.e. (Minimum 4 cores for Driver type and 4 cores for
Worker type).
Resolutions for this issue:
Increase the core limits by raising the ticket to billing and subscription team to higher limit. With this option once you use it you will be charged for the used cores only.
Limit your job frequency so that limited number of clusters/ consider using a single job for copying multiple file so that you can limit the cluster creations which will exhaust your cores under your subscription.
To request an increase of one or more resources that support such an increase, submit an
Azure support request (select "Quota" for Issue type).
Issue type: Service and subscription limits (quotas)
Reference: Total regional vCPU limit increases
Hope this helps. Do let us know if you any further queries.
Do click on "Mark as Answer" and Upvote on the post that helps you, this can be beneficial to other community members.
You can limit number of activities being run in parallel at each foreach level by setting - Batch Count Parameter. ( Found under settings tab on foreach loop)
batchCount- Batch count to be used for controlling the number of parallel execution (when isSequential is set to false).
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-for-each-activity
If not able to set limit at overall Pipeline level, try arriving at a minimum value of batch count in your each nested foreach loops.

What happens when HDInsight source data from Azure DocumentDB

I have a Hadoop job running on HDInsight and source data from Azure DocumentDB. This job runs once a day and as new data comes in the DocumentDB everyday, my hadoop job filters out old records and only process the new ones (this is done by storing a time stamp somewhere). However, as the Hadoop job is running and if new records come in, I don't know what happens to them. Are they fed to running job or not? How does the throttling mechanism in DocumentDB play roles here?
as the hadoop job is running and if new records come in, I don't know what happens to them. Are they fed to running job or not?
The answer to this depends on what phase or step the hadoop job is in. Data gets pulled once at the beginning. Documents added while data is getting pulled will be included in the Hadoop job results. Documents added after data is finished getting pulled will not included in the Hadoop job results.
Note: ORDER BY _ts is needed for consistent behavior - as the Hadoop job simple follows the continuation token when paging through query results.
"How the throttling mechanism in DocumentDB play roles here?"
The DocumentDB Hadoop connector will automatically retry when throttled.

Resources