Solving Timeout Issue with Azure Synapse/ADF Pipeline of Pipelines - azure

Context
We're creating a pipeline of Spark Jobs in Azure Synapse (much like in Azure Data Factory) that read in data from various databases and merge it into a larger dataset. Originally, we had a single pipeline that worked, with many Spark Jobs leading into others. As part of a redesign, we were thinking that we would create a pipeline for each individual Spark job, so that we can create various orchestration pipelines. If the definition of a Spark job changes, we only have to change the definition file in one place.
Problem
When we run our "pipeline of pipelines", we get an error that we don't get with the single pipeline. The error:
Is a series of timeout errors like: Timeout waiting for idle object
Occurs in different areas of the pipeline on different runs
This results in the failure of the pipeline as a whole.
Questions
What is the issue happening here?
Is there a setting that could help solve it? (Azure Synapse does not provide many levers)
Is our pipeline-of-pipelines construction an anti-pattern? (Doesn't seem like it should be since Azure allows it)
Hypotheses
From the post here: are we running out of available connections in Azure? (how would you solve that?)

Related

Azure Synapse transaction stuck in queued state

We have several tables living in an Azure Synapse DB that we need to truncate as part of a larger pipeline. But, starting very recently, whenever we try to truncate the tables, the transactions are unable to complete and get stuck in an eternal state of suspension.
sys.dm_pdw_exec_requests shows these transactions as having a status of "Suspended" while sys.dm_pdw_waits shows them as being in a state of "Queued" with a priority of "8". Does anyone know what these values mean? We haven't found any further documentation on them. Any suggestions on particular queries that can be executed within Synapse that could provide more detail about the reason for these suspensions?
The reason all of this matters to us is that we have a pipeline set up in Azure Data Factory that copies data to these tables. This pipeline is set up to first truncate the data in our tables before inserting new data. Because the truncation is eternally suspended, the copy jobs never complete and the pipeline fails.
I tried to repro the issue in my environment where I have multiple tables in Azure Synapse Dedicated pool.
I have run the command TRUNCATE TABLE <table_name> and it works fine for me as shown in below image.
The select * from sys.dm_pdw_exec_requests is also showing that my query Completed successfully.
I have also tied to truncate multiple tables in a single run and that also works fine for me.
sys.dm_pdw_exec_requests shows these transactions as having a status of "Suspended"
SUSPENDED: It means that the request currently is not active because it is waiting on a resource. The resource can be an I/O for reading a page, A WAIT can be communication on the network, or it is waiting for lock or a latch. It will become active once the task it is waiting for is completed.
sys.dm_pdw_waits shows them as being in a state of "Queued" with a priority of "8"
It means your query is in Queue and will be executed once other queries which are in queue with higher priority will be completed.
To set the priority of your queries, you can configure workload importance in dedicated SQL pool. Setting importance in dedicated SQL pool for Azure Synapse allows you to influence the scheduling of queries. Queries with higher importance will be scheduled to run before queries with lower importance. To assign importance to queries, you need to create a workload classifier.
Refer configure workload importance to implement the same.

Update databricks job status through API

We need to execute a long running exe running on a windows machine and thinking of ways to integrate with the workflow. The plan is to include the exe as a task in the Databricks workflow.
We are thinking of couple of approaches
Create a DB table and enter a row when this particular task gets started in the workflow. Exe which is running on a windows machine will ping the database table for any new records. Once a new record is found, the exe proceeds with actual execution and updates the status after completion. Databricks will query this table constantly for the status and once completed, task finishes.
Using databricks API, check whether the task has started execution in the exe and continue with execution. After application finishes, update the task status to completion until then the Databricks task will run like while (true). But the current API doesn't support updating the task execution status (To Complete) (not 100% sure).
Please share thoughts OR alternate solutions.
This is an interesting problem. Is there a reason you must use Databricks to execute an EXE?
Regardless, I think you have the right kind of idea. How I would do this with the jobs api is as described:
Have your EXE process output a file to a staging location probably in DBFS since this will be locally accessible inside of databricks.
Build a notebook to load this file, having a table is optional but may give you addtional logging capabilities if needed. The output of your notebook should use the dbutils.notebook.exit method which allows you to output any value string or array. You could return "In Progress" and "Success" or the latest line from your file you've written.
Wrap that notebook in a databricks job and execute on an interval with a cron schedule (you said 1 minute) and you can retrieve the output value of your job via the get-output endpoint
Additional Note, the benefit of abstracting this into return values from a notebook is you could orchestrate this via other workflow tools e.g. Databricks Workflows or Azure Data Factory with inside an Until condition. There are no limits so long as you can orchestrate a notebook in that tool.

How to run a long-running dbt job?

I am using the dbt CLI to regularly update data via dbt run. However, this materializes several tables, and can take 20+ hours to do a full refresh.
I am currently running this from my PC/cloud VM, but I don't want to keep my PC on / VM running just to run the dbt CLI. Moreover, I've been burned before by trying to do this (brief Wi-Fi issue interrupting a dbt operation 10h into a 12h table materialization).
Are there any good patterns for this? Note that I'm using SQL Server which is not supported by DBT cloud.
I've considered:
Setting up airflow / prefect
Having a small vm just for DBT to run
Moving to a faster database (eg. from Azure SQL to Azure Synapse)
Any ideas?
I would agree here with Branden. Throwing resources should be the last resort. First thing that you should do is try optimizing sql queries. If the queries are optimized the time for full refresh takes will depend upon data volume. If the volume is high you should be doing incremental runs rather than full refreshes. You can schedule for incremental runs using something like cron scheduler or airflow
Another thing to note is you don't need to do dbt run if you want to run selected models. You can always do dbt run -m +model
+model -> run the model with all upstream dependencies
model- -> run the model with all downstream dependencies
Another aspect, since you're using SQL Server which is row store (more suited to ETL) you also need some dimensional modeling. dbt is the T of ELT in which data is already loaded in a powerful column store warehouse (like snowflake/redshift) and dimensional modeling is not needed as queries already leverage the columnar storage. Doesn't mean dbt cannot work with row stores but dimensional modeling may be needed.
2nd. You can always have a small VM or run it on something like ECS Fargate. This is a serverless solution and you're charged only when dbt runs.
Finally, if nothing works then you should consider of moving to something like Synapse that will likely use compute intense resources to run queries faster.

Databricks Jobs - Is there a way to restart a failed task instead of the whole pipeline?

If I have for example a (multitask) Databricks job with 3 tasks in series and the second one fails - is there a way to start from the second task instead of running the whole pipeline again?
Right now this is not possible, but if you refer to the Databrick's Q3 (2021) public roadmap, there were some items around improving multi-task jobs.
Update: September 2022. This functionality was released back in May 2022nd with name repair & rerun
If you are running Databricks on Azure it is possible via Azure Data Factory.

How Do I monitor progess and recover in a long-running Spark map job?

We're using Spark to run an ETL process by which data gets loaded in from a massive (500+GB) MySQL database and converted into aggregated JSON files, then gets written out to Amazon S3.
My question is two-fold:
This job could take a long time to run, and it would be nice to know how that mapping is going. I know Spark has a built in log manager. Is it as simple as just putting a log statement inside of each map? I'd like to know when each record gets mapped.
Suppose this massive job fails in the middle (maybe it chokes on a DB record or the MYSQL connection drops). Is there an easy way to recover from this in Spark? I've heard that caching/checkpointing can potentially solve this, but I'm not sure how?
Thanks!
Seems like 2 questions with lost of answers and detail. Anyway, assuming non-SPARK Streaming answer and referencing other based on my own reading / research, a limited response:
The following on logging progress checking of stages, tasks, jobs:
Global Logging via log4j and tailoring of this by using under the template stored under SPARK_HOME/conf folder, the template log4j.properties.template file which serves as a basis for defining logging requirements for ones own purposes but at SPARK level.
Programmtically by using Logger, using import org.apache.log4j.{Level, Logger}.
REST API to get status of SPARK Jobs. See this enlightening blog: http://arturmkrtchyan.com/apache-spark-hidden-rest-api
There is also a Spark Listener that can be used
:http://:8080 to see progress via Web UI.
Depends on type of failure. Graceful vs. non-graceful, fault tolerance aspects or memory usage issues and things like serious database duplicate key errors depending on API used.
See How does Apache Spark handles system failure when deployed in YARN? SPARK handles its own failures by looking at DAG and attempting to reconstruct a partition by re-execution of what is needed. This all encompasses aspects under fault tolerance for which nothing needs to be done.
Things outside of SPARK's domain and control mean it's over. E.g. memory issues that may result from exceeding various parameters on at large scale computations, DF JDBC write against a store with a duplicate error, JDBC connection outages. This means re-execution.
As an aside, some aspects are not logged as failures even though they are, e.g. duplicate key inserts on some Hadoop Storage Managers.

Resources