We have a ADFv1 pipeline with an HDInsightHive type activity that submits hive script to a Hadoop HDIndight cluster. Looking at the JSON for the pipeline, there doesn't seem to be any way to specify a YARN queue that the job should be submitted to.
So it's assuming that the job is always to be submitted to the default queue. I didn't find anything in ADFv1 documentation yet to specify queue name (assuming we actually create more YARN queues on the cluster using capacity scheduler).
Can someone provide sample JSON for specifying a YARN queue in an activity if it is possible at all? Also, my requirement is specifically for ADFv1, I would also like to know if this is a limitation of ADFv1, is it fixed in ADFv2?
Currently, Azure Data Factory doesn't support submitting an activity to a specific queue.
Azure Data Factory activity always submitted to the default queue.
I would suggest you vote up an idea submitted by another Azure customer.
https://feedback.azure.com/forums/270578-data-factory/suggestions/32956186-hdinsightspark-activity-should-support-additional
All of the feedback you share in these forums will be monitored and reviewed by the Microsoft engineering teams responsible for building Azure.
Related
I need to develop a event driven pipeline which should get trigger on file arrival in ADLS2 i.e. ABFS. On file arrival I need to trigger 4 subsequent Spark jobs on Azure Databricks cluster.
For orchestrating the Spark Jobs I can use Databricks jobs as an option so that jobs could get triggered in a pipeline.
But the first job should get triggered only after the file arrival.
I am currently exploring ways to achieve this but need expert advice to design this in a best possible manner w.r.t cost.
One solution could be to use Azure Data Factory for orchestrating the entire flow based on Storage Event Trigger component but going for ADF just because of event based trigger don't look feasible to me as the rest part of the application i.e. Spark jobs can be pipelined from Databricks Job feature. Also, in terms of cost ADF can be expensive. Another solution could be to use Azure Functions Blob Trigger to know the file arrival but I am not able to understand how can I trigger Azure Databricks jobs from Azure Functions. As going with Functions can be cost effective as the function would not be running/active until the file has arrived.
Note:There can be multiple files arriving in an hour. No fixed duration on file arrival.
Also, trigger file is different than data files. i.e. On arrival of trigger files, Spark pipeline would consume actual data files.
Data files and Trigger files have different extensions and both are arriving in ABFS.
Your worry about ADF cost is misplaced. The Pipelines are extremely cheap. The activities that actually move data and use CPU are where most of the cost is. For instance Data Flows are run on managed Spark clusters, which is reflected in the pricing. See Data Factory Pricing. Using a Pipeline to orchestrate Databricks jobs is a common, simple, and (at least for ADF) very inexpensive.
If you want to kick off a Databricks job from an Azure Function, there's an API. Also check out the Databricks Autoloader, but running your Databricks cluster continuously can be expensive.
Problem
Due to internal requirements, I need to run a Synapse pipeline and then trigger an ADF pipeline. It does not seem that there is a Microsoft-approved method of doing this. The pipelines run infrequently (every week or month) and the ADF pipeline must run after the Synapse pipeline.
Options
It seems that other answers pose several options:
Azure Functions. Create an Azure function that calls the CreatePipelineRun function on the ADF pipeline. At the end of the Synapse pipeline, insert a block that calls the Azure function.
Use the REST API and Web Activity. Use the REST API to make a call to run the ADF pipeline. Insert a Web Activity block at the end of the Synapse pipeline to make the API call.
Tables and polling. Insert a record into a table in a managed database with data about the Synapse pipeline run. Have regular polling from the ADF pipeline to check for new records and run when ready.
Storage Event. Create a timestamped blob file at the end of the Synapse run. Use the "storage event trigger" within ADF to trigger the ADF pipeline.
Question
Which of these would be closest to the "approved" option? Are there any clear disadvantages to any of these?
As you mentioned, there is no "approved" solution for this problem. All the approaches you mentioned have pros and cons and should work. For me, Option #3 has been very successful. We have built a Queue Manager based on Tables & Stored Procedures in Azure SQL. We use Logic Apps to process the Triggers which can be Scheduled, Blob Events, or REST calls. Those Logic Apps insert jobs in the Queue table via Stored Procedure. That Stored Procedure can be called directly by virtually any system, so your Synapse pipeline could insert a Queue job to execute the ADF pipeline. Other benefits include a log of all the pipeline runs, support for multiple Data Factories (and now Synapse Workspaces), and a web interface we wrapped around the database for management and tracking.
We have 2 other Logic Apps that process the Queue (a Status manager and an Executor). These run constantly (every 1 minute and every 3 minutes). The actions to check status and create pipeline runs are both implemented as .NET Azure Functions [you'll need different SDKs for Synapse vs. ADF]. This system runs thousands of pipelines a month, sometimes more, across numerous Data Factories and Synapse Workspaces.
The PROs here are many, but this disconnected approach permits facets of your system to operate in isolation. And it is flexible, in that you can tie virtually any system into the Queue. Your example of a pipeline that needs to execute another pipeline in a different system is a perfect example.
The CON here is that this is the most involved approach. If this is a on-off problem you are trying to solve, choose one of the other options.
I've successfully integrated Snowpipe with a container inside the Azure storage and loaded data into my target table, but now I can't exactly figure out how does Snowpipe actually works. Also, please let me know if there is already a good resource that answers this question, I'd be very grateful.
In my example, I tested a Snowpipe mechanism that uses cloud messaging. So, from my understanding, when a file is uploaded into an Azure container, Azure Event Grid sends an event message to an Azure queue, from which Snowpipe is notified that a new file is uploaded into the container. Then, Snowpipe in the background starts its loading process and imports the data into a target table.
If this is correct, I don't understand how does Azure queue informs Snowpipe about uploaded files. Is this connected to the "notification integration" inside Snowflake? Also, I don't understand what does it mean when they say on the Snowflake page that "Snowpipe copies the files into a queue, from which they are loaded into the target table...". Is this an Azure queue or some Snowflake queue?
I hope this question makes sense, any help or detailed explanation of the whole process is appreciated!
You've pretty much nailed it. to answer your specific questions... (and don't feel bad about them, this is definitely confusing)
how does Azure queue informs Snowpipe about uploaded files? Is this connected to the "notification integration" inside Snowflake?
Yes, this is the notification integration. But Azure is not "informing" the Snowpipe, it's the other way around. The Azure queue creates a notification that various other applications can subscribe to (this has no awareness of Snowflake). The notification integration on the snowflake side is snowflake's way to integrate with these external notifications
Snowpipe's queueing
Once snowflake recieves one of these notifications it puts that notification into a snowflake-side queue (or according to that page, the file itself. I was surprised by this, but the end result is the same). Snowpipes are wired up to that notification integration (as part of the create statement). The files are directed to the appropriate snowpipe based on the information in the "Stage" (also as part of the pipe create statement. I'm actually not certain if this part is a push or a pull). Then it runs the COPY INTO on that file.
I have a time consuming custom activity running in a Azure data factory pipeline.
It copies files from Blob to FTP server recursively.
The entire activity take 3-4 hours based on the number of files in the folder.
But when I am running the pipeline, it shows in progress 0%.
How update pipeline progress from custom activity?
In short, I doubt you will be able to. The services are very discounted from each other.
You might be better off writing out to the Azure generic activity log and monitoring directly from the custom activity method. This is an assumption though.
Hope this helps.
I am new to using on demand hd insight. I have a basic question -
I have multiple activities running simultaneously in separate ADF pipelines each using an HDInsight ondemand linked service. How many instances of HDInsight gets created? Is it one instance per activity?
I got confused a bit because the documentation states that each instance created has a time-to-live value within which if a new job comes it will process that. Does the new job need to come from an activity in the same pipeline that originally created the instance or this instance is shared across activities in other pipelines?
Also just wanted to confirm my understanding that the cores count used for on demand instances do not count towards the subscription usage count.
Really sorry if the questions are very basic but any help very much appreciated.
Partial answers to my questions provided below - refer to comments section above for open points.
Answer for sharing of instance across pipelines is available at url -> "If the timetolive property value is appropriately set, multiple pipelines can share the instance of the on-demand HDInsight cluster."
Regarding my other question on CPU limits for HDInsight, as per azure limits -> the ondemand HDinsight core limits are restricted to 60 per subscription and this is different than the general core limit per subscription.
Also interestingly for manually created HDInsight clusters there exists a CPU limit as mentioned in this Stackoverflow link. It is as of today 170 per subscription obtainable by issuing the powershell command Get-AzureHDInsightProperties. Again I understand this limit is different than the subscription general core limit.