Trigger Dataflow job on file arrival in GCS using Cloud Composer - python-3.x

I'm new to GCP and trying to gain a better understanding of ETL services offered by Google Cloud. I'm trying to do a POC where a dataflow job runs whenever a file arrives at a specified location in GCS, I want to utilize Cloud Composer for the same.
How can I utilize different airflow operators (one to watch for file in GCS and other to trigger dataflow job) in a DAG?

Create a Cloud Function triggered when a file is created / arrives at your bucket. You can find how to deploy such function in the following documentation https://cloud.google.com/functions/docs/calling/storage#deploy
Code your function to trigger your DAG, it's well explained in the following documentation: https://cloud.google.com/composer/docs/how-to/using/triggering-with-gcf
Use one of the following airflow operators to run a dataflow job as a task within your DAG
DataflowCreateJavaJobOperator if you are using Java SDK
DataflowCreatePythonJobOperator if you are using Python SDK
More details here : https://airflow.apache.org/docs/apache-airflow-providers-google/2.1.0/operators/cloud/dataflow.html
That was the answer to your question about composer, but you should consider replacing Composer by Cloud Workflows, and rather than using an airflow dataflow operator, you can use Cloud Workflows dataflow connector, and you can code your cloud function to trigger a Cloud Workflows execution. There are many clients library in different languages to do that. Here the link to choose your library: https://cloud.google.com/workflows/docs/reference/libraries
Finally Cloud Workflows pros compared to Composer.
it's serverless: you don't have to bother about machine types, network, ...
you use yaml to describe your workflow in easy and fast way, so you don't have to worry about learning airflow
it's way cheaper than composer
...

Related

How to invoke Job/Task in Azure Databricks from Azure Function

I need to develop a event driven pipeline which should get trigger on file arrival in ADLS2 i.e. ABFS. On file arrival I need to trigger 4 subsequent Spark jobs on Azure Databricks cluster.
For orchestrating the Spark Jobs I can use Databricks jobs as an option so that jobs could get triggered in a pipeline.
But the first job should get triggered only after the file arrival.
I am currently exploring ways to achieve this but need expert advice to design this in a best possible manner w.r.t cost.
One solution could be to use Azure Data Factory for orchestrating the entire flow based on Storage Event Trigger component but going for ADF just because of event based trigger don't look feasible to me as the rest part of the application i.e. Spark jobs can be pipelined from Databricks Job feature. Also, in terms of cost ADF can be expensive. Another solution could be to use Azure Functions Blob Trigger to know the file arrival but I am not able to understand how can I trigger Azure Databricks jobs from Azure Functions. As going with Functions can be cost effective as the function would not be running/active until the file has arrived.
Note:There can be multiple files arriving in an hour. No fixed duration on file arrival.
Also, trigger file is different than data files. i.e. On arrival of trigger files, Spark pipeline would consume actual data files.
Data files and Trigger files have different extensions and both are arriving in ABFS.
Your worry about ADF cost is misplaced. The Pipelines are extremely cheap. The activities that actually move data and use CPU are where most of the cost is. For instance Data Flows are run on managed Spark clusters, which is reflected in the pricing. See Data Factory Pricing. Using a Pipeline to orchestrate Databricks jobs is a common, simple, and (at least for ADF) very inexpensive.
If you want to kick off a Databricks job from an Azure Function, there's an API. Also check out the Databricks Autoloader, but running your Databricks cluster continuously can be expensive.

How to Trigger ADF Pipeline from Synapse Pipelines

Problem
Due to internal requirements, I need to run a Synapse pipeline and then trigger an ADF pipeline. It does not seem that there is a Microsoft-approved method of doing this. The pipelines run infrequently (every week or month) and the ADF pipeline must run after the Synapse pipeline.
Options
It seems that other answers pose several options:
Azure Functions. Create an Azure function that calls the CreatePipelineRun function on the ADF pipeline. At the end of the Synapse pipeline, insert a block that calls the Azure function.
Use the REST API and Web Activity. Use the REST API to make a call to run the ADF pipeline. Insert a Web Activity block at the end of the Synapse pipeline to make the API call.
Tables and polling. Insert a record into a table in a managed database with data about the Synapse pipeline run. Have regular polling from the ADF pipeline to check for new records and run when ready.
Storage Event. Create a timestamped blob file at the end of the Synapse run. Use the "storage event trigger" within ADF to trigger the ADF pipeline.
Question
Which of these would be closest to the "approved" option? Are there any clear disadvantages to any of these?
As you mentioned, there is no "approved" solution for this problem. All the approaches you mentioned have pros and cons and should work. For me, Option #3 has been very successful. We have built a Queue Manager based on Tables & Stored Procedures in Azure SQL. We use Logic Apps to process the Triggers which can be Scheduled, Blob Events, or REST calls. Those Logic Apps insert jobs in the Queue table via Stored Procedure. That Stored Procedure can be called directly by virtually any system, so your Synapse pipeline could insert a Queue job to execute the ADF pipeline. Other benefits include a log of all the pipeline runs, support for multiple Data Factories (and now Synapse Workspaces), and a web interface we wrapped around the database for management and tracking.
We have 2 other Logic Apps that process the Queue (a Status manager and an Executor). These run constantly (every 1 minute and every 3 minutes). The actions to check status and create pipeline runs are both implemented as .NET Azure Functions [you'll need different SDKs for Synapse vs. ADF]. This system runs thousands of pipelines a month, sometimes more, across numerous Data Factories and Synapse Workspaces.
The PROs here are many, but this disconnected approach permits facets of your system to operate in isolation. And it is flexible, in that you can tie virtually any system into the Queue. Your example of a pipeline that needs to execute another pipeline in a different system is a perfect example.
The CON here is that this is the most involved approach. If this is a on-off problem you are trying to solve, choose one of the other options.

Is that possible to use control M to orchestrate Azure Data factory Jobs

Is that possible to use control M to orchestrate Azure Data factory Jobs?
I found this agent that can be installed on an VM:
https://azuremarketplace.microsoft.com/en-us/marketplace/apps/bmc-software.ctm-agent-linux-arm
But I didn't find documentation about it.
Cal Control M call an REST API to run and monitor a Job? I could user Azure functions and Blobs to control it.
All Control-M components can be installed and operated on Azure (and most other cloud infrastructure). Either use the link you quote or alternatively deploy Agents using Control-M Automation API (AAPI) or a combination of the two.
So long as you are on a fairly recent version Control-M you can do most operational tasks, for example you can monitor a job like so -
ctm run jobs:status::get -s "jobid=controlm:00001"
The Control-M API is developing quickly, check out the documentation linked from here -
https://docs.bmc.com/docs/automation-api/9019100monthly/services-872868740.html#Control-MAutomationAPI-Services-ctmrunjob:status::get
Also see -
https://github.com/controlm/automation-api-quickstart http://controlm.github.io https://docs.bmc.com/docs/display/public/workloadautomation/Control-M+Automation+API+-+Services https://52.32.170.215:8443/automation-api/swagger-ui.html
At this time, I don't believe you will find any out of the box connectors for Control-M to Azure Data Factory integration. You do have some other options, though!
Proxy ADF Yourself
You can write the glue code for this, essentially being the mediator between the two.
Write a program that will invoke the ADF REST API to run a pipeline.
Details Here
After triggering the pipeline, then write the code for monitoring for status.
Details Here
Have Control-M call your code via an Agent that has access to it.
I've done this with a C# console app running on a local server, and a Control-M Agent that invokes the glue code.
Control-M Documentation here also allows a way for you to execute an Azure Function directly from Control-M. This means you could put your code in an Azure Function.
Details Here'
ALTERNATIVE METHOD
For a "no code" way, check out this Logic App connector.
Write a logic app to run the pipeline and get the pipeline run to monitor status in a loop.
Next, Control-M should be able to use a plugin to invoke the logic app.
Notes
**Note that Control-M required an HTTP Trigger for Azure Functions and Logic Apps.
**You might also be able to take advantage of the Control-M Web Services plugin. Though, in my experience, I wasn't impressed with the lack of support for different authentication methods.
Hope this helps!
I just came across this post so a bit late to the party.
Control-M includes Application Integrator which enables you to use integrations created by others and to either enhance them or build your own. You can use REST or cli to instruct Control-M what requests should be generated to an application when a job is started, during execution and monitoring and how to analyze results and collect output.
A public repository accessible from Application Integrator shows existing jobs and there is one for Data Factory. I have extended it a bit so that the the Data Factory is started and monitored to completion via REST but then a Powershell script is invoked to retrieve the pipeline run information for each activity within the pipeline.
I've posted that job and script in https://github.com/JoeGoldberg/automation-api-community-solutions/tree/master/4-ai-job-type-examples/CTM4AzureDataFactory but the README is coming later.

Is there any alternative for WebJobs in AWS (like in Azure)?

I need to implement scheduled tasks, so that every X time the job will start running and will start an .exe file.
I did this those tasks in Azure very easily, but can't find something appropriate in Amazon Web Services.
Can you tell me if there is something similar in AWS for Azure WebJobs?
The most similar piece of AWS services that fits your needs is AWS Lambda. But as your comment states you do not want to code.
When comparing AWS to other cloud services it pops out that AWS focus on a very primitive services that can be connect and build complex systems. This is an advantage as one can tailor the cloud to its needs. However it can be more complex to setup when compared to a PaaS.

Automating Azure Machine Learning

Is there a way of automating the calls to the Azure Machine Learning Service (AML)?
I’ve created the web service from AML. Now I have to do the calls the automated way. I’m trying to build a system, that connects to a Raspberry Pi for sensor data and gets a prediction from the ML service to be saved with the data itself.
Is there something in Azure to automate this or should I do it within the application?
I'm assuming you've created the webservice from the experiment and asking about the consumption of the webservice. You can consume the webservice from anything that can do an API call to the endpoint. I don't know the exact architecture of your solution but take a look at this as it might suit your scenario.
Stream analytics on Azure has a new feature called Functions(just a heads-up, its still in preview) that can automate the usage of deployed ML services from your account.Since you are trying to gather info from IoT devices, you might use Event Hubs or IoT Hubs to get the data and process it using Stream Analytics and during the process you can use the Webservice as Function in SA to achieve on-the-go ML results.
Usage is relatively simple if you are familiar with Stream Analytics or SQL queries in general.This link shows the step by step implementation and the usage is below;
WITH subquery AS (
SELECT text, "webservicealias"(text) as result from input
)
Select text, result.[Score]
Into output
From subquery
Hope this helps!
Mert
you can also automatically schedule this using powershell command and any task scheduler
Powershell for Azure ML - https://github.com/hning86/azuremlps and its usage is described here - https://github.com/hning86/azuremlps#invoke-amlwebservicerrsendpoint
Task Scheduler for powershell - http://www.metalogix.com/help/Content%20Matrix%20Console/SharePoint%20Edition/002_HowTo/004_SharePointActions/012_SchedulingPowerShell.htm

Resources