Http requests in azure data factory to retrieve xml data and store it in azure storage blob - azure

Previously we made logic app in azure where we used http request to retrieve xml file from our clients system.
It goes like that:
HTTP request --> response body is xml data --> we save that xml data in azure blob storage as xml file.
My question is how and if its possible to do the same thing in azure data factory?
Reason for us to move this process over to data factory is that we also need to execute sql server stored procedures there and in logic app there is that 2 minute timeout and some of our procedures run longer than 2 min.

If you're looking for a way to manually trigger an Azure Data Factory pipeline,
You can manually run your pipeline by using one of the following methods:
.NET SDK
Azure PowerShell module
REST API
Python SDK
The following sample command shows you how to run your pipeline by using the REST API manually:
POST
https://management.azure.com/subscriptions/mySubId/resourceGroups/myResourceGroup/providers/Microsoft.DataFactory/factories/myDataFactory/pipelines/copyPipeline/createRun?api-version=2017-03-01-preview
More information: Manual execution (on-demand) with JSON
There are more questions to be answered, however, like "can we increase the timeout for the Logic App" (yes, see HTTP request limits - Timeout duration), "does the Logic App need to wait for the Stored Procedures to complete" and "is Data Factory the best tool for the job". The best answer to your question depends on the answer to all of these questions.
Based on the information you provided, running the logic in a different way like a Logic App on an Integrated Service Environment or an Azure Function feels like the best option.

Related

Using Azure Data Factory to ingest incoming data from a REST API

Is there a way to create an Azure ADF Pipeline to ingest the incoming POST requests? I have this gateway app (outside Azure) that is able to publish data via REST as it arrives from the application and this data needs to be ingested into a Data Lake. I am utilizing the REST calls from another pipeline to pull the data but this basically needs to do the reverse - the data will be pushed and i need to be constantly 'listening' to those calls...
Is this something an ADF pipeline should do or maybe there are any other Azure components able to do it?
Previous comment is right and is one of the approach to get it working but would need bit of coding (for azure function).
There could also be an alternate solution to cater to your requirement is with Azure Logic Apps and Azure data factory.
Step 1: Create a HTTP triggered logic app which would be invoked by your gateway app and data will be posted to this REST callable endpoint.
Step 2: Create ADF pipeline with a parameter, this parameter holds the data that needs to be pushed to the data lake. It could be raw data and can be transformed as a step within the pipeline before pushing it to the data lake.
Step 3: Once logic app is triggered, you can simply use Azure data factory actions to invoke the data factory pipeline created in step 2 and pass the posted data as a pipeline parameter to your ADF pipeline.
This should be it, with this - you can spin up your code-less solution.
If your outside application is already pushing via REST, why not have it make calls directly to the Data Lake REST APIs? This would cut out the middle steps and bring everything under your control.
Azure Data Factory is a batch data movement service. If you want to push the data over HTTP, you can implement a simple Azure Function to accept the data and write it to the Azure Data Lake.
See Azure Functions HTTP triggers and bindings overview

Can azure event hub ingest json events from azure blog storage without writing any code?

Is it possible to use some ready made construct in azure cloud environment to ingest the events (in json format) that are currently stored in azure blob storage and have it submit those events directly to azure event hub without writing any (however small) custom code? In other words, I would like to use configuration driven approach only.
Sure. You can try to use Azure Logic Apps to realize your needs without any code or just with some function expressions, please refer to the offical documents of Azure Logic Apps to know more details.
The logic flow is as the figure below.
You can refer to my sample below to make it works.
Here is my sample to receive an event from my EventHub and transfer to Azure Blob Storage to create a new blob for storing the event data.
Create an Azure Logic App instance on Azure portal, it should be easy for you.
Move to the tab Logic app designer to configure the logic flow.
Click Save and Run buttons. Then, use ServiceBusExplorer (downloaded from https://github.com/paolosalvatori/ServiceBusExplorer/releases) to send event message and check whether new blob created using AzureStorageExplorer. It works fine after a few minutes.

Data from HTTP endpoint to be loaded into Azure Data Lake using Azure Data Factory

I am trying to build a so called "modern data warehouse" using Azure services.
First step is to gather all the data in its native raw format into Azure Data Lake store. For some of the data sources we have no other choice than to use API for consuming the data. There's not much information when searching, therefore I am asking.
Is it possible to define 2 Web Activities in my pipeline that will handle below scenario?
Web1 activity gets an API URL generated from C# (Azure Function). It returns data in JSON format and saves it to Web1.Output - this is working fine.
Web2 activity consumes Web1.Output and saves it into Azure Data Lake as a plain txt file (PUT or POST) - this is needed.
Above scenario is achievable by using Copy activity, but then I am not able to pass dynamic URL generated by Azure Functions. How do I save the JSON output to ADL? Is there any other way?
Thanks!
Since you are using blob storage as an intermediary, and want to consume the blob upon creation, you could take advantage of Event Triggers. You can set up the Event trigger to run a pipeline containing Web2 activity. Which kicks off when the Web1 activity completes (separate pipeline).
By separating the two activities into separate pipelines, the workflow becomes asynchronous. This means you will not need to wait for both activities to complete before doing the next URL. There are many other benefits as well.

Taking parameters from manual triggers in ADF

Usecase
We have an on-premise Hadoop setup and we are using power BI as a BI visualization tool. What we do currently to get data on Powerbi is as follows.
Copy data from on-premise to Azure Blob(Our on-premise schedule does this once the data is ready in Hive)
Data from Azure Blob is then copied to Azure-DataWarehouse/Azure-SQL
Cube refreshes on Azure AAS, AAS pulls data from Azure DataWarehouse/SQL
To do the step2 and step3 we are currently running a web server on Azure and the endpoints are configured to take few parameters like the table name, azure file location, cube information and so on.
Sample http request:
http://azure-web-server-scheduler/copydata?from=blob&to=datawarehouse&fromloc=myblob/data/today.csv&totable=mydb.mytable
Here the web servers extract the values from variables(from, fromloc, to, totable) and them does the copy activity. We did this as we had a lot of tables and all could reuse the same function.
Now we have use cases piling up(retries, control flows, email alerts, monitoring) and we are looking for a cloud alternative to do the scheduling job for us, we would still like to hit an HTTP endpoint like the above one.
One of the alternatives we have checked till now is the Azure Data Factory, where are create pipelines to achieve the steps above and trigger the ADF using http endpoints.
Problems
How can we take parameters from the http post call and make it available as custom variables[1], this is required within the pipeline so that we can still write a function for each step{2, 3} and the function can take these parameters, we don't want to create an ADF for each table.
How can we detect for failure in ADF steps and send email alerts during failures?
What are the other options apart from ADF to do this in Azure?
[1] https://learn.microsoft.com/en-us/azure/data-factory/control-flow-system-variables
You could trigger the copy job from blob to SQL DW via a Get Metadata Acitivity. It can be used in the following scenarios:
- Validate the metadata information of any data
- Trigger a pipeline when data is ready/ available
For eMail notification you can use a Web Activity calling a LogicApp. See the following tuturial how to send an email notification.

Is Azure Data Factory suitable for downloading data from non-Azure REST APIs?

Consider a data processing pipeline as follows:
Fetch a large amount of data from a REST API that's hosted somewhere on the internet and persist it to a data store.
Perform some complex data transformations on the persisted data.
Persist the results of the data transformations on a data store.
Aiming to implement such a pipeline in Azure, steps 2 and 3 seem like a good fit for implementation as Azure Data Factory activities.
My questions is - Does it make sense to implement step 1 in an Azure Data Factory activity as well?
Technically it might be possible to code a .Net activity that perform the data download and persistence.
No - do not implement step 1 in an Azure Data Factory activity.
Technically it is possible to run the entire process from ADF but I would argue that the choice is more costly (relatively) than other options available to you because you will pay for each activity in Azure Data Factory.
For instance, what if the rest api does not have any new data to offer when you initiate the (scheduled) activity? You'll pay for that.
You might consider the following as an easy to implement alternative:
1 - Create a .NET console app, publish as a WebJob, schedule to run daily.
2 - The long-running console app can query the rest api, persist data into azure storage / documentdb, push a message into queue which triggers ADF steps 2/3 to run against the saved data.
I have done exactly that using .Net Activity. I had a need to fetch data from Salesforce api. This has been working well for my needs. Here is a post I wrote up about creating a .net activity and storing the data in azure data lake.
As in Newport99's answer yes you will incur costs for that activity but I am not sure how cost effect it would be to be running a separate web app to host a web job and also run the Azure Data Factory pipeline. When I was originally designing a solution the WebJob was my first choice but in the end I prefer to have the whole solution utilizing one azure service instead of multiple.
Hope that helps.
There have been a lot of improvements to ADF in the years since this question was posted, including a REST connector.
Here's the approach recommended by ADF at this time...
Copy data from a REST endpoint by using Azure Data Factory

Resources