Using Azure Data Factory to ingest incoming data from a REST API - azure

Is there a way to create an Azure ADF Pipeline to ingest the incoming POST requests? I have this gateway app (outside Azure) that is able to publish data via REST as it arrives from the application and this data needs to be ingested into a Data Lake. I am utilizing the REST calls from another pipeline to pull the data but this basically needs to do the reverse - the data will be pushed and i need to be constantly 'listening' to those calls...
Is this something an ADF pipeline should do or maybe there are any other Azure components able to do it?

Previous comment is right and is one of the approach to get it working but would need bit of coding (for azure function).
There could also be an alternate solution to cater to your requirement is with Azure Logic Apps and Azure data factory.
Step 1: Create a HTTP triggered logic app which would be invoked by your gateway app and data will be posted to this REST callable endpoint.
Step 2: Create ADF pipeline with a parameter, this parameter holds the data that needs to be pushed to the data lake. It could be raw data and can be transformed as a step within the pipeline before pushing it to the data lake.
Step 3: Once logic app is triggered, you can simply use Azure data factory actions to invoke the data factory pipeline created in step 2 and pass the posted data as a pipeline parameter to your ADF pipeline.
This should be it, with this - you can spin up your code-less solution.

If your outside application is already pushing via REST, why not have it make calls directly to the Data Lake REST APIs? This would cut out the middle steps and bring everything under your control.

Azure Data Factory is a batch data movement service. If you want to push the data over HTTP, you can implement a simple Azure Function to accept the data and write it to the Azure Data Lake.
See Azure Functions HTTP triggers and bindings overview

Related

Http requests in azure data factory to retrieve xml data and store it in azure storage blob

Previously we made logic app in azure where we used http request to retrieve xml file from our clients system.
It goes like that:
HTTP request --> response body is xml data --> we save that xml data in azure blob storage as xml file.
My question is how and if its possible to do the same thing in azure data factory?
Reason for us to move this process over to data factory is that we also need to execute sql server stored procedures there and in logic app there is that 2 minute timeout and some of our procedures run longer than 2 min.
If you're looking for a way to manually trigger an Azure Data Factory pipeline,
You can manually run your pipeline by using one of the following methods:
.NET SDK
Azure PowerShell module
REST API
Python SDK
The following sample command shows you how to run your pipeline by using the REST API manually:
POST
https://management.azure.com/subscriptions/mySubId/resourceGroups/myResourceGroup/providers/Microsoft.DataFactory/factories/myDataFactory/pipelines/copyPipeline/createRun?api-version=2017-03-01-preview
More information: Manual execution (on-demand) with JSON
There are more questions to be answered, however, like "can we increase the timeout for the Logic App" (yes, see HTTP request limits - Timeout duration), "does the Logic App need to wait for the Stored Procedures to complete" and "is Data Factory the best tool for the job". The best answer to your question depends on the answer to all of these questions.
Based on the information you provided, running the logic in a different way like a Logic App on an Integrated Service Environment or an Azure Function feels like the best option.

Data from HTTP endpoint to be loaded into Azure Data Lake using Azure Data Factory

I am trying to build a so called "modern data warehouse" using Azure services.
First step is to gather all the data in its native raw format into Azure Data Lake store. For some of the data sources we have no other choice than to use API for consuming the data. There's not much information when searching, therefore I am asking.
Is it possible to define 2 Web Activities in my pipeline that will handle below scenario?
Web1 activity gets an API URL generated from C# (Azure Function). It returns data in JSON format and saves it to Web1.Output - this is working fine.
Web2 activity consumes Web1.Output and saves it into Azure Data Lake as a plain txt file (PUT or POST) - this is needed.
Above scenario is achievable by using Copy activity, but then I am not able to pass dynamic URL generated by Azure Functions. How do I save the JSON output to ADL? Is there any other way?
Thanks!
Since you are using blob storage as an intermediary, and want to consume the blob upon creation, you could take advantage of Event Triggers. You can set up the Event trigger to run a pipeline containing Web2 activity. Which kicks off when the Web1 activity completes (separate pipeline).
By separating the two activities into separate pipelines, the workflow becomes asynchronous. This means you will not need to wait for both activities to complete before doing the next URL. There are many other benefits as well.

Use Azure Functions as custom activity in ADFv2

Is it possible to somehow package and execute already written azure function as a custom activity in azure data factory?
My workflow is next:
I want to use azure function (which is doing some data processing) in ADF pipeline as a custom activity. This custom activity is just one of the activities in pipeline but its key to be executed.
Is it possible to somehow package and execute already written azure
function as a custom activity in azure data factory?
As I know, there is no way to do that so far. In my opinion, you do not need to package the Azure Function. I suggest you using Web Activity to invoke the endpoint of your Azure Function which could merge into previous pipeline nicely.

Taking parameters from manual triggers in ADF

Usecase
We have an on-premise Hadoop setup and we are using power BI as a BI visualization tool. What we do currently to get data on Powerbi is as follows.
Copy data from on-premise to Azure Blob(Our on-premise schedule does this once the data is ready in Hive)
Data from Azure Blob is then copied to Azure-DataWarehouse/Azure-SQL
Cube refreshes on Azure AAS, AAS pulls data from Azure DataWarehouse/SQL
To do the step2 and step3 we are currently running a web server on Azure and the endpoints are configured to take few parameters like the table name, azure file location, cube information and so on.
Sample http request:
http://azure-web-server-scheduler/copydata?from=blob&to=datawarehouse&fromloc=myblob/data/today.csv&totable=mydb.mytable
Here the web servers extract the values from variables(from, fromloc, to, totable) and them does the copy activity. We did this as we had a lot of tables and all could reuse the same function.
Now we have use cases piling up(retries, control flows, email alerts, monitoring) and we are looking for a cloud alternative to do the scheduling job for us, we would still like to hit an HTTP endpoint like the above one.
One of the alternatives we have checked till now is the Azure Data Factory, where are create pipelines to achieve the steps above and trigger the ADF using http endpoints.
Problems
How can we take parameters from the http post call and make it available as custom variables[1], this is required within the pipeline so that we can still write a function for each step{2, 3} and the function can take these parameters, we don't want to create an ADF for each table.
How can we detect for failure in ADF steps and send email alerts during failures?
What are the other options apart from ADF to do this in Azure?
[1] https://learn.microsoft.com/en-us/azure/data-factory/control-flow-system-variables
You could trigger the copy job from blob to SQL DW via a Get Metadata Acitivity. It can be used in the following scenarios:
- Validate the metadata information of any data
- Trigger a pipeline when data is ready/ available
For eMail notification you can use a Web Activity calling a LogicApp. See the following tuturial how to send an email notification.

Is Azure Data Factory suitable for downloading data from non-Azure REST APIs?

Consider a data processing pipeline as follows:
Fetch a large amount of data from a REST API that's hosted somewhere on the internet and persist it to a data store.
Perform some complex data transformations on the persisted data.
Persist the results of the data transformations on a data store.
Aiming to implement such a pipeline in Azure, steps 2 and 3 seem like a good fit for implementation as Azure Data Factory activities.
My questions is - Does it make sense to implement step 1 in an Azure Data Factory activity as well?
Technically it might be possible to code a .Net activity that perform the data download and persistence.
No - do not implement step 1 in an Azure Data Factory activity.
Technically it is possible to run the entire process from ADF but I would argue that the choice is more costly (relatively) than other options available to you because you will pay for each activity in Azure Data Factory.
For instance, what if the rest api does not have any new data to offer when you initiate the (scheduled) activity? You'll pay for that.
You might consider the following as an easy to implement alternative:
1 - Create a .NET console app, publish as a WebJob, schedule to run daily.
2 - The long-running console app can query the rest api, persist data into azure storage / documentdb, push a message into queue which triggers ADF steps 2/3 to run against the saved data.
I have done exactly that using .Net Activity. I had a need to fetch data from Salesforce api. This has been working well for my needs. Here is a post I wrote up about creating a .net activity and storing the data in azure data lake.
As in Newport99's answer yes you will incur costs for that activity but I am not sure how cost effect it would be to be running a separate web app to host a web job and also run the Azure Data Factory pipeline. When I was originally designing a solution the WebJob was my first choice but in the end I prefer to have the whole solution utilizing one azure service instead of multiple.
Hope that helps.
There have been a lot of improvements to ADF in the years since this question was posted, including a REST connector.
Here's the approach recommended by ADF at this time...
Copy data from a REST endpoint by using Azure Data Factory

Resources