Is Azure Data Factory suitable for downloading data from non-Azure REST APIs? - azure

Consider a data processing pipeline as follows:
Fetch a large amount of data from a REST API that's hosted somewhere on the internet and persist it to a data store.
Perform some complex data transformations on the persisted data.
Persist the results of the data transformations on a data store.
Aiming to implement such a pipeline in Azure, steps 2 and 3 seem like a good fit for implementation as Azure Data Factory activities.
My questions is - Does it make sense to implement step 1 in an Azure Data Factory activity as well?
Technically it might be possible to code a .Net activity that perform the data download and persistence.

No - do not implement step 1 in an Azure Data Factory activity.
Technically it is possible to run the entire process from ADF but I would argue that the choice is more costly (relatively) than other options available to you because you will pay for each activity in Azure Data Factory.
For instance, what if the rest api does not have any new data to offer when you initiate the (scheduled) activity? You'll pay for that.
You might consider the following as an easy to implement alternative:
1 - Create a .NET console app, publish as a WebJob, schedule to run daily.
2 - The long-running console app can query the rest api, persist data into azure storage / documentdb, push a message into queue which triggers ADF steps 2/3 to run against the saved data.

I have done exactly that using .Net Activity. I had a need to fetch data from Salesforce api. This has been working well for my needs. Here is a post I wrote up about creating a .net activity and storing the data in azure data lake.
As in Newport99's answer yes you will incur costs for that activity but I am not sure how cost effect it would be to be running a separate web app to host a web job and also run the Azure Data Factory pipeline. When I was originally designing a solution the WebJob was my first choice but in the end I prefer to have the whole solution utilizing one azure service instead of multiple.
Hope that helps.

There have been a lot of improvements to ADF in the years since this question was posted, including a REST connector.
Here's the approach recommended by ADF at this time...
Copy data from a REST endpoint by using Azure Data Factory

Related

Using Azure Data Factory to ingest incoming data from a REST API

Is there a way to create an Azure ADF Pipeline to ingest the incoming POST requests? I have this gateway app (outside Azure) that is able to publish data via REST as it arrives from the application and this data needs to be ingested into a Data Lake. I am utilizing the REST calls from another pipeline to pull the data but this basically needs to do the reverse - the data will be pushed and i need to be constantly 'listening' to those calls...
Is this something an ADF pipeline should do or maybe there are any other Azure components able to do it?
Previous comment is right and is one of the approach to get it working but would need bit of coding (for azure function).
There could also be an alternate solution to cater to your requirement is with Azure Logic Apps and Azure data factory.
Step 1: Create a HTTP triggered logic app which would be invoked by your gateway app and data will be posted to this REST callable endpoint.
Step 2: Create ADF pipeline with a parameter, this parameter holds the data that needs to be pushed to the data lake. It could be raw data and can be transformed as a step within the pipeline before pushing it to the data lake.
Step 3: Once logic app is triggered, you can simply use Azure data factory actions to invoke the data factory pipeline created in step 2 and pass the posted data as a pipeline parameter to your ADF pipeline.
This should be it, with this - you can spin up your code-less solution.
If your outside application is already pushing via REST, why not have it make calls directly to the Data Lake REST APIs? This would cut out the middle steps and bring everything under your control.
Azure Data Factory is a batch data movement service. If you want to push the data over HTTP, you can implement a simple Azure Function to accept the data and write it to the Azure Data Lake.
See Azure Functions HTTP triggers and bindings overview

Azure data factory data flow task cannot take on prem as source

Hey,
I am having trouble in creating a Data Flows task which uses an on-prem source. Is this not possible in the Preview version?
I have created a self hosted IR to connect ADF to my laptop, and that is what I want to use. In the pic below I am trying to create a dataset off self hosted IR. It works great in Copy task but for Data Flows it is greyed out.
For this question, I asked Azure support for help and they replied me with the answer:
Answer:
This means on-premise SQL server is not supported as dataset in data flow in current stage.
Screen shot:
Update:
Data flow now only support Azure IR so it doesn’t support on-premise dataset.
Refer to Integration runtime types.
Hope this helps.
If your goal is to use visual data transformations in ADF using Mapping Data Flows with on-prem data, then build a pipeline with a Copy Activity first. Use the Self-Hosted Integration Runtime with the Copy Activity to stage your data in Blob Store. Then add a subsequent Execute Data Flow activity to transform that data.
I made video on how to do this:
https://www.youtube.com/watch?v=IN-4v0e7UIs
Reproduce your issue on my side ,however nothing about this feature is claimed on the official document. As you can see everywhere about data flow cliams that:
You could submit any voice here:
Also found a feedback for data flow in ADF for your reference.If you need push the progress of it,you could vote up it. Also,i would suggest you referring to the comments in the link:
For access to the 80+ ADF connectors, use Copy Activity to stage data
for transformation.
Data Flows will access data in your lake (Blob, ADB, ADW, ADLS) for
transformation.
Think of Copy Activity as your data heavy-lifting activity and Data
Flow as your data transformation engine.

Data from HTTP endpoint to be loaded into Azure Data Lake using Azure Data Factory

I am trying to build a so called "modern data warehouse" using Azure services.
First step is to gather all the data in its native raw format into Azure Data Lake store. For some of the data sources we have no other choice than to use API for consuming the data. There's not much information when searching, therefore I am asking.
Is it possible to define 2 Web Activities in my pipeline that will handle below scenario?
Web1 activity gets an API URL generated from C# (Azure Function). It returns data in JSON format and saves it to Web1.Output - this is working fine.
Web2 activity consumes Web1.Output and saves it into Azure Data Lake as a plain txt file (PUT or POST) - this is needed.
Above scenario is achievable by using Copy activity, but then I am not able to pass dynamic URL generated by Azure Functions. How do I save the JSON output to ADL? Is there any other way?
Thanks!
Since you are using blob storage as an intermediary, and want to consume the blob upon creation, you could take advantage of Event Triggers. You can set up the Event trigger to run a pipeline containing Web2 activity. Which kicks off when the Web1 activity completes (separate pipeline).
By separating the two activities into separate pipelines, the workflow becomes asynchronous. This means you will not need to wait for both activities to complete before doing the next URL. There are many other benefits as well.

Does azure data factory support real time copy activity?

I want to copy data from an on-premise oracle db to sql server on a real time basis.
Data Factory has many strengths, but frequency is not one of them. Have you considered a different approach?
For real time integrations I would recommend Functional App, Logic App and/or Service Bus. I would call this App whenever there is a relevant change in the oracle DB. Alternatively you could have an API on top of this on-premise oracle DB that you call from a scheduled APP.
If you expect heavy traffic you might want to consider using a Service Bus. The illustration below shows how Azure Service Bus sends data from a Publisher(on-premise) to a Subscriber(Azure sql DB) using Topic Message.
Illustration of large scale service bus dataflow
Ref. Azure Service Bus
Welcome to Stack Overflow!
You can copy data from an Oracle database to any supported sink data store. For a list of data stores that are supported as sources or sinks by the copy activity, see the Supported data stores table.
To copy data from and to an Oracle database that isn't publicly accessible, you need to set up a Self-hosted Integration Runtime. The integration runtime provides a built-in Oracle driver. Therefore, you don't need to manually install a driver when you copy data from and to Oracle.
For more details and step by step procedure, refer Copy data from and to Oracle by using Azure Data Factory.
You can also use custom activity as per your needs. Refer custom activities.
Hope this helps.
You could copy data incrementally. But there is frequency limitation. Please reference this post. https://social.msdn.microsoft.com/Forums/en-US/54380f98-716b-4a95-88af-cad2ab7e47b5/what-type-of-data-ingestion-does-azure-data-factory-use?forum=AzureDataFactory

Which Azure products are needed for a staging database?

I have several external data APIs that I access using some Python scripts. My scripts run from an on-premises server, transform the data, and store it in a SQL Server database on the same server. I suppose it's a rudimentary ETL system run with Python and T-SQL.
The system is about to grow quite a bit with new APIs and will require more complex data pipelines (for example, some of the API data will be spun off to more than one table). I think this would be a good time to move the system onto Azure (we are heavily integrated with Microsoft so it will have to be Azure!).
I have spent a few days researching the Azure products that would let me run Python scripts to access data from web APIs and store the processed data in a cloud database. I'm looking for advice on what sort of Azure products other people have used for similar jobs. At the moment it seems I will need:
Azure SQL Database to hold the processed data that can be accessed by various colleagues.
Azure Data Factory to manage, log, and schedule the pipeline jobs and to run my custom Python scripts (is this even possible?).
Azure Batch to run the aforementioned Python scripts but I'm not sure about this.
I want to put together a proposal basically and start thinking about costs but it would be good to hear from someone who has done something similar - am I on the right track or completely off? Should I just stay on-premises? Thank you in advance.
Azure SQL Database, Azure SQL Data Warehouse are good for relational data. And if you want to use NoSQL, you could go with Azure Cosmos DB. If you want to use Files to store data, you could use Azure Data Lake.
For python scripts, you could use custom activity or Data bricks for Azure Data Factory.
Azure SQL Warehouse should be used if the amount of data you want to load is in petabytes. Also, Azure Data warehouse is not meant for complex transformations. I would recommend it for plain data load with PolyBase.

Resources