Is it possible to download a million files in parallel from Rest API endpoint using Azure Data Factory into Blob? - azure

I am fairly new to Azure and I have a task in hand to make use of any Azure Service (or group of azure services in integration together for that matter) to o download a million files in parallel from a third party Rest API endpoint, that returns one file at a time, using Azure Data Factory into Blob Storage?
WHAT I RESEARCHED :
From what I researched my task had three requirements in a nutshell :
Parallel runs in millions - For this I deduced Azure Batch would be a good option as it lets run a large number of tasks in parallel on VMs ( it uses that concept for graphic rendering processes or Machine Learning Tasks)
Save response from Rest API to Blob Storage : I found that Azure Data Factory is able to handle such ETL type of operation from a Source/Sink style, where I could set the REST API as source and target as blob.
WHAT I HAVE TRIED:
Here are some things to note:
I added the REST API and Blob as linked services.
The API endpoint takes in a query string param named : fileName
I am passing the whole URL with the query string
The Rest API is protected by Bearer Token, which I am trying to pass using additional headers.
THE MAIN PROBLEM:
I get an error message on publishing pipeline that model is not appropriate, just that one line, and it gives no insight what's wrong
OTHER QUERIES:
It is possible to pass query string values dynamically from a sql table such that each filename can be picked a single row/column item from single columned rows of data from stored procedure/inline query?
Is it possible to make this pipeline run in parallel using Azure Batch somehow? How can we integrate this process ?
Is it possible to achieve the million parallel without data factory just using Batch ?

Hard to help with you main problem - you need to provide more examples of your code
In relation to your other queries:
You can use a "Lookup activity" to fetch a list of files from a database (with either sproc or inline query). The next step would be a ForEach activity that iterates over the array and copies the file from the REST endpoint to the storage account. You can adjust the parallelism on the ForEach activity to match your requirement but around 20 concurrent executions is what you normally see.
Using Azure Batch to just download a file seems a bit overkill as it should be a fairly quick operation. If you want to see an example of a Azure Batch job written in C# I can recommend this example => `https://github.com/Azure-Samples/batch-dotnet-quickstart/blob/master/BatchDotnetQuickstart. In terms of parallelism I think you will manage to achieve a higher degree on Azure Batch compared to Azure Data Factory.
In you need to actually download 1M files in parallel I don't think you have any other option then Azure Batch to get close to such numbers. But you most have a pretty beefy API if it can handle 1M requests within a second or two.

Related

Solution architecture for data transfer from SQL Server database to external API and back

I am looking for a proper solution architecture for a data transfer scenario from SQL Server to an external API and then from the API back to SQL Server. We are thinking of using Azure technologies.
We have a database hosted on an Azure VM. When the value of the author of the book table changes, we would like to get all the data for that book from related table and transfer it an external API. the quantity of the rows to be transferred (the select-join) is huge so it takes a long time to execute the select-join query, After this data is read it is transformed and then it is sent to an external API (over which we have no control) The transfer of the data to the API could take up to an hour. After the data is written into this API, we read some reports from this API and write these reports back into the original database.
We must repeat this process more than 50 per day.
We are thinking of using Logic app to detect the trigger from SQL Server (as it is hosted in Azure VMs) publish this even to an Azure Data grid and then use Azure Durable functions to handle the Read SQL data-Transform it- and Send to the external API.
Does this make sense? Does anybody have any better ideas?
Thanks in advance
At this moment, Logic App SQL connector can't detect when a particular row changes, it will perform a select (which you'll provide), and then it will check for changes every X interval (you'll specify).
In other words, SQL Database doesn't offer a change feed like CosmosDB where you can subscribe to events and trigger an Azure Function.
Things you can do:
1-Add a Trigger on SQL after insert / update which will insert the new/changed row into a separated table, and then you can use Logic App / Azure Functions to query this table and retrieve data.
2-Migrate to Cosmos DB and use the change feed + Azure Functions
3-Change your code to after insert into SQL Database, also add a message with the Identifier for the row you're about to insert / update, then add it to a Queue, which will be consumed by Azure Function.

Looking for an alternative solution to processing tens of thousands of JSONs from Azure Blob to Azure SQL DB

I currently have pipelines developed that leverage Azure Data Factory for orchestration and Azure DataBricks for it's compute to perform the following actions... I receive tens of thousands of single record json files into Azure Blob in a real-time basis and on a 15 minute basis i check the folders for any new files and once found I load them into a dataframe using Databricks and load these into a single file in SQL DB before having other ADF jobs trigger stored procedures which then transform my data into final SQL tables.... We are looking to move away from Databricks as we are not using it for it's true capabilities but are of course paying the Databricks costs. Looking for ideas on other solutions to load tens of thousands of jsons into SQL DB (with minimal to no transformations) on a periodic (i.e. 15 minute) basis. We are a microsoft shop so not looking to necessarily move away from Azure tools.
Here a few ideas:
use Azure Functions + Blob Trigger / Event Grid to process the JSON files in real time (every time a new JSON file arrives, it will trigger your function). Then you could either insert into the final table or on a temporary table.
another idea would be combine Azure Functions + Blob Trigger / Event Grid to sink the data to a data lake. You can use ADF to sink it to SQL final tables.
Azure SQL DB is actually pretty capable as far as JSON goes so you could just use OPENROWSET to import the data directly from blob store and OPENJSON to shred it. You could then use a Logic App running on a schedule to call the proc say every 15 minutes, you wouldn't even need ADF as part of the solution.
I've worked up a couple of similar answers previously, eg here and here, but let me know if you want to progress more down this route and we can work up something more detailed.

Data from HTTP endpoint to be loaded into Azure Data Lake using Azure Data Factory

I am trying to build a so called "modern data warehouse" using Azure services.
First step is to gather all the data in its native raw format into Azure Data Lake store. For some of the data sources we have no other choice than to use API for consuming the data. There's not much information when searching, therefore I am asking.
Is it possible to define 2 Web Activities in my pipeline that will handle below scenario?
Web1 activity gets an API URL generated from C# (Azure Function). It returns data in JSON format and saves it to Web1.Output - this is working fine.
Web2 activity consumes Web1.Output and saves it into Azure Data Lake as a plain txt file (PUT or POST) - this is needed.
Above scenario is achievable by using Copy activity, but then I am not able to pass dynamic URL generated by Azure Functions. How do I save the JSON output to ADL? Is there any other way?
Thanks!
Since you are using blob storage as an intermediary, and want to consume the blob upon creation, you could take advantage of Event Triggers. You can set up the Event trigger to run a pipeline containing Web2 activity. Which kicks off when the Web1 activity completes (separate pipeline).
By separating the two activities into separate pipelines, the workflow becomes asynchronous. This means you will not need to wait for both activities to complete before doing the next URL. There are many other benefits as well.

Is it possible to Push (using API) and Pull (using indexer) data into the same Index with Azure Search?

I have an index in Azure Search lets says called Hotels.
I have a hotels table in Azure SQL that has the same schema that is a copy of the hotels index found in Azure Search.
I push from my back-end to Azure SQL table and Azure Search at create/update/delete.
In a scenario my data was pushed to Azure SQL but failed to be pushed to Azure Search is it possible to have my Azure SQL Hotels table be an indexer, such that the indexer could sync data to my Azure Search index (hotels) that failed to be pushed from my backend?
Yes, you can both mix push and pull as well as have multiple pull indexers targeting the same index. We see this done often when part of the data is in one data source and part in another, where the index is the point where they converge, coordinated by their key.
The pattern you're describing is not as common, but generally speaking it should work. You'd have to account for cases where your write conflicts with an indexer write, and make sure the writes you do as they happen ultimately win. Also if you go down this path make sure to configure a change detection (and deletion detection if you delete rows) policy so we index from SQL incrementally and don't ready everything on every run.
An alternative approach if you're worried about missing writes is to push all your writes into a queue, and then pull from the queue and into Azure Search. That way you have a single stream of writes instead of two.

Is Azure Data Factory suitable for downloading data from non-Azure REST APIs?

Consider a data processing pipeline as follows:
Fetch a large amount of data from a REST API that's hosted somewhere on the internet and persist it to a data store.
Perform some complex data transformations on the persisted data.
Persist the results of the data transformations on a data store.
Aiming to implement such a pipeline in Azure, steps 2 and 3 seem like a good fit for implementation as Azure Data Factory activities.
My questions is - Does it make sense to implement step 1 in an Azure Data Factory activity as well?
Technically it might be possible to code a .Net activity that perform the data download and persistence.
No - do not implement step 1 in an Azure Data Factory activity.
Technically it is possible to run the entire process from ADF but I would argue that the choice is more costly (relatively) than other options available to you because you will pay for each activity in Azure Data Factory.
For instance, what if the rest api does not have any new data to offer when you initiate the (scheduled) activity? You'll pay for that.
You might consider the following as an easy to implement alternative:
1 - Create a .NET console app, publish as a WebJob, schedule to run daily.
2 - The long-running console app can query the rest api, persist data into azure storage / documentdb, push a message into queue which triggers ADF steps 2/3 to run against the saved data.
I have done exactly that using .Net Activity. I had a need to fetch data from Salesforce api. This has been working well for my needs. Here is a post I wrote up about creating a .net activity and storing the data in azure data lake.
As in Newport99's answer yes you will incur costs for that activity but I am not sure how cost effect it would be to be running a separate web app to host a web job and also run the Azure Data Factory pipeline. When I was originally designing a solution the WebJob was my first choice but in the end I prefer to have the whole solution utilizing one azure service instead of multiple.
Hope that helps.
There have been a lot of improvements to ADF in the years since this question was posted, including a REST connector.
Here's the approach recommended by ADF at this time...
Copy data from a REST endpoint by using Azure Data Factory

Resources