Azure-based approach for sending 100000 requests to external service - azure

Every night I need to get data from external http service and save it to Azure Data Lake.
Actually, I need to get all the orders for all the customers. The problem is that there is now way to get this data via a single call. Id of a customer should be provided per each separate call.
The format of url is something like /api/ordersByCutomer/{cutomerId}
I need to get data for 100 000 different customers. It will result in 100 000 calls to the external service.
I tried to use Azure Data Factory with Foreach activity in parallel mode, but it takes 4 sec per each call there (3 seconds are spent in queue). The overall speed result was not satisfying.
What is the best (I mean the fastest) azure-based approach for this (except Azure Data Factory)?
Thanks

You could write some asynchronous code to hit the API/http service parallelly and execute this code using custom activity in ADF which works by using batch account to get this job done. Use custom activities in an Azure Data Factory
Also, before doing any of this it would nice to contact the owner/stakeholder of the external http service and finding out if there is rate limiting on that service and even if the service can handle such loads.

Related

Handling long-running tasks using Azure Functions and Azure Storage

I want to use Azure to handle long running tasks that can’t be handled solely by a web server as they exceed the 2 min HTTP limit (and would put unnecessary load on it regardless). In this case, it’s the generation of a PDF report that can take some time (between 2-5 mins). I’ve seen examples of solutions for this using other technologies (Celery, RabbitMQ, AWS Lamda, etc.) but not much using what's available on Azure (Functions and Storage in this case).
Details of my proposed solution are as follows (a diagram is here)
API (that has 3 endpoints):
Generate report – post a message to Azure Queue Storage
Get report generation status – query Azure Table Storage for status
Get report – retrieve PDF from Azure Blob Storage
Azure Queue Storage
Receives a message from the API containing parameters of the requested report
Azure Function
Triggered when a message is added to Azure Queue Storage
Creates report generation status record in Azure Table storage, set to ‘Generating’
Generate a report based on parameters contained in the message
Stores output PDF in Azure Blob Storage
Updates report generation status record in Azure Table storage to ‘Completed’
Azure Table Storage
Contains a table of report generation requests and associated status
Azure Blob Storage
Stores PDF reports
Other points
The app isn’t built yet – so there is no base case I’m comparing against (e.g. Celery/RabbitMQ)
The time it takes to run the report isn’t super important (i.e. I’m not concerned about Azure Function cold starts)
There’s no requirement for immediate notification of completion using something like Webhooks – the client will poll the API every so often using the get report generation endpoint.
There won't be much usage of the app, so having an always active server to handle tasks (vs Azure Function) seems to be a waste of money.
If I find that report generation takes longer than 10 mins, I can split it up into more than one Azure Function (to avoid consumption plan hard limit of 10 mins execution time per function)
My question is essential whether or not this is a good approach (to me it seems good, and relatively cost-effective, I’m just not sure if there’s something I’m missing).
This can be simplified using Durable Functions. Most of the job is already handled by the framework and you also can query an endpoint to check for the completion status.
https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-overview?tabs=csharp

Azure Data Factory(ADF) vs Azure Functions: How to choose?

Currently we are using Blob trigger Azure Functions to move json data into Cosmos DB. We are planning to replace Azure Functions with Azure Data Factory(ADF) pipeline.
I am new to Azure Data Factory(ADF), so not sure, Could Azure Data Factory(ADF) pipeline be better option or not?
Though my answer is a bit late, I would like to add that I would not recommend replacing your current setup with ADF. Reasons:
It is too expensive. ADF costs way more than azure functions.
Custom Logic: ADF is not built to perform cleansing logics or any custom code. Its primary goal is for data integration from external systems using its vast connector pool
Latency: ADF has much higer latency due to the large overhead of its job frameweork
Based on you requirements, Azure Data Factory is your perfect option. You could follow this tutorial to configure Cosmos DB Output and Azure Blob Storage Input.
Advantage over azure function is being that you don't need to write any custom code unless there is a data cleaning involved and azure data factory is the recommended option, even if you want azure function for other purposes you can add it within the pipeline.
Fundamental use of Azure Data Factory is data ingestion. Azure Functions are Server-less (Function as a Service) and its best usage is for short lived instances. Azure Functions which are executed for multiple seconds are far more expensive. Azure Functions are good for Event Driven micro services. For Data ingestion , Azure Data Factory is a better option as its running cost for huge data will be lesser than azure functions. Also you can integrate Spark processing pipelines in ADF for more advanced data ingestion pipelines.
Moreover , it depends upon your situation . Azure functions are server less light weight processes meant for quick access in response to an event instead of volumetric responses which are meant for batch processes.
So, if your requirement is to quickly respond to an event with little information stay with Azure functions or if you have a need for batch process switch to ADF.
Cost
I get images from here.
Let's calculate the cost:
if your file is large:
43:51hour=43.867(h)
4(DIU)*43.867(h)*0.25($/DIU-H)=43.867$
43.867/7.514GB= 5.838 ($/GB)
if your file is small(2.497MB), take about 45 seconds:
4(DIU)*1/60(h)*0.25($/DIU-H)=0.0167$
2.497MB/1024MB=0.00244013671 GB
0.0167/0.00244013671= 6.844 ($/GB)
scale
The Max instances Azure function can run is 200.
ADF can run 3,000 Concurrent External activities. And In my test, only 1500 copy activities were running parallel. (This test wasted a lot of money.)

Is Azure Monitor a good store for custom application performance monitoring

We have legacy applications that currently write out various run time metrics (SQL calls run time, api / http request run times etc) to local SQL DB.
format:( source, event, data, executionduration)
We are moving away from storing those in local SQL DB, and are now publishing those same metrics to azure event hub.
Looking for a good place to store those metrics for the purpose of monitoring the health of the application. Simple solution would be to store in some DB and build custom application to visualize the data in custom ways.
We are also considering using Azure Monitor for this purpose via data collector API (https://learn.microsoft.com/en-us/azure/azure-monitor/platform/data-collector-api)
QUESTION: Are there any issues with azure monitor that would prevent us from achieving this type of health monitoring?
Details
each event is small (few hundred characters)
expecting ~ 10 million events per day
retention of 1-2 days is enough
ability to aggregate old events per source per event is important (to have historical run time information)
Thank you
You can do some simple graphs and with the Log Analytics query language, you can do just about any form of data analytics you need.
Here's a pretty good article on Monitor Visualizations.
learn.microsoft.com/en-us/azure/azure-monitor/log-query/charts

Azure Search: Frequency of calling indexer

For an Azure Search service, I am looking to get updates reflected closest to real-time. I can execute it via REST API using a logic app with recurrence.
If I invoke the logic app very frequently(ever 3 seconds). Is there a catch to this approach?
can the indexer get blocked because of too frequent calls?
is there any cost-implication of this constant invocation (either on logic app or on azure search)
I am trying to see if I can avoid building logic to determine various scenarios when the indexer needs to be called. (can get complex)
If you're OK with indexer running every 5 minutes, you don't need to invoke it at all - it can run on a schedule.
If you need to invoke indexer more frequently, you can run it once every 3 minutes on a free tier service, and as often as you want on any paid tier service. If an indexer is already running when you run it, the call will fail with a 409 status.
However, if you need to run indexer very frequently, usually it's in response to new data arriving and you have that data in hand. In that case, it may be more efficient to directly index that data using the push-style indexing API. See Add, update, or delete documents for REST API reference, or you can use Azure Search .NET SDK

Pulling data asynchronously from third-party web service on Windows Azure Platform

I want to pull large amount of data, frequently from different third party API web services and store it in a staging area (this is what I want to decide right now) from where it will be then moved one by one as required into my application's database.
I wanted to know that can I use Azure platform to achieve the above? How good is it to use Azure platform for this task?
What if the data to be pulled is of large amount and the frequency of the pull is high i.e. may be half-hourly or hourly for 2,000 different users?
I assume that if at all this is possible, then the bandwidth, data storage and server capability etc. will not be a thing to worry for me but for ©Microsoft. And obviously, I should be able to access the data back whenever I need it.
If I would have to implement it on Windows Servers, then I know that I would use a windows service to do this. But I don't know how it can be done for Windows Azure Platform if at all it is possible?
As Rinat stated, you can use Lokad's solution. If you choose to do it yourself, you can run a timed task in your worker role - maybe spawn a thread that sleeps, waking every 30 minutes to perform its task. It can then reach out to the Web Services in question (or maybe one thread per Web Service?) and fetch data. You can store it temporarily in Azure Table Storage, which is a fraction of the cost of SQL Azure (0.15 per GB), and then easily read it out of Table Storage on-demand and transfer to SQL Azure.
Assuming you host your services, storage and SQL Azure are in the same data center (by setting the affinity appropriately), you'd only pay for bandwidth when pulling data from the web service. There'd be no bandwidth charges to retrieve from Table Storage or insert into SQL Azure.
In Windows Azure that's usually Worker Role used to host the cloud processing. In order to accomplish your tasks you'll either need to implement this messaging/scheduling infrastructure yourself or use something like Lokad.Cloud or Lokad.CQRS open source projects for Azure.
We use Lokad.Cloud for distributed BI processing of hundreds of thousands of series and Lokad.CQRS allows to reliably retrieve and synchronize millions of products on schedule.
There are samples, docs and community in both projects to get you started.

Resources