Azure LogicApp for migration of millions of files - azure

I have the following requirements, where I consider using Azure LogicApp:
Files placed in Azure Blob Storage must be migrated into a custom place (it can be different from case to case)
Amount of files is something about 1 000 000
When the process is over, we should have a report saying how many records (files) failed
If the process stopped somewhere in the middle, the next run must take only files that have not been migrated
The process must be fast as it can be and files must be migrated within N hours
But what makes me worried is the fact that I cannot find any examples or articles (including official Azure Documentation) where the same thing is achieved by Azure LogicApp.
I have some ideas about my requirements and Azure Logic App:
I think that I must use pagination for dealing with this amount of files because Azure Logic App will not be able to read millions of file names - https://learn.microsoft.com/en-us/azure/logic-apps/logic-apps-exceed-default-page-size-with-pagination
I can add a record into Azure Table Storage to track failed migrations (something like creating a record to say that the process started and updating it when the file is moved to the destination)
I have no ideas how I can restart the Azure Logic App without using a custom tracking mechanism (for instance it can be the same Azure Table Storage instance)
And the question about splitting the work across several units is still open
Do you think that Azure Logic App is the right choice for my needs or I should consider something else? If Azure LogicApp can work for me, could you please share your thoughts and ideas on how I can achieve the given requirements?

I don't think logic app is a good solution for you to implement the requirement because the amount of files is about 1000000, that's too much. For this requirement, I suggest you to use Azure Data Factory.
To migrate data in azure blob according data factory, you can refer to this document

Related

Uploading data to Azure App Service's persistent storage (%HOME%)

We have a windows-based app service that requires a large dataset to run (files stored on Azure Blob Storage at around ~30GB). This data is static per app version, and therefore should be accessible to all instances across a given slot (a slot in our case represents a version).
Based on our initial research, it seems like Persistent Storage (%HOME%) would be the ideal place for this, since data stored there is shared across instances, but not across slots.
The next step now is to load the required data as part of our devops deployment pipeline, since the app service cannot operate without the underlying data. However, it seems like the %HOME% directory is only accessible by the app service itself, even though the underlying implementation is using Azure Storage.
At this point, we're considering having the app service download the data during its startup, but then we hit a snag which is that we have two instances. We could implement a Mutex (using blob lease) but this seems to us to be too complicated a solution for a simple need.
Any thoughts about how to best implement this?
The problems I see with loading the file on container startup are the following:
It's going to be really slow, and you might hit one of the built-in App Service timeouts.
Every time your container restarts, or you add another instance, it will re-download all the data, and it might cause issues with blocked writes because of file handle locks, which can make files or directories on %HOME% completely unaccessible for reading and modifying (I just had this happen to me).
For this I would rather suggest connecting the app to Azure Files by SMB, and for example have a directory per each version. This way you can connect to Azure Files and write the data during your build pipeline, and save an ENV variable or file that tells each slot which directory to get the current version's data from.

How can I find the source of my Hot LRS Write Operations on Azure Storage Account?

We are using an Azure Storage account to store some files that shall be downloaded by our app on the users demand.
Even though there should be no write operations (at least none I could think of), we are exceeding the included write operations just some days into the billing period (see image).
Regarding the price it's still within limits, but I'd still like to know whether this is normal and how I can analyze the matter. Besides the storage we are using
Functions and
App Service (mobile app)
but none of them should cause that many write operations. I've checked the logs of our functions and none of those that access the queues or the blobs have been active lately. There are are some functions that run every now and then, but only once every few minutes and those do not access the storage at all.
I don't know if this is related, but there is a kind of periodic ingress on our blob storage (see the image below). The period is roundabout 1 h, but there is a baseline of 100 kB per 5 min.
Analyzing the metrics of the storage account further, I found that there is a constant stream of 1.90k transactions per hour for blobs and 1.3k transactions per hour for queues, which seems quite exceptional to me. (Please not that the resolution of this graph is 1 h, while the former has a resolution of 5 minutes)
Is there anything else I can do to analyze where the write operations come from? It kind of bothers me, since it does not seem as if it's supposed to be like that.
I 've had the exact same problem; after enabling Storage Analytics and inspecting the $logs container I found many log entries that indicate that upon every request towards my Azure Functions, these write operations occur against the following container object:
https://[function-name].blob.core.windows.net:443/azure-webjobs-hosts/locks/linkfunctions/host?comp=lease
In my Azure Functions code I do not explicitly write in any of container or file as such but I have the following two Application Settings configured:
AzureWebJobsDashboard
AzureWebJobsStorage
So I filled a support ticker in Azure with the following questions:
Are the write operation triggered by these application settings? I
believe so but could you please confirm.
Will the write operation stop if I delete these application settings?
Could you please describe, in high level, in what context these operations occur (e.g. logging? resource locking, other?)
and I got the following answers from Azure support team, respectively:
Yes, you are right. According to the logs information, we can see “https://[function-name].blob.core.windows.net:443/azure-webjobs-hosts/locks/linkfunctions/host?comp=lease”.
This azure-webjobs-hosts folder is associated with function app and it’s created by default as well as creating function app. When function app is running, it will record these logs in the storage account which is configured with AzureWebJobsStorage.
You can’t stop the write operations because these operations record necessary logs to storage account used by Azure Functions runtime. Please do not remove application setting AzureWebJobsStorage. The Azure Functions runtime uses this storage account connection string for all functions except for HTTP triggered functions. Removing this Application Settings will cause your function app unable to start. By the way, you can remove AzureWebJobsDashboard and it will stop Monitor rather than the operation above.
These operations is to record runtime logs of function app. These operations will occur when our backend allocates instance for running the function app.
Best place to find information about storage usage is to make use of Storage Analytics especially Storage Analytics Logging.
There's a special blob container called $logs in the same storage account which will have detailed information about every operation performed against that storage account. You can view the blobs in that blob container and find the information.
If you don't see this blob container in your storage account, then you will need to enable storage analytics on your storage account. However considering you can see the metrics data, my guess is that it is already enabled.
Regarding the source of these write operations, have you enabled diagnostics for your Functions and App Service? These write diagnostics logs to blob storage. Also, storage analytics is also writing to the same account and that will also cause these write operations.
For my case, I have a Azure App Insight which took 10K transactions on its storage per mintues for functions and app services, even thought there are only few https requests among them. I'm not sure what triggers them, but once I removed app insights, everything becomes normal.

Monitor the amount of blobs entering into an Azure container

Basically I have a storage account with a containers that contain blobs of unhandled errors. My task is to somehow generate a metric that will be able to show how many blobs were uploaded to that container every hour. I tried using the Azure built in metrics, but it seems like that might limit me to the entire storage account and not just one container. I did some research on Power BI and thought that might be a good place to start, but again I came up empty.
If anyone has a good starting place for me, that would be incredible. I'm assuming that this will end up being something that requires some SQL queries, or perhaps something I can do programatically in Visual Studio. Apologies if this was posted in the wrong place, but it seemed like the best fit from my opinion.
Thanks!
You should take a look at Azure Event Grid with Blob Storage Integration. In short, whenever a blob is created, an event will be raised by Azure Event Grid. You can consume this event and post the event data to an HTTP endpoint (or call an Azure Function) which can save this information about this event in some persistent storage (Azure Tables for example). You can then create reports by querying this data.
For more information about this, you may find this link helpful: https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blob-event-overview.

Architecture design and role communication with Azure in file bound app

I am considering moving my web application to Windows Azure for scalability purposes but I am wondering how best to partition my application.
I expect my scenario is typical and is as follows: my application allows users to upload raw data, this is processed and a report is generated. The user can then review their raw data and view their report.
So far I’m thinking a web role and a worker role. However, I understand that a VHD can be mounted to a single instance with read/write access so really both my web role and worker role need access to a common file store. So perhaps I need a web role and two separate worker roles, one worker role for the processing and the other for reading and writing to a file store. Is this a good approach?
I am having difficulty picturing the plumbing between the roles and concerned of the overhead caused by the communication between this partitioning so would welcome any input here.
Adding to Stuart's excellent answer: Blobs can store anything, with sizes up to 200GB. If you needed / wanted to persist an entire directory structure that's durable, you can mount a VHD with just a few lines of code. It's an NTFS volume that your app can interact with, just like any other drive.
In your case, a vhd doesn't fit well, because your web app would have to mount a vhd and be the sole writer to it. And if you have more than one web role instance (which you would if you wanted the SLA and wanted to scale), you could only have one writer. In this case, individual blobs fit MUCH better.
As Stuart stated, this is a very normal and common pattern. And again, with only a few lines of code, you can call upon the storage sdk to copy a file from blob storage to your instance's local disk. Then you can process the file using regular File IO operations. When your report is complete, another few lines of code lets you copy your report into a new blob (most likely in a well-known container that the web role knows to look in).
You can take this a step further and insert rows into an Azure table that are partitioned by customer, with row key identifying the individual uploaded file, and a 3rd field representing the URI to the completed report. This makes it trivial for the web app to display a customer's completed reports.
Blob storage is the easiest place to store files which lots of roles and role instances can then access - with none of them requiring special access.
The normal pattern suggested seems to be:
allow the raw files to be uploaded using instances of a web role
these web role instances return the HTTP call without doing processing - they store the raw files in blob storage, and add a "do this work message" to a queue.
the worker role instances pick up the message from the queue, read the raw blob, do the work, store the report result, then delete the message from the queue
all the web roles can then access the report when the user asks for it
That's the "normal pattern suggested" and you can see it implemented in things like the photo upload/thumbnail generation apps from the very first Azure PDC - its also used in this training course - follow through to the second page.
Of course, in practice you may need to build on this pattern depending on the size and type of data you are processing.

Pulling data asynchronously from third-party web service on Windows Azure Platform

I want to pull large amount of data, frequently from different third party API web services and store it in a staging area (this is what I want to decide right now) from where it will be then moved one by one as required into my application's database.
I wanted to know that can I use Azure platform to achieve the above? How good is it to use Azure platform for this task?
What if the data to be pulled is of large amount and the frequency of the pull is high i.e. may be half-hourly or hourly for 2,000 different users?
I assume that if at all this is possible, then the bandwidth, data storage and server capability etc. will not be a thing to worry for me but for ©Microsoft. And obviously, I should be able to access the data back whenever I need it.
If I would have to implement it on Windows Servers, then I know that I would use a windows service to do this. But I don't know how it can be done for Windows Azure Platform if at all it is possible?
As Rinat stated, you can use Lokad's solution. If you choose to do it yourself, you can run a timed task in your worker role - maybe spawn a thread that sleeps, waking every 30 minutes to perform its task. It can then reach out to the Web Services in question (or maybe one thread per Web Service?) and fetch data. You can store it temporarily in Azure Table Storage, which is a fraction of the cost of SQL Azure (0.15 per GB), and then easily read it out of Table Storage on-demand and transfer to SQL Azure.
Assuming you host your services, storage and SQL Azure are in the same data center (by setting the affinity appropriately), you'd only pay for bandwidth when pulling data from the web service. There'd be no bandwidth charges to retrieve from Table Storage or insert into SQL Azure.
In Windows Azure that's usually Worker Role used to host the cloud processing. In order to accomplish your tasks you'll either need to implement this messaging/scheduling infrastructure yourself or use something like Lokad.Cloud or Lokad.CQRS open source projects for Azure.
We use Lokad.Cloud for distributed BI processing of hundreds of thousands of series and Lokad.CQRS allows to reliably retrieve and synchronize millions of products on schedule.
There are samples, docs and community in both projects to get you started.

Resources