Architecture design and role communication with Azure in file bound app - azure

I am considering moving my web application to Windows Azure for scalability purposes but I am wondering how best to partition my application.
I expect my scenario is typical and is as follows: my application allows users to upload raw data, this is processed and a report is generated. The user can then review their raw data and view their report.
So far I’m thinking a web role and a worker role. However, I understand that a VHD can be mounted to a single instance with read/write access so really both my web role and worker role need access to a common file store. So perhaps I need a web role and two separate worker roles, one worker role for the processing and the other for reading and writing to a file store. Is this a good approach?
I am having difficulty picturing the plumbing between the roles and concerned of the overhead caused by the communication between this partitioning so would welcome any input here.

Adding to Stuart's excellent answer: Blobs can store anything, with sizes up to 200GB. If you needed / wanted to persist an entire directory structure that's durable, you can mount a VHD with just a few lines of code. It's an NTFS volume that your app can interact with, just like any other drive.
In your case, a vhd doesn't fit well, because your web app would have to mount a vhd and be the sole writer to it. And if you have more than one web role instance (which you would if you wanted the SLA and wanted to scale), you could only have one writer. In this case, individual blobs fit MUCH better.
As Stuart stated, this is a very normal and common pattern. And again, with only a few lines of code, you can call upon the storage sdk to copy a file from blob storage to your instance's local disk. Then you can process the file using regular File IO operations. When your report is complete, another few lines of code lets you copy your report into a new blob (most likely in a well-known container that the web role knows to look in).
You can take this a step further and insert rows into an Azure table that are partitioned by customer, with row key identifying the individual uploaded file, and a 3rd field representing the URI to the completed report. This makes it trivial for the web app to display a customer's completed reports.

Blob storage is the easiest place to store files which lots of roles and role instances can then access - with none of them requiring special access.
The normal pattern suggested seems to be:
allow the raw files to be uploaded using instances of a web role
these web role instances return the HTTP call without doing processing - they store the raw files in blob storage, and add a "do this work message" to a queue.
the worker role instances pick up the message from the queue, read the raw blob, do the work, store the report result, then delete the message from the queue
all the web roles can then access the report when the user asks for it
That's the "normal pattern suggested" and you can see it implemented in things like the photo upload/thumbnail generation apps from the very first Azure PDC - its also used in this training course - follow through to the second page.
Of course, in practice you may need to build on this pattern depending on the size and type of data you are processing.

Related

Uploading data to Azure App Service's persistent storage (%HOME%)

We have a windows-based app service that requires a large dataset to run (files stored on Azure Blob Storage at around ~30GB). This data is static per app version, and therefore should be accessible to all instances across a given slot (a slot in our case represents a version).
Based on our initial research, it seems like Persistent Storage (%HOME%) would be the ideal place for this, since data stored there is shared across instances, but not across slots.
The next step now is to load the required data as part of our devops deployment pipeline, since the app service cannot operate without the underlying data. However, it seems like the %HOME% directory is only accessible by the app service itself, even though the underlying implementation is using Azure Storage.
At this point, we're considering having the app service download the data during its startup, but then we hit a snag which is that we have two instances. We could implement a Mutex (using blob lease) but this seems to us to be too complicated a solution for a simple need.
Any thoughts about how to best implement this?
The problems I see with loading the file on container startup are the following:
It's going to be really slow, and you might hit one of the built-in App Service timeouts.
Every time your container restarts, or you add another instance, it will re-download all the data, and it might cause issues with blocked writes because of file handle locks, which can make files or directories on %HOME% completely unaccessible for reading and modifying (I just had this happen to me).
For this I would rather suggest connecting the app to Azure Files by SMB, and for example have a directory per each version. This way you can connect to Azure Files and write the data during your build pipeline, and save an ENV variable or file that tells each slot which directory to get the current version's data from.

Azure LogicApp for migration of millions of files

I have the following requirements, where I consider using Azure LogicApp:
Files placed in Azure Blob Storage must be migrated into a custom place (it can be different from case to case)
Amount of files is something about 1 000 000
When the process is over, we should have a report saying how many records (files) failed
If the process stopped somewhere in the middle, the next run must take only files that have not been migrated
The process must be fast as it can be and files must be migrated within N hours
But what makes me worried is the fact that I cannot find any examples or articles (including official Azure Documentation) where the same thing is achieved by Azure LogicApp.
I have some ideas about my requirements and Azure Logic App:
I think that I must use pagination for dealing with this amount of files because Azure Logic App will not be able to read millions of file names - https://learn.microsoft.com/en-us/azure/logic-apps/logic-apps-exceed-default-page-size-with-pagination
I can add a record into Azure Table Storage to track failed migrations (something like creating a record to say that the process started and updating it when the file is moved to the destination)
I have no ideas how I can restart the Azure Logic App without using a custom tracking mechanism (for instance it can be the same Azure Table Storage instance)
And the question about splitting the work across several units is still open
Do you think that Azure Logic App is the right choice for my needs or I should consider something else? If Azure LogicApp can work for me, could you please share your thoughts and ideas on how I can achieve the given requirements?
I don't think logic app is a good solution for you to implement the requirement because the amount of files is about 1000000, that's too much. For this requirement, I suggest you to use Azure Data Factory.
To migrate data in azure blob according data factory, you can refer to this document

Fastest Azure Blob Storage Replication method?

We have a situation where we now need to replicate all data from our Blob Storage account to another node in the same rough region (think WE Europe to NE Europe).
Currently we have only 1 application that writes content to certain containers, and other applications that write to other containers.
I'd ideally like to replicate all of them.
I've looked into the following:
Azure Blob Storage Geo-replication.
Azure Event Grid
Just writing the blobs in other applications themselves.
The issue with 1. is that the delivery time of the updated blob is not guaranteed. This could take upto 5 minutes, for example, which is too long in our use case. This is defined in their SLA.
The issue with 2. is that there is no guarantee of delivery time - only that delivery will take place at least once, whether it's to the intended destination, or a deadletter queue.
The issue with 3. is latency, cost & having to reengineer a few applications at once.
Expanding on 3. - we have an application that listens to messages published to Azure Service Bus. It acts upon those messages and creates blobs based off data that is stored in a DB on-site. It then uploads the blobs to the specific container it pertains to, and publshes a message saying it has done so.
The problem here is that we'd need to write from one data center, to another, incurring bandwidth charges, and ask this application to do double the work.
We would also need to ensure that blobs are written, and implement a method to ensure consistency - i.e. if one location goes down, we need to be able to replicate once it's back up.
So - what's the fastest replication method here? I'm assuming it's number 3. However, could there be another solution that has not been thought of so far?
Edit: 4. Have the generated files sent to Service Bus as messages and be picked up by applications dedicated to write, hosted in each geo-location.
Issue here is limitations on message size (makes sense!) we have files over 1MB.

How to store (and query) the MaxMind GeoIP2 database in Azure?

In an Azure Web App I need to efficiently query the MaxMind GeoIP2 City Database (due to the volume of queries and the latency requirements we cannot use the MaxMind's rest API).
I'm wondering what's the best approach for storing the db (binary MMDB format, accessed via the official .NET api) so that it's easy to update with minimal downtime (we are going to subscribe Monthly updates) and still cost effective as to what regards Azure storage and transactions.
Apparently block blobs are the way to go, but I'm not sure about the monthly updates and the fact that the GeoIP2 api load in memory the whole db (I do not know if this would be a problem for the Web App, if I need a web worker to keep it up or I need something else), but actually I do not know yet how large the file is.
What's the most cost effective solution that preserve low latency over a huge volume?
According to the API docs you must have the database available in a file system (the API doesn't know anything about Azure storage and related REST API). So, regardless where you permanently store it, you'll need to have it on a disk somewhere.
I have no idea how large the database footprint is, but Web Apps, Cloud Services (web/worker roles) and Virtual Machines (whether Linux or Windows) all have local disks. And you have read/write access to these disks. So, you'd need to copy the database binary file (or csv) to local disk from somewhere. At this point, when you initialize the SDK, you'd create a DatabaseReader and point it to your locally-downloaded copy of the database file.
You mentioned storing the database in blob storage. There's nothing stopping you from doing so and simply downloading a copy to local disk. And there's nothing stopping you from storing multiple versions in multiple blobs. Note: You may also take advantage of Azure File storage (an SMB share). Which you choose is up to you.
As far as most cost effective solution: You'll need to do the pricing workup yourself to see what's most effective. You'd also need to evaluate how much RAM is available for the given size VM/role instance/Web App you choose. You mentioned Web Apps in your question: Web App instances scale from 0.5GB to 14GB, depending on the tier you choose (again, you'll need to evaluate this).

Multiple Instances of Azure Worker Roles for non-transaction integration tasks

We have an upcoming project where we'll need to integrate with 3rd parties over a variety of transports to get data from them.
Things like WCF Endpoints & Web API Rest Endpoints are fine.
However in 2 scenario's we'll need to either pick up auto-generated emails containing xml from a pop3 account OR pull the xml files from an External SFTP account.
I'm about to start prototyping these now, but I'm wondering are there any standard practices, patterns or guidelines about how to deal with these non-transactional systems, in a multi-instance worker role environment. i.e.
What happens if 2 workers connect to the pop account at the same time or the same FTP at the same time.
What happens if 1 worker deletes the file from the FTP while another is in mid-download.
Controlling duplication shouldn't be an issue, as we'll be logging everything on application side to a database, and everything should be uniquely identifiable so we'll be able to add if-not-exists-create-else-skip logic to the workers but I'm just wondering is there anything else I should be considering to make it more resilient/idempotent.
Just thinking out loud, since the data is primarily files and emails one possible thing you could do is instead of directly processing them via your worker roles first thing you do is save them in blob storage. So there would be some worker role instances which will periodically poll the POP3 server / SFTP site and pull the data from the there and push them in blob storage. When the blob is written, same instance can delete the data from the source as well. With this approach you don't have to worry about duplicate records because blob will be overwritten (assuming each message/file has a unique identifier and the name of the blob is that identifier).
Once the file is in your blob storage, you can write a message in a Windows Azure Queue which has details about this blob (may be blob URL etc.). Then using 'Get' semantics of Windows Azure Queues, your worker role instances start fetching and processing these messages. Because of Get semantic, once a message is fetched from the queue it becomes invisible to other callers (worker roles instances in this case). This way you could take care of duplicate message processing.
UPDATE
So I'm trying to combat against two competing instances pulling the same file at the same moment from the SFTP
For this, I'll pitch my favorite Master/Slave Concept:). Essentially the idea is that each instance will try to acquire a lease on a single blob. The instance which acquires the lease becomes the master and others slave. Master would fetch the data from SFTP while slaves will wait. I've described this concept in my blog post which you can read here: http://gauravmantri.com/2013/01/23/building-a-simple-task-scheduler-in-windows-azure/, though the context of the blog is somewhat different.
have a look the recently released Cloud Design Patterns. you might be able to find the corresponding pattern and sample code for what you need.

Resources