Optimal architecture for processing and storing large data blocks quickly

Optimal architecture for processing and storing large data blocks quickly - azure

I have a situation which involves an MVC app, to which a potentially large number of up to 32MB chunks of data are uploaded. After each chunk is uploaded, it needs to be processed and a response sent before the client browser uploads the next chunk.
Ultimately the data and the results of its processing need to be stored on Azure storage. The data processing is CPU intensive. Given that transferring this amount of data takes an appreciable amount of time, I am looking to reduce the number of trips the data needs to do between machines, as well as move the work out of the web server threads.
Currently this is done by queuing up the jobs which are consumed by a single worker thread.
However, this process needs to be upgraded such that it runs an executable to the heavy work.
At the end of processing, the data is uploaded to Azure Blob storage. So, the data already needs to be transferred twice over the network (AFAIK) before the response is sent. Not ideal.
I am aware of the different queuing options in Azure, but am wary of making the situation worse rather than better. I don't want to overkill this problem, but do need to make the entire process run as quickly and efficiently as possible.
a) What kind of data transfer speeds can I expect between an Azure Web Role and Worker Role in a Cloud Service?
b) Is there any way to transfer the data directly to Azure storage and then process it there, without transferring it again?
c) Can / Will the worker role and web role actually run on the same machine?
d) Can I just run the .exe from inside the web app? How to get the path?

I would suggest a workflow similar to:
Client uploads data directly to Blob storage (in smaller chunks as per this guide)
When upload is finished client notifies your web service, and the web service posts a message on a Service Bus Queue (jobQueue). The message contains a unique session identifier and the Blob Url of the data uploaded. The web service then blocks and listens on another service bus queue (replyQueue) for the reply message with the specified sessionId.
A multi-threaded Worker role long polls the service bus jobQueue, for each message it receives they are processed, the processed data is then stored somewhere, and then a reply message is created and posted to the replyQueue with the sessionId set.
The web service will then receive the reply message (for the given sessionId) and can return a result to your client.
With an architecture similar to this you can scale vertically by using a bigger machine for your worker role, or horizontally by adding more instances of the worker role.
To make the process a bit more resilient you may want to return to the client instantly after the client has notified the web service of the uploaded data, and then the client can be signaled directly from the worker role using Signalr when the data has been processed.
My answers to the other parts of your question is:
a) I'm unsure what the guarantees of data transfer speeds are between the roles
b) Yes you can transfer data directly to Blob Storage, and then from the Blob to the Worker Role
c) You can run Worker Role style processing on the Web Role, call your Worker Role style code from the WebRole.OnRoleStart and WebRole.Run, then as you need to scale this code can be moved to its own dedicated Worker Role

Related

Fastest Azure Blob Storage Replication method?

We have a situation where we now need to replicate all data from our Blob Storage account to another node in the same rough region (think WE Europe to NE Europe).
Currently we have only 1 application that writes content to certain containers, and other applications that write to other containers.
I'd ideally like to replicate all of them.
I've looked into the following:
Azure Blob Storage Geo-replication.
Azure Event Grid
Just writing the blobs in other applications themselves.
The issue with 1. is that the delivery time of the updated blob is not guaranteed. This could take upto 5 minutes, for example, which is too long in our use case. This is defined in their SLA.
The issue with 2. is that there is no guarantee of delivery time - only that delivery will take place at least once, whether it's to the intended destination, or a deadletter queue.
The issue with 3. is latency, cost & having to reengineer a few applications at once.
Expanding on 3. - we have an application that listens to messages published to Azure Service Bus. It acts upon those messages and creates blobs based off data that is stored in a DB on-site. It then uploads the blobs to the specific container it pertains to, and publshes a message saying it has done so.
The problem here is that we'd need to write from one data center, to another, incurring bandwidth charges, and ask this application to do double the work.
We would also need to ensure that blobs are written, and implement a method to ensure consistency - i.e. if one location goes down, we need to be able to replicate once it's back up.
So - what's the fastest replication method here? I'm assuming it's number 3. However, could there be another solution that has not been thought of so far?
Edit: 4. Have the generated files sent to Service Bus as messages and be picked up by applications dedicated to write, hosted in each geo-location.
Issue here is limitations on message size (makes sense!) we have files over 1MB.

Ways to make a broker at Azure for anonymous HTTP API messages?

We need API at Azure that would store messages sent to it (broker) via HTTP in case my system (Cloud Service) unavailable or DB is busy. It's not easy to change what exact message will be sent. What ways to make such a broker at Azure?
Service Bus Queue looks interesting but it needs Shared Access Signatures as far as I understand.
Another WebRole should be a solution but it needs time to implement.
Virtual Machine with some tool (MSMQ?) seems a way but it requires maintenance.
What do you think?

Your scenario is a prime candidate for applying a Queue-Centric Work Pattern.
From http://www.asp.net/aspnet/overview/developing-apps-with-windows-azure/building-real-world-cloud-apps-with-windows-azure/queue-centric-work-pattern:
If either your Worker(s) or Database become unavailable, messages are still placed in durable storage and consumed later.
The Task Queue can take the form of an Azure Storage Queue or a Service Bus Queue. In every great design, the least complex component that does the job wins. In this case that would be Azure Storage Queues, durable, reliable, very few moving parts. Unless you absolutely need precision FIFO ordering, in which case you go with Service Bus.
From https://msdn.microsoft.com/en-us/library/dn568101.aspx:
This solution offers the following benefits:
It enables an inherently load-leveled system that can handle wide variations in the volume of requests sent by application instances. The queue acts as a buffer between the application instances and the consumer service instances, which can help to minimize the impact on availability and responsiveness for both the application and the service instances (as described by the Queue-based Load Leveling pattern). Handling a message that requires some long-running processing to be performed does not prevent other messages from being handled concurrently by other instances of the consumer service.
It improves reliability. If a producer communicates directly with a consumer instead of using this pattern, but does not monitor the consumer, there is a high probability that messages could be lost or fail to be processed if the consumer fails. In this pattern messages are not sent to a specific service instance, a failed service instance will not block a producer, and messages can be processed by any working service instance.
It does not require complex coordination between the consumers, or between the producer and the consumer instances. The message queue ensures that each message is delivered at least once.
It is scalable. The system can dynamically increase or decrease the number of instances of the consumer service as the volume of messages fluctuates.
It can improve resiliency if the message queue provides transactional read operations. If a consumer service instance reads and processes the message as part of a transactional operation, and if this consumer service instance subsequently fails, this pattern can ensure that the message will be returned to the queue to be picked up and handled by another instance of the consumer service.

Given you can't change the client, I would proxy the call. Recreate the API using the API Management Service in Azure, and change the web url to point to the API Management Service proxy.
The proxy can then easily delegate to a function application like Aravind mentioned in the comments to your question by using the API Management Service policies.

Multiple Instances of Azure Worker Roles for non-transaction integration tasks

We have an upcoming project where we'll need to integrate with 3rd parties over a variety of transports to get data from them.
Things like WCF Endpoints & Web API Rest Endpoints are fine.
However in 2 scenario's we'll need to either pick up auto-generated emails containing xml from a pop3 account OR pull the xml files from an External SFTP account.
I'm about to start prototyping these now, but I'm wondering are there any standard practices, patterns or guidelines about how to deal with these non-transactional systems, in a multi-instance worker role environment. i.e.
What happens if 2 workers connect to the pop account at the same time or the same FTP at the same time.
What happens if 1 worker deletes the file from the FTP while another is in mid-download.
Controlling duplication shouldn't be an issue, as we'll be logging everything on application side to a database, and everything should be uniquely identifiable so we'll be able to add if-not-exists-create-else-skip logic to the workers but I'm just wondering is there anything else I should be considering to make it more resilient/idempotent.

Just thinking out loud, since the data is primarily files and emails one possible thing you could do is instead of directly processing them via your worker roles first thing you do is save them in blob storage. So there would be some worker role instances which will periodically poll the POP3 server / SFTP site and pull the data from the there and push them in blob storage. When the blob is written, same instance can delete the data from the source as well. With this approach you don't have to worry about duplicate records because blob will be overwritten (assuming each message/file has a unique identifier and the name of the blob is that identifier).
Once the file is in your blob storage, you can write a message in a Windows Azure Queue which has details about this blob (may be blob URL etc.). Then using 'Get' semantics of Windows Azure Queues, your worker role instances start fetching and processing these messages. Because of Get semantic, once a message is fetched from the queue it becomes invisible to other callers (worker roles instances in this case). This way you could take care of duplicate message processing.
UPDATE
So I'm trying to combat against two competing instances pulling the same file at the same moment from the SFTP
For this, I'll pitch my favorite Master/Slave Concept:). Essentially the idea is that each instance will try to acquire a lease on a single blob. The instance which acquires the lease becomes the master and others slave. Master would fetch the data from SFTP while slaves will wait. I've described this concept in my blog post which you can read here: http://gauravmantri.com/2013/01/23/building-a-simple-task-scheduler-in-windows-azure/, though the context of the blog is somewhat different.

have a look the recently released Cloud Design Patterns. you might be able to find the corresponding pattern and sample code for what you need.

Way to share task and results between Azure website and workers

We need to change our system to a two-tiered structure on azure with an Azure website handling requests and adding tasks to a queue which will then be processed in priority order by a set of Azure worker roles. The website will then return the results to the end user. The data and results sets for each task will be largish (several megabytes). What's the best way to broker this exchange of data.
We could do it via an Azure storage blob but they are quite slow. Is there a better way? Up until now we have been doing everything in scaled azure website which allows all instances access to the same disk.

If this is a long-running process I doubt that using blob storage would add that much overhead, although you don't specify what the tasks are.
On Zudio long-running tasks update Table Storage tables with progress and completion status, and we use polling from the browser to check when a task has finished. In the case of a large result returning to the user, we provide a direct link with a shared access signature to the blob with the completion message, so they can download it directly from storage. We're looking at replacing the polling with SignalR running over Service Bus, and having the worker roles send updates directly to the client, but we haven't started that development work yet so I can't tell you how that will actually work.

Architecture design and role communication with Azure in file bound app

I am considering moving my web application to Windows Azure for scalability purposes but I am wondering how best to partition my application.
I expect my scenario is typical and is as follows: my application allows users to upload raw data, this is processed and a report is generated. The user can then review their raw data and view their report.
So far I’m thinking a web role and a worker role. However, I understand that a VHD can be mounted to a single instance with read/write access so really both my web role and worker role need access to a common file store. So perhaps I need a web role and two separate worker roles, one worker role for the processing and the other for reading and writing to a file store. Is this a good approach?
I am having difficulty picturing the plumbing between the roles and concerned of the overhead caused by the communication between this partitioning so would welcome any input here.

Adding to Stuart's excellent answer: Blobs can store anything, with sizes up to 200GB. If you needed / wanted to persist an entire directory structure that's durable, you can mount a VHD with just a few lines of code. It's an NTFS volume that your app can interact with, just like any other drive.
In your case, a vhd doesn't fit well, because your web app would have to mount a vhd and be the sole writer to it. And if you have more than one web role instance (which you would if you wanted the SLA and wanted to scale), you could only have one writer. In this case, individual blobs fit MUCH better.
As Stuart stated, this is a very normal and common pattern. And again, with only a few lines of code, you can call upon the storage sdk to copy a file from blob storage to your instance's local disk. Then you can process the file using regular File IO operations. When your report is complete, another few lines of code lets you copy your report into a new blob (most likely in a well-known container that the web role knows to look in).
You can take this a step further and insert rows into an Azure table that are partitioned by customer, with row key identifying the individual uploaded file, and a 3rd field representing the URI to the completed report. This makes it trivial for the web app to display a customer's completed reports.

Blob storage is the easiest place to store files which lots of roles and role instances can then access - with none of them requiring special access.
The normal pattern suggested seems to be:
allow the raw files to be uploaded using instances of a web role
these web role instances return the HTTP call without doing processing - they store the raw files in blob storage, and add a "do this work message" to a queue.
the worker role instances pick up the message from the queue, read the raw blob, do the work, store the report result, then delete the message from the queue
all the web roles can then access the report when the user asks for it
That's the "normal pattern suggested" and you can see it implemented in things like the photo upload/thumbnail generation apps from the very first Azure PDC - its also used in this training course - follow through to the second page.
Of course, in practice you may need to build on this pattern depending on the size and type of data you are processing.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string