Azure Batch and use cases for JobManagerTask - azure

I am currently digging into the Azure Batch service, and I am confused about the proper use of a JobManagerTask...
...and maybe what the overall architecture of an Azure Batch application should look like. I have built the below architecture based on code samples from Microsoft found on Github .
These are my current application components.
App1 - ClusterHead
Creates a job (including an auto pool)
Defines the JobManagerTask
Runs on a workstation
App2 - JobManagerTask
Splits input data into chunks
Pushes chunks (unit of work) onto an input queue
Creates tasks (CloudTask)
App3 - WorkloadRunner
Pulls from the input queue
Executes the task
Pushes to the output queue
Azure Storage Account
Linked to Azure Batch account
Provides input & output queues
Provides a result table
Azure Durable Function
Implements the aggregator pattern by using DurableEntities so that I can access incoming results prematurely.
Gets triggered by messages in the output queue
Aggregates results and writes the entity to Azure Storage table
Questions
Is that proper use of the JobManagerTask?
Why do I want/need the extra binary/application package, that encapsulates the JobManagerTask?
Could someone please give an example of when I should prefer to use a JobManagerTask over creating the Jobs manually?
Thanks in advance!

Your example is an example of a how a JobManagerTask can be used, albeit as you mentioned if the data being generated is all generated by the JobManagerTask and is fixed then it could make sense to just merge it into your ClusterHead as you mentioned. In your case it just depends if you want the split and upload of your data to occur as part of the Job or run it on the workstation.
One area JobManagerTasks excel is if the data coming in is continuous. Basically if you had a bunch of writers to a raw input queue, you could have your JobManagerTask run continuously reading from that queue and splitting the data/creating the tasks.

Related

Azure storage queue triggered function starts multiple times

All,
I have storage queue triggered Azure Function. It loads various data into a database from files. I specify the input file in the message sent into the input queue.
However when I send a message into the queue my function starts in multiple instances and tries to insert the same file into the db. If I log msg.dequeue_count I see it rising.
What shall I do to start only one function for each message? Please note I'd like to keep the possibility to start multiple instance for multiple messages to load different files parallel.
This question was also asked here and the answer was to check out the chart comparing storage and service bus queues.
Bottom line is that storage queues offer 'at least once' delivery. If you want 'at most once' you should use service bus and PeekLock or ReceiveAndDelete.

What happens if multiple azure function apps bind to the same storage queue for input

I have function apps running in two different regions for redundancy. i.e. there are two separate apps in azure portal (deployed from the same code). So both apps have the function that input binds to the same storage queue. Would all messages be delivered to both or would the messages get split between the two?
I am using C#, dotnet core, and Functions 2.0.
You do not have to worry about it. The function runtime will lock the messages using the default storage queue behavior.
From the docs:
The queue trigger automatically prevents a function from processing a queue message multiple times; functions do not have to be written to be idempotent.
Now I do know the docs are talking about one function that is scaling out but the same applies to two functions with the same qeueue binding.
So
Would all messages be delivered to both or would the messages get split between the two?
The latter, messages will split between the two.
Is anyone seeing anything different with this? I'm using the Azure Message Bus Queues, but it should be the same. I can see in our log where the queue starts processing the item at almost exactly the same time and it does it twice which matches the number of function apps pointing to the same queue. It doesn't do it every time, but often enough where I can say it's not locking the message from being picked up multiple times.
This might be the case for a single function app that has logic to mitigate processing the same message twice.
However:
I Have seen a scaled function app using a queue trigger sometimes getting the same message on multiple instances.
You have to be prepared for that when scaling the function.

Parallelize Azure Logic App executions when copying a file from SFTP to Blob Storage

I have an Azure Logic App which gets triggered when a new file is added or modified in an SFTP server. When that happens the file is copied to Azure Blob Storage and then gets deleted from the SFTP server. This operation takes approximately 2 seconds per file.
The only problem I have is that these files (on average 500kb) are processed one by one. Given that I'm looking to transfer around 30,000 files daily this approach becomes very slow (something around 18 hours).
Is there a way to scale out/parallelize these executions?
I am not sure that there is a scale out/parallelize execution on Azure Logic App. But based on my experience, if the timeliness requirements are not very high, we could use Foreach to do that, ForEach parallelism limit is 50 and the default is 20.
In your case, my suggestion is that we could do loop to trigger when a new file is added or modified in an SFTP then we could insert a queue message with file path as content to azure storage queue, then according to time or queue length to end the loop. We could get the queue message collection. Finally, fetch the queue message and fetch the files from the SFTP to create blob in the foreach action.
If you're C# use Parallel.ForEach like Tom Sun said. If you use this one I also recommend to use async/await pattern for IO operation (save to blob). It will free up the executing thread when file is being saved to serve some other request.

spring batch design advice for processing 50k files

We have more than 50k files coming in everyday and needs to be processed. For that we have developed POC apps with design like,
Polling app picks the file continuously from ftp zone.
Validate that file and create metadata in db table.
Another poller picks 10-20 files from db(only file id and status) and deliver it to slave apps as message
Slave app take message and launch a spring batch job, which is reading data, does biz validation in processors and writes validated data to db/another file.
We used spring integration and spring batch technology for this POC
Is it a good idea to launch spring batch job in slaves or directly implement read,process and write logic as plan java or spring bean objects?
Need some insight on launching this job where slave can have 10-25 MDP(spring message driven pojo) and each of this MDP is launching a job.
Note : Each file will have max 30 - 40 thousand records
Generally, using SpringIntegration and SpringBatch for such tasks is a good idea. This is what they are intended for.
With regard to SpringBatch, you get the whole retry, skip and restart handling out of the box. Moreover, you have all these readers and writers that are optimised for bulk operations. This works very well and you only have to concentrate on writing the appropriate mappers and such stuff.
If you want to use plain java or spring bean objects, you will probably end up developing such infrastructure code by yourself... incl. all the needed effort for testing and so on.
Concerning your design:
Besides validating and creation of the metadata entry, you could consider to load the entries directly into a database table. This would give you a better "transactional" control, if something fails. Your load job could look something like this:
step1:
tasklet to create an entry in metadata table with columns like
FILE_TO_PROCESS: XY.txt
STATE: START_LOADING
DATE: ...
ATTEMPT: ... first attempt
step2:
read and validate each line of the file and store it in a data table
DATA: ........
STATE:
FK_META_TABLE: ForeignKey to meta table
step3:
update metatable with status LOAD_completed
-STATE : LOAD_COMPLETED
So, as soon as your metatable entry gets the state LOAD_COMPLETED, you know that all entries of the files have been validated and are ready for further processing.
If something fails, you just can fix the file and reload it.
Then, to process further, you could just have jobs which poll periodically and check if there are new data in the database which should be processed. If more than one file had been loaded during the last period, simply process all files that are ready.
You could even have several slave-processes polling from time to time. Just do a read for update on the state of the metadata table or use an optimistic locking approach to prevent several slaves from trying to process the same entries.
With this solution, you don't need a message infrastructure and you can still scale the whole application without any problems.

Azure event hubs and multiple consumer groups

Need help on using Azure event hubs in the following scenario. I think consumer groups might be the right option for this scenario, but I was not able to find a concrete example online.
Here is the rough description of the problem and the proposed solution using the event hubs (I am not sure if this is the optimal solution. Will appreciate your feedback)
I have multiple event-sources that generate a lot of event data (telemetry data from sensors) which needs to be saved to our database and some analysis like running average, min-max should be performed in parallel.
The sender can only send data to a single endpoint, but the event-hub should make this data available to both the data handlers.
I am thinking about using two consumer groups, first one will be a cluster of worker role instances that take care of saving the data to our key-value store and the second consumer group will be an analysis engine (likely to go with Azure Stream Analysis).
Firstly, how do I setup the consumer groups and is there something that I need to do on the sender/receiver side such that copies of events appear on all consumer groups?
I did read many examples online, but they either use client.GetDefaultConsumerGroup(); and/or have all partitions processed by multiple instances of a same worker role.
For my scenario, when a event is triggered, it needs to be processed by two different worker roles in parallel (one that saves the data and second one that does some analysis)
Thank You!
TLDR: Looks reasonable, just make two Consumer Groups by using different names with CreateConsumerGroupIfNotExists.
Consumer Groups are primarily a concept so exactly how they work depends on how your subscribers are implemented. As you know, conceptually they are a group of subscribers working together so that each group receives all the messages and under ideal (won't happen) circumstances probably consumes each message once. This means that each Consumer Group will "have all partitions processed by multiple instances of the same worker role." You want this.
This can be implemented in different ways. Microsoft has provided two ways to consume messages from Event Hubs directly plus the option to use things like Streaming Analytics which are probably built on top of the two direct ways. The first way is the Event Hub Receiver, the second which is higher level is the Event Processor Host.
I have not used Event Hub Receiver directly so this particular comment is based on the theory of how these sorts of systems work and speculation from the documentation: While they are created from EventHubConsumerGroups this serves little purpose as these receivers do not coordinate with one another. If you use these you will need to (and can!) do all the coordination and committing of offsets yourself which has advantages in some scenarios such as writing the offset to a transactional DB in the same transaction as computed aggregates. Using these low level receivers, having different logical consumer groups using the same Azure consumer group probably shouldn't (normative not practical advice) be particularly problematic, but you should use different names in case it either does matter or you change to EventProcessorHosts.
Now onto more useful information, EventProcessorHosts are probably built on top of EventHubReceivers. They are a higher level thing and there is support to enable multiple machines to work together as a logical consumer group. Below I've included a lightly edited snippet from my code that makes an EventProcessorHost with a bunch of comments left in explaining some choices.
//We need an identifier for the lease. It must be unique across concurrently
//running instances of the program. There are three main options for this. The
//first is a static value from a config file. The second is the machine's NETBIOS
//name ie System.Environment.MachineName. The third is a random value unique per run which
//we have chosen here, if our VMs have very weak randomness bad things may happen.
string hostName = Guid.NewGuid().ToString();
//It's not clear if we want this here long term or if we prefer that the Consumer
//Groups be created out of band. Nor are there necessarily good tools to discover
//existing consumer groups.
NamespaceManager namespaceManager =
NamespaceManager.CreateFromConnectionString(eventHubConnectionString);
EventHubDescription ehd = namespaceManager.GetEventHub(eventHubPath);
namespaceManager.CreateConsumerGroupIfNotExists(ehd.Path, consumerGroupName);
host = new EventProcessorHost(hostName, eventHubPath, consumerGroupName,
eventHubConnectionString, storageConnectionString, leaseContainerName);
//Call something like this when you want it to start
host.RegisterEventProcessorFactoryAsync(factory)
You'll notice that I told Azure to make a new Consumer Group if it doesn't exist, you'll get a lovely error message if it doesn't. I honestly don't know what the purpose of this is because it doesn't include the Storage connection string which needs to be the same across instances in order for the EventProcessorHost's coordination (and presumably commits) to work properly.
Here I've provided a picture from Azure Storage Explorer of leases the leases and presumably offsets from a Consumer Group I was experimenting with in November. Note that while I have a testhub and a testhub-testcg container, this is due to manually naming them. If they were in the same container it would be things like "$Default/0" vs "testcg/0".
As you can see there is one blob per partition. My assumption is that these blobs are used for two things. The first of these is the Blob leases for distributing partitions amongst instances see here, the second is storing the offsets within the partition that have been committed.
Rather than the data getting pushed to the Consumer Groups the consuming instances are asking the storage system for data at some offset in one partition. EventProcessorHosts are a nice high level way of having a logical consumer group where each partition is only getting read by one consumer at a time, and where the progress the logical consumer group has made in each partition is not forgotten.
Remember that the throughput per partition is measured so that if you're maxing out ingress you can only have two logical consumers that are all up to speed. As such you'll want to make sure you have enough partitions, and throughput units, that you can:
Read all the data you send.
Catch up within the 24 hour retention period if you fall behind for a few hours due to issues.
In conclusion: consumer groups are what you need. The examples you read that use a specific consumer group are good, within each logical consumer group use the same name for the Azure Consumer Group and have different logical consumer groups use different ones.
I haven't yet used Azure Stream Analytics, but at least during the preview release you are limited to the default consumer group. So don't use the default consumer group for something else, and if you need two separate lots of Azure Stream Analytics you may need to do something nasty. But it's easy to configure!

Resources