How is ForEach activity declaring success in Data Factory? - azure

In the following scenario we have a ForEach activity running in an Azure Data Factory pipeline to copy data from source to destination.
The last CopyActivity took 4:10:33 but the ForEach activity declared Succeeded 36 Minutes later: 4:46:12.
The question is, why ForEach activity need this 36 Minutes extra?
Is it the case that the ForEach needs also to consolidate results from subactivities before declaring success or fail?

Official answer from Microsoft: ForEach activity does wait for all inner activity runs to complete. In theory, there should not be much delay on marking foreach run success after the last activity run within it succeed. However, ADF rely on partner service to execute the runs and it's possible that the partner service run into failures and could not complete foreach in time. They have build in logic to keep retry and recover but the behavior in ADF activity runs is delay. It's also possible that orchestration service fails and partner service keep retry on calling us. but usually partner service delay is the main cause here.
Our assumption: Duration time is end-to-end for the pipeline activity. That takes into account all factors like marshaling of your data flow script from ADF to the Spark cluster, cluster acquisition time, job execution, and I/O write time. Due to ADF is serverless compute, I think ForEach needs time to wait for all activities to acquire and release computing resources, but this is my guess because there are few official explanations.
So there will be a delay time, which varies according to internal activities.

Official answer from Microsoft: ForEach activity does wait for all inner activity runs to complete. In theory, there should not be much delay on marking foreach run success after the last activity run within it succeed. However, ADF rely on partner service to execute the runs and it's possible that the partner service run into failures and could not complete foreach in time. They have build in logic to keep retry and recover but the behavior in ADF activity runs is delay. It's also possible that orchestration service fails and partner service keep retry on calling us. but usually partner service delay is the main cause here.

Related

Add Tasks to a running Azure batch job and manually control termination

We have an Azure-batch job that uses some quite large files which we are uploading to Azure Blob storage asynchronously so that we don't have to wait for all files to upload before starting our batch job made up of a collection of Tasks that will process each file and generate output. All good so far - this is working fine.
I'd like to be able to create an Azure Task and Add it to an existing, running Azure Job increasing the length of the Task list but I cant find how to do this. It seems that Azure expects you to define ALL jobs for a Task before the Job starts and then it runs until all tasks are complete and terminates the job (which makes sense in some scenarios - but not mine).
I would like to suppress this Job completion behavior and be able to queue up additional Azure Tasks for the same job. I could then monitor the Azure Job status (via the Tasks) and determine myself if the Job is complete.
Our issue is that uploads of multi-MB files takes time and we want Task processing to start as soon as the first file is available. If we have to wait until all files are available, then our processing start is delayed which is not what we need.
We 'could'create a job per task and manage it in our application but that is a little 'messy' and I would like to use the encapsulating Azure Job entity and supporting functionality if I possibly can.
Has anyone done this and can offer some guidance? Many thanks?
You can add new tasks to an existing Azure Batch job in the active state. There is no running state for an Azure Batch job. You can find a list of Azure Batch job states here.
Azure Batch Jobs, by default, do not automatically complete by terminating upon all tasks completing. You can view this related question regarding this subject.

Do Azure Batch jobs need a watcher process?

We have a very long running operation (potentially days) that we would like to have triggered from a BLOB file written to a Azure Storage. This job could be started once year, never, or many times over a few days.
Azure Batch jobs look exactly like what we need assuming there doesn't need to be a 'watcher' process on the batch job as it runs. For example, if we can have a Azure Function catch the BLOB event, fire up a Batch job, start the job in a "fire and forget" type fashion, and then the Function ends it is exactly what we need. We aren't really too worried about reporting progress of the job (we are using a SQL table for that), we just want to start the job then monitor it out of band.
Is there a way to start a batch job and let the instigator process disappear while the job continues to run in the background? If not, is there any way to do this without having to have a constantly running process (Worker Role or Fabric Worker)? We are trying to avoid having a process (Worker/Fabric Role, Function using the App Function Plan, etc.) running all the time when 99.9% of the time it isn't doing anything.
Short answer: No, you don't need a watcher process.
Azure Batch tasks are asynchronous in nature. When you add a task (under a job), your call against the Batch service immediately returns with success or failure of the submission action itself (and not if the task completed successfully on a compute node or not). The Batch service takes care of scheduling the task among the available compute nodes in your pool, internally monitoring the progress of the task, updating stats, etc.
If you elect to do so, you can monitor the progress of your task independently of the submitting actor using any SDK, REST API or client tool. Or you can opt to monitor it out-of-band yourself as you have described if your task is updating an external monitor or data store. Or you can schedule a task and not monitor it, the service does not force you to monitor/watch the task.

How to host long running process into Azure Cloud?

I have a C# console application which extracts 15GB FireBird database file on a server location to multiple files and loads the data from files to SQLServer database. The console application uses System.Threading.Tasks.Parallel class to perform parallel execution of the dataload from files to sqlserver database.
It is a weekly process and it takes 6 hours to complete.
What is best option to move this (console application) process to azure cloud - WebJob or WorkerRole or Any other cloud service ?
How to reduce the execution time (6 hrs) after moving to cloud ?
How to implement the suggested option ? Please provide pointers or code samples etc.
Your help in detail comments is very much appreciated.
Thanks
Bhanu.
let me give some thought on this question of yours
"What is best option to move this (console application) process to
azure cloud - WebJob or WorkerRole or Any other cloud service ?"
First you can achieve the task with both WebJob and WorkerRole, but i would suggest you to go with WebJob.
PROS about WebJob is:
Deployment time is quicker, you can turn your console app without any change into a continues running webjob within mintues (https://azure.microsoft.com/en-us/documentation/articles/web-sites-create-web-jobs/)
Build in timer support, where WorkerRole you will need to handle on your own
Fault tolerant, when your WebJob fail, there is built-in resume logic
You might want to check out Azure Functions. You pay only for the processing time you use and there doesn't appear to be a maximum run time (unlike AWS Lambda).
They can be set up on a schedule or kicked off from other events.
If you are already doing work in parallel you could break out some of the parallel tasks into separate azure functions. Aside from that, how to speed things up would require specific knowledge of what you are trying to accomplish.
In the past when I've tried to speed up work like this, I would start by spitting out log messages during the processing that contain the current time or that calculate the duration (using the StopWatch class). Then find out which areas can be improved. The slowness may also be due to slowdown on the SQL Server side. More investigation would be needed on your part. But the first step is always capturing metrics.
Since Azure Functions can scale out horizontally, you might want to first break out the data from the files into smaller chunks and let the functions handle each chunk. Then spin up multiple parallel processing of those chunks. Be sure not to spin up more than your SQL Server can handle.

Asynchronous recurring task scheduler in Windows Azure

We would like to make our customers able to schedule recurring tasks on a daily, weekly and monthly basis. Linear scalability is really important to us, that is why we use Windows Azure Table Storage instead of SQL Azure. The current design is the following:
- Scheduling information is stored in a Table Storage table. For example: Task A, daily; Task B, weekly; ...
- There are worker processes, which run hourly and query this table. Then decide, they have to run a given task or not.
But what if, multiple worker roles start to run the same task?
Some other requirements:
- The worker processes can be in different time zones.
Windows Azure Queue Storage could solve all cuncurrency problems mentioned above, but it also introduces some new issues:
- How many queue items should we generate?
- What if the customer changes the recurrence rate or revokes the scheduling?
So, my question is: how to design a recurring task scheduler with multiple asynchronous workers using Windows Azure Storage?
Perhaps the new Azure Scheduler service could help?
http://www.windowsazure.com/en-us/services/scheduler/
Some thoughts:
But what if, multiple worker roles start to run the same task?
This could very well happen. To avoid this, what you could do is have a worker role instance (any worker role instance from the pool) read from table and push messages in a queue. While this instance is doing this work, all other instances wait. To decide which instance does this work, you can make use of blob lease functionality.
Some other requirements: - The worker processes can be in different
time zones.
Not sure about this. Assuming you're talking about Cloud Services Worker Roles, they could be in different data centers but all of them will be in UTC time zone.
How many queue items should we generate?
It really depends on how much work needs to be done. You could put all messages in a queue. Only a maximum of 32 messages can be dequeued from a queue by a client at a time. So if you have say 100 tasks and thus 100 messages, each instance can only read up to 32 messages from the queue in a single call to queue service.
What if the customer changes the recurrence rate or revokes the
scheduling?
That should be OK as once the task is completed you must remove the message from the queue. Next time when the task is invoked, you can read from the table again and it will give you latest information about the task from the table.
I would continue using the Azure Table Storage, but mark the process as "in progress" before a worker starts working on it. Since ATS supports concurrency which is controlled by Etags, you can be assured that two processes won't be able to start the same process
I would, however, think about retry logic when jobs fail unexpectedly and have a process that restarts job that appear to have gone orphan

Scheduled Tasks with Sql Azure?

I wonder if there's a way to use scheduled tasks with SQL Azure?
Every help is appreciated.
The point is, that I want to run a simple, single line statement every day and would like to prevent setting up a worker role.
There's no SQL Agent equivalent for SQL Azure today. You'd have to call your single-line statement from a background task. However, if you have a Web Role already, you can easily spawn a thread to handle this in your web role without having to create a Worker Role. I blogged about the concept here. To spawn a thread, you can either do it in the OnStart() event handler (where the Role instance is not yet added to the load balancer), or in the Run() method (where the Role instance has been added to the load balancer). Usually it's a good idea to do setup in the OnStart().
One caveat that might not be obvious, whether you execute this call in its own worker role or in a background thread of an existing Web Role: If you scale your Role to, say, two instances, you need to ensure that the daily call only occurs from one of the instances (otherwise you could end up with either duplicates, or a possibly-costly operation being performed multiple times). There are a few techniques you can use to avoid this, such as a table row-lock or an Azure Storage blob lease. With the former, you can use that row to store the timestamp of the last time the operation was executed. If you acquire the lock, you can check to see if the operation occurred within a set time window (maybe an hour?) to decide whether one of the other instances already executed it. If you fail to acquire the lock, you can assume another instance has the lock and is executing the command. There are other techniques - this is just one idea.
In addition to David's answer, if you have a lot of scheduled tasks to do then it might be worth looking at:
lokad.cloud - which has good handling of periodic tasks - http://lokadcloud.codeplex.com/
quartz.net - which is a good all-round scheduling solution - http://quartznet.sourceforge.net/
(You could use quartz.net within the thread that David mentioned, but lokad.cloud would require a slightly bigger architectural change)
I hope it is acceptable to talk about your own company. We have a web based service that allows you to do this. You can click this link to see more details on how to schedule execution of SQL Azure queries.
The overcome the issue of multiple roles executing the same task, you can check for role instance id and make sure that only the first instance will execute the task.
using Microsoft.WindowsAzure.ServiceRuntime;
String g = RoleEnvironment.CurrentRoleInstance.Id;
if (!g.EndsWith("0"))
{
return;
}

Resources