I have a powershell azure runbook that iterates through a large storage account and enforces file age policies on the blobs within the account. This runs fine but runs up against the Fair Share policy of 3 hours. I can use hybrid workers but I would prefer to run multiple child runbooks in parallel each handling a different portion of the blob account using the first letter prefix.
Example:
First child runbook runs A-M
Second: N-Z
Third: a-m
Fourth: m-z
I'm thinking of using a prefix variable within a loop that will iterate between letters.
## Declaring the variables
$number_of_days_bak_threshold = 15
$number_of_days_trn_threshold = 2
$current_date = get-date
$date_before_blobs_to_be_deleted_bak = $current_date.AddDays(-$number_of_days_bak_threshold)
$date_before_blobs_to_be_deleted_trn = $current_date.AddDays(-$number_of_days_trn_threshold)
# Number of blobs deleted
$blob_count_deleted = 0
# Storage account details
$storage_account_name = <Account Name>
$storage_account_key = <Account Key>
$container = <Container>
## Creating Storage context for Source, destination and log storage accounts
$context = New-AzureStorageContext -StorageAccountName $storage_account_name -StorageAccountKey $storage_account_key
$blob_list = Get-AzureStorageBlob -Context $context -Container $container
## Creating log file
$log_file = "log-"+(get-date).ToString().Replace('/','-').Replace(' ','-').Replace(':','-') + ".txt"
$local_log_file_path = $env:temp + "\" + "log-"+(get-date).ToString().Replace('/','-').Replace(' ','-').Replace(':','-') + ".txt"
write-host "Log file saved as: " $local_log_file_path -ForegroundColor Green
## Iterate through each blob
foreach($blob_iterator in $blob_list){
$blob_date = [datetime]$blob_iterator.LastModified.UtcDateTime
# Check if the blob's last modified date is less than the threshold date for deletion for trn files:
if($blob_iterator.Name -Match ".trn") {
if($blob_date -le $date_before_blobs_to_be_deleted_trn) {
Write-Output "-----------------------------------" | Out-File $local_log_file_path -Append
write-output "Purging blob from Storage: " $blob_iterator.name | Out-File $local_log_file_path -Append
write-output " " | Out-File $local_log_file_path -Append
write-output "Last Modified Date of the Blob: " $blob_date | Out-File $local_log_file_path -Append
Write-Output "-----------------------------------" | Out-File $local_log_file_path -Append
# Cmdle to delete the blob
Remove-AzureStorageBlob -Container $container -Blob $blob_iterator.Name -Context $context
$blob_count_deleted += 1
Write-Output "Deleted "$extn
}
}
Elseif($blob_iterator.Name -Match ".bak") {
if($blob_date -le $date_before_blobs_to_be_deleted_bak) {
Write-Output "-----------------------------------" | Out-File $local_log_file_path -Append
write-output "Purging blob from Storage: " $blob_iterator.name | Out-File $local_log_file_path -Append
write-output " " | Out-File $local_log_file_path -Append
write-output "Last Modified Date of the Blob: " $blob_date | Out-File $local_log_file_path -Append
Write-Output "-----------------------------------" | Out-File $local_log_file_path -Append
# Cmdle to delete the blob
Remove-AzureStorageBlob -Container $container -Blob $blob_iterator.Name -Context $context
$blob_count_deleted += 1
Write-Output "Deleted "$extn
}
}
Else{
Write-Error "Unable to determine file type." $blob_iterator.Name
}
}
Write-Output "Blobs deleted: " $blob_count_deleted | Out-File $local_log_file_path -Append
I expect to be able to run through the account in parallel.
So, I agree with #4c74356b41 that breaking the workload down is the best approach. However, that is itself, not always as simple as it might sound. Below I describe the various workarounds for fairshare and the potential issues I can think of off the top of my head. It as quite a lot of information, so here are the highlights:
Create jobs that do part of the work and then start the next job in the sequence.
Create jobs that all run on part of the sequence in parallel.
Create a runbook that does the work in parallel but also in a single job.
Use PowerShell Workflow with checkpoints so that your job is not subjected to fairshare.
Migrate the workload to use Azure Functions, e.g. Azure PowerShell functions.
TL;DR
No matter what, there are ways to breakup a sequential workload into sequentially executed jobs, e.g. each job works on a segment and then starts the next job as it's last operation. (Like a kind of recursion.) However, managing a sequential approach to correctly handle intermittent failures can add a lot of complexity.
If the workload can be broken down into smaller jobs that do not use a lot of resources, then you could do the work in parallel. In other words, if the amount of memory and socket resources required by each segment is low, and as long as there is no overlap or contention, this approach should run in parallel much faster. I also suspect that in parallel, the combined job minutes will still be less than the minutes necessary for a sequential approach.
There is one gotcha with processing the segments in parallel...
When a bunch of AA jobs belonging to the same account are started together, the tendency that they will all run within the same sandbox instance increases significantly. Sandboxes are never shared with un-related accounts, but because of improvements in job start performance, there is a preference to share sandboxes for jobs within the same account. When these jobs all run at the same time, there is an increased likelihood that the overall sandbox resource quota will be hit and then the sandbox will perform a hard exit immediately.
Because of this gotcha, if your workload is memory or socket intensive, you may want to have a parent runbook that controls the lifecycle (i.e. start rate) of the child runbooks. This has the twisted effect that the parent runbook could now hit the fairshare limit.
The next workaround is to implement runbooks that kick off the job for the next processing segment when they are completed. The best approach for this is to store the next segment somewhere the job can retrieve it, e.g. variable or blob. This way, if a job fails with its segment, as long as there is some way of making sure the jobs run until the entire workload finishes, everything will eventually finish. You might want to use a watcher task to verify eventual completion and handle retries. Once you get to this level of complexity, you can experiment to discover how much parallelism you can introduce without hitting resource limits.
There is no way for a job to monitor the available resource and throttle itself.
There is no way to force each job to run in its own sandbox.
Whether jobs run in the same sandbox is very non-deterministic, which can cause problems with hard to trace intermittent failures.
If you have no worry for hitting resource limits, you could consider using the ThreadJob module on the PowerShell Gallery. With this approach, you would still have a single runbook, but know you would be able to parallelize the workload within that runbook and complete the workload before hitting the fairshare limit. This can be very effective if the individual tasks are fast and lightweight. Otherwise, this approach may work for a little while but begin to fail if the workload increases in either the time or resources required.
Do not use PowerShell Jobs within an AA Job to achieve parallelism. This includes not using commands like Parallel-ForEach. There are a lot of examples for VM-Start/Stop runbooks that use PowerShell Jobs; this is not a recommend approach. PowerShell Jobs require a lot of resources to execute, so using PowerShell Jobs will significantly increase the resources used by you AA Job and the chance of hitting the memory quota.
You can get around the fairshare limitation by re-implementing you code as a Power Shell Workflow and performing frequent checkpoints. When a workflow job hits the fairshare limit, if it has been performing checkpoints, it will be restarted on another sandbox, resuming from the last checkpoint.
My recollection is your jobs need to perform a checkpoint at least once every 30 minutes. If they do this, that will resume from the fairshare without any penalty, forever. (At the cost of a tremendous number of job minutes.)
Even without a checkpoint, a workflow will get re-tried 2 times after hitting the checkpoint. Because of this, if your workflow code is idempotent, and quickly will skip previously completed work, by using a workflow, your job may complete (in 9 hours) even without checkpoints.
However, workflow are not just Power Shell script wrapped inside a workflow {} script block:
There are a lot of subtle differences in the way workflow function compared to scripts. Mastering these subtleties is difficult at best.
Workflow will not checkpoint all state during job execution. For example, the big one is that you need to write your workflow so that it will re-authenticate with Azure after each checkpoint, because credentials are not captured by the checkpoint.
I don't think anyone is going to claim debugging an AA job is an easy task. For workflow this task is even harder. Even with all the correct dependencies, how a workflow runs when executed locally is different from how it executes in the cloud.
Script run measurably faster than workflow.
Migrate the work into Azure Functions. With the recent release of PowerShell function, this might be relatively easy. Functions will have different limitations than those in Automation. This difference might suit your workload well, or be worse. I have not yet tried the Functions approach, so I can't really say.
The most obvious difference you will notice right away is that Functions is a lot more of a raw, DevOps oriented service than Automation. This is partly because Automation is a more mature product. (Automation was probably the first widely available serverless service, having launched roughly a year before Lambda.) Automation was purpose built for automating the management of cloud resources, and automation is the driving factor in the feature selection. Whereas Functions is a much more general purpose serverless operations approach. Anyway, one obvious difference at the moment is Functions does not have any built-in support for things like the RunAs account or Variables. I expect Functions will improve in this specific aspect over time, but right now it is pretty basic for automation tasks.
Related
I have realized that the jobs submitted with a previous version of my software are useless because of a bug, so I want to cancel them. However, I also have newer jobs that I would like to keep running. All the jobs have the same job name and are running in the same partition.
I have written the following script to cancel the jobs with an ID lower than a given one.
#!\bin\bash
if [ $1 ]
then
MAX_JOBID=$1
else
echo "An integer value is needed"
exit
fi
JOBIDLIST=$(squeue -u $USER -o "%F")
for JOBID in $JOBIDLIST
do
if [ "$JOBID" -lt "$MAX_JOBID" ]
then
echo "Cancelling job "$JOBID
scancel $JOBID
fi
done
I would say that this is a recurrent situation for someone developing a software and I wonder if there is a direct way to do it using slurm commands. Alternatively, do you use some tricks like appending the software commit ID to the job name to overcome this kind of situations?
Unfortunately there is no direct way to cancel the job in such scenarios.
Alternatively, like you pointed out, naming the job by adding software version/commit along with job name is useful. In that case you can use, scancel --name=JOB_NAME_VERSION to cancel all the jobs with that job name.
Also, if newly submitted jobs can be hold using scontrol hold <jobid> and then all the PENDING job can be cancelled using scancel --state=PENDING
In my case, I used a similar approach (like yours) by having squeue piped the output to awk and cancelled the first N number of jobs I wanted to remove. Its a one-liner script.
Something like this:
eg: squeue arguments | awk 'NR>=2 && NR<=N{print $1}' | xargs /usr/bin/scancel
In addition to the suggestions by #j23, you can organise your jobs with
job arrays ; if all your jobs are similar in terms of submission script, make them a job array, and submit one job array per version of your software. Then you can cancel an entire job array with just one scancel command
a workflow management system ; they enable submitting and managing sets of jobs (possibly on different clusters) easily
Fireworks https://materialsproject.github.io/fireworks/
Bosco https://osg-bosco.github.io/docs/
Slurm pipelines https://github.com/acorg/slurm-pipeline
Luigi https://github.com/spotify/luigi
With an Azure Data Factory "Tumbling Window" trigger, is it possible to limit the hours of each day that it triggers during (adding a window you might say)?
For example I have a Tumbling Window trigger that runs a pipeline every 15 minutes. This is currently running 24/7 but I'd like it to only run during business hours (0700-1900) to reduce costs.
Edit:
I played around with this, and found another option which isn't ideal from a monitoring perspective, but it appears to work:
Create a new pipeline with a single "If Condition" step with a dynamic Expression like this:
#and(greater(int(formatDateTime(utcnow(),'HH')),6),less(int(formatDateTime(utcnow(),'HH')),20))
In the true case activity, add an Execute Pipeline step executing your original pipeline (with "Wait on completion" ticked)
In the false case activity, add a wait step which sleeps for X minutes
The longer you sleep for, the longer you can possibly encroach on your window, so adjust that to match.
I need to give it a couple of days before I check the billing on the portal to see if it has reduced costs. At the moment I'm assuming a job which just sleeps for 15 minutes won't incur the costs that one running and processing data would.
there is no easy way but you can create two deployment pipelines for the same job in Azure devops and as soon as your winodw 0700 to 1900 expires you replace that job with a dummy job using azure dev ops pipeline.
I have created an azure batch account using "user subscription" allocation mode in order to control the network where my nodes will belong. The objective is to be able to open some firewalls for the set of IP that the nodes may take.
I had been using "batch service" allocation mode before without any trouble but it forces security breach because you have to open your firewalls to all azure if you want to access other services from batch.
The problem I am facing is that no matter what I try (be it Autoscale formula or just a fixed target node count) I never get any node allocated to my pool.
The only message I get is: AllocationTimedout: Desired number of dedicated nodes could not be allocated as the resize timeout was reached.
I checked the timeout (which is set to 10 minutes the default value) and I expect azure to be able to create nodes in less than 10 minutes (in "batch service" mode, it is much quicker).
I also checked my virtual machine quota and it is enough to create at least one node (it could create even more).
The problem I am facing is that I think the timeout is not the issue. It is the consequence of something not working in the background.
I checked the Activity log of batch and can see errors:
Write Deployments and Write VirtualMachineScaleSets.
The first seems to be linked to the second and the second state:
Error code
InvalidParameter
Message
Windows computer name prefix cannot be more than 9 characters long, be entirely numeric, or contain the following characters: ` ~ ! # # $ % ^ & * ( ) = + _ [ ] { } \ | ; : . ' " , < > / ?.
What am I missing here? The nodes names are given by Azure batch, not by me and they are indeed very long on standard "batch service" allocation mode.
When I run a U SQL script from portal/visual studio it follows stages like preparing,queued,running,finalizing. What exactly happens behind the scenes in all these stages?Will there be any execution time difference when the job is run from visual studio/portal in dev and production environment? We need to clock the speeds and record the time the script would take in production.Ultimately, the goal is to run these scripts as Data Factory activities in production.
I assume that there would be differences since I assume your dev environment would probably run at lower resource usage (lower degree of parallelism both between jobs and inside a job) than your production environment. Otherwise there should be no difference.
Note that we are still working on performance so if you are running into particular issues, please let us know.
The phases roughly do the following (I am probably missing some parts):
preparing: includes compilation, optimization, Codegen, preparing the execution graph and required resources and putting the job into the queue.
queueing: The job sits in the queue to get executed once the job is at the top of the queue and resources are available to start the job. This can be impacted by setting the maximal number of jobs that can run in parallel (a setting you can set by "calling" support/us).
running: Actual job execution. This will be affected by resources: Maximal number of parallelism that is specified on the job, network bandwidth, store access (throttling, bandwidth).
finalizing: Cleanup and stitching results into files, "sealing" table files. This can be more expensive depending on where you write the data (ADL is faster than WASB for example).
I need to move vhds from one subscription to other. I would like to know which one is better option for the same: Start-AzureStorageBlobCopy or AzCopy?
Which one takes lesser time ?
Both of them would take the same time as all they do is initiate Async Server-Side Blob Copy. They just tell the service to start copying blob from source to destination. The actual copy operation is performed by Azure Blob Storage Service. The time it would take to copy the blob would depend on a number of factors including but not limited to:
Source & destination location.
Size of the source blob.
Load on storage service.
Running AzCopy without specifying the option /SyncCopy and running PowerShell command Start-AzureStorageBlobCopy should take the same duration, because they both use server side asynchronous copy.
If you'd like to copy blobs across regions, you'd better consider specifying the option /SyncCopy while executing AzCopy in order to achieve a consistent speed because the asynchronous copying of data will run in the background of servers that being said you might see inconsistent copying speed among your “copying” operations.
If /SyncCopy option is specified, AzCopy will download the content to memory first, and then upload content back to Azure Storage. In order to achieve better performance of /SyncCopy, you are supposed to run AzCopy in the VM whose region is the same as source storage account. Besides that, the VM size (which decides bandwidth and CPU core number) will probably impact the copying performance as well.
For further information, please refer to Getting Started with the AzCopy Command-Line Utility
They don't take the same time.
I've tried to copy from one account to another and have a huge difference.
Start-AzureStorageBlobCopy -SrcBlob $_.Name -SrcContainer $Container -Context $ContextSrc -DestContainer $Container -DestBlob $_.Name -DestContext $ContextDst --Verbose
This takes about 2.5 hours.
& .\AzCopy.exe /Source:https://$StorageAccountNameSrc.blob.core.windows.net/$Container /Dest:https://$StorageAccountNameDst.blob.core.windows.net/$Container /SourceKey:$StorageAccountKeySrc /DestKey:$StorageAccountKeyDst /S
This takes several minutes.
I have about 600 Mb and about 7000 files here.
Elapsed time: 00.00:03:41
Finished 44 of total 44 file(s).
[2017/06/22 17:05:35] Transfer summary:
-----------------
Total files transferred: 44
Transfer successfully: 44
Transfer skipped: 0
Transfer failed: 0
Elapsed time: 00.00:00:08
Finished 345 of total 345 file(s).
[2017/06/22 17:06:07] Transfer summary:
-----------------
Total files transferred: 345
Transfer successfully: 345
Transfer skipped: 0
Transfer failed: 0
Elapsed time: 00.00:00:31
Do anyone know why it's so different?
In most scenarios, AzCopy is likely to be quicker than Start-AzureStorageBlobCopy due to way you would initiate the copy resulting in fewer calls to Azure API:
[AzCopy]1 call for whole container (regardless of blob count)
vs
[Start-AzureStorageBlobCopy] N number of calls due to number of blobs in container.
Initially I thought it would be same as both appear to trigger same asynchronous copies on Azure side, however on client side this would be directly visible as #Evgeniy has found in his answer.
In 1 blob in container scenario, theoretically both commands would complete at same time.
*EDIT (possible workaround): I was able to decrease my time tremendously by:
Removing console output AND
Using the -ConcurrentTaskCount switch, set to 100 in my case. Cut it down to under 5 minutes now.
AzCopy offers an SLA which the Async copy services lacks. AzCopy is designed for optimal performance. Use the/SyncCopy parameter to get a consistent copy speed.