I have a simple pipeline in ADF that is triggered by a Logic App every time someone submits a file as response in a Microsoft forms. The pipeline creates a cluster based in a Docker and then uses a Databricks notebook to run some calculations that can take several minutes.
The problem is that every time the pipeline is running and someone submits a new response to the forms, it triggers another pipeline run that, for some reason, will make the previous runs to fail.
The last pipeline will always work fine, but earlier runs will show this error:
> Operation on target "notebook" failed: Cluster 0202-171614-fxvtfurn does not exist
However, checking the parameters of the last pipeline it uses a different cluster id, 0202-171917-e616dsng for example.
It seems that for some reason, the computers resources for the first run are relocated in order to be used for the new pipeline run. However, the IDs of the cluster are different.
I have set up the concurrency up to 5 in the pipeline general settings tab, but still getting the same error.
Concurrency setup screenshot
Also, in the first connector that looks up for the docker image files I have the concurrency set up to 15, but this won’t fix the issue
look up concurrency screenshot
To me, it seems a very simple and common task when it comes to automation and data workflows, but I cannot figure it out.
I really appreciate any help and suggestions, thanks in advance
The best way would be use an existing pool rather than recreating the pool everytime
I have a certain script (python), which needs to be automated that is relatively memory and CPU intensive. For a monthly process, it runs ~300 times, and each time it takes somewhere from 10-24 hours to complete, based on input. It takes certain (csv) file(s) as input and produces certain file(s) as output, after processing of course. And btw, each run is independent.
We need to use configs and be able to pass command line arguments to the script. Certain imports, which are not default python packages, need to be installed as well (requirements.txt). Also, need to take care of logging pipeline (EFK) setup (as ES-K can be centralised, but where to keep log files and fluentd config?)
Last bit is monitoring - will we be able to restart in case of unexpected closure?
Best way to automate this, tools and technologies?
My thoughts
Create a docker image of the whole setup (python script, fluent-d config, python packages etc.). Now we somehow auto deploy this image (on a VM (or something else?)), execute the python process, save the output (files) to some central location (datalake, eg) and destroy the instance upon successful completion of process.
So, is what I'm thinking possible in Azure? If it is, what are the cloud components I need to explore -- answer to my somehows and somethings? If not, what is probably the best solution for my use case?
Any lead would be much appreciated. Thanks.
Normally for short living jobs I'd say use an Azure Function. Thing is, they have a maximum runtime of 10 minutes unless you put them on an App Service Plan. But that will costs more unless you manually stop/start the app service plan.
If you can containerize the whole thing I recommend using Azure Container Instances because you then only pay for what you actual use. You can use an Azure Function to start the container, based on an http request, timer or something like that.
You can set a restart policy to indicate what should happen in case of unexpected failures, see the docs.
Configuration can be passed from the Azure Function to the container instance or you could leverage the Azure App Configuration service.
Though I don't know all the details, this sounds like a good candidate for Azure Batch. There is no additional charge for using Batch. You only pay for the underlying resources consumed, such as the virtual machines, storage, and networking. Batch works well with intrinsically parallel (also known as "embarrassingly parallel") workloads.
The following high-level workflow is typical of nearly all applications and services that use the Batch service for processing parallel workloads:
Basic Workflow
Upload the data files that you want to process to an Azure Storage account. Batch includes built-in support for accessing Azure Blob storage, and your tasks can download these files to compute nodes when the tasks are run.
Upload the application files that your tasks will run. These files can be binaries or scripts and their dependencies, and are executed by the tasks in your jobs. Your tasks can download these files from your Storage account, or you can use the application packages feature of Batch for application management and deployment.
Create a pool of compute nodes. When you create a pool, you specify the number of compute nodes for the pool, their size, and the operating system. When each task in your job runs, it's assigned to execute on one of the nodes in your pool.
Create a job. A job manages a collection of tasks. You associate each job to a specific pool where that job's tasks will run.
Add tasks to the job. Each task runs the application or script that you uploaded to process the data files it downloads from your Storage account. As each task completes, it can upload its output to Azure Storage.
Monitor job progress and retrieve the task output from Azure Storage.
(source)
I would go with Azure Devops and a custom agent pool. This agent pool could include some virtual machines (maybe only one) with docker installed. I would then install all the necessary packages that you mentioned on this docker container and also the DevOps agent (it will be needed to communicate with the agent pool).
You could pass every parameter needed in the build container agents through Azure Devops tasks and also have a common storage layer for build and release pipeline. This way you could mamipulate/process your files on the build pipeline and then using the same folder create a task on the release pipeline to export/upload those files somewhere.
As this script should run many times through the month, you could have many containers so that to run more than one job at a given time.
I follow the same procedure for a corporate environment. I keep a VM running windows with multiple docker machines to compile diferent code frameworks. Each container includes different tools and is registered to a custom agent pool. Jobs are distributed across those containers and build and release pipelines integrate with multiple processing.
You probably suppose to use Azure Data Factory for moving and transforming data.
Then you can also use ADF for calling Azure Batch that will be using python.
https://learn.microsoft.com/en-us/azure/batch/tutorial-run-python-batch-azure-data-factory
Adding more info could probably suggest other better suggestions.
Situation:
I have a pipeline job that executes tests in parallel. I use Azure VMs that I start/stop on each build of the job thru Powershell. Before I run the job, it checks if there are available VMs on azure (offline VMs) then use that VMs for that build. If there is no available VMs then I will fail the job. Now, one of my requirements is that instead of failing the build, I need to queue the job until one of the nodes is offline/available then use those nodes.
Problem:
Is there any way for me to this? Any existing plugin or a build wrapper that will allow me to queue the job based on the status of the nodes? I was forced to do this because we need to stop the Azure VM to lessen the cost usage.
As of the moment, I am still researching if this is possible or any other way for me to achieve this. I am thinking of any groovy script that will check the nodes and if there are no available, I will manually add it to the build queue until at least 1 is available. The closest plugin that I got is Run Condition plugin but I think this will not work.
I am open to any approach that will help me achieve this. Thanks
I'm looking for what I would assume is quite a standard solution: I have a node app that doesn't do any web-work - simply runs and outputs to a console, and ends. I want to host it, preferably on Azure, and have it run once a day - ideally also logging output or sending me the output.
The only solution I can find is to create a VM on Azure, and set a cron job - then I need to either go fetch the debug logs daily, or write node code to email me the output. Anything more efficient available?
Azure Functions would be worth investigating. It can be timer triggered and would avoid the overhead of a VM.
Also I would investigate Azure Container Instances, this is a good match for their use case. You can have a container image that you run on an ACI instance that has your Node app. https://learn.microsoft.com/en-us/azure/container-instances/container-instances-tutorial-deploy-app
I'm not sure if it's an off topic for SO but I really need help here. Now in my project we are running load test on weekly basic and we are taking the advantage of ARM and azure CLI for making it fully automated test framework, starting from vm spinning to report gen.
But after the test, for now we are terminating the resource group manually and we have few though to make it automatic e.g by running a cron job. So just I'm curious if there is a better approach to do a graceful termination/destroy(not stop) automatically using azure cli based on a time window.
No, there is no such a way, but if everything is automated, you can run az group delete xxx at the end of your script\automation routine.
On top of that, take a look at Event Grid. Its a new service that can create actions in response to events.