I'm looking for what I would assume is quite a standard solution: I have a node app that doesn't do any web-work - simply runs and outputs to a console, and ends. I want to host it, preferably on Azure, and have it run once a day - ideally also logging output or sending me the output.
The only solution I can find is to create a VM on Azure, and set a cron job - then I need to either go fetch the debug logs daily, or write node code to email me the output. Anything more efficient available?
Azure Functions would be worth investigating. It can be timer triggered and would avoid the overhead of a VM.
Also I would investigate Azure Container Instances, this is a good match for their use case. You can have a container image that you run on an ACI instance that has your Node app. https://learn.microsoft.com/en-us/azure/container-instances/container-instances-tutorial-deploy-app
Related
I have a certain script (python), which needs to be automated that is relatively memory and CPU intensive. For a monthly process, it runs ~300 times, and each time it takes somewhere from 10-24 hours to complete, based on input. It takes certain (csv) file(s) as input and produces certain file(s) as output, after processing of course. And btw, each run is independent.
We need to use configs and be able to pass command line arguments to the script. Certain imports, which are not default python packages, need to be installed as well (requirements.txt). Also, need to take care of logging pipeline (EFK) setup (as ES-K can be centralised, but where to keep log files and fluentd config?)
Last bit is monitoring - will we be able to restart in case of unexpected closure?
Best way to automate this, tools and technologies?
My thoughts
Create a docker image of the whole setup (python script, fluent-d config, python packages etc.). Now we somehow auto deploy this image (on a VM (or something else?)), execute the python process, save the output (files) to some central location (datalake, eg) and destroy the instance upon successful completion of process.
So, is what I'm thinking possible in Azure? If it is, what are the cloud components I need to explore -- answer to my somehows and somethings? If not, what is probably the best solution for my use case?
Any lead would be much appreciated. Thanks.
Normally for short living jobs I'd say use an Azure Function. Thing is, they have a maximum runtime of 10 minutes unless you put them on an App Service Plan. But that will costs more unless you manually stop/start the app service plan.
If you can containerize the whole thing I recommend using Azure Container Instances because you then only pay for what you actual use. You can use an Azure Function to start the container, based on an http request, timer or something like that.
You can set a restart policy to indicate what should happen in case of unexpected failures, see the docs.
Configuration can be passed from the Azure Function to the container instance or you could leverage the Azure App Configuration service.
Though I don't know all the details, this sounds like a good candidate for Azure Batch. There is no additional charge for using Batch. You only pay for the underlying resources consumed, such as the virtual machines, storage, and networking. Batch works well with intrinsically parallel (also known as "embarrassingly parallel") workloads.
The following high-level workflow is typical of nearly all applications and services that use the Batch service for processing parallel workloads:
Basic Workflow
Upload the data files that you want to process to an Azure Storage account. Batch includes built-in support for accessing Azure Blob storage, and your tasks can download these files to compute nodes when the tasks are run.
Upload the application files that your tasks will run. These files can be binaries or scripts and their dependencies, and are executed by the tasks in your jobs. Your tasks can download these files from your Storage account, or you can use the application packages feature of Batch for application management and deployment.
Create a pool of compute nodes. When you create a pool, you specify the number of compute nodes for the pool, their size, and the operating system. When each task in your job runs, it's assigned to execute on one of the nodes in your pool.
Create a job. A job manages a collection of tasks. You associate each job to a specific pool where that job's tasks will run.
Add tasks to the job. Each task runs the application or script that you uploaded to process the data files it downloads from your Storage account. As each task completes, it can upload its output to Azure Storage.
Monitor job progress and retrieve the task output from Azure Storage.
(source)
I would go with Azure Devops and a custom agent pool. This agent pool could include some virtual machines (maybe only one) with docker installed. I would then install all the necessary packages that you mentioned on this docker container and also the DevOps agent (it will be needed to communicate with the agent pool).
You could pass every parameter needed in the build container agents through Azure Devops tasks and also have a common storage layer for build and release pipeline. This way you could mamipulate/process your files on the build pipeline and then using the same folder create a task on the release pipeline to export/upload those files somewhere.
As this script should run many times through the month, you could have many containers so that to run more than one job at a given time.
I follow the same procedure for a corporate environment. I keep a VM running windows with multiple docker machines to compile diferent code frameworks. Each container includes different tools and is registered to a custom agent pool. Jobs are distributed across those containers and build and release pipelines integrate with multiple processing.
You probably suppose to use Azure Data Factory for moving and transforming data.
Then you can also use ADF for calling Azure Batch that will be using python.
https://learn.microsoft.com/en-us/azure/batch/tutorial-run-python-batch-azure-data-factory
Adding more info could probably suggest other better suggestions.
I've containerized a logic that I have to run on a schedule. If I do my docker run locally (whatever my image is local or it is using the one from the hub) everything works great.
Now I need though to run that "docker run" on a scheduled base, on the cloud.
Azure would be preferred, but honestly, I'm looking for the easier and cheapest way to achieve this goal.
Moreover, my schedule can change, so maybe today that job runs once a day, in the future that can change.
What do you suggest?
You can create an Azure Logic app to trigger the start of a Azure Container Instance. As you have a "run-once" (every N minute/hour/..) container, the restart-policy should be set to "Never", so that the container only executes and then stops after the scheduling.
The Logic app needs to have the permissions to start the Container, so add a role assignment on the ACI to the managed identity of the logic App.
Screenshot shows the workflow with a Recurrence trigger, that starts an existing container every minute.
Should be quite cheap and utilizes only Azure services, without any custom infrastructure
Professionally I used 4 ways to run cron jobs/ scheduled builds. I give a quick summary of all with it pros/cons.
GitLab scheduled builds (free)
My personal preference would be to setup a scheduled pipeline in GitLab. Simply add the script to a .gitlab-ci.yml, configure the scheduled build and you are done. This is the lightweight option and works in most cases, if the execution time is not too long. I used this approach for scraping simple pages.
Jenkins scheduled builds (not-free)
I used the same approach as GitLab with Jenkins. But Jenkins comes with more overhead and you have to configure the entire Jenkins on multiple machines.
Kubernetes CronJob (expensive)
My third approach would be using a kubernetes cronjob. However, I would only use this if I consume a lot of memory/ram, or have a long execution time. I used this approach for dumping really large data sets.
Run a cron job from a container (expensive)
My last option would be to deploy a docker container on either a VM or a Kubernetes cluster and configure a cron job from within that docker container. You can even use docker-in-docker for that. This gives maximum flexibility, but comes with some challenges. Personally I like the separation of concerns when it comes to down-times etc. That's why never run a cron job as main process.
We have multiple Google Cloud Run services running for an API. There is one parent service and multiple child services. When the parent service starts it loads a schema from all the children.
Currently there isn't a way to tell the parent process to reload the schema so when a new child is deployed the parent service needs to be restarted to reload the schema.
We understand there there are 1 or more instances of Google Cloud Run running and have ideas on dealing with this, but are wondering if there is a way to restart the parent process at all. Without a way to achieve it, one or more is irrelevant for now. The only way found it by deploying the parent which seems like overkill.
The containers running in google cloud are Alpine Linux with Nodejs, running an express application/middleware. I can stop the node application running but not restart it. If I stop the service Google Cloud Run may still continue to serve traffic to that instances causing errors.
Perhaps I can stop the express service so Google Cloud run will replace that instance? Is this a possibility? Is there a graceful way to do it so it tries to complete and current requests first (not simply kill express)?
Looking for any approaches to force Google Cloud Run to restart or start new instances. Thoughts?
Your design seems, at high level, be a cache system: The parent service get the data from the child service and cache the data.
Therefore, you have all the difficulties of cache management, especially cache invalidation. There is no easy solution for that, but my recommendation will be to use memorystore where all child service publish the latest version number of their schema (at container startup for example). Then, the parent service checks (at each requests, for example) the status in memory store (single digit ms latency) if a new version is available of not. If a new, request the child service, and update the parent service schema cache.
If applicable, you can also set a TTL on your cache and reload it every minute for example.
EDIT 1
If I focus only on Cloud Run, you can in only one condition, restart your container without deploying a new version: set the max-instance param to 1, and implement an exit endpoint (simply do os.exit() or similar in your code)
Ok, you loose all the scale up capacity, but it's the only case where, with a special exit endpoint, you can exit the container and force Cloud Run to reload it at the next request.
If you have more than 1 instance, you won't be able to restart all the running instances but only this one which handle the "exit" request.
Therefore, the only one solution is to deploy a new revision (simply deploy, without code/config change)
I have a Nuxt app deployed on Azure App Service, with cron library to run schedule jobs. However I found that If there are more than one instance running, the schedule job will be duplicated. What is the proper way to handle it? Thanks!
If you have more than one instance of your app running, then you have more than one instance of cron running (I'm assuming you are referring to the npm module in your case). Both are going to activate on the same schedule since it's coded into both apps. And of course if you scale beyond that, you would have three, four, five jobs running.
There are a few options that let you run singleton jobs on a timer such as adding a WebJob to your App Service, creating a Logic App, and running an Azure Function. For a basic JS script I would recommend the function. You create a JSON file that defines the schedule you want to run it on (just like cron), it's JS so you can probably just copy your code over along with other npm modules you need, and you can set Configuration just like web apps so if your job needs to connect to storage or a database you can have connection strings and other info there just like you do for your existing web app.
Is there a way to run something like "node testscript.js" remotely?
If not, how do you test particular functions on App Engine? I can test them locally, but there are difference when running on App Engine.
If you want to run something in App Engine, you will have to deploy it, and whenever you make changes to the source code, you will have to redeploy it again to be able to run the updated code on App Engine. You should test your application in local thoroughly to be sure it will be working as expected when deployed.
With respect to the timeouts, please keep in mind that there are two environments: flexible and standard, where the timeout deadlines differ (60 sec for standard vs 60 min for flexible). Also, you can have long-running requests on App Engine standard if you use the manual scaling option.
You might also look at Cloud Functions, depending on what your scripts do. Some of the options to trigger Cloud Functions are HTTP requests or Direct Triggers.