Running a Node.js Application Once Every Year - node.js

I have been recently challenged with an architectural problem. Basically, I developed a Node.js application that fetches three zip files from Census.gov (13 MB, 1.2 MB and 6.7 GB) which takes about 15 to 20 minutes. After the files are downloaded the application unzips these files and extracts needed data to an AWS RDS Database. The issue for me is that this application needs to run only one time each year. What would be the best solution for this kind of task? Also, the zip files are deleted after the processing is done.

I would use a cron job. You can use this website (https://crontab.guru/every-year) to determine the correct settings for the crontab.
0 0 1 12 1
This setting will run “At 00:00 on day-of-month 1 and on Monday in December.”
To run the nodeJS program you simply put node yourcode.js aftewards. So it would look like the code below. Where node is you may need to put the path to the node program, and where yourprogram.js is you simply need to add the path there as well.
0 0 1 12 1 node yourprogram.js

Hei, I would give u suggestion. But according what Services do you use. In example if using Google Cloud with Google Scheduller. If using Openshift or another u can use Cronjob. But it worst case configuration I think where u need make some yaml file deployment that need trigger to publisher/subscriber:
Make some subscriber, on services which can trigger by Google PubSub by Topic to do your task and after all executed publish to the broker (Google PubSuB) again.
And than make another subscriber to trigger deleting file after receive a publisher message if all task execute.
The Idea i suggest because the process like that, it best practices if using the Asyncrhrounus process.
Thanks,

I would look into AWS Batch service which can run a scheduled job on an EC2 instance (virtual machine) or Fargate (serverless container runner).
Alternative #2: Use AWS Lambda serverless function to execute a NodeJS script (no need to set up an EC2 Instance or Fargate). Lambda functions can be triggered by EventBridge Rules using cron expressions. With Lambda, you pay for number of executions and the execution time in 1ms increments, however this use case could be covered within the AWS Free Tier Lambda pricing. AWS Free Tier
Note on Lambda limits: Lambda execution time is limited to 15 minutes and 10GB of local storage maximum (source: Lambda Quotas). Lambda CPU is allocated in proportion to memory configuration, you may need to increase it to improve execution time. Lambda Memory Configuration
Alternative #3: You can build a state machine using AWS Step Functions to trigger Lambda functions in steps.
For example, a state machine can trigger three Lambda functions in parallel where each function downloads its corresponding .zip file from census.gov and stores it to an Amazon S3 bucket. When all functions complete, the state machine can progress to next step and trigger a fourth function to grab data from S3 for processing and loading into the database. Once the data has been processed and loaded, the final step function can delete the .zip files from S3 if you no longer need them. EventBridge can also be used here to execute the state machine using a cron expression. You can also use Amazon SNS to publish notifications (email/sms/http endpoint) to alert if any step fails/completes.

The simple solution is to Schedule AWS Lambda Functions Using CloudWatch Events
So, you will have an AWS lambda function that will download the .zip files in the S3 buckets, unzip it and extract the data to database. After that, the same function can empty the S3 buckets.
This function will be yearly trigger by CloudWatch Events.
For more information, check out this tutorial here

Related

AWS and NodeJS architecture for a scheduled/cron task in multi server setup

I am using AWS services in deploying my application which currently has the production site setup on an application load balancer running 2 instances of my NodeJS server.
My current concern is if I just setup a node-cron to trigger a task at 5:00am, it will do this for each server I spin up.
I need to implement an email delivery system where at 5:00am it will query my database table I made to generate customized emails (need to iterate over each individual;s record which has a unique array that helps build a list of items for each user). I then fire the object off to AWS SES.
What's are some ways you have done this?
**Currently based off my readings I am looking at two options:
**
Setup a node-cron child process within one cluster (but if I have auto-scaling, wouldn't this create a duplicate node-cron task), but this would probably require Redis and tracking the process across servers
OR
Setup an EventBridge API which fires an api.mybackendserver.com/send-email-event where I then carry out my logic. (this seems like the simpler approach, and the drawbacks would be potential CPU/RAM spikes which would be fine as i'm regionally based and would do this in off-peak hours).
EventBridge is definitely a way to go with CRON. If you're worried about usage spikes you could use CRON to invoke a Lambda function. That pushes events to SQS for each job. Those would be polled by EC2 instances.
Other way would be to schedule a task to increase number of instances before cron event occurs.

long-running job on GCP cloud run

I am reading 10 million records from BigQuery and doing some transformation and creating the .csv file, the same .csv stream data I am uploading to SFTP server using Node.JS.
This job taking approximately 5 to 6 hrs to complete the request locally.
Solution has been delpoyed on GCP Cloud run but after 2 to 3 second cloud run is closing the container with 503 error.
Please find below configuration of GCP Cloud Run.
Autoscaling: Up to 1 container instances
CPU allocated: default
Memory allocated: 2Gi
Concurrency: 10
Request timeout: 900 seconds
Is GCP Cloud Run is good option for long running background process?
You can use a VM instance with your container deployed and perform you job on it. At the end kill or stop your VM.
But, personally, I prefer serverless solution and approach, like Cloud Run. However, Long running job on Cloud Run will come, a day! Until this, you have to deal with the limit of 60 minutes or to use another service.
As workaround, I propose you to use Cloud Build. Yes, Cloud Build for running any container in it. I wrote an article on this. I ran a Terraform container on Cloud Build, but, in reality, you can run any container.
Set the timeout correctly, take care of default service account and assigned role, and, thing not yet available on Cloud Run, choose the number of CPUs (1, 8 or 32) for the processing and speed up your process.
Want a bonus? You have 120 minutes free per day and per billing account (be careful, it's not per project!)
Update: 2021-Oct
Cloudrun supports background activities.
Configure CPU to be always-allocated if you use background activities
Background activity is anything that happens after your HTTP response has been delivered. To determine whether there is background activity in your service that is not readily apparent, check your logs for anything that is logged after the entry for the HTTP request.
Configure CPU to be always-allocated
If you want to support background activities in your Cloud Run service, set your Cloud Run service CPU to be always allocated so you can run background activities outside of requests and still have CPU access.
Is GCP Cloud Run is good option for long running background process?
Not a good option because your container is 'brought to life' by incoming HTTP request and as soon as the container responds (e.g. sends something back), Google assumes the processing of the request is finished and cuts the CPU off.
Which may explain this:
Solution has been delpoyed on GCP Cloud run but after 2 to 3 second cloud run is closing the container with 503 error.
You can try using an Apache Beam pipeline deployed via Cloud Dataflow. Using Python, you can perform the task with the following steps:
Stage 1. Read the data from BigQuery table.
beam.io.Read(beam.io.BigQuerySource(query=your_query,use_standard_sql=True))
Stage 2. Upload Stage 1 result into a CSV file on a GCS bucket.
beam.io.WriteToText(file_path_prefix="", \
file_name_suffix='.csv', \
header='list of csv file headers')
Stage 3. Call a ParDo function which will then take CSV file created in Stage 2 and upload it to the SFTP server. You can refer this link.
You may consider a serverless, event-driven approach:
configure google storage trigger cloud function running transformation
extract/export BigQuery to CF trigger bucker - this is the fastest way to get BigQuery data out
Sometimes exported data in that way may be too large not be suitable in that form for Cloud Function processing, due to restriction like max execution time (9 min currently) or memory limitation 2GB,
In that case, you can split the original data file to smaller pieces and/or push then to Pub/Sub with storage mirror
All that said we've used CF to process a billion records from building bloom filters to publishing data to aerospike under a few minutes end to end.
I will try to use Dataflow for creating .csv file from Big Query and will upload that file to GCS.

How to run multiple executables with Azure cloud services - Function apps?

I'm dealing with a legacy piece of software, totally not cloud friendly.
The local workflow is as follows:
Run Software1
Software1 creates some helper files to be used by Software2
Software2 runs and generates a result file
Software2 is a simulation model compiled as executable.
I now need to run hundreds of simulations and since this software doesn't even support multi-threading I'm looking at running it in the cloud. I have little to none experience with cloud computing. Our company mainly works with Azure but I don't have a problem using AWS or another cloud computing service.
What I'm thinking as possible solution is:
Run a virtual machine that runs Software1
Software1 creates several folders. Each folder contains all the necessary files to perform a single simulation.
Each folder is loaded to a blob storage folder
A Function app is triggered by the blob storage folder creation and a run is performed for each folder by running Software2
Once Software2 is done with the simulation, the function app copies the result file back to the blob storage, in the same folder of the corresponding run.
I tested the Function App and it does what I need but I'm not quite sure how run it several times in parallel. Do you have any suggestion on how to achieve this? Or maybe I should be using something different than function apps.
Thank you in advance for your help,
Guido
If I have understood this correctly, you want to run this Function App multiple times in parallel to "simulate" parallel execution. I think you need to look at Event Grid and re-think your architecture.
If you use a blob trigger, your function will be triggered each time you'll be making an operation in the blob container. If 1 file = 1 run for Software2, a blob trigger is OK and Azure will scale and run your function in parallel. The issue is that Software2 needs to write the results back in blob, creating new triggers.
Another way would be to have Software1 send a message to Storage Queue or Service Bus or an event with Event Grid and have your function be triggered by that. You then would write a Durable function using the "Fan out/fan in" pattern to run Software2 in parallel.
You can also look at creating parallel branches in Logic App.

Advise needed - Running Python code on GOOGLE CLOUD PLATFORM serverless

I have a python code which reads data from one cloud system via rest api using the requests module and then writes data back to another cloud system via rest api . This code runs anywhere from 1 to 4 hours every week. Is there a place in Google Cloud Platform , I can execute this code on a periodic basis. Sort of like a scheduled batch job . Is there a serverless option to do this in App Engine . I know about the App engine cron service but seems like it is only for calling a URL regularly . Any thoughts ? Appreciate your help.
Google Cloud Scheduler could be the tool you are looking for. As it is mentioned in its documentation:
Cloud Scheduler is a fully managed enterprise-grade cron job scheduler. It allows you to schedule virtually any job, including batch, big data jobs, cloud infrastructure operations, and more. You can automate everything, including retries in case of failure to reduce manual toil and intervention.
Here you have the quickstart for Cloud Scheduler, and also another tutorial for Cron jobs.
You can use the Google Genomics API pipelines.run endpoint to run a long-running job on a Google Compute Engine virtual machine and then it will destroy the machine when it's done. If your job will run for less than 24 hours and it can handle a failure, then you can use a Preemptible VM to save cost.
Pipelines: Run
https://cloud.google.com/genomics/reference/rest/v2alpha1/pipelines/run
Preemptible Virtual Machines
https://cloud.google.com/preemptible-vms/
You could use Cloud Scheduler to kick off the job
Pipelines may be preferred to trying to use one of the serverless technologies because they don't tend to handle the long running jobs as well.
You can use AI Platform Training to run any arbitrary Python package — it doesn’t have to be a machine learning job.

How to host long running process into Azure Cloud?

I have a C# console application which extracts 15GB FireBird database file on a server location to multiple files and loads the data from files to SQLServer database. The console application uses System.Threading.Tasks.Parallel class to perform parallel execution of the dataload from files to sqlserver database.
It is a weekly process and it takes 6 hours to complete.
What is best option to move this (console application) process to azure cloud - WebJob or WorkerRole or Any other cloud service ?
How to reduce the execution time (6 hrs) after moving to cloud ?
How to implement the suggested option ? Please provide pointers or code samples etc.
Your help in detail comments is very much appreciated.
Thanks
Bhanu.
let me give some thought on this question of yours
"What is best option to move this (console application) process to
azure cloud - WebJob or WorkerRole or Any other cloud service ?"
First you can achieve the task with both WebJob and WorkerRole, but i would suggest you to go with WebJob.
PROS about WebJob is:
Deployment time is quicker, you can turn your console app without any change into a continues running webjob within mintues (https://azure.microsoft.com/en-us/documentation/articles/web-sites-create-web-jobs/)
Build in timer support, where WorkerRole you will need to handle on your own
Fault tolerant, when your WebJob fail, there is built-in resume logic
You might want to check out Azure Functions. You pay only for the processing time you use and there doesn't appear to be a maximum run time (unlike AWS Lambda).
They can be set up on a schedule or kicked off from other events.
If you are already doing work in parallel you could break out some of the parallel tasks into separate azure functions. Aside from that, how to speed things up would require specific knowledge of what you are trying to accomplish.
In the past when I've tried to speed up work like this, I would start by spitting out log messages during the processing that contain the current time or that calculate the duration (using the StopWatch class). Then find out which areas can be improved. The slowness may also be due to slowdown on the SQL Server side. More investigation would be needed on your part. But the first step is always capturing metrics.
Since Azure Functions can scale out horizontally, you might want to first break out the data from the files into smaller chunks and let the functions handle each chunk. Then spin up multiple parallel processing of those chunks. Be sure not to spin up more than your SQL Server can handle.

Resources