Run Node JS on a multi-core cluster cloud - node.js

Is there a service or framework or any way that would allow me to run Node JS for heavy computations letting me choose the number of cores?
I'll be more specific: let's say I want to run some expensive computation for each of my users and I have 20000 users.
So I want to run the expensive computation for each user on a separate thread/core/computer, so I can finish the computation for all users faster.
But I don't want to deal with low level server configuration, all I'm looking for is something similar to AWS Lambda but for high performance computing, i.e., letting me scale as I please (maybe I want 1000 cores).
I did simulate this with AWS Lambda by having a "master" lambda that receives the data for all 20000 users and then calls a "computation" lambda for each user. Problem is, with AWS Lambda I can't make 20000 requests and wait for their callbacks at the same time (I get a request limit exceeded error).
With some setup I could user Amazon HPC, Google Compute Engine or Azure, but they only go up to 64 cores, so if I need more than that, I'd still have to setup all the machines I need separately and orchestrate the communication between them with something like Open MPI, handling the different low level setups for master and compute instances (accessing via ssh and etc).
So is there any service I can just paste my Node JS code, maybe choose the number of cores and run (not having to care about OS, or how many computers there are in my cluster)?
I'm looking for something that can take that code:
var users = [...];
function expensiveCalculation(user) {
// ...
return ...;
}
users.forEach(function(user) {
Thread.create(function() {
save(user.id, expensiveCalculation(user));
});
});
And run each thread on a separate core so they can run simultaneously (therefore finishing faster).

I think that your problem is that you feel the need to process 20000 inputs at once on the same machine. Have you looked into SQS from Amazon? Maybe you push those 20000 inputs into SQS and then have a cluster of servers pull from that queue and process each one individually.
With this approach you could add as many servers, processes or add as many AWS Lambda invokes as you want. You could even use a combination of the 3 to see what's cheaper or faster. Adding resources will only reduce the amount of time it would take to complete the computations. Then you wouldn't have to wait for 20000 requests or anything to complete. The process could tell you when it completes the computation by sending some notification after it completes.
So basically, you could have a simple application that just grabbed 10 of these inputs at a time and ran your computation on them. After it finishes you could then have this process delete them from SQS and send a notification somewhere (Maybe SNS?) to notify the user or some other system that they are done. Then it would repeat the process.
After that you could scale the process horizontally and you wouldn't need a super computer in order to process this. So you could either get a cluster of EC2 instances that ran several of these applications a piece or have a Lambda function invoked periodically in order to pull items out of SQS and process them.
EDIT:
To get started using an EC2 instance I would look at the docs here. To start with I would pick the smallest, cheapest instance (T2.micro I think), and leave everything at it's default. There's no need to open any port other than the one for SSH.
Once it's setup and you login, the first thing you need to do is run aws configure to setup your profile that way you can access AWS resources from the instance. After that install Node and get your application on there using git or something. Once it's setup though, go to the EC2 console and in your Actions menu there will be an option to create an image from the instance.
Once you create an image, then you can go to Auto Scaling groups and create a launch configuration using that AMI. Then it'll let you specify how many instances you want to run.
I feel like this could also be done more easily using their container service, but honestly I don't know how to use it yet.

Related

Google Cloud Run - One container handling multiple similar requests with queue for each user

I have a SERVICE that gets a request from a Webhook and this is currently deployed across seperate Cloud Run containers. These seperate containers are the exact same (image), however, each instance processes data seperately for each particular account.
This is due to a ~ 3-5 min processing of the request and if the user sends in more requests, it needs to wait for the existing process to be completed for that particular user before processing the next one to avoid racing conditions. The container can still receive webhooks though, however, the actual processing of the data itself needs to be done one by one for each account.
Is there no way to reduce the container count, as such for example, to use one container to process all the requests, while still ensuring it processes one task for each user at a time and waits for that to complete for that user, before processing the next request from the same user.
To explain it better, i.e.
Multiple tasks can be run across all the users
However, per user 1 task at a time processed; Once that is completed, the next task for that user can be processed
I was thinking of monitoring the tasks through a Redis Cache, however, with Cloud Run being stateless, I am not sure that is the right way to go.
Or seperating the requests and the actual work - Master / Worker - And having the worker report back to the master once a task is completed for the user across 2 images (Using the concurrency to process multiple tasks across the users), however that might mean that I would have to increase the timeout time for Cloud Run.
Good to hear any other suggestions.
Apologies if this doesn't seem clear, feel free to ask for more information.

How to find/cure source of function app throughput issues

I have an Azure function app triggered by an HttpRequest. The function app reads the request, tosses one copy of it into a storage table for safekeeping and sends another copy to a queue for further processing by another element of the system. I have a client running an ApacheBench test that reports approximately 148 requests per second processed. That rate of processing will not be enough for our expected load.
My understanding of function apps is that it should spawn as many instances as is needed to handle the load sent to it. But this function app might not be scaling out quickly enough as it’s only handling that 148 requests per second. I need it to handle at least 200 requests per second.
I’m not 100% sure the problem is on my end, though. In analyzing the performance of my function app I found a LOT of 429 errors. What I found online, particularly https://learn.microsoft.com/en-us/azure/azure-resource-manager/resource-manager-request-limits, suggests that these errors could be due to too many requests being sent from a single IP. Would several ApacheBench 10K and 20K request load tests within a given day cause the 429 error?
However, if that’s not it, if the problem is with my function app, how can I force my function app to spawn more instances more quickly? I assume this is the way to get more throughput per second. But I’m still very new at working with function apps so if there is a different way, I would more than welcome your input.
Maybe the Premium app service plan that’s in public preview would handle more throughput? I’ve thought about switching over to that and running a quick test but am unsure if I’d be able to switch back?
Maybe EventHub is something I need to investigate? Is that something that might increase my apparent throughput by catching more requests and holding on to them until the function app could accept and process them?
Thanks in advance for any assistance you can give.
You dont provide much context of you app but this is few steps how you can improve
If you want more control you need to use App Service plan with always on to avoid cold start, also you will need to configure auto scaling since you are responsible in this plan and auto scale is not enabled by default in app service plan.
Your azure function must be fully async as you have external dependencies so you dont want to block thread while you are calling them.
Look on the limits. Using host.json you can tweek it.
429 error means that function is busy to process your request, so probably when you writing to table you are not using async and blocking thread
Function apps work very well and scale as it says. It could be because request coming from Single IP and Azure could be considering it DDOS. You can do the following
AzureDevOps Load Test
You can load test using one of the azure service . I am very sure they have better criteria of handling IPs. Azure DeveOps Load Test
Provision VM in Azure
The way i normally do is provision the VM (windows 10 pro) in azure and use JMeter to Load test. I have use this method to test and it works fine. You can provision couple of them and subdivide the load.
Use professional Load testing services
If possible you may use services like Loader.io . They use sophisticated algos to run the load test and provision bunch of VMs to run the same test.
Use Application Insights
If not already you must be using application insights to have a better look from server perspective. Go to live stream and see how many instance it would provision to handle the load test . You can easily look into events and error logs that may be arising and investigate. You can deep dive into each associated dependency and investigate the problem.

Schedule a task to run at some point in the future (architecture)

So we have a Python flask app running making use of Celery and AWS SQS for our async task needs.
One tricky problem that we've been facing recently is creating a task to run in x days, or in 3 hours for example. We've had several needs for something like this.
For now we create events in the database with timestamps that store the time that they should be triggered. Then, we make use of celery beat to run a scheduled task every second to check if there are any events to process (based on the trigger timestamp) and then process them. However, this is querying the database every second for events which we feel could be bettered somehow.
We looked into using the eta parameter in celery (http://docs.celeryproject.org/en/latest/userguide/calling.html) that lets you schedule a task to run in x amount of time. However it seems to be bad practice to have large etas and also AWS SQS has a visibility timeout of about two hours and so anything more than this time would cause a conflict.
I'm scratching my head right now. On the one had this works, and pretty decent in that things have been separated out with SNS, SQS etc. to ensure scaling-tolerance. However, it just doesn't feel write to query the database every second for events to process. Surely there's an easier way or a service provided by Google/AWS to schedule some event (pub/sub) to occur at some time in the future (x hours, minutes etc.)
Any ideas?
Have you taken a look at AWS Step Functions, specifically Wait State? You might be able to put together a couple of lambda functions with the first one returning a timestamp or the number of seconds to wait to the Wait State and the last one adding the message to SQS after the Wait returns.
Amazon's scheduling solution is the use of CloudWatch to trigger events. Those events can be placing a message in an SQS/SNS endpoint, triggering an ECS task, running a Lambda, etc. A lot of folks use the trick of executing a Lambda that then does something else to trigger something in your system. For example, you could trigger a Lambda that pushes a job onto Redis for a Celery worker to pick up.
When creating a Cloudwatch rule, you can specify either a "Rate" (I.e., every 5 minutes), or an arbitrary time in CRON syntax.
So my suggestion for your use case would be to drop a cloudwatch rule that runs at the time your job needs to kick off (or a minute before, depending on how time sensitive you are). That rule would then interact with your application to kick off your job. You'll only pay for the resources when CloudWatch triggers.
Have you looked into Amazon Simple Notification Service? It sounds like it would serve your needs...
https://aws.amazon.com/sns/
From that page:
Amazon SNS is a fully managed pub/sub messaging service that makes it easy to decouple and scale microservices, distributed systems, and serverless applications. With SNS, you can use topics to decouple message publishers from subscribers, fan-out messages to multiple recipients at once, and eliminate polling in your applications. SNS supports a variety of subscription types, allowing you to push messages directly to Amazon Simple Queue Service (SQS) queues, AWS Lambda functions, and HTTP endpoints. AWS services, such as Amazon EC2, Amazon S3 and Amazon CloudWatch, can publish messages to your SNS topics to trigger event-driven computing and workflows. SNS works with SQS to provide a powerful messaging solution for building cloud applications that are fault tolerant and easy to scale.
You could start the job with apply_async, and then use a countdown, like:
xxx.apply_async(..., countdown=TTT)
It is not guaranteed that the job starts exactly at that time, depending on how busy the queue is, but that does not seem to be an issue in your use case.

Google Cloud Platform : Running several hours scraping script

I have a NodeJS script, that scrapes URLs everyday.
The requests are throttled to be kind to the server. This results in my script running for a fairly long time (several hours).
I have been looking for a way to deploy it on GCP. And because it was previously done in cron, I naturally had a look at how to have a cronjob running on Google Cloud. However, according to the docs, the script has to be exposed as an API and http calls to that API can only run for up to 60 minutes, which doesn't fit my needs.
I had a look at this S.O question, which recommends to use a Cloud Function. However, I am unsure this approach would be suitable in my case, as my script requires a lot more processing than the simple server monitoring job described there.
Has anyone experience in doing this on GCP ?
N.B : To clarify, I want to to avoid deploying it on a VPS.
Edit :
I reached out to google, here is their reply :
Thank you for your patience. Currently, it is not possible to run cron
script for 6 to 7 hours in a row since the current limitation for cron
in App Engine is 60 minutes per HTTP
request.
If it is possible for your use case, you can spread the 7 hours to
recurrring tasks, for example, every 10 minutes or 1 hour. A cron job
request is subject to the same limits as those for push task
queues. Free
applications can have up to 20 scheduled tasks. You may refer to the
documentation
for cron schedule format.
Also, it is possible to still use Postgres and Redis with this.
However, kindly take note that Postgres is still in beta.
As I a can't spread the task, I had to keep on managing a dokku VPS for this.
I would suggest combining two services, GAE Cron Jobs and Cloud Tasks.
Use GAE Cron jobs to publish a list of sites and ranges to scrape to Cloud Tasks. This initialization process doesn't need to be 'kind' to the server yet, and can simple publish all chunks of works to the Cloud Task queue, and consider itself finished when completed.
Follow that up with a Task Queue, and use the queue rate limiting configuration option as the method of limiting the overall request rate to the endpoint you're scraping from. If you need less than 1 qps add a sleep statement in your code directly. If you're really queueing millions or billions of jobs follow their advice of having one queue spawn to another.
Large-scale/batch task enqueues
When a large number of tasks, for
example millions or billions, need to be added, a double-injection
pattern can be useful. Instead of creating tasks from a single job,
use an injector queue. Each task added to the injector queue fans out
and adds 100 tasks to the desired queue or queue group. The injector
queue can be sped up over time, for example start at 5 TPS, then
increase by 50% every 5 minutes.
That should be pretty hands off, and only require you to think through the process of how the cron job pulls the next desired sites and pages, and how small it should break down the work loads into.
I'm also working on this task. I need to crawl website and have the same problem.
Instead of running the main crawler task on the VM, I move the task to Google Cloud Functions. The task is consist of add get the target url, scrape the web, and save the result to Datastore, then return the result to caller.
This is how it works, I have a long run application that call be called a master. The master know what URL we are going to access in to. But instead of access the target website by itself, it sends the url to a crawler function in GCF. Then the crawling tasked is done and send result back to the master. In this case, the master only request and get a small amount of data and never touch the target website, let the rest to GCF. You can off load your master and crawl the website in parallel via GCF. Or you can use other method to trigger GCF instead of HTTP request too.

Node/Express: running specific CPU-instensive tasks in the background

I have a site that makes the standard data-bound calls, but then also have a few CPU-intensive tasks which are ran a few times per day, mainly by the admin.
These tasks include grabbing data from the db, running a few time-consuming different algorithms, then reuploading the data. What would be the best method for making these calls and having them run without blocking the event loop?
I definitely want to keep the calculations on the server so web workers wouldn't work here. Would a child process be enough here? Or should I have a separate thread running in the background handling all /api/admin calls?
The basic answer to this scenario in Node.js land is to use the core cluster module - https://nodejs.org/docs/latest/api/cluster.html
It is an acceptable API to :
easily launch worker node.js instances on the same machine (each instance will have its own event loop)
keep a live communication channel for short messages between instances
this way, any work done in the child instance will not block your master event loop.

Resources