NODE JS - cron jobs in cluster mode or behind a load balancer executing multiple times - node.js

I have a NestJS server working in cluster mode on ec2 instance using pm2.
I have successfully set up cron jobs executing only one time in the cluster mode by starting the server with multiple name configurations using the name of the pm2 process.
{
"apps":[
{
"script":"dist/main.js",
"instances": "1",
"exec_mode": "cluster",
"name":"queue"
},
{
"script":"dist/main.js",
"instances": "1",
"exec_mode": "cluster",
"name":"coco"
}
]
}
But I want to know how to handle this case when multiple instances are behind the load balancer. As the jobs are scheduled on each instance and executed multiple times with the same data from the remote database.
Any help would be appreciated.

Related

Controlling the number of spark drivers running on Mesos

We have a cluster with 6 EC2 nodes in AWS (16 cpu, 64 gb per node - 1 node running mesos master and 5 nodes running mesos slave). The mesos dispatcher is running on the master node. We use this cluster exclusively to run spark jobs. Our spark configuration is 5 cpu and 10 gb per executor (same for driver). I
In one of our scheduled jobs, we have a scenario where we need to do a few hundred spark-submits at the same time. When this happens, the dispatchers starts drivers for all these spark-submits leaving no room for any executors in the cluster.
I'm looking at a few options. I would like to get some pointers from members in the spark/mesos community.
Some options which I don't want to get into are : increasing the cluster size, asking the analysts to change their job structure to combine all spark-submits into a single one, switching to YARN/EMR (actually I tried this and got into some messy queue problems there)
Option 1 : Using Mesos roles
I found some documentation on the use of quotas and roles to solve this. But I'm not sure about the following :
How to create a mesos role and update the resources to be made available to
this role ?
How to set up separate roles for spark drivers and spark executors ?
By default all resources are in the * role. There is a flag called spark.mesos.role that I can set while doing spark-submit but not sure how to create this role and ensure this role is used only for executors ?
Option 2 : Modifying the mesos cluster scheduler
When spark-submit happens to mesos dispatcher, it adds the driver request to a WaitingQueue. When drivers fail while executing and if supervise mode is available, they are sent to a PendingRetryQueue with custom retry schedule settings. When resources are available from mesos, these drivers from the PendingRetryQueue are scheduled first and WaitingQueue are scheduled next.I was thinking of keeping the WaitingQueue with size 5 (spark.mesos.maxDrivers) and when there are more spark-submits than the queue size, I would add these drivers to the PendingRetryQueue and schedule them to run later. Currently, as per my understanding, when there are more that 200 drivers in the WaitingQueue mesos rest server sends a failed message and doesn't add it to the PendingRetryQueue.
Any help on implementing either of the options would be very helpful to me. Thanks in advance.
Update : Just saw that by when I give spark-submit with a role, it runs only executors in that role and drivers run in the default * role. I think this should solve this issue for me. Once I test this, I'll post my update here and close this. Thanks
As mentioned in the update, by default the mesos runs spark drivers in default role (*) and executors in the role provided by 'spark.mesos.role' parameter. To control the resources available for each role, we can use quotas , guarantees or reservations. We went ahead with static reservations since it suited our requirements.Thanks.
Option 1 is the good one.
First set dispatcher quota by creating a file like dispatcher-quota.json
cat dispatcher-quota.json
{
"role": "dispatcher",
"guarantee": [
{
"name": "cpus",
"type": "SCALAR",
"scalar": { "value": 5.0 }
},
{
"name": "mem",
"type": "SCALAR",
"scalar": { "value": 5120.0 }
}
]
}
Then push it to you're mesos master (leader) with
curl -d #dispatcher-quota.json -X POST http://<master>:5050/quota
So now you will have a quota for driver
Ensure you dispatcher is running with the right service role set if needed ajust it. If in DC/OS use
$ cat options.json
{
"service": {
"role": "dispatcher"
}
}
$ dcos package install spark --options=options.json
Otherwise feel free to share how you've deployed you dispatcher. I will provide you a how to guide.
That's ok for drivers. Now let's work with executor folowing the same way
$ cat executor-quota.json
{
"role": "executor",
"guarantee": [
{
"name": "cpus",
"type": "SCALAR",
"scalar": { "value": 100.0 }
},
{
"name": "mem",
"type": "SCALAR",
"scalar": { "value": 409600.0 }
}
]
}
$ curl -d #executor-quota.json -X POST http://<master>:5050/quota
Adapt values to you requirements
Then ensure to launch executor on with the correct role by providing
--conf spark.mesos.role=executor \
Source of my explanation came from https://github.com/mesosphere/spark-build/blob/master/docs/job-scheduling.md don't hesistate if it's not enought.
This should do the work

how to tell mesos frameworks which launched by different commands/parameters

I am building metrics collector to collect the running status about all the Spark Jobs running on it. The mesos API http://masterip/frameworks return a lot of details about all the frameworks and then I run http://slaveip/slave(1)/monitor/statistics to get each frameworks detail info from each slave, then correlate them.
This works fine for most of the jobs, but I have some jobs which behave different according to different parameters when submitting. They are shown as same framework name in Mesos GUI and I can not tell each other.
Is there a way to get the detail full commands which launches the job? Or any other idea about how to tell them?
You can find there are multiple instances with same framework name. As they are different spark job instances.
When I connect to Mesos slave, the monitor/statistics doesn't show the full command with all the parameters, so I can not tell which framework correlate to which Spark job instance.
{
"executor_id": "0",
"executor_name": "Command Executor (Task: 0) (Command: sh -c '
\"/usr/local...')",
"framework_id": "06ba8de8-7fc3-422d-9ee3-17dd9ddcb2ca-3157",
"source": "0",
"statistics": {
"cpus_limit": 2.1,
"cpus_system_time_secs": 848.689999999,
"cpus_user_time_secs": 5128.78,
"mem_limit_bytes": 4757389312,
"mem_rss_bytes": 2243149824,
"timestamp": 1522858776.20098
}
},
Thanks

Running integration tests on ephemeral server instance using Heroku CI

I'm trying to take advantage of this new feature of Heroku to test a parse-server/nodejs application that we have on Heroku, using mocha.
I was expecting Heroku to launch an ephemeral instance of my app along with the tests so that they could be run against it, but it doesn't seem like that's happening. Only the tests get launched.
Now, I found at least one snippet about configuring the Dyno formation to use dynos other than performance-m for the test, so I'm trying to declare my other dynos there as well:
"environments": {
"test": {
"scripts": {
"test-setup": "echo done",
"test": "npm run test"
},
"addons": [
{
"plan": "rediscloud:30",
"as": "REDISCLOUD_URL"
}
],
"formation": {
"test": {
"quantity": 1,
"size": "standard-1x"
},
"worker": {
"quantity": 1,
"size": "standard-1x"
},
"web": {
"quantity": 1,
"size": "standard-1x"
}
}
}
}
in my app.json, but it seems to be getting totally ignored.
I know my mocha script could import the relevant part of the web server and test against it, and that's what I've seen in the non-heroku-related examples, but our app consists of a worker too, and I'd like to profile the interaction of both and test the job lengths against our expectations of performance, rather than individual components, hence "integration tests". Is this a legit use for Heroku tests or I'm doing something wrong or have wrong expectations? I'm more concerned about this than getting it to work, because I'm quite certain I could get it to work in a certain number of ways (mocha spawning the server processes, npm concurrently package, etc), but if I can avoid hacks, all the better.
Locally, I was able to get both imported in the script, but the performance is degraded since it's now 2 processes + the tests running in a single memory process, with nodejs's memory cap limitations and a single event loop instead of 3. While writing this I'm thinking I could probably use throng and spawn different functions depending on the process ID. I'll try this if I don't get any better solutions.
Edit: I already managed to make it run by spawning the server/worker as separate processes in a before step in mocha, calculating the proper ram amounts to allow to each using the env vars. I'm still interested in knowing if there's a better solution.

Service Fabric Application PackageDeployment Operation Time Out exception

i have service fabric cluster and 3 nodes are created in 3 systems and it is inter-connected. i am able to connect each of nodes. These nodes are created in windows server. These Windows Server(VMs) are on-premises.
Manually i am trying to deploy my package into my cluster/one of nodes, i am getting Operation Timeout exception. i have used below commands to execute for deployment.
Service Fabric Power shell Commands:
Copy-ServiceFabricApplicationPackage -ApplicationPackagePath 'c:\sample\etc' -ApplicationPackagePathInImageStore 'abc.app.portaltype'
after execute above command it runs for 2 -3 mins and throws Operation Timeout exception. My package size is almost 250 MB and approximately 15000 file exist in my package. after that i have passed an extra parameter -TimeOutSec to 600(10mins) explicitly in above command, then it successfully executed and it copied to service fabric imagestore.
Register-ServiceFabricApplicationType -ApplicationPathInImageStore 'abc.app.portaltype'
after executed Copy-ServiceFabricApplicationPackage command , i have executed above Register-ServiceFabricApplicationType command to register my in cluster.but it also throws Operation timeout exception then i have passed an extra parameter -TimeOutSec to 600(10mins) explicitly in above command, but no luck it throws same operation timeout exception.
Just to make sure these operation Timeout issue because of no files in package or not. i have created simple empty service fabric asp.net core app and created package and try to deploy in same server with using above command, it deployed with in fraction of second and it works as smoothly.
Anybody has any idea how to over come service fabric operation timeout issue ?
How to handle the operation timeout issue if the package contains large set of files ?
Any help/suggestion would be very appreciated.
Thanks,
If this is taking longer than the 10 Minute default max it's probably one of the following issues:
Large application packages (>100s of MB)
Slow network connections
A large number of files within the application package (>1000s).
The following workarounds should help you.
Add the following settings to your cluster config:
"fabricSettings": [
{
"name": "NamingService",
"parameters": [
{
"name": "MaxOperationTimeout",
"value": "3600"
},
]
}
]
Also add:
"fabricSettings": [
{
"name": "EseStore",
"parameters": [
{
"name": "MaxCursors",
"value": "32768"
},
]
}
]
There’s a couple additional features which are currently rolling out. For these to be present and functional, you need to be sure that the client is at least 2.4.28 and the runtime of your cluster is at least 5.4.157. If you’re staying up to date these should already be present in your environment.
For register you can specify the -Async flag which will handle the upload asynchronously, reducing the need for the timeout to just the time necessary to send the command, not the application package. You can also query the status of the registration with Get-ServiceFabricApplicationType. 5.5 fixes some issues with these commands, so if they aren't working for you you'll have to wait for that release to hit your environment.

Deploying containerised node.JS application through mesos-marathon

I am using Marathon to deploy my Docker containerised node.js application. My Marathon app spec is as follows :
{
"id": "<some-name>",
"cmd": null,
"cpus": 1,
"mem": 2800,
"disk": 30720,
"instances": 1,
"container": {
"docker": {
"image": "<some-docker-registry-IP>:5000/<repo>",
"network": "BRIDGE",
"privileged": true,
"forcePullImage": true,
"parameters": [
{
"key": "net",
"value": "host"
}
],
"portMappings": [
{
"containerPort": <some-port>,
"hostPort": <some-port>,
"protocol": "tcp",
"name": null
}
]
},
"type": "DOCKER"
}
}
The problem however is that this leads to restarting my server where application is deployed once it is out of memory. I need my services to listen on private IP of host machine and that's why I am using --net=host.
Is it possible to just kill the task freeing up the memory so that Marathon can re-spawn it without restarting/shutting down server? Or is there any other way to make the Docker container routable to the outside world without using --net=host?
Basically, I think there is a problem with you Node application if it shows a memory leaking behavior. Thats the first point I'd address.
Second one is that you should use something like pm2 in your application's Docker image which will take care of restarting you application (in the container itself) when it encounters a problem.
Furthermore, you could implement a Marathon health endpoint, so that Marathon can recognize that the application actually has problems.
To reach some redundancy, I'd strongly advise that you run at least two instances of the application, and use Mesos DNS and a load balancer like marathon-lb on the public slave node(s), which will take care of the routing. This also allows you to use bridged networking if you want to.

Resources