shut down local client of hazelcast exector service - executorservice

We are using a hazelcast executor service to distribute tasks across our cluster of servers.
We want to shut down one of our servers and take it out of the cluster but allow it to continue working for a period to finish what it is doing but not accept any new tasks from the hazelcast executor service.
I don't want to shut down the hazelcast instance because the current tasks may need it to complete their work.
Shutting down the hazelcast executor service is not what I want. That shuts down the executor cluster-wide.
I would like to continue processing the tasks in the local queue until it is empty and then shut down.
Is there a way for me to let a node in the cluster continue to use hazelcast but tell it to stop accepting new tasks from the executor service?

Not that easily, however you have member attributes (Member::setX/::getX) and you could set an attribute to signal "no new tasks please" and when you submit a tasks you either preselect a member to execute on based on the attribute or you use the overload with the MemberSelector.

Related

Send task with celery signature using node name instead of queue?

Currently I am using queue to specify the worker for the celery task as follow:
celery.signature("cropper.run", args=[str("/ngpointdata/NG94_118.ngt"), int(1), str("NGdata_V4_ML_Test"), True], priority="3", queue="crop").delay()
But due to the needs of the pipeline I am working on, I have multiple workers with the same queue name, so I wanted to know if it is possible to send the task to a specific worker that has the same queue name as others but different node name?

The specific role of worker in DolphinDB?

What does worker, secondaryWorker, web worker, infra worker, dynamic worker, and local executors stand for respectively in DolphinDB?  Why the secondaryWorker and dynamic worker is introduced and what’s the usage?
worker: the thread of regular interactive jobs. It divides clients’ requests into subtasks once received. Depending on the task granularity, the worker will either execute the tasks or allocate them to local executor or remote executor. The number of workers can be set by specifying the configuration parameter workerNum and the default value is determined by the number of CPU cores.
secondary worker: the thread of secondary jobs. It is used to avoid job loops and solve the deadlocks caused by circular dependency between tasks. The upper limit can be set by specifying the configuration parameter secondaryWorkerNum and the default value is workerNum.
web worker: the thread that processes HTTP requests. DolphinDB provides a web interface for cluster management, allowing users to interact with DolphinDB nodes. The upper limit can be set by specifying the configuration parameter webWorkerNum and the default value is 1.
infra worker: the thread that reports heartbeat within clusters. It solves the problem that the heartbeat cannot be reported to the master in time when a cluster is under high pressure.
dynamic worker: the dynamic working thread as a supplemental to worker. If a new task is requested when all the worker threads are occupied, the system creates a dynamic worker thread to perform the task. The upper limit can be set by specifying the configuration parameter maxDynamicWorker and the default value is workerNum. The thread will be recycled by the system after being idle for 60 seconds to release memory resources.
local executor: the local thread that executes sub-tasks allocated by worker. Each local executor can only execute one task at a time. All worker threads share one local executor. The number of local executors can be set by specifying configuration parameter localExecutors and the default value is the number of CPU cores minus 1. The number of workers and local executors directly determines the system’s performance for concurrent computing.

What does [Max tasks per child setting] exactly mean in Celery?

The doc is:
With this option you can configure the maximum number of tasks a worker can execute before it’s replaced by a new process.
In what condition will a worker be replaced by a new process ? Does this setting make a worker, even with multi processes, can only process one task at one time?
It means that when celery has executed tasks more than the limit on one worker (the "worker" is a process if you use the default process pool), it will restart the worker automatically.
Say if you use celery for database manipulation and you forget to close the database connection, the auto restart mechanism will help you close all pending connections.

How do I get a list of worker/process IDs inside a strongloop cluster?

It looks like each process in a Strongloop cluster is considered a worker, and therefore if you use a tool like node-scheduler that schedules jobs and you have multiple workers, the job is executed multiple times.
Ideally I'd be able to do something like:
var cluster = require('cluster');
if (cluster.isMaster) {
// execute code
}
Since this doesn't seem to be possible, I wonder if there is a way to get a list of all worker or process IDs from inside the node app so that I can do this same sort of thing with one worker? This will need to be something dynamic, as cluster.worker.id does not appear to be a reliable way to do this since the worker IDs are unpredictable.
Ideas?
"strongloop cluster" isn't a thing, its a node cluster: https://nodejs.org/dist/latest-v6.x/docs/api/cluster.html
No such API exists, and it wouldn't help you, you'd need to implement some kind of consensus algorithm to choose one of a (dynamic, workers can die and get replaced/restarted) set of workers as the "singleton"
Compose your system as microservices, if you need a singleton task runner, make it a service, and run it with a cluster size of 1.
This isn't really a cluster problem, its an inability to scale problem, isn't it? Cluster does internal scaling, you can limit to one for the scheduling service.... but when you will scale across multiple VMs (multiple Heroku dyno's, multiple docker containers, etc.) this will still fall apart... which will be the source of the timed node-schedule jobs?

gearman and retrying workers with unreliable external dependencies

I'm using gearman to queue a variety of different jobs, some which can always be serviced immediately, and some which can "fail", because they require an unreliable external service. (For example, sending email might require an SMTP server that's frequently unavailable.)
If an external service goes down, I'd like to keep all jobs which require that service on the queue, and retry one job occasionally (every few minutes, say) until the service becomes available again. (Perhaps optionally sending email if the service has not been available for hours.)
However I'd like jobs that don't require a failed service to be passed on to workers as soon as possible. How can this be achieved? (I'm happy to put some of the logic in the workers if necessary, although it seems to be a bit "late" to throttle on the worker side.)
Gearman should already be handle this. As long as you have some workers which specialise in handling jobs with unreliable dependancies and don't handle other jobs, along with some workers that either do all jobs, or just jobs without unreliable dependencies.
All you would need to do it add some code the unreliable dependancy workers so that they only accept jobs once that have checked that the dependent service is running, if the service is down then just have them wait a bit and retest the service (and continue ad infinitum), once the service is up then have them join the gearmand server, do job, return work, retest service, etc etc.
While the dependent service is down, the workers that don't handle jobs that need the service will keep on trundling through the job queue for the other jobs. Gearmand won't block an entire job queue (or worker) on one job type if there are workers available to handle other job types.
The key is to be sensible about how you define your job types and workers.
EDIT--
Ah-ha, I knew my thinking was a little out, (I wrote my gearman system about a year ago and haven't really touched it since). My solution to this type of issue was to have all the workers that normally handle dependent-job unregister their dependent job handling capability with the gearmand server once a failure was detected with the dependent service. (and any workers that are currently trying to complete that job should return a failure.) Once the service is backup - get those same workers to reregister their ability to handle that job. Do note this does require another channel of communications for the workers to be notified of the status of the dependent services.
Hope this helps

Resources