Ramifications of using timeout on cluster shutdown? - cassandra

I'm using the java datastax driver. I have a ServletContextListener that closes the datastax Cluster object on context destroyed by calling Cluster.shutdown(). The problem is that it takes shutdown() several minutes to return.
Cluster.shutdown() have an override where you can specify a timeout value. I can't seem to find any documentation for NOT using the shutdown value, and when I specify a timeout of one millisecond, the cluster shuts down more or less instantly (as expected).
So, my question is, if I'm only shutting down the cluster when the servlet is shutting down anyway, is there a reason I should wait for the return? It seems that by specifying the timeout, it's essentially calling an asynchronous shutdown, which should be ok, but I don't want to introduce a memory leak or any instability.
I'm pretty new to Cassandra/datastax so if information about using the timeout is spelled out somewhere, pointing me in that direction would be great!
TIA,
wbj

If you do specify a short timeout, the method will initiate the shutdown but only wait on the completion of the shutdown for as long as asked. So yes, a short timeout won't interfere with the shutdown per-se which will continue asynchronously. If you don't care about knowing when the shutdown is complete (i.e. when exactly all resources have been properly closed), then there is no particular downside to using a timeout (and you can even use 0 for the timeout to make that intention clear).
I'll not that version 2.x of the driver changes the shutdown API slightly, making it asynchronous by default but returning a future on the shutdown completion. Which hopefully makes it more clear what happens.

Related

Do I really need to call client.shutdown() when finished with Cassandra in Node.js script?

I've been trying to find information about Cassandra sessions relating to the Node.js cassandra-driver by Datastax. I read something which said that cassandra-driver automatically manages a session and that I don't need to call client.shutdown().
I'm looking for general information about how cassandra-driver manages sessions, how can I see all active Cassandra sessions, and do I need to call shutdown() or is that counter productive having to reopen a session every time the script is run?
Based on "pm2 info" I don't see a ton of active handles so I don't think anything wrong is going on but I may be mistaken. Ram usage does seem a bit high for a small script (85mb).
In the DataStax drivers, Session is a stateful object handling a pool of connections and aware of the status of nodes in the Cluster at any time (avoiding sending request to unavailable node). TCP sockets are opened and it is a best practice to close when you don't need it anymore. See here to get more infos : https://docs.datastax.com/en/developer/nodejs-driver-dse/2.1/features/connection-pooling/
Now session.connect() may takes a bit of time: the more nodes you have in your cluster, the longer it will be to open connections to every single one. This is the reason why, it is better to init connections in a "cold start" when you work with FAAS (avoiding to open/close for each request)
So:
Always close your connections (shutdown()) when you don't need it anymore (shutdown hook in your applications)
Keep your connections "alive" as long as you need it, do not shutdown for each request, this is NOT stateless.
yes, it is "better" to connect the client outside of the handler function. to keep it state-Full.
however, AWS Lambda with nodeJS, by default function execution continues until the event loop is empty or the function times out.
create the client outside of handler, set the context.callbackWaitsForEmptyEventLoop = false and don't call client.shutdown.

Is there a timeout on RoleEnvironment.Stopping. On Azure Worker Roles?

We have some long running tasks on our roles and need to be sure to stop them ima controlled way. initially we tried to use On stop method but MSDN says that
Important
Code running in the OnStop method has a limited time to finish when it is called for reasons other than a user-initiated shutdown. After this time elapses, the process is terminated, so you must make sure that code in the OnStop method can run quickly or tolerates not running to completion. The OnStop method is called after the Stopping event is raised.
And timeout seems to be around 30 seconds and overall shutdown procedure should take no more than 5 minutes.
Does this limitation occurs also on Stopping event? I can't find a clear and direct answer anywhere.
Thanks

How to test master behaviour in a Node.JS cluster?

Suppose you are running a cluster in Node.JS and you wish to unit-test it. For instance, you'd like to make sure that if a worker dies the cluster takes some action, such as forking another worker and possibly some related job. Or that, under certain conditions, additional workers are spawned.
I suppose that in order to do this one must launch the cluster and have somehow access to its internal state; then (for instance) force workers to get stuck, and check the state after a delay. If so, how to export the state?
You'll have to architect your master to return a reference to its cluster object. In your tests, you can kill one of its workers with cluster.workers[2].kill(). The worker object also has a reference to the child's process object, which you can use to simulate various conditions. You may have to use a setTimeout to ensure the master has the time to do its thing.
The above methods however still creates forks, which may be undesirable in a testing scenario. Your other option is to use a mocking library (SinonJS et al) to mock out cluster's fork method, and then spy the number of calls it gets. You can simulate worker death by using cluster.emit('exit') on the master cluster object.
Note: I'm not sure if this is an issue only with me, but cluster.emit always seems to emit twice for me, for some reason.

Error conditions and retries in gearman?

Can someone guide me on how gearman does retries when exceptions are
thrown or when errors occur?
I use the python gearman client in a Django app and my workers are
initiated as a Django command. I read from this blog post that retries
from error conditions are not straight forward and that it requires
sys.exit from the worker side.
Has this been fixed to retry perhaps with sendFail or sendException?
Also does gearman support retries with exponentials algorithm – say if
an SMTP failure happens its retries after 2,4,8,16 seconds etc?
To my understanding, Gearman employs a very "it's not my business" approach - e.g., it does not intervene with jobs performed, unless workers crash. Any success / failure messages are supposed to be handled by the client, not Gearman server itself.
In foreground jobs, this implies that all sendFail() / sendException() and other send*() are directed to the client and it's up to the client to decide whether to retry the job or not. This makes sense as sometimes you might not need to retry.
In background jobs, all the send*() functions lose their meaning, as there is no client that would be listening to the callbacks. As a result, the messages sent will be just ignored by Gearman. The only condition on which the job will be retried is when the worker crashes (which can by emulated with a exit(XX) command, where XX is a non-zero value). This, of course, is not something you want to do, because workers are usually supposed to be long-running processes, not the ones that have to be restarted after each unsuccessful job.
Personally, I have solved this problem by extending the default GearmanJob class, where I intercept the calls to send*() functions and then implementing the retry mechanism myself. Essentially, I pass all the retry-related data (max number of retries, times already retried) together with a workload and then handle everything myself. It is a bit cumbersome, but I understand why Gearman works this way - it just allows you to handle all the application logic.
Finally, regarding the ability to retry jobs with exponential timeout (or any timeout for that matter). Gearman has a feature to add delayed jobs (look for SUBMIT_JOB_EPOCH in the protocol documentation), yet I am not sure about its status - the PHP extension and, I think, the Python module do not support it and the docs say it can be removed in the future. But I understand it works at the moment - you just need to submit raw socket requests to Gearman to make it happen (and the exponential part should be implemented on your side, too).
However, this blog post argues that SUBMIT_JOB_EPOCH implementation does not scale well. He uses node.js and setTimeout() to make it work, I've seen others use the unix utility at to do the same. In any way - Gearman will not do it for you. It will focus on reliability, but will let you focus on all the logic.

perl cgi threads

I am having bit of a problem with my cgi web application, I use ithreads to do some parallel processing, where all the thread have a common 'goal'. Thus I detach all of them, and once I find my answer, I call exit.
However the problem is that the script will actually continue processing even after the user has closed the connection and left, which of course if a problem resourcewise.
Is there any way to force exit on the parent process if the user has disconnected?
If you're running under Apache, if the client closes the connection prematurely, it sends a SIGTERM to the cgi process. In my simple testing, that kills the script and threads as default behavior.
However, if there is a proxy between the server and the client, it's possible that Apache will not be able to detect the closed connection (as the connection from the server to the proxy may remain open) - in that case, you're out of luck.
AFAIK create and destroy threads isn't (at least for now) a good Perl practice because it will constantly increase the memory usage!
You should think in some other way to get the job done. Usually the solution is create a pool of threads and send arguments with the help of a shared array or Thread::Queue.
I personally would suggest changing you approach and, when creating these threads for the client connection, would be to save and associate PID of each thread with the client connection. I personally like to use daemons instead of threads, ie. Proc::Daemon. When client disconnects prematurely (before the threads finish), send SIGTERM to each process ID associated with that client.
To exit gracefully, override the termination sub in the thread process with a stop condition, so something like:
$SIG{TERM} = sub { $continue = 0; };
Where $continue would be the condition of the thread processing loop. You still would have to watch out for code errors, because even you can try overriding $SIG{__DIE__}, the die() method usually doesn't respect that and dies instantly without grace ;) (at least from my experience)
I'm not sure how you go about detecting if the user has disconnected, but, if they have, you'll have to make the threads stop yourself, since they're obviously not being killed automatically.
Destroying threads is a dangerous operation, so there isn't a good way to do it.
The standard way, as far as I know, is to have a shared variable that the threads check periodically to determine if they should keep working. Set it to some value before you exit, and check for that value inside your threads.
You can also send a signal to the threads to kill them. The docs know more about this than I do.

Resources