Should restarting a Linux host from within a cfengine policy be avoided?

Should restarting a Linux host from within a cfengine policy be avoided? - linux

Specifically, if cfengine is used to install the most recent version of an onboard device's firmware and do some tests to see if a reboot is required, and the results indicate that the machine needs a restart, is this something that can be done from within cfengine or should that practice be avoided? If so, why? My experience with Puppet tells me that stopping a run to reboot could be a Very Bad Thing in certain cases, so I'm wondering if the same limitations apply to cfengine as well.

Stopping a CFEngine run is not that bad; it's designed to be convergent and modifications are always atomic. If it stops, the next runs will behave correctly.
However, writing promises that restart a device could lead to bad surprises (like, if there is a flaws in the logic of the promise, that results in never-ending restarts), so I suggest that it should be avoided, if possible, and if it is necessary (like, handling thousands of devices), it should be thoroughly tested

Like Nicolas said, there is no harm in stopping a CFEngine run. A CFEngine policy will continue converging the next time it runs. If you want to ensure that everything is properly finished before the reboot, you could just set a class that indicates that a reboot is needed, and to the actual reboot in a separate bundle that is called near the end of your bundlesequence (I'm assuming CFEngine 3).
And indeed, be VERY mindful and test VERY carefully the conditions under which the reboot will take place!

Related

Is there a node.js api that allows to store the current running node process and resume it later?

Googling for it results in many “how to persist data in a node app” but I’m looking on a way to store the program counter, memory status, event loop, call stack etc in persistent storage, and resume it later.
Benefits: if you see the runtime (a server, container, serverless function) is about to terminate, instead of using business logic to pause and resume (custom work), use the same way operating systems handle multiple processes / threads. Store everything, then resume it later form a different infrastructure (but with identical specs).
I’m sure there is something like this, but simply can’t find the right search term probably.
Ps this might be an OS feature that I’m looking for and not node specific, but if this can be done from within Node’s API (Eg v8 internals) I can basically get an unlimited / long running lambda ;) (which is a bad idea but I want to know if it’s possible).

(V8 developer here.)
V8 definitely doesn't support this.
What V8 does support is taking a heap snapshot, and deserializing that on renewed process startup (and I believe Node is making use of this functionality). That's quite different from freezing an entire running process though.
I'm not sure what you mean by "the same way operating systems handle multiple processes / threads". Operating systems don't usually let you snapshot a process and transfer it to a different machine.
On the same machine, you could literally just let the OS do it: pause the process (e.g. press Ctrl+Z if you started it at a Linux command line, or use equivalent Task Manager functionality if your OS provides it, or similar), and resume it later. If the process itself doesn't fire any repeated tasks/timers, then that's almost equivalent to simply doing nothing: a process that executes no work won't get scheduled by the kernel anyway; a server that isn't serving any requests can just sit around waiting.
If you actually need to transfer a running process to another machine, your best bet may be a VM which you can snapshot, transfer, resume.

Shorting the time between process crash and shooting server in the head?

I have a routine that crashes linux and force a reboot using a system function.
Now I have the problem that I need to crash linux when a certain process dies. Using a script starting the process and if the script ends restart the server is not appropriate since it takes some ms.
Another idea is spawning the shooting processes alongside and use polling of a counter and if the counter is not incremented reboot the server would be another idea.
This would result in an almost instant reaction.
Now the question is what would be a good timeframe. I have no idea how the scheduler of linux would guarantee a certain update of any such counter and what a good timeout would be.
Also I would like to hear some alternatives to this second process spawning. Is there a possibility to advice linux to run a certain routine in case of a crash of the given process or a listener meachanism for the even of problems with a given process?

The timeout idea is already implemented in the kernel. You can register any application as a software watchdog, but you'll have to lower the default timeout. Have a look at http://linux.die.net/man/8/watchdog for some ideas. That application can also handle user-defined tests. Realistically unless you're trying to run kernel like linux-rt, having timeouts lower than 100ms can be dangerous on a system with heavy load - especially if the check needs to poll another application.
In cases of application crashes, you can handle them if your init supports notifications. For example both upstart and systemd can do that by monitoring files (make sure coredumps are created in the right place).
But in any case, I'd suggest rethinking the idea of millisecond-resolution restarts. Do you really need to kill the system in that time, or do you just need to isolate it? Just syncing the disks will take a few extra milliseconds and you probably don't want to miss that step. Instead of just killing the host, you could make sure the affected app isn't working (SIGABRT?) and kill all networking (flush iptables, change default to DROP).

Maintaining a long-running task on Linux

My system includes a task which opens a network socket, receives pushed data from the network, processes it, and writes it out to disk or pings other machines depending on the messages. This task is intended to run forever, and the service is designed to have this task always running. But sometimes it crashes.
What's the best practice for keeping a task like this alive? Assume it's okay for the task to be dead for up to 30 seconds before we restart it.
Some obvious ideas include having a watchdog process that checks to make sure the process is still running. Watchdog could be triggered by cron. But how does it know if the process is alive or not? Write a pidfile? touch a heartbeat file? An ideal solution wouldn't continuously spin up more processes if the machine gets bogged down to the point where the watchdog is running faster than the heartbeat.
Are there standard linux tools for this? I can imagine a solution that uses a message queue, but I'm not sure if that's a good idea or not.

Depending on the nature of the task that you wish to monitor, one method is to write a simple wrapper to start up your task in a fork().
The wrapper task can then do a waitpid() on the child and restart it if it is terminated.
This does depend on modifying the source for the task that you wish to run.

sysvinit will restart processes that die, if added to inittab.
If you're worried about the process freezing without crashing and ending the process, you can use a heartbeat and hard kill the active instance, letting init restart it.

You could use monit along with daemonize. There are lots of tools for this in the *nix world.

Supervisor was designed precisely for this task. From the project website:
Supervisor is a client/server system that allows its users to monitor and control a number of processes on UNIX-like operating systems.
It runs as a daemon (supervisord) controlled by a command line tool, supervisorctl. The configuration file contains a list of programs it is supposed to monitor, among other settings.
The number of options is quite extensive, -- have a look at the docs for a complete list. In your case, the relevant configuration section might be something like this:
[program:my-network-task]
command=/bin/my-network-task # where your binary lives
autostart=true # start when supervisor starts?
autorestart=true # restart automatically when stopped?
startsecs=10 # consider start successful after how many secs?
startretries=3 # try starting how many times?
I have used Supervisor myself and it worked really well once everything was set up. It requires Python, which should not be a big deal in most environments but might be.

How to simulate a Windows Azure VM crash in my DevAppFabric

We need to think big and our applications need to scale in order to work on the Windows Azure Platform. But how do I simulate a crash of one of the VMs running my application?
I want to see (debug) how my application behaves in such environment.

Simulating faults is simple (just call Thread.Abord()); but it won't tell you much about your design.
In particular, debugging is a bit irrelevant, because whenever VM stop working there is nothing more to observe (nothing more to debug too). You should just assume that your app is likely to be abruptly stopped at any point of its execution.
Since, you cannot realistically observe all the subtle data corruptions that could be caused by interrupted executions, you should think to your persistence design to be resilient to such problem from the start (idempotent processes help a lot when possible).

Memory Leaks and Apache

My VPS account has been occasionally running out of memory. It's using Apache on Linux. Support says it's a slow memory leak and has enabled MaxRequestsPerChild to deal with it.
I have a few questions about this. When a child process dies, will it cause my scripts to lose session data? Does anyone have advice on how I can track down this memory leak?
Thanks

No, when a child process dies you will not lose any data unless it was in the middle of a request at the time (which should not happen if it exits due to MaxRequestsPerChild).
You should try to reproduce the memory leak using an identical software stack on your test system. You can use tools such as Valgrind to try to detect it.
You can also try a debug build of your web server and its modules, which will enable you to detect what's going on.
It's difficult to reproduce the behaviour of production systems in non-production ones. If you have auto-test coverage of your web application, you could try using your full auto-test suite, but in practice this is unlikely to cover every code path therefore may miss the leaky one.

When a child process dies, will it cause my scripts to lose session data?
Without knowing what scripting language and session handler you are using (and the actual code) it rather hard to say.
In most cases, using scripting languages in modules or via [fast] cgi, then its very unlikely that the session data would actually be lost - although if the process dies in the middle of processing a request it may not get the chance to write the updated session back to whatever is storing the session. And in the very unlikely event it dies during the writeback, it may corrupt the session data. These are quite exceptional circumstances.
OTOH if your application logic is implemented via a daemon (e.g. a Java container) then its quite probable that memory leaks could accumulate (although these would be reported against a different process).
Note that if the problem is alleviated by setting MaxRequestsPerChild then it implies that the problem is occurring in an Apache module.
The production releases of Apache itself, in my experience, is very stable without memory leaks. However I've not used all the modules. Not sure if ExtendedStatus gives a breakdwon of memory usage by module - might be worth checking.
I've previously seen problems with the memory management of modules loaded by the PHP module not respecting PHP's memory limits - these did clear down at the end of the request though.
C.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Should restarting a Linux host from within a cfengine policy be avoided? - linux

Related

Is there a node.js api that allows to store the current running node process and resume it later?

Shorting the time between process crash and shooting server in the head?

Maintaining a long-running task on Linux

How to simulate a Windows Azure VM crash in my DevAppFabric

Memory Leaks and Apache

Categories

Resources