I have a service on a Redhat 7.1 which I use systemctl start, stop, restart and status to control. One time the systemctl status returned active, but the application "behind" the service responded http code different from 200.
I know that I can use Monit or Nagios to check this and do the systemctl restart - but I would like to know if there exist something per default when using systemd, so that I do not need to have other tools installed.
My preferred solution would be to have my service restarted if http return code is different from 200 totally automatically without other tools than systemd itself - (and maybe with a possibility to notify a Hipchat room or send a email...)
I've tried googling the topic - without luck. Please help :-)
The Short Answer
systemd has a native (socket-based) healthcheck method, but it's not HTTP-based. You can write a shim that polls status over HTTP and forwards it to the native mechanism, however.
The Long Answer
The Right Thing in the systemd world is to use the sd_notify socket mechanism to inform the init system when your application is fully available. Use Type=notify for your service to enable this functionality.
You can write to this socket directly using the sd_notify() call, or you can inspect the NOTIFY_SOCKET environment variable to get the name and have your own code write READY=1 to that socket when the application is returning 200s.
If you want to put this off to a separate process that polls your process over HTTP and then writes to the socket, you can do that -- ensure that NotifyAccess is set appropriately (by default, only the main process of the service is allowed to write to the socket).
Inasmuch as you're interested in detecting cases where the application fails after it was fully initialized, and triggering a restart, the sd_notify socket is appropriate in this scenario as well:
Send WATCHDOG_USEC=... to set the amount of time which is permissible between successful tests, then WATCHDOG=1 whenever you have a successful self-test; whenever no successful test is seen for the configured period, your service will be restarted.
Related
I'm currently researching and experimenting with Kubernetes in Azure. I'm playing with AKS and the Application Gateway ingress. As I understand it, when a pod is added to a service, the endpoints are updated and the ingress controller continuously polls this information. As new endpoints are added AG is updated. As they're removed AG is also updated.
As pods are added there will be a small delay whilst that pod is added to the AG before it receives requests. However, when pods are removed, does that delay in update result in requests being forwarded to a pod that no longer exists?
If not, how does AG/K8S guarantee this? What behaviour could the end client potentially experience in this scenario?
Azure Application gateway ingress is an ingress controller for your kubernetes deployment which allows you to use native Azure Application gateway to expose your application to the internet. Its purpose is to route the traffic to pods directly. At the same moment all questions about pods availability, scheduling and generally speaking management is on kubernetes itself.
When a pod receives a command to be terminated it doesn't happen instantly. Right after kube-proxies will update iptables to stop directing traffic to the pod. Also there may be ingress controllers or load balancers forwarding connections directly to the pod (which is the case with an application gateway). It's impossible to solve this issue completely, while adding 5-10 seconds delay can significantly improve users experience.
If you need to terminate or scale down your application, you should consider following steps:
Wait for a few seconds and then stop accepting connections
Close all keep-alive connections not in the middle of request
Wait for all active requests to finish
Shut down the application completely
Here are exact kubernetes mechanics which will help you to resolve your questions:
preStop hook - this hook is called immediately before a container is terminated. This is very helpful for graceful shutdowns of an application. For example simple sh command with "sleep 5" command in a preStop hook can prevent users to see "Connection refused errors". After the pod receives an API request to be terminated, it takes some time to update iptables and let an application gateway know that this pod is out of service. Since preStop hook is executed prior SIGTERM signal, it will help to resolve this issue.
(example can be found in attach lifecycle event)
readiness probe - this type of probe always runs on the container and defines whether pod is ready to accept and serve requests or not. When container's readiness probe returns success, it means the container can handle requests and it will be added to the endpoints. If a readiness probe fails, a pod is not capable to handle requests and it will be removed from endpoints object. It works very well with newly created pods when an application takes some time to load as well as for already running pods if an application takes some time for processing.
Before removing from the endpoints readiness probe should fail several times. It's possible to lower this amount to only one fail using failureTreshold field, however it still needs to detect one failed check.
(additional information on how to set it up can be found in configure liveness readiness startup probes)
startup probe - for some applications which require additional time on their first initialisation it can be tricky to set up a readiness probe parameters correctly and not compromise a fast response from the application.
Using failureThreshold * periodSecondsfields will provide this flexibility.
terminationGracePeriod - is also may be considered if an application requires more than default 30 seconds delay to gracefully shut down (e.g. this is important for stateful applications)
We have,
two GoLang microservices(http server)
&
one GoLang background program(running infinite loop).
Within micro-service, we have added diagnostic endpoint point(http port), to provide health-check of service. Grafana monitoring tool talks to this diagnostoc end point.
For background program,
How to diagnose health-check(up or down) of backend program? application healthcheck monitoring
You can add a small HTTP server in the background program, which responds to health-check requests.
When you get a request, you can verify a state which is updated in the infinite loop (it actually depends on your custom logic).
This way you can inspect the health of your program in grafana as well (consistency).
I'm thinking about making a worker script to handle async tasks on my server, using a framework such as ReactPHP, Amp or Swoole that would be running permanently as a service (I haven't made my choice between these frameworks yet, so solutions involving any of these are helpful).
My web endpoints would still be managed by Apache + PHP-FPM as normal, and I want them to be able to send messages to the permanently running script to make it aware that an async job is ready to be processed ASAP.
Pseudo-code from a web endpoint:
$pdo->exec('INSERT INTO Jobs VALUES (...)');
$jobId = $pdo->lastInsertId();
notify_new_job_to_worker($jobId); // how?
How do you typically handle communication from PHP-FPM to the permanently running script in any of these frameworks? Do you set up a TCP / Unix Socket server and implement your own messaging protocol, or are there ready-made solutions to tackle this problem?
Note: In case you're wondering, I'm not planning to use a third-party message queue software, as I want async jobs to be stored as part of the database transaction (either the whole transaction is successful, including committing the pending job, or the whole transaction is discarded). This is my guarantee that no jobs will be lost. If, worst case scenario, the message cannot be sent to the running service, missed jobs may still be retrieved from the database at a later time.
If your worker "runs permanently" as a service, it should provide some API to interact through. I use AmPHP in my project for async services, and my services implement HTTP/Websockets servers (using Amp libraries) as an API transport.
Hey ReactPHP core team member here. It totally depends on what your ReactPHP/Amp/Swoole process does. Looking at your example my suggestion would be to use a message broker/queue like RabbitMQ. That way the process can pic it up when it's ready for it and ack it when it's done. If anything happens with your process in the mean time and dies it will retry as long as it hasn't acked the message. You can also do a small HTTP API but that doesn't guarantee reprocessing of messages on fatal failures. Ultimately it all depends on your design, all 3 projects are a toolset to build your own architectures and systems, it's all up to you.
I have write a simple python 3.7 window service and installed successfully.Now I am facing this error
"Error starting service: The service did not respond to the start or control request in a timely fashion."
Please Help me to fix this error.
Thanks
One of the most common errors from windows when starting your service is Error 1053: The service did not respond to the start or control request in a timely fashion. This can occur for multiple reasons but there are a couple things to check when you do get them:
Make sure your service is actually stopping:Note the main method has an infinite loop. The template above with break the loop if the stop even occurs, but that will only happen if you call win32event.WaitForSingleObject somewhere within that loop; setting rc to the updated value
Make sure your service actually starts: Same as the first one, if your service starts and does not get stuck in the infinite loop, it will exit. Terminating the service
Check your system and user PATH contains the necessary routes: The DLL path is extremely important for your python service as its how the script interfaces with windows to operate as a service. Additionally if the service is unable to load python - you are also hooped. Check by typing echo %PATH% in both a regular console and a console running with Administrator priveleges to ensure all of your paths have been loaded
Give the service another restart: Changes to your PATH variable may not kick in immediately - its a windows thing
I am a beginner in nodejs. I am trying to use nodejs in production. I wanted to achieve nodejs failover. As I am executing chat app, if a node server fails the chat should not breakup and should be automatically connected to a different nodeserver and the same socket id should be used for further chatting so that the chat message shouldn't go off. Is this can be achieved? Any samples.
I should not use Ngnix/HAProxy. Also let me know how the node servers should be: Either Active-Active or Active-Passive
PM2 is preferred to be the manager of process, especially the features of auto-failover, auto-scailing, auto-restart .
The introduction is as followed,
PM2 is a production process manager for Node.js applications with
a built-in load balancer. It allows you to keep applications alive
forever, to reload them without downtime and to facilitate common
system admin tasks.
Starting an application in production mode is as easy as:
$ pm2 start app.js
PM2 is constantly assailed by more than 700 tests.
Official website: http://pm2.keymetrics.io
Works on Linux (stable) & MacOSx (stable) & Windows (bĂȘta).
There's several problem's you're tackling at once there:
Daemonization - keeping your app up: As already mentioned, scripts such as forever can be used to supervise your nodeJS application to restart it on fail. This is good for starting the application in a worst-case failure.
Similarly recluster can be used to fork your application and make it more fault-resistant by creating a supervisor process and subprocesses.
Uncaught exceptions: A Known hurdle in nodejs is that asyncronous errors cannot be caught with a try/catch block. As a consequence exceptions can bubble up and cause your entire application to crash.
Rather than letting this occur, you should use domains to create a logical grouping of activities that are affected by the exception and handle it as appropriate. If you're running a webserver with state, an unhandled exception should probably be caught and the rest of the connections closed off gracefully before terminating the application.
(If you're running a stateless application, it may be possible to just ignore the exception and attempt to carry on; though this is not necessarily advisable. use it with care).
Security: This is a huge topic. You need to ensure at the very least:
Your application is running as a non-root user with least privileges. Ports < 1024 require root permissions. Typically this is proxied from a higher port with nginx, docker or similar.
You're using helmet and have hardened your application to the extent you can.
As an aside, I see you're using Apache in front of NodeJS, this isn't necessarily as apache will probably struggle under load with it's threading model more than nodeJS with it's event-loop model.
assuming you use a database for authenticating clients, there isn't much into it to accomplish, i mean, a script to manage state of the server script, like forever does,
it would try to start the script if it fails, more than that you should design the server script to handle every known and possible unknown error, any signal send to it , etc.
a small example would be with streams.
(Websocket Router)
|
|_(Chat Channel #1) \
|_(Chat Channel #2) - Channel Cache // hold last 15 messages of every channel
|_(Chat Channel #3) /
|_(Authentication handler) // login-logout
-- hope i helped in some way.
For a simple approach, I think you should build a reconnect mechanism in your client side and use a process management as forever or PM2 to manage your Node.js processes. I was trying too many ways but still can't overcome the socket issue, it's always killed whenever the process stops.
You could try usingPm2 start app.js -I 0. This would run your application in cluster mode creating many child processes for same thread. You can share socket information between various processes.