What is the signal sent to the process running in the container when k8s liveness probe fails? KILL or TERM - linux

I have a use case to gracefully terminate the container where i have a script to kill the process gracefully from within the container by using the command "kill PID".( Which will send the TERM signal )
But I have liveness probe configured as well.
Currently liveness probe is configured to probe at 60 second interval. So if the liveness probe take place shortly after the graceful termination signal is sent, the overall health of the container might become CRITICAL when the termination is still in progress.
In this case the liveness probe will fail and container will be terminated immediately.
So i wanted to know whether kubelet kills the container with TERM or KILL.
Appreciate your support
Thanks in advance

In Kubernetes, Liveness Probe checks for the health state of a container.
To answer your question on whether it uses SIGKILL or SIGTERM, the answer is both are used but in order. So here is what happens under the hood.
Liveness probe check fails
Kubernetes stops routing of traffic to the container
Kubernetes restarts the container
Kubernetes starts routing traffic to the container again
For container restart, SIGTERM is first sent with waits for a parameterized grace period, and then Kubernetes sends SIGKILL.
A hack around your issue is to use the attribute:
timeoutSeconds
This specifies how long a request can take to respond before it’s considered a failure. You can add and adjust this parameter if the time taken for your application to come online is predictable.
Also, you can play with readinessProbe before livenessProbe with an adequate delay for the container to come into service after restarting the process. Check https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/ for more details on which parameters to use.

Related

How does kubelet inform about SIGTERM signal to kubernetes node?

We have container per pod. Each container runs a service that listens on SIGTERM signal to initiate it's shutdown process.
To initiate the shutdown process,
does kubectl delete pod [podname] command needs to be initiated?
does container receive the SIGTERM signal from the linux kernel of the kubernetes node?
if yes, how does kubelet inform the kernel of kubernetes node, to throw SIGTERM signal on a specific container?
Note: of course pod is just a network namespace isolated container
When initiating a pod deletion request
kubectl delete pods <POD>
the container runtime would send a TERM signal to the main process in each container.
When the --force --grace-period=0 switch options are used, the container runtime will send a TERM followed directly by the KILL signal.
kubectl won't initiate a termination request in a container based fashion but would rather terminate all containers running inside a terminated pod.
Note that a pod won't transition to a TERMINATING state only upon an explicit request to delete the latter, but this should be a core part of how Kubernetes would manage your cluster meaning that it might terminate a perfectly healthy container for one of below reasons:
If the deployment is updated with a rolling update
If a node is drained
If a node runs out of resources
You can check the Pods Lifecycle documentation for further details.

How does Application Gateway prevent requests being sent to recently terminated pods?

I'm currently researching and experimenting with Kubernetes in Azure. I'm playing with AKS and the Application Gateway ingress. As I understand it, when a pod is added to a service, the endpoints are updated and the ingress controller continuously polls this information. As new endpoints are added AG is updated. As they're removed AG is also updated.
As pods are added there will be a small delay whilst that pod is added to the AG before it receives requests. However, when pods are removed, does that delay in update result in requests being forwarded to a pod that no longer exists?
If not, how does AG/K8S guarantee this? What behaviour could the end client potentially experience in this scenario?
Azure Application gateway ingress is an ingress controller for your kubernetes deployment which allows you to use native Azure Application gateway to expose your application to the internet. Its purpose is to route the traffic to pods directly. At the same moment all questions about pods availability, scheduling and generally speaking management is on kubernetes itself.
When a pod receives a command to be terminated it doesn't happen instantly. Right after kube-proxies will update iptables to stop directing traffic to the pod. Also there may be ingress controllers or load balancers forwarding connections directly to the pod (which is the case with an application gateway). It's impossible to solve this issue completely, while adding 5-10 seconds delay can significantly improve users experience.
If you need to terminate or scale down your application, you should consider following steps:
Wait for a few seconds and then stop accepting connections
Close all keep-alive connections not in the middle of request
Wait for all active requests to finish
Shut down the application completely
Here are exact kubernetes mechanics which will help you to resolve your questions:
preStop hook - this hook is called immediately before a container is terminated. This is very helpful for graceful shutdowns of an application. For example simple sh command with "sleep 5" command in a preStop hook can prevent users to see "Connection refused errors". After the pod receives an API request to be terminated, it takes some time to update iptables and let an application gateway know that this pod is out of service. Since preStop hook is executed prior SIGTERM signal, it will help to resolve this issue.
(example can be found in attach lifecycle event)
readiness probe - this type of probe always runs on the container and defines whether pod is ready to accept and serve requests or not. When container's readiness probe returns success, it means the container can handle requests and it will be added to the endpoints. If a readiness probe fails, a pod is not capable to handle requests and it will be removed from endpoints object. It works very well with newly created pods when an application takes some time to load as well as for already running pods if an application takes some time for processing.
Before removing from the endpoints readiness probe should fail several times. It's possible to lower this amount to only one fail using failureTreshold field, however it still needs to detect one failed check.
(additional information on how to set it up can be found in configure liveness readiness startup probes)
startup probe - for some applications which require additional time on their first initialisation it can be tricky to set up a readiness probe parameters correctly and not compromise a fast response from the application.
Using failureThreshold * periodSecondsfields will provide this flexibility.
terminationGracePeriod - is also may be considered if an application requires more than default 30 seconds delay to gracefully shut down (e.g. this is important for stateful applications)

How to tells Knative Pod Autoscaler not to kill in-progress long running pod

My goal:
Implement a cron job run once per week and I intend to implement this topology on Knative to save the computing resources:
PingSource -> knative service
The PingSource will emit a dummy event to a knative service once per week just to bring up 1 knative service pod. The knative service pod will get huge amount of data and then process them.
My concern:
If I set enable-scale-to-zero to true, the Knative pod autoscaler probably shutdown the knative service pod even when the pod has not finished its work.
So far, I explored:
The scale-to-zero-grace-period which can be configured to tell the auto scaler how long it should wait after the last traffic ends to shutdown the pod. But I don't think this approach is subtle. I prefer somewhat similar to readinessProbe or livenessProbe. The auto scaler should send a probe to know whether the pod is processing something before sending the kill signal.
In addition, according to knative's docs, there are 2 type of event sink: callable and addressable. Addressable and Callable both return the response or acknowledgement. Would the knative auto scaler consider the pod as handling the request till the pod return the response/acknowledgement? So as long as the pod does not response, it won't be removed by the auto scaler.
The Knative autoscaler relies on the pod strictly working in a request/response fashion. As long as the "huge amount of data" is processed as part of an HTTP request (or Websocket session, or gRPC session etc.) the pod will not even be considered for deletion.
What will not work is sending the request, immediately return and then munging the data in the background. The autoscaler will think that there's no activity at all and thus shut it down. There is a sandbox project that tries to implement such asynchronous semantics though.

How does KillSignal interact with TimeoutStopSec in systemd?

Can someone let me know the following about systemd service shutdown sequence
If I have specified KillSignal=SIGTERM then how does this interact
this TimeoutStopSec ? Does this mean that during shutdown of
service, first SIGTERM will be sent and if the service is still
running after TimeoutStopSec SIGKILL will be sent (if SendSIGKILL is
set to yes)? I am asking about the case where nothing is specified in
ExecStop.
Does TimeoutStopSec take into account ExecStop and all ExecPostStop?
This has been answered in systemd email thread. Posting the answer below
Can someone let me know the following about systemd service shutdown
sequence
1.
If I have specified KillSignal=SIGTERM then how does this interact this
TimeoutStopSec ? Does this mean that during shutdown of service, first
SIGTERM will be sent and if the service is still running after
TimeoutStopSec SIGKILL will be sent (if SendSIGKILL is set to yes? I am
asking about the case where nothing is specified in ExecStop.
Yes, that's correct
2.
Does TimeoutStopSec take into account ExecStop and all ExecPostStop?
TimeoutStopSec is for every command. If ExecStopPost command fails (or
times out) subsequent commands are not executed, but if each command
requires almost TimeoutStopSec time, total execution time will be
close to ExecStopPost commands multiplied by TimeoutStopSec.
From that systemd man
This option serves two purposes. First, it configures the time to wait
for each ExecStop= command. If any of them times out, subsequent
ExecStop= commands are skipped and the service will be terminated by
SIGTERM. If no ExecStop= commands are specified, the service gets the
SIGTERM immediately. This default behavior can be changed by the
TimeoutStopFailureMode= option. Second, it configures the time to wait
for the service itself to stop. If it doesn't terminate in the
specified time, it will be forcibly terminated by SIGKILL (see
KillMode= in systemd.kill(5)). Takes a unit-less value in seconds, or
a time span value such as "5min 20s". Pass "infinity" to disable the
timeout logic. Defaults to DefaultTimeoutStopSec= from the manager
configuration file (see systemd-system.conf(5)).
If a service of Type=notify sends "EXTEND_TIMEOUT_USEC=…", this may
cause the stop time to be extended beyond TimeoutStopSec=. The first
receipt of this message must occur before TimeoutStopSec= is exceeded,
and once the stop time has extended beyond TimeoutStopSec=, the
service manager will allow the service to continue to stop, provided
the service repeats "EXTEND_TIMEOUT_USEC=…" within the interval
specified, or terminates itself (see sd_notify(3)).

Systemd http health check

I have a service on a Redhat 7.1 which I use systemctl start, stop, restart and status to control. One time the systemctl status returned active, but the application "behind" the service responded http code different from 200.
I know that I can use Monit or Nagios to check this and do the systemctl restart - but I would like to know if there exist something per default when using systemd, so that I do not need to have other tools installed.
My preferred solution would be to have my service restarted if http return code is different from 200 totally automatically without other tools than systemd itself - (and maybe with a possibility to notify a Hipchat room or send a email...)
I've tried googling the topic - without luck. Please help :-)
The Short Answer
systemd has a native (socket-based) healthcheck method, but it's not HTTP-based. You can write a shim that polls status over HTTP and forwards it to the native mechanism, however.
The Long Answer
The Right Thing in the systemd world is to use the sd_notify socket mechanism to inform the init system when your application is fully available. Use Type=notify for your service to enable this functionality.
You can write to this socket directly using the sd_notify() call, or you can inspect the NOTIFY_SOCKET environment variable to get the name and have your own code write READY=1 to that socket when the application is returning 200s.
If you want to put this off to a separate process that polls your process over HTTP and then writes to the socket, you can do that -- ensure that NotifyAccess is set appropriately (by default, only the main process of the service is allowed to write to the socket).
Inasmuch as you're interested in detecting cases where the application fails after it was fully initialized, and triggering a restart, the sd_notify socket is appropriate in this scenario as well:
Send WATCHDOG_USEC=... to set the amount of time which is permissible between successful tests, then WATCHDOG=1 whenever you have a successful self-test; whenever no successful test is seen for the configured period, your service will be restarted.

Resources