I have a problem with azure and it's kubernetes environment. Form to time calls to k8s API are failing and when it happened the pod which experiencing the issue stops responding (more like the network issue than application hanging), for a calls from other pods, but health check is working, so k8s is not restarting the pod. The only way to restore it is a deletion of the pod.
Here is a part of the stacktrace.
Failed to get list of services from DiscoveryClient org.springframework.cloud.kubernetes.discovery.KubernetesDiscoveryClient#7057dbda due to: {}
io.fabric8.kubernetes.client.KubernetesClientException: Operation: [list] for kind: [Service] with name: [null] in namespace: [dev] failed.
at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.list(BaseOperation.java:602)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.list(BaseOperation.java:63)
at org.springframework.cloud.kubernetes.discovery.KubernetesDiscoveryClient.getServices(KubernetesDiscoveryClient.java:253)
at org.springframework.cloud.kubernetes.discovery.KubernetesDiscoveryClient.getServices(KubernetesDiscoveryClient.java:249)
.....
Caused by: java.net.SocketTimeoutException: timeout
at okhttp3.internal.http2.Http2Stream$StreamTimeout.newTimeoutException(Http2Stream.java:593)
at okhttp3.internal.http2.Http2Stream$StreamTimeout.exitAndThrowIfTimedOut(Http2Stream.java:601)
What I know now:
issue happens on azure
doesn't depend on the load (also the actual calls are background jobs and they are also not depend on the load, so this is expected)
doesn't depend on the number of services deployed or frequency of the calls to k8s API (so actual traffic to k8s API from cluster doesn't matter)
it is very selective: if one service/replica is affected other can work without issues
if affected pod is restarted quickly (we have a job to automate restarts) then problem tends to jump to other service
Azure support says it is problem with our apps, which are build on spring boot and it's auto discovery mechanism, but I am starting doubt it.
Basically it looks like pod is partially lost by k8s engine.
So the question is what is wrong and what else can I check?
Related
I have implemented openwhisk using kubernetes in Windows operating system. The same thing I needs to implement in Linux. I followed following document to implement in Linux. https://medium.com/#ansjin/openwhisk-deployment-on-a-kubernetes-cluster-7fd3fc2f3726. But when I try to get all pods details but openwhisk pods status is in pending only.
How to up these pods?
It looks to me like something is going wrong with your Kubernetes/flannel installation. Kubernetes won't create new pods until it can assign the pod IPs properly, so the CNI that you use (in your case, flannel) needs to be working properly before OpenWhisk (or other applications) can be deployed.
If you investigate the flannel-ds pod with the CrashLoopBackOff and also try to figure out why the coredns pods haven't finished creating, that would be a good place to start debugging.
I want to find the Node scalability time on Azure Kubernetes Service (AKS) using Logs.
It's possible with some assumptions.
This information is taken from Azure AKS documentation (consider getting familiar with it, it describes how to enable, where to look at and etc):
To diagnose and debug autoscaler events, logs and status can be
retrieved from the autoscaler add-on.
AKS manages the cluster autoscaler on your behalf and runs it in the
managed control plane. You can enable control plane node to see the
logs and operations from CA (cluster autoscaler).
The same cluster-autoscaler is used across different platforms, each of them can have some specific setup (e.g. for Azure AKS). Based on it, logs should have events like:
status, scaleUp, scaleDown, eventResult
Whether the application will be live (In transaction) during the time of POD deployment in AKS?
While we are performing the POD deployment, whether the application transactions will go through (or) get error out?
The Deployment system does a rolling update. New pods are created with the new template and once Ready they are added to the service load balancer, and then old ones are removed and terminated.
we installed the follwing presto cluster on Linux redhat 7.2 version
presto latest version - 0.216
1 presto coordinator
231 presto workers
on each worker machine we can use the follwing command in order to verify the status
/app/presto/presto-server-0.216/bin/launcher status
Running as 61824
and also stop/start as the follwing
/app/presto/presto-server-0.216/bin/launcher stop
/app/presto/presto-server-0.216/bin/launcher start
I also searches in google about UI that can manage the presto status/stop/start
but not seen any thing about this
its very strange that presto not comes with some user interface that can show the cluster status and do stop/start action if we need to do so
as all know the only user interface of presto is show status and not have the actions as stop/start
in the above example screen we can see that the active presto worker are only 5 from 231 , but this UI not support stop/start actions and not show on which worker presto isn't active
so what we can do about it?
its very bad idea to access each worker machine and see if presto is up or down
why presto not have centralized UI that can do stop/start action ?
example what we are expecting from the UI , - partial list
.
.
.
Presto currently uses discovery service where workers announce themselves to join the cluster, so if a worker node is not registered there is no way for coordinator or discovery server to know about its presence and/or restart it.
At Qubole, we use an external service alongside presto master that tracks nodes which do not register with discovery service within a certain interval. This service is responsible for removing such nodes from the cluster.
One more thing we do is use monit service on each of presto worker nodes, which ensures that presto server is restarted whenever it goes down.
You may have to do something similar for cluster management , as presto does not provide it right now.
In my opinion and experience managing prestosql cluster, it matters of service discovery in architecture patterns.
So far, it uses following patterns in the open source release of prestodb/prestosql:
server-side service discovery - it means a client app like presto cli or any app uses presto sdk just need to reach a coordinator w/o awareness of worker nodes.
service registry - a place to keep tracking available instances.
self-registration - A service instance is responsible for registering itself with the service registry. This is the key part that it forces several behaviors:
Service instances must be registered with the service registry on startup and unregistered on shutdown
Service instances that crash must be unregistered from the service registry
Service instances that are running but incapable of handling requests must be unregistered from the service registry
So it keeps the life-cycle management of each presto worker to each instance itself.
so what we can do about it?
It provides some observability from presto cluster itself like HTTP API /v1/node and /v1/service/presto to see instance status. Personally I recommend using another cluster manager like k8s or nomad to manage presto cluster members.
its very bad idea to access each worker machine and see if presto is up or down
why presto not have centralized UI that can do stop/start action ?
No opinion on good/bad. Take k8s for example, you can manage all presto workers as one k8s deployment and manage each presto worker in one pod. It can use Liveness, Readiness and Startup Probes to automate the instance lifecycle with a few YAML code. E.g., the design of livenessProbe of helm chart stable/presto. And cluster manageer like k8s does provide web UI so that you can touch resources to act like an admin. . Or you can choose to write more Java code to extend Presto.
I'm trying to understand service fabric logic to consider a node in a cluster as unhealthy.
I recently deployed a new version of our application that had 3 unhealthy worker services running on all nodes, they are very light services loading messages from a queue, but because their frequent failures, all other services running on same node were affected by some reason, so all services are reported as unhealthy.
I assume this behavior is a service fabric health monitoring thinking the node is not healthy because multiple services are failing on same node. Is this right?
What is the measures that SF uses to consider a node as unhealthy.
Service Fabric's health model is described in detail here. The measures are always "health reports". Service Fabric emits some health reports on its own, but the model is also extensible and you can add your own.
Regardless of whether you've added any new health reports or are relying only on what is present in the system by default, then you can see what health reports are being emitted for a given node by either selecting the node specifically within SFX or by running a command like the following:
Get-ServiceFabricNodeHealth -NodeName Node1
As we saw in the doc, Node health is mainly determined by
Health Reports against that particular node (ex: Node went down)
Failures of a Deployed Application
Failures of a particular Deployed Service Package (usually the code packages within in)
In these cases SF tries to grab as much information about what failed (exit codes, exceptions and their stack traces, etc) and reports a health warning or error for that node.