Vertx clustered eventbus not removing old node on kubernetes rolling deployment - hazelcast

I have two vertx micro services running in cluster and communicate with each other using a headless service(link) in on premise cloud. Whenever I do a rolling deployment I am facing connectivity issue within services. When I analysed the log I can see that old node/pod is getting removed from cluster list but the event bus is not removing it and using it in round robin basis.
Below is the member group information before deployment
Member [192.168.4.54]:5701 - ace32cef-8cb2-4a3b-b15a-2728db068b80 //pod 1
Member [192.168.4.54]:5705 - f0c39a6d-4834-4b1d-a179-1f0d74cabbce this
Member [192.168.101.79]:5701 - ac0dcea9-898a-4818-b7e2-e9f8aaefb447 //pod 2
When deployment is started, pod 2 gets removed from the member list,
[192.168.4.54]:5701 [dev] [4.0.2] Could not connect to: /192.168.101.79:5701. Reason: SocketException[Connection refused to address /192.168.101.79:5701]
Removing connection to endpoint [192.168.101.79]:5701 Cause => java.net.SocketException {Connection refused to address /192.168.101.79:5701}, Error-Count: 5
Removing Member [192.168.101.79]:5701 - ac0dcea9-898a-4818-b7e2-e9f8aaefb447
And new member is added,
Member [192.168.4.54]:5701 - ace32cef-8cb2-4a3b-b15a-2728db068b80
Member [192.168.4.54]:5705 - f0c39a6d-4834-4b1d-a179-1f0d74cabbce this
Member [192.168.94.85]:5701 - 1347e755-1b55-45a3-bb9c-70e07a29d55b //new pod
All migration tasks have been completed. (repartitionTime=Mon May 10 08:54:19 MST 2021, plannedMigrations=358, completedMigrations=358, remainingMigrations=0, totalCompletedMigrations=3348, elapsedMigrationTime=1948ms, totalElapsedMigrationTime=27796ms)
But when a request is made to the deployed service, event though old pod is removed from member group the event bus is using the old pod/service reference(ac0dcea9-898a-4818-b7e2-e9f8aaefb447),
[vert.x-eventloop-thread-1] DEBUG io.vertx.core.eventbus.impl.clustered.ConnectionHolder - tx.id=f9f5cfc9-8ad8-4eb1-b12c-322feb0d1acd Not connected to server ac0dcea9-898a-4818-b7e2-e9f8aaefb447 - starting queuing
I checked the official documentation for rolling deployment and my deployment seems to be following two key things mentioned in documentation, only one pod removed and then the new one is added.
never start more than one new pod at once
forbid more than one unavailable pod during the process
I am using vertx 4.0.3 and hazelcast kubernetes 1.2.2. My verticle class is extending AbstractVerticle and deploying using,
Vertx.clusteredVertx(options, vertx -> {
vertx.result().deployVerticle(verticleName, deploymentOptions);
Sorry for the long post, any help is highly appreciated.

One possible reason could be due to a race condition with Kubernetes removing the pod and updating the endpoint in Kube-proxy as detailed in this extensive article. This race condition will lead to Kubernetes continuing to send traffic to the pod being removed after it has terminated.
One TL;DR solution is to add a delay when terminating a pod by either:
Have the service delay when it receives a SIGTERM (e.g. for 15 sec) such that it keeps responding to requests during that delay period like normal.
Use the Kubernetes preStop hook to execute a sleep 15 command on the container. This allows the service to continue responding to requests during that 15 second period while Kubernetes is updating it's endpoints. Kubernetes will send SIGTERM when the preStop hook completes.
Both solutions will give Kubernetes some time to propagate changes to it's internal components so that traffic stops being routed to the pod being removed.
A caveat to this answer is that I'm not familiar with Hazelcast Clustering and how your specific discover mode is setup.

Related

How does Application Gateway prevent requests being sent to recently terminated pods?

I'm currently researching and experimenting with Kubernetes in Azure. I'm playing with AKS and the Application Gateway ingress. As I understand it, when a pod is added to a service, the endpoints are updated and the ingress controller continuously polls this information. As new endpoints are added AG is updated. As they're removed AG is also updated.
As pods are added there will be a small delay whilst that pod is added to the AG before it receives requests. However, when pods are removed, does that delay in update result in requests being forwarded to a pod that no longer exists?
If not, how does AG/K8S guarantee this? What behaviour could the end client potentially experience in this scenario?
Azure Application gateway ingress is an ingress controller for your kubernetes deployment which allows you to use native Azure Application gateway to expose your application to the internet. Its purpose is to route the traffic to pods directly. At the same moment all questions about pods availability, scheduling and generally speaking management is on kubernetes itself.
When a pod receives a command to be terminated it doesn't happen instantly. Right after kube-proxies will update iptables to stop directing traffic to the pod. Also there may be ingress controllers or load balancers forwarding connections directly to the pod (which is the case with an application gateway). It's impossible to solve this issue completely, while adding 5-10 seconds delay can significantly improve users experience.
If you need to terminate or scale down your application, you should consider following steps:
Wait for a few seconds and then stop accepting connections
Close all keep-alive connections not in the middle of request
Wait for all active requests to finish
Shut down the application completely
Here are exact kubernetes mechanics which will help you to resolve your questions:
preStop hook - this hook is called immediately before a container is terminated. This is very helpful for graceful shutdowns of an application. For example simple sh command with "sleep 5" command in a preStop hook can prevent users to see "Connection refused errors". After the pod receives an API request to be terminated, it takes some time to update iptables and let an application gateway know that this pod is out of service. Since preStop hook is executed prior SIGTERM signal, it will help to resolve this issue.
(example can be found in attach lifecycle event)
readiness probe - this type of probe always runs on the container and defines whether pod is ready to accept and serve requests or not. When container's readiness probe returns success, it means the container can handle requests and it will be added to the endpoints. If a readiness probe fails, a pod is not capable to handle requests and it will be removed from endpoints object. It works very well with newly created pods when an application takes some time to load as well as for already running pods if an application takes some time for processing.
Before removing from the endpoints readiness probe should fail several times. It's possible to lower this amount to only one fail using failureTreshold field, however it still needs to detect one failed check.
(additional information on how to set it up can be found in configure liveness readiness startup probes)
startup probe - for some applications which require additional time on their first initialisation it can be tricky to set up a readiness probe parameters correctly and not compromise a fast response from the application.
Using failureThreshold * periodSecondsfields will provide this flexibility.
terminationGracePeriod - is also may be considered if an application requires more than default 30 seconds delay to gracefully shut down (e.g. this is important for stateful applications)

How to tells Knative Pod Autoscaler not to kill in-progress long running pod

My goal:
Implement a cron job run once per week and I intend to implement this topology on Knative to save the computing resources:
PingSource -> knative service
The PingSource will emit a dummy event to a knative service once per week just to bring up 1 knative service pod. The knative service pod will get huge amount of data and then process them.
My concern:
If I set enable-scale-to-zero to true, the Knative pod autoscaler probably shutdown the knative service pod even when the pod has not finished its work.
So far, I explored:
The scale-to-zero-grace-period which can be configured to tell the auto scaler how long it should wait after the last traffic ends to shutdown the pod. But I don't think this approach is subtle. I prefer somewhat similar to readinessProbe or livenessProbe. The auto scaler should send a probe to know whether the pod is processing something before sending the kill signal.
In addition, according to knative's docs, there are 2 type of event sink: callable and addressable. Addressable and Callable both return the response or acknowledgement. Would the knative auto scaler consider the pod as handling the request till the pod return the response/acknowledgement? So as long as the pod does not response, it won't be removed by the auto scaler.
The Knative autoscaler relies on the pod strictly working in a request/response fashion. As long as the "huge amount of data" is processed as part of an HTTP request (or Websocket session, or gRPC session etc.) the pod will not even be considered for deletion.
What will not work is sending the request, immediately return and then munging the data in the background. The autoscaler will think that there's no activity at all and thus shut it down. There is a sandbox project that tries to implement such asynchronous semantics though.

FabricDnsService is not preferred DNS server on the node on solution with one stateful and one stateless

Unhealthy event: SourceId='System.FabricDnsService', Property='Environment', HealthState='Warning', ConsiderWarningAsError=false.
FabricDnsService is not preferred DNS server on the node.
Wondering if anyone has a place on where to start on getting this warning in azure fabric?
Looks this was an issue on sf v6.0 and now it has beed fixed in v6.1
https://github.com/Azure/service-fabric-issues/issues/496
For now to workaround this you should turn OFF all your network connections except one major, reset a local cluster, redeploy the app.
https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-manifest

Service Fabric Cluster status "Upgrade service unreachable"

I Had SF cluster made of 3 Standard A0 nodes.
I scaled cluster in to 1 node and understood that this was bad idea because nothing was working in this state (even SF explorer was not working)
Then I scaled it out back to 3 nodes and restarted Primary scaleser.
Now all nodes in scaleset are up and running but SF cluster status is "Upgrade service unreachable".
I saw similar question Service Fabric Status: Upgrade service unreachable where was recommended to scale nodes up to D2 but this hasn't solve my problem.
I have connected to one node via RDP and are some Event logs:
EventLog -> Applications and Service Logs -> Microsoft Service Fabric -> Operational:
Node name: _SSService_0 has failed to open with upgrade domain: 0, fault domain: fd:/0, address: 10.0.0.4, hostname: SSService000000, isSeedNode: true, versionInstance: 5.6.210.9494:3, id: d9e8bae2d4d8116bfefb989b95e91f7b, dca instance: 131405546580494698, error: FABRIC_E_TIMEOUT
EventLog -> Applications and Service Logs -> Microsoft Service Fabric -> Admin:
client-10.0.0.4:19000/10.0.0.4:19000: error = 2147943625, failureCount=487. Filter by (type~Transport.St && ~"(?i)10.0.0.4:19000") to get listener lifecycle. Connect failure is expected if listener was never started, or listener/its process was stopped before/during connecting.
If you are scaling down the cluster by resizing VM scale set to 1 you're basically destroying the cluster because it requires a minimum of 3 nodes by design. Therefore the only way is to recreate it again from scratch.
If you need a tiny cluster consisting of just 1 node (like for testing purposes) there is a way in Azure now to create a single node cluster, but you won't be able to scale it as it's a special case not for production use.
Upgrade service unreachable this happens if the number of active VM or node of the cluster become 0 anyhow. In my case, his happened by restarting all the VM at a time. In this state, the nodes are available and running but they have been disconnected from the cluster.
I resolved this, by deallocating and restarting the node from Virtual machine Scale set.

How do I select the master Redis pod in this Kubernetes example?

Here's the example I have modeled after.
In the Readme's "Delete our manual pod" section:
The redis sentinels themselves, realize that the master has disappeared from the cluster, and begin the election procedure for selecting a new master. They perform this election and selection, and chose one of the existing redis server replicas to be the new master.
How do I select the new master? All 3 Redis server pods controlled by the redis replication controller from redis-controller.yaml still have the same
labels:
name: redis
which is what I currently use in my Service to select them. How will the 3 pods be distinguishable so that from Kubernetes I know which one is the master?
How will the 3 pods be distinguishable so that from Kubernetes I know
which one is the master?
Kubernetes isnt aware of the master nodes. You can find the pod manually by connecting to it and using:
redis-cli info
You will get lots of information about the server but we need role for our purpose:
redis-cli info | grep ^role
Output:
role: Master
Please note Replication controllers are replaced by Deployments for stateless services. For stateful services use Statefulsets.
Your client Redis library can actually handle this. For example with ioredis:
ioredis guarantees that the node you connected to is always a master even after a failover.
So, you actually connect to a redis-sentinel instead of a redis-client.
We need to do the same thing and tried different things like modifying chart. Finally, just created a simple python docker that does the labeling and created chart that expose the master redis as service. This periodically checked the pods create for redis-ha and label them according to their role ( master/ slave)
It uses the same sentinel commands to find the master/slave.
helm chart redis-pod-labeler here
source repo

Resources