Service Fabric Cluster status "Upgrade service unreachable"

Service Fabric Cluster status "Upgrade service unreachable" - azure

I Had SF cluster made of 3 Standard A0 nodes.
I scaled cluster in to 1 node and understood that this was bad idea because nothing was working in this state (even SF explorer was not working)
Then I scaled it out back to 3 nodes and restarted Primary scaleser.
Now all nodes in scaleset are up and running but SF cluster status is "Upgrade service unreachable".
I saw similar question Service Fabric Status: Upgrade service unreachable where was recommended to scale nodes up to D2 but this hasn't solve my problem.
I have connected to one node via RDP and are some Event logs:
EventLog -> Applications and Service Logs -> Microsoft Service Fabric -> Operational:
Node name: _SSService_0 has failed to open with upgrade domain: 0, fault domain: fd:/0, address: 10.0.0.4, hostname: SSService000000, isSeedNode: true, versionInstance: 5.6.210.9494:3, id: d9e8bae2d4d8116bfefb989b95e91f7b, dca instance: 131405546580494698, error: FABRIC_E_TIMEOUT
EventLog -> Applications and Service Logs -> Microsoft Service Fabric -> Admin:
client-10.0.0.4:19000/10.0.0.4:19000: error = 2147943625, failureCount=487. Filter by (type~Transport.St && ~"(?i)10.0.0.4:19000") to get listener lifecycle. Connect failure is expected if listener was never started, or listener/its process was stopped before/during connecting.

If you are scaling down the cluster by resizing VM scale set to 1 you're basically destroying the cluster because it requires a minimum of 3 nodes by design. Therefore the only way is to recreate it again from scratch.
If you need a tiny cluster consisting of just 1 node (like for testing purposes) there is a way in Azure now to create a single node cluster, but you won't be able to scale it as it's a special case not for production use.

Upgrade service unreachable this happens if the number of active VM or node of the cluster become 0 anyhow. In my case, his happened by restarting all the VM at a time. In this state, the nodes are available and running but they have been disconnected from the cluster.
I resolved this, by deallocating and restarting the node from Virtual machine Scale set.

Related

Vertx clustered eventbus not removing old node on kubernetes rolling deployment

I have two vertx micro services running in cluster and communicate with each other using a headless service(link) in on premise cloud. Whenever I do a rolling deployment I am facing connectivity issue within services. When I analysed the log I can see that old node/pod is getting removed from cluster list but the event bus is not removing it and using it in round robin basis.
Below is the member group information before deployment
Member [192.168.4.54]:5701 - ace32cef-8cb2-4a3b-b15a-2728db068b80 //pod 1
Member [192.168.4.54]:5705 - f0c39a6d-4834-4b1d-a179-1f0d74cabbce this
Member [192.168.101.79]:5701 - ac0dcea9-898a-4818-b7e2-e9f8aaefb447 //pod 2
When deployment is started, pod 2 gets removed from the member list,
[192.168.4.54]:5701 [dev] [4.0.2] Could not connect to: /192.168.101.79:5701. Reason: SocketException[Connection refused to address /192.168.101.79:5701]
Removing connection to endpoint [192.168.101.79]:5701 Cause => java.net.SocketException {Connection refused to address /192.168.101.79:5701}, Error-Count: 5
Removing Member [192.168.101.79]:5701 - ac0dcea9-898a-4818-b7e2-e9f8aaefb447
And new member is added,
Member [192.168.4.54]:5701 - ace32cef-8cb2-4a3b-b15a-2728db068b80
Member [192.168.4.54]:5705 - f0c39a6d-4834-4b1d-a179-1f0d74cabbce this
Member [192.168.94.85]:5701 - 1347e755-1b55-45a3-bb9c-70e07a29d55b //new pod
All migration tasks have been completed. (repartitionTime=Mon May 10 08:54:19 MST 2021, plannedMigrations=358, completedMigrations=358, remainingMigrations=0, totalCompletedMigrations=3348, elapsedMigrationTime=1948ms, totalElapsedMigrationTime=27796ms)
But when a request is made to the deployed service, event though old pod is removed from member group the event bus is using the old pod/service reference(ac0dcea9-898a-4818-b7e2-e9f8aaefb447),
[vert.x-eventloop-thread-1] DEBUG io.vertx.core.eventbus.impl.clustered.ConnectionHolder - tx.id=f9f5cfc9-8ad8-4eb1-b12c-322feb0d1acd Not connected to server ac0dcea9-898a-4818-b7e2-e9f8aaefb447 - starting queuing
I checked the official documentation for rolling deployment and my deployment seems to be following two key things mentioned in documentation, only one pod removed and then the new one is added.
never start more than one new pod at once
forbid more than one unavailable pod during the process
I am using vertx 4.0.3 and hazelcast kubernetes 1.2.2. My verticle class is extending AbstractVerticle and deploying using,
Vertx.clusteredVertx(options, vertx -> {
vertx.result().deployVerticle(verticleName, deploymentOptions);
Sorry for the long post, any help is highly appreciated.

One possible reason could be due to a race condition with Kubernetes removing the pod and updating the endpoint in Kube-proxy as detailed in this extensive article. This race condition will lead to Kubernetes continuing to send traffic to the pod being removed after it has terminated.
One TL;DR solution is to add a delay when terminating a pod by either:
Have the service delay when it receives a SIGTERM (e.g. for 15 sec) such that it keeps responding to requests during that delay period like normal.
Use the Kubernetes preStop hook to execute a sleep 15 command on the container. This allows the service to continue responding to requests during that 15 second period while Kubernetes is updating it's endpoints. Kubernetes will send SIGTERM when the preStop hook completes.
Both solutions will give Kubernetes some time to propagate changes to it's internal components so that traffic stops being routed to the pod being removed.
A caveat to this answer is that I'm not familiar with Hazelcast Clustering and how your specific discover mode is setup.

nodejs web application in k8s gets OOM

I'm running a nestjs web application implemented with fastify on kubernetes.
I split my application into Multi Zones, and deploy it into different pyhsical location k8s clusters (Cluster A & Cluster B).
Everything gose well, except the Zone X in Culster A which has the maximum traffic during all zones.
( Here is a 2-Day metrics dashboard for Zone X during normal time )
The problem only happens on the Zone X in Cluster A and never happens on any other zones or clusters.
At first some 499 responses appear in Cluster A's Ingress Dashboard, and soon the memory of pods suddenly expand to the memory limit one pod after another.
It seems that the 499 status is caused by pods not sending responses to the outer.
At the same time, other zones in Cluster A work normally.
For avoiding influencing users, I switch all network traffic to Cluster B and everything work properly, Which excludes causing by dirty data.
I tried to kill and redeploy all pods of Zone X in Cluster A, but when I switch traffic back to Cluster A, the problem occurs again. But after waitting for 2-3 hours and then swith back the traffic, the problems disappers!
Since I don't konow how comes, only thing I can do is switching traffic and check is everything back to normal.
I've tried multiple variations of node memory issues, but none of them seems to cause this problem. Any ideas or inspirations of this problem?
Name
Version
nestjs
v6.1.1
fastify
v2.11.0
Docker Image
node:12-alpine(v12.18.3)
Ingress
v0.30.0
Kubernetes
v1.18.12

FabricDnsService is not preferred DNS server on the node on solution with one stateful and one stateless

Unhealthy event: SourceId='System.FabricDnsService', Property='Environment', HealthState='Warning', ConsiderWarningAsError=false.
FabricDnsService is not preferred DNS server on the node.
Wondering if anyone has a place on where to start on getting this warning in azure fabric?

Looks this was an issue on sf v6.0 and now it has beed fixed in v6.1
https://github.com/Azure/service-fabric-issues/issues/496
For now to workaround this you should turn OFF all your network connections except one major, reset a local cluster, redeploy the app.
https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-manifest

After setting node type i,e Reliability tier to sliver to bronze on azure service fabric, error on cluster health is waring

After setting node type i,e Reliability tier to sliver to bronze on azure service fabric, error on cluster health is waring here below is the error evaluation from service fabric.(Even in vmss of service fabric
Services
Warning
Unhealthy services: 100% (1/1), ServiceType='ClusterManagerServiceType', MaxPercentUnhealthyServices=0%.
Service
Warning
Unhealthy service: ServiceName='fabric:/System/ClusterManagerService', AggregatedHealthState='Warning'.
Event
Warning
Unhealthy event: SourceId='System.PLB', Property='ServiceReplicaUnplacedHealth_Secondary_00000000-0000-0000-0000-000000002000', HealthState='Warning', ConsiderWarningAsError=false.
The Load Balancer was unable to find a placement for one or more of the Service's Replicas:
ClusterManagerServiceName Secondary Partition 00000000-0000-0000-0000-000000002000 could not be placed, possibly, due to the following constraints and properties:
TargetReplicaSetSize: 5
Placement Constraint: NodeTypeName==NOde
Depended Service: N/A
Constraint Elimination Sequence:
ReplicaExclusionStatic eliminated 2 possible node(s) for placement -- 1/3 node(s) remain.
ReplicaExclusionDynamic eliminated 1 possible node(s) for placement -- 0/3 node(s) remain.
Nodes Eliminated By Constraints:
ReplicaExclusionStatic -- No Colocations with Partition's Existing Secondaries/Instances:
FaultDomain:fd:/0 NodeName:_NOde_0 NodeType:NOde NodeTypeName:NOde UpgradeDomain:0 UpgradeDomain: ud:/0 Deactivation Intent/Status: None/None
FaultDomain:fd:/2 NodeName:_NOde_2 NodeType:NOde NodeTypeName:NOde UpgradeDomain:2 UpgradeDomain: ud:/2 Deactivation Intent/Status: None/None
ReplicaExclusionDynamic -- No Colocations with Partition's Existing Primary or Potential Secondaries:
FaultDomain:fd:/1 NodeName:_NOde_1 NodeType:NOde NodeTypeName:NOde UpgradeDomain:1 UpgradeDomain: ud:/1 Deactivation Intent/Status: None/None
Help me to slove this problem

When you create your cluster with Reliability tier Silver it will provision 5 replicas of the system services, i.e. the services that essentially are Service Fabric.
Downgrading from Silver to Bronze means that you change the target replica count of these services from 5 to 3.
In order for SF to place replicas on nodes it evaluates a set of constraints, on of these being that it does not want two replicas of the same service partition to end up on the same node.
As it looks from your error you have one Node Type with 3 nodes in it but you still have Silver reliabilty tier, that means that SF is unable to find a node for the last two of your replicas for the system services (in your log it is System/ClusterManagerService, but same applies for all system services).
Make sure that your cluster has at least as many nodes as your reliability tier needs, i.e. 3 nodes for a Bronze tier, 5 for a Silver and so on.
Also, what you are seeing is a warning that the cluster is not able to uphold it's characteristics, but it should still be running, right?

RabbitMQ cluster fails when one node is not reachable

I created a RabbitMQ cluster via Docker and Docker Cloud. I am running two RabbitMQ container on two separate nodes (both hosted on AWS).
The output of rabbitmqctl cluster_status is:
Cluster status of node 'rabbit#rabbitmq-cluster-2' ...
[{nodes,[{disc,['rabbit#rabbitmq-cluster-1','rabbit#rabbitmq-cluster-2']}]},
{running_nodes,['rabbit#rabbitmq-cluster-1','rabbit#rabbitmq-cluster-2']},
{cluster_name,<<"rabbit#rabbitmq-cluster-1">>},
{partitions,[]}]
However, when I am stopping one container/node, then my messages cannot get delievered and get queued in .dlx
I am using senecajs with NodeJS.
Did anybody have the same problems and can point me into a direction?

To answer my own question:
The problem was that Docker, after starting, caches the DNS and is not
able to connect to a new one. So if one cluster fails, Docker still
tries to connect to the one, instead of trying a new one.
The solution was to write my own function when connecting to the RabbitMQ. I first check with net.createConnection if the host is online. If yes, I connect to it, if not I try a different one.
Every time a RabbitMQ node is down, my service fails, restarts and calls the "try this host" function.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string