Pacemaker/Corosync/PostgreSQL cluster failovers during heavy load

Pacemaker/Corosync/PostgreSQL cluster failovers during heavy load - linux

At our company we're running PostgreSQL on 4-node clusters using Pacemaker and Corosync.
During heavy batch loads we suffer from cluster failovers because the inbuilt resource monitoring gets timed out when trying to access the database because well, server overload...
On one end it's understandable cluster behaviour that a 'self induced denial of service' should trigger a master switchover, on the other hand we'd like to not see our batches and service (temporarily) aborted because of this. A standalone server would have just pulled through. Obviously we look into optimizing and spreading the batches, but that's like putting one fire out and another pops up elsewhere.
I looked into linux cgroups but this doesn't seem to be a viable solution as all it does is CPU/IO limit your postgresql resource, which is part of the problem :-)
Any ideas or suggestions very much appreciated!

Related

Databricks REST API throttling and capacity restrictions/limits

I've scaled up the hardware on an azure-databricks cluster ("all-purpose" cluster) appropriately so that it should handle a very large amount of work. The application is designed in a way where incoming data is processed in smallish, discrete chunks. The jobs run in ~20 to 30 seconds. But there is a high degree of concurrent jobs that need to execute at the same time (eg. anywhere from 0 to 50 simultaneous jobs).
The only approach for delivering jobs to the cluster seems to be by way of their REST API in azure databricks (doc: https://docs.databricks.com/dev-tools/api/latest/jobs.html )
Everything behaves normally until the number of concurrent jobs reaches 10 or so. At that point I see an unreasonable deterioration in throughput. But if I check ganglia or custom telemetry, there appears to be no reason for the deteriorated performance.
My suspicion is that the REST API itself is introducing an artificial bottleneck and they are throttling the number of jobs I can send over to my cluster. This was not self-evident to me. If I am paying for a large cluster, I should be allowed to send jobs to it. The REST API seems to be doing little more than serving as a communication channel that allows me to transmit my requests to my cluster. That API is the last place I would expect to find a resource bottleneck. A Spark developer would naturally investigate their code, then the cluster hardware. The REST API is not a reasonable place for Databricks to be introducing some additional, secretive limitations.
Does anyone know of another way to transmit distinct jobs to a cluster without going thru the REST API? Eg. is there a way for the driver node in the cluster to spawn additional/distinct/first-class jobs without being counted against our REST API allowance?
This issue seems silly and artificial. The secretive nature of these limits is bothersome to me as well. If they are throttling the REST API then there should be a warning, error, or ganglia chart for that. Otherwise developers will struggle with the performance issues using trial and error and guesswork.
Any help is appreciated. I'd prefer not to go all the way back to the drawing board, because of an artificial restriction in their REST API (one that was probably put in place to protect an underpowered "control plane").

Spark is awesome, but it isn't designed to be a high-concurrency database. The folks at Databricks have done a lot to lift the concurrency limitations of Spark, it still isn't a high-concurrecy solution.
In other words, your problem isn't the REST API ... it's the Spark engine in Databricks.
I know you don't want to go back to the drawing board, but the choices here are all bad ones:
you can run multiple Databricks clusters ( https://docs.databricks.com/clusters/index.html ) and use NGINX or some other load balancer to distribute the API requests. This will get expensive, quickly, but will avoid redesign.
If your use case supports it, try using a real-time database that supports high concurrency. I like Druid (see https://druid.apache.io or https://imply.io if you want a managed version), but there are others in the same category

Stopping a Running Spark Application (Databricks Interactive Cluster)

I'm using databricks with an interactive cluster. If I review their management user-interface, there is only one "application" listed. And when I try to kill it, I always get this message
HTTP ERROR 405
Problem accessing /app/kill/. Reason:
Method Not Allowed
The end result is that I'm forced to restart the entire cluster. I use their "cluster pool" feature which makes the wait time a bit less. but it still involves waiting for about a minute before I'm able to get back to work.
The reason I need to restart the application is to swap fresh jar's into the spark environment. Otherwise when I repeatedly use addJar(), I run into some annoying jar-hell issues (class not found errors and such).
Why does Databricks only list one application at a time in their "interactive" cluster?
Why doesn't databricks have a way to stop one application and start another in its place (without restarting the whole cluster)?
This affects development productivity when we are forced to sit around waiting an extra minute for no good reason. It is already pretty hard to be productive with spark.

Is implementing elastic search service on same server as node server with auto scaling is a good idea?

Trying to deploy a project on t3 large server with auto scaling.
I have my elastic search service deployed on same system as node and react projects.(Not using AWS elastic search)
Will it be facing issues in future and i need to segregate elastic search service to some other server?

It's always nice to have a separate dedicated server for running the Elasticsearch server but as you are using AWS some of the things which you can do to minimize the issues:
Elasticsearch is a stateful application contrast to your node and react app unless you are storing the state there as well which is not a good idea and due to stateless nature of the applications, autoscaling is very useful as you can on-demand based on the CPU, memory or other metrics scale up or down the instances.
But in case of Elasticsearch or other stateful applications, it becomes tricky as when you scale up or down the instance, shards get relocated if they are not reachable within a threshold which can lead to unbalanced Elasticsearech cluster.
Now in order to minimize these issues:
Make sure you can storing Elasticsearch indices on the network-attached disk so that there is no data loss when autoscaling brings a new instance and new instance again should use earlier network attaches EBS(where your data is stored).
Make sure you don't create a new Elasticsearch process when you scale up or down the instances according to your autoscaling policy and the Elasticsearch process should be fixed and scale up/down with some manual intervention.
If you have to scale up the Elasticsearch cluster then make sure you disable shard allocation to avoid the issues mentioned earlier.
These are some known issues which you might face and there could be even more based on your configuration and while writing the answer itself I felt, it so easy to just have a dedicated instance for Elasticsearch to avoid these weird issues.

I would add to other answers following:
Elasticsearch performs best if it has enough RAM to keep indexes in entirety in RAM. If the Elasticsearch is competing with Node/Application for RAM it will affect it's performance.
From maintenance/performance perspective you should consider having at least 3-node cluster. Even if that means you have smaller machines. If AWS is upgrading infrastructure and you have 1 machine, when than 0.05% unavailability hits your search is down. If you need to do maintenance on the node or do upgrades having multiple machines will help with availability.
Depending on your use of Elasticsearch and how often you update/delete items in the indexes, and how fast your indexes will grow, adding more machines/nodes to the cluster will help with growth.
There are probably many more things to consider, but that totally depends on your application, budget, SLAs etc.

Blocking on idle connections on ClientRead for parametrized queries (bindings) during high traffic

I am looking for good solution for my problem which occurs during high traffic peaks. I use postgres on AWS with nodejs (knex for queries buliding) - details below.
When I am looking on Performance Insights in my RDS console I see that some of queries stuck on "ClientRead". My RDS instances are rather huge and my CPU utilization is on very low level (1%-10%). So I confirmed it by connect to db and run query for pg_stats and in result I saw that a lot of queries is idle on ClientRead event.
What connect these queries? Bindings. I assume that these parametrized queries wait to get values from my EC2 instances. I thought that my services are too slow, so I scaled up to more instances, but the results on RDS were worse and more connections were blocked.
For testing solution I converted few of queries from parametrized into raw sql queries without bindings (with values directly in query). And these queries exactly were run immediately without any problems. But it seems that not perfect solution even if security reasons.
At the moment I have no idea where the problem is? Should I reduce traffic by add throttling on api gw? Create inner queues in my service? Is it a communication problem or settings of my RDS/postgre?
If anyone has more experience with similar cases or is possible to point at probable solution, link to documents which could help me or detect where the problem is it would be great.
AWS RDS (Aurora) Postgres 9.6.9
nodejs 10.12.0
knex 0.17.3
node-postgres 7.4.1

If your database backends are blocked waiting for ClientRead, that means that the database is waiting for requests from the client.
The queries you are seeing are not running queries. If the state is not active, query contains the last SQL statement that was run on this database connection.
If you are experiencing performance problems, the cause seems to be outside the database.

I was facing the same problem and upgrading my pg Node.js package from 7.14.0 to 8.5.1 solved this exact problem.
Before upgrading, ClientRead wait event was accounting for something like 80 % of the RDS performance chart, now it's down to less than 10 %.

'Unable to connect Net/http: TLS handshake timeout' — Why can't Kubectl connect to Azure Kubernetes server? (AKS)

My question (to MS and anyone else) is: Why is this issue occurring and what work around can be implemented by the users / customers themselves as opposed to by Microsoft Support?
There have obviously been 'a few' other question about this issue:
Managed Azure Kubernetes connection error
Can't contact our Azure-AKS kube - TLS handshake timeout
Azure Kubernetes: TLS handshake timeout (this one has some Microsoft feedback)
And multiple GitHub issues posted to the AKS repo:
https://github.com/Azure/AKS/issues/112
https://github.com/Azure/AKS/issues/124
https://github.com/Azure/AKS/issues/164
https://github.com/Azure/AKS/issues/177
https://github.com/Azure/AKS/issues/324
Plus a few twitter threads:
https://twitter.com/ternel/status/955871839305261057
TL;DR
Skip to workarounds in Answers below.
Current best solution is post a help ticket — and wait — or re-create your AKS cluster (maybe more than once, cross your fingers, see below...) but there should be something better. At least please grant the ability to let AKS preview customers, regardless of support tier, upgrade their support request severity for THIS specific issue.
You can also try scaling your Cluster (assuming that doesn't break your app).
What about GitHub?
Many of the above GitHub issues have been closed as resolved but the issue persists. Previously there was an announcements document regarding the problem but no such status updates are currently available even though the problem continues to present itself:
https://github.com/Azure/AKS/tree/master/annoucements
I am posting this as I have a few new tidbits that I haven't seen elsewhere and I am wondering if anyone has ideas as far as other potential options for working around the issue.
Affected VM / Node Resource Usage
The first piece I haven't seen mentioned elsewhere is Resource usage on the nodes / vms / instances that are being impacted by the above Kubectl 'Unable to connect to the server: net/http: TLS handshake timeout' issue.
Production Node Utilization
The node(s) on my impacted cluster look like this:
The drop in utilization and network io correlates strongly with both the increase in disk utilization AND the time period we began experiencing the issue.
The overall Node / VM utilization is generally flat prior to this chart for the previous 30 days with a few bumps relating to production site traffic / update pushes etc.
Metrics After Issue Mitigation (Added Postmortem)
To the above point, here are the metrics the same Node after Scaling up and then back down (which happened to alleviate our issue, but does not always work — see answers at bottom):
Notice the 'Dip' in CPU and Network? That's where the Net/http: TLS issue impacted us — and when the AKS Server was un-reachable from Kubectl. Seems like it wasn't talking to the VM / Node in addition to not responding to our requests.
As soon as we were back (scaled the # nodes up by one, and back down — see answers for workaround) the Metrics (CPU etc) went back to normal — and we could connect from Kubectl. This means we can probably create an Alarm off of this behavior (and I have a issue in asking about this on Azure DevOps side: https://github.com/Azure/AKS/issues/416)
Node Size Potentially Impacts Issue Frequency
Zimmergren over on GitHub indicates that he has less issues with larger instances than he did running bare bones smaller nodes. This makes sense to me and could indicate that the way the AKS servers divy up the workload (see next section) could be based on the size of the instances.
"The size of the nodes (e.g. D2, A4, etc) :)
I've experienced that when running A4 and up, my cluster is healther than if running A2, for example. (And I've got more than a dozen similar experiences with size combinations and cluster failures, unfortunately)." (https://github.com/Azure/AKS/issues/268#issuecomment-375715435)
Other Cluster size impact references:
giorgited (https://github.com/Azure/AKS/issues/268#issuecomment-376390692)
An AKS server responsible for more smaller Clusters may possibly get hit more often?
Existence of Multiple AKS Management 'Servers' in one Az Region
The next thing I haven't seen mentioned elsewhere is the fact that you can have multiple Clusters running side by side in the same Region where one Cluster (production for us in this case) gets hit with 'net/http: TLS handshake timeout' and the other is working fine and can be connected to normally via Kubectl (for us this is our identical staging environment).
The fact that users (Zimmergren etc above) seem to feel that the Node size impacts the likelihood that this issue will impact you also seems to indicate that node size may relate to the way the sub-region responsibilities are assigned to the sub-regional AKS management servers.
That could mean that re-creating your cluster with a different Cluster size would be more likely to place you on a different management server — alleviating the issue and reducing the likelihood that multiple re-creations would be necessary.
Staging Cluster Utilization
Both of our AKS Clusters are in U.S. East. As a reference to the above 'Production' Cluster metrics our 'Staging' Cluster (also U.S. East) resource utilization does not have the massive drop in CPU / Network IO — AND does not have the increase in disk etc. over the same period:
Identical Environments are Impacted Differently
Both of our Clusters are running identical ingresses, services, pods, containers so it is also unlikely that anything a user is doing causes this problem to crop up.
Re-creation is only SOMETIMES successful
The above existence of multiple AKS management server sub-regional responsibilities makes sense with the behavior described by other users on github (https://github.com/Azure/AKS/issues/112) where some users are able to re-create a cluster (which can then be contacted) while others re-create and still have issues.
Emergency could = Multiple Re-Creations
In an emergency (ie your production site... like ours... needs to be managed) you can PROBABLY just re-create until you get a working cluster that happens to land on a different AKS management server instance (one that is not impacted) but be aware that this may not happen on your first attempt — AKS cluster re-creation is not exactly instant.
That said...
Resources on the Impacted Nodes Continue to Function
All of the containers / ingresses / resources on our impacted VM appear to be working well and I don't have any alarms going off for up-time / resource monitoring (other than the utilization weirdness listed above in the graphs)
I want to know why this issue is occurring and what work around can be implemented by the users themselves as opposed to by Microsoft Support (currently have a ticket in). If you have an idea let me know.
Potential Hints at the Cause
https://github.com/Azure/AKS/issues/164#issuecomment-363613110
https://github.com/Azure/AKS/issues/164#issuecomment-365389154
Why no GKE?
I understand that Azure AKS is in preview and that a lot of people have moved to GKE because of this problem (). That said my Azure experience has been nothing but positive thus far and I would prefer to contribute a solution if at all possible.
And also... GKE occasionally faces something similar:
TLS handshake timeout with kubernetes in GKE
I would be interested to see if scaling the nodes on GKE also solved the problem over there.

Workaround 1 (May Not Work for Everyone)
An interesting solution (worked for me) to test is scaling the number of nodes in your cluster up, and then back down...
Log into the Azure Console — Kubernetes Service blade.
Scale your cluster up by 1 node.
Wait for scale to complete and attempt to connect (you should be able to).
Scale your cluster back down to the normal size to avoid cost increases.
Alternately you can (maybe) do this from the command line:
az aks scale --name <name-of-cluster> --node-count <new-number-of-nodes> --resource-group <name-of-cluster-resource-group>
Since it is a finicky issue and I used the web interface I am uncertain if the above is identical or would work.
Total time it took me ~2 minutes — for my situation that is MUCH better than re-creating / configuring a Cluster (potentially multiple times...)
That being Said....
Zimmergren brings up some good points that Scaling is not a true Solution:
"It worked sometimes, where the cluster self-healed a period after scaling. It failed sometimes with the same errors. I don't consider scaling a solution to this problem, as that causes other challenges depending on how things are set up. I wouldn't trust that routine for a GA workload, that's for sure. In the current preview, it's a bit wild west (and expected), and I'm happy to blow up the cluster and create a new one when this fails continuously." (https://github.com/Azure/AKS/issues/268#issuecomment-395299308)
Azure Support Feedback
Since I had a support ticket open at the time I ran into the above scaling solution I was able to get feedback (or rather a guess) on what the above might have worked, here's a paraphrased response:
"I know that scaling the cluster can sometimes help if you get into a state where the number of nodes is mismatched between “az aks show” and “kubectl get nodes”. This may be similar."
Workaround References:
GitHub user Scaled nodes from console and fixed the problem: https://github.com/Azure/AKS/issues/268#issuecomment-375722317
Workaround Didn't Work?
If this DOES NOT work for you, please post a comment below as I am going to try to keep an up to date list of how often the issue crops up, whether it resolves itself, and whether this solution works across Azure AKS users (looks like it doesn't work for everyone).
Users Scaling Up / Down DID NOT work for:
omgsarge (https://github.com/Azure/AKS/issues/112#issuecomment-395231681)
Zimmergren (https://github.com/Azure/AKS/issues/268#issuecomment-395299308)
sercand — scale operation itself failed — not sure if it would have impacted connectability (https://github.com/Azure/AKS/issues/268#issuecomment-395301296)
Scaling Up / Down DID work for:
Me
LohithChanda (https://github.com/Azure/AKS/issues/268#issuecomment-395207716)
Zimmergren (https://github.com/Azure/AKS/issues/268#issuecomment-395299308)
Email Azure AKS Specific Support
If after all the diagnosis you still suffer from this issue, please don't hesitate to send email to aks-help#service.microsoft.com

Adding another answer since this is now the Azure Support official solution when the above attempts do not work. I haven't experienced the issue in a while so I can't verify this one but it seems like it would make sense to me (based on previous experience).
Credit on this one / full thread found here (https://github.com/Azure/AKS/issues/14#issuecomment-424828690)
Check for Tunneling Issues
ssh to the agent node which running the tunnelfront pod
get tunnelfront logs: "docker ps" -> "docker logs "
"nslookup " whose fqdn can be get from above command -> if it resolves ip, which means dns works, then go to the following step
"ssh -vv azureuser# -p 9000" ->if port is working, go to the next step
"docker exec -it /bin/bash", type "ping google.com", if it is no response, which means tunnel front pod doesn't have external network, then do following step
restart kube-proxy, using "kubectl delete po -n kube-system", choose the kube-proxy which is runing on the same node with tunnelfront. customer can use "kubectl get po -n kube-system -o wide"
I feel like this particular work-around could PROBABLY be automated (for sure on Azure side but probably on the community side).
Email Azure AKS Specific Support
If after all the diagnosis you still suffer from this issue, please don't hesitate to send email to aks-help#service.microsoft.com

Workaround 2 Re-Create Cluster (Somewhat Obvious)
I am adding this one because there are some details to keep in mind and even though I touched on it in my original Question, that thing got long, so I am adding specific details on re-creation here.
Cluster Re-Creation Doesn't Always Work
Per the above in my original question there are multiple AKS Server instances that divide up responsibilities for a given Azure region (we think). Some, or all, of these can be impacted by this bug resulting in your Cluster being un-reachable via Kubectl.
That means that if you re-create your Cluster and it some how lands on the same AKS server, probably that new Cluster will ALSO not be reachable requiring...
Additional Re-creation Attempts
Probably re-creating multiple times will result in you eventually landing your new Cluster on one of the other AKS servers (which is working fine).
As far as I can tell I don't see any indication that ALL AKS servers get hit with this problem at once in a while (if ever).
Different Cluster Node Size
If you are in a pinch and want the highest possibly probability (we haven't confirmed this) that your re-creation lands on a different AKS management server — choose a different Node size when you create your new Cluster (see Node Size section of the initial Question above).
I have opened this ticket asking Azure DevOps whether or not the Node Size is ACTUALLY related to deciding which Clusters are administered by which AKS management servers: https://github.com/Azure/AKS/issues/416
Support Ticket Fix vs. Self Healing
Since there are a lot of users who indicate that the problem occasionally solves itself and just goes away I think that it is reasonable to guess that Support actually fixes the offending AKS server (which may result in other users having their Clusters fixed — 'Self Heal') as opposed to fixing the individual user's Cluster.
Creating Support Tickets
To me the above would likely mean that creating a Ticket is probably a good thing since it would fix other user Clusters experiencing the same issue — it might also be an argument for allowing support issue severity escalation for this specific issue.
I think this is also a decent indicator that maybe Azure support hasn't figured out how to fully alarm for the problem yet, in which case creation of a support ticket serves that purpose as well.
I also asked Azure DevOps whether they Alarm for the issue (based on my experience easily visualizing the issue based on CPU and Network IO metric changes) on their side: https://github.com/Azure/AKS/issues/416
If NOT (haven't heard back) then it makes sense to create a ticket EVEN IF you plan to re-create your cluster since that ticket would then make Azure DevOps aware of the issue resulting in a fix for other users on that Cluster management server.
Things to make Cluster Re-Creation Easier
I will add to this (feedback / ideas are appreciated) but off the top of my head:
Be diligent (obvious) about how you store all YAML files used to create your Cluster (even if you don't re-deploy often for your app by design).
Script your DNS modifications in order to speed up pointing to the new instance — If you have a public facing app / service that utilizes DNS (Maybe something like this example for Google Domains?: https://gist.github.com/cyrusboadway/5a7b715665f33c237996, Full docs here: https://cloud.google.com/dns/api/v1/)

We just had this issue for one of our clusters. Sent a support ticket and got called back 5 minutes later by an engineer asking if it was OK for them to restart the API Server. 2 minutes later it was working again.
Reason was something about timeouts in their messaging queue.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string