Need more information on a CouchDB Replication Error - couchdb

I'm attempting to do a one time replication of a CouchDB from a production server to my development machine for debugging. It simply gives, "Replication Error" with no other information (other than it will retry at increasing intervals). Is there a log somewhere or something I can do to see what the error was?

Related

Erlang connection reset

I have an api application that is running in a docker container, and since moving to AWS, the api stops daily with the error: Erlang closed the connection. I've monitored the server during that time and no IOPS seem to be causing this issue. Beyond that though, when the api fails, it won't restart on it's own on one of our clusters. I'm not sure where to find the logs to get more context and could use any input that may be helpful. Also, more context here, is that this api worked fairly well before in our data-center/physical server space, but now in AWS, it fails daily. Any thoughts or suggestions as to why this may be failing to restart?
I've looked at the syslogs and the application server logs and don't see any kind of failures. Maybe I'm not looking in the proper place. At this point, someone from one of our teams has to manually restart the api with an init.d command. I don't want to create a cron job to "fix" this because that's a band-aid and not a true fix.
There really isn't enough information on the structure of your app or its connections, so no one can give a "true fix".
The problem may be as small as configuring your nodes differently, changing some of the server's local configurations, or you may need some "keep alive" handler towards AWS.
Maybe try adding a manual periodic dump to see if its an accumulating problem, but I believe if Erlang closed the connection there's a problem between your API and AWS, and your API can't understand where it is located.

Aks Error Failed to drain the node, aborting scale down

I am using Kured to perform safe reboots of our nodes to upgrade the OS and kernel versions.
In my understanding, it works by cordoning and draining the node, and the pods are scheduled on a new node with the older version. After the reboot, the nodes are uncordoned and back to the ready state and the temporary worker nodes get deleted.
It was perfectly fine until yesterday when one of the nodes failed to upgrade to the latest kernel version. It was on 5.4.0-1058-azure last week after a successful upgrade and it should be on 5.4.0-1059-azure yesterday after the latest patch, but it is using the old version 5.4.0-1047-azure (which I think is the version of the temporary node that got created).
Upon checking the log analytics on azure, it says that it failed to scale down.
Reason: ScaleDownFailed
Message: failed to drain the node, aborting ScaleDown
Error message
Any idea on why this is happening?
Firstly, there is a little misunderstanding of the OS and Kernel patching process.
In my understanding, it works by cordoning and draining the node, and the pods are scheduled on a new node with the older version.
The new node that is/are added should come with the latest node image version with latest security patches (which usually does not fall back to an older kernel version) available for the node pool. You can check out the AKS node image releases here. Reference
However, it is not necessary that the pod(s) evicted by the drain operation from the node that is being rebooted at any point during the process has to land on the surge node. Evicted pod(S) might very well be scheduled on an existing node should the node fit the bill for scheduling these pods.
For every Pod that the scheduler discovers, the scheduler becomes responsible for finding the best Node for that Pod to run on. The scheduler reaches this placement decision taking into account the scheduling principles described here.
The documentation, at the time of writing, might be a little misleading on this.
About the error:
Reason: ScaleDownFailed
Message: failed to drain the node, aborting ScaleDown
This might happen due to a number of reasons. Common ones might be:
The scheduler could not find a suitable node to place evicted pods and the node pool could not scale up due to insufficient compute quota available. [Reference]
The scheduler could not find a suitable node to place evicted pods and the cluster could not scale up due to insufficient IP addresses in the node pool's subnet. [Reference]
PodDisruptionBudgets (PDBs) did not allow for at least 1 pod replica to be moved at a time causing the drain/evict operation to fail. [Reference]
In general,
The Eviction API can respond in one of three ways:
If the eviction is granted, then the Pod is deleted as if you sent a DELETE request to the Pod's URL and received back 200 OK.
If the current state of affairs wouldn't allow an eviction by the rules set forth in the budget, you get back 429 Too Many Requests. This is typically used for generic rate limiting of any requests, but here we mean that this request isn't allowed right now but it may be allowed later.
If there is some kind of misconfiguration; for example multiple PodDisruptionBudgets that refer the same Pod, you get a 500 Internal Server Error response.
For a given eviction request, there are two cases:
There is no budget that matches this pod. In this case, the server always returns 200 OK.
There is at least one budget. In this case, any of the three above responses may apply.
Stuck evictions
In some cases, an application may reach a broken state, one where unless you intervene the eviction API will never return anything other than 429 or 500.
For example: this can happen if ReplicaSet is creating Pods for your application but the replacement Pods do not become Ready. You can also see similar symptoms if the last Pod evicted has a very long termination grace period.
How to investigate further?
On the Azure Portal navigate to your AKS cluster
Go to Resource Health on the left hand menu as shown below and click on Diagnose and solve problems
You should see something like the following
If you click on each of the options, you should see a number of checks loading. You can set the time frame of impact on the top right hand corner of the screen as shown below (Please press the Enter key after you have set the correct timeframe). You can click on the More Info link on the right hand side of each entry for detailed information and recommended action.
How to mitigate the issue?
Once you have identified the issue and followed the recommendations to fix the same, please perform an az aks upgrade on the AKS cluster to the same Kubernetes version it is currently running. This should initiate a reconcile operation wherever required under the hood.

Multiple Redis connection exception (No Connection available to service) during App service swap slots

I have a web app in production (.Net Core), I deployed it in Azure as App service which is in premium tier p2v2 4 instances. I am also using Azure Redis cache (Premium Tier) which my app is using it as cache. I have two app services (primary and secondary) configured Traffic Manager for load balancing.
Whenever I am trying to deploy my app into production using swap slot feature, Both the app service response time goes up to 20 secs and it is down for around 1 minute and my CPU utilization goes close to 90%. And I am seeing multiple exceptions from Redis client (For ex: No connection is available to service this operation: EVAL; It was not possible to connect to the Redis server(s). To create a disconnected multiplexer, disable AbortOnConnectFail. ConnectTimeout; IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=452,Free=32315,Min=8,Max=32767), Local-CPU: n/a) and my HttpQueue length goes above 10
I can infer from the above image is that worker thread has been overloaded, Donno why it is happening
I am using .Net StackExchange Redis client version 2.0.601, recently did an update from version 1.2.4
Note:
I didn't use slot specific app setting.
It keeps happening for every swap slots during deployment
I didn't find any app service restart in the logs.
I want to know any of you guys are facing this issue, if yes please suggest me where is the problem or how to debug and it would also better if you can share any of things you tried.
I tried to find any error logs in AZure Redis cache server but couldn't find any.
I am trying to figure out what is causing this issue, how to debug this kind of issues with azure, and whether anybody encountered the same and have implemented any resolution for the same?
Please let me know if you need any additional details.
Here is something which might be worth trying :
Cache metrics are reported using several reporting intervals, including Past hour, Today, Past week, and Custom. The Metric blade for each metrics chart displays the average, minimum, and maximum values for each metric in the chart, and some metrics display a total for the reporting interval.
Each metric includes two versions. One metric measures performance for the entire cache, and for caches that use clustering, a second version of the metric that includes (Shard 0-9) in the name measures performance for a single shard in a cache. For example if a cache has 4 shards, Cache Hits is the total amount of hits for the entire cache, and Cache Hits (Shard 3) is just the hits for that shard of the cache.
Try looking for the Error metric while monitoring.
https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-how-to-monitor#available-metrics-and-reporting-intervals
Additionally , we need to retry for TimeoutException, RedisConnectionException or SocketException even which ensure it will try to connect in case of any exception, you can read about all the best practises arouns Redis Cache usage in below doc:
https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-best-practices
https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-best-practices#when-is-it-safe-to-retry
Hope it helps.

Make Node/MEANjs Highly Available

I'm probably opening up a can of worms with regard to how many hundreds of directions can be taken with this- but I want high availability / disaster recovery with my MEANjs servers.
Right now, I have 3 servers:
MongoDB
App (Grunt'ing the main application, this is the front end
server)
A third server for other processing on the back-end
So at the moment, if I reboot my MongoDB server (or more realistically, it crashes for some reason), I suddenly see this in my App server terminal:
MongoDB connection error: Error: failed to connect to
[172.30.3.30:27017] [nodemon] app crashed - waiting for file changes
before starting...
After MongoDB is back online, nothing happens on the app server until I re-grunt.
What's the best practice for this situation? You can see in the error I'm using nodeMon to monitor changes to the app. I bet upon init I could get my MongoDB server to update a file on the app server within nodemon's view to force a restart? Or is there some other tool I can use for this? Or should I be handling my connections to the db server more gracefully so the app doesn't "crash"?
Is there a way to re-direct to a secondary mongodb in case the primary isn't available? This would be more apt to HA/DR type stuff.
I would like to start with a side note: Given the description in the question and the comments to it, I am not convinced that using AWS is a wise option. A PaaS provider like Heroku, OpenShift or AppFog seems to be more suitable, especially when combined with a MongoDB service provider. Running MongoDB on EBS can be quite a challenge when you are new to MongoDB. And pretty expensive, too, as soon as you need provisioned IOPS.
Note In the following paragraphs, I simplified a few things for the sake of comprehensibility
If you insist on running it on your own, however, you have an option. MongoDB itself comes with means of automatic, transparent failover, called a replica set.
A minimal replica set consists of of two data bearing nodes and a so called arbiter. Write operations go to the node currently elected "primary" only, and reads do, too, unless you explicitly allow or request reads to be performed on the current "secondary". The secondary constantly syncs to the primary. If the current primary goes down for some reason, the former secondary becomes elected primary.
The arbiter is there so that there is always a quorum (qualified majority would be an equivalent term) of members to elect the current secondary to be the new primary. This quorum is mainly important for edge cases, but since you can not rule out these edge cases, an uneven number of members is a hard requirement for a MongoDB replica set (setting aside some special cases).
The beauty of this is that almost all drivers, and the node.js for sure, are replica set aware and deal with the failover procedure pretty gracefully. They simply send the reads and writes to the new primary, without any change to be done at any other point.
You only need to deal with some cases during the failover process. Without going into much detail, you basically check for certain errors in the according callbacks and redo the operation, if you encounter one of those errors and redoing the operation is feasible.
As you might have noticed, the third member, the arbiter, does not hold much data. It is a very lightweight process and can basically run on the cheapest instance you can find.
So you have data replication and automatic, transparent failover with relative ease at the cost of the cheapest VM you can find, since you would need two data bearing nodes anyway if you used any other means.

Syncing clocks on multiple Azure VMs

I have a requirement to write a load test measuring message transmission latencies. In order to simulate a large number of simultaneous uses without running into thread contention problem on one box, I'm spinning up multiple servers in Azure.
When I got my first results back, I was a little shocked to see that the results indicated the message was received before it was sent. I immediately realized that, while I had an implicit assumption that all the VMs would have their clocks synced to within milliseconds, that was clearly not the case.
I've spent several hours googling ways to resolve this, and I'm not getting anywhere. One thought was to have each VM query the time on a central server using NetRemoteTOD() using a technique similar to this NetRemoteTOD, and then establish a per-machine correction factor to be added to the time measured from the local machine's clock. However when I tried to run that method, I got a error 2184, "The service has not been started" I have verified that both the RPC service and the Windows Time service are running on the both the client and target machines, and I have not been successful in finding any information indicating what other service needs to be running (or even if the error really means what it seems to mean). (I also get the same error when running between my development desktop and a server on our corporate network. However, I can run it successfully to a PDC on the corporate network - but I can't find a PDC on Azure, since neither machine is part of a domain.)
So, does any one have either any information on what service needs to be started to get NetRemoteTOD (or the windows NET TIME command, which relies on NetRemoteTOD under the covers) working. Alternatively, does anyone have a suggestion for some other technique to get a consistent time reference across multiple VMs in Azure? (Note, I don't necessarily need their clocks synced, I just need a way to establish a consistent correction factor to reference the times to a common source. Note also, I need sub-second accuracy - probably about 100 msec will do.) Basically, I just need a windows function or shell command that will get me the time to sub-second accuracy on a given remote server.
Thanks in advance.
PS. Azure servers are running Server 2008 R2 SP1

Resources