ElasticSearch 6.0 timeout on cluster - node.js

I've 3 differents server with 1 instance of ES 6.0 on each and another server with nodejs, to query on.
On each server I just changed :
discovery.zen.ping.unicast.hosts : [ LIST_ES_IP ]
discovery.zen.minimum_master_nodes: 2
My problem is that after some times (not define), I've timeout error from nodejs server. But if I call
curl -XGET 'IP:9200/_cluster/health?pretty'
On this same server, I can see that ES works fine.
If I remove one server from cluster (comment previous 2 config lines), and query only on it, all works good & I never have timeout.
Did I need to change another config to make this cluster works ?
Did you have ideas about why I've timeout only on cluster mode ?
Thanks in advance,

Apparently it's cause on elasticsearch-js client because I reactivate cluster, but define host to
"IP:9200"
and it works for 3 hours now.
Before I've
[ "IP1:9200", "IP2:9200", "IP3:9200" ]
I try with
[ {host: "IP", port: 9200}, {...} ]
But time out too..
So no way to have rollback if one server failed ?

Related

Random connection errors to MS SQL from nodeJS app

We have an AWS server running some nodeJS services. The services connecting to MS sql are randomly crashing with message "Failed to connect to databaseserver:1433 - Could not connect (sequence)".
We are running on:
App server:
Linux Ubuntu 14.4
AWS m5
NodeJS: 8.11.2
Services are using package mssql latest version (4.3.0). This includes tedious 2.7.1.
DB server:
Windows server 2012.
sql server 2012
throughput: about 300 rpm, error also happens when throughput is lower (about 20 rpm).
App is running in a cluster through PM2 (runs 4 times). We see the error happening on all 4 at the same time, but sometimes also on 1 or 2 instances.
What we tried:
Upgrading to alpha version of mssql with tedious 3.0.1. Did not make a difference
Upgrading from Amazon M4 machine to M5 machine with enhanced networking
Changing the pool settings in the app. We tried setting min connections to 0 or low/high value. Max also to low/high value but no avail.
Duplicate server to new machine.
Setting idleTimeoutMillis to 1 second
Pinging DB server to see if there is a connection problem, but we see no weird pings when the error happens.
Connection on app startup:
App.sqlConnection = new App.SQL.ConnectionPool(config, function(err) {
if(err){
Log.error(err);
process.exit(1);
}
App.sqlConnection.on('error', err => {
Log.error(`There was a connection err : ${err}`);
process.exit(1);
});
});
request;
var request = new App.SQL.Request(App.sqlConnection);
request.query(sQuery, function(err,results)
{
});
Errors are catched by the "on error" handler.
The error happens randomly across services. Some have more instances of the error then others.
We are running out of options. Any idea if we can see more detailed errors?
I have a couple suggestions.
First, how sure are you that these errors are actually a problem? If your code simply retries, instead of exiting, are the connections stable afterwards, or can a connection drop in the middle of a query?
(Connections dropping in the middle of queries are obviously not good, but random failures on connection, that can be fixed by retries, are the best kind of problem to have IMHO.)
Ignoring the potential in-code fix, I'm wondering when you say you "duplicated server to new machine" - did you launch a new AMI using latest Windows Server 2012, or did you image and clone? If your database server is a couple years old, you might actually be running outdated network drivers in your instance, which could give you some hiccups.
If you wanted to explore that, you could attempt rebuilding the entire database server from scratch on a newly launched AMI. Alternately you can upgrade PV driver, network adapter, and EC2Config on your existing instance, you can find the instructions at the following links:
https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/Upgrading_PV_drivers.html#aws-pv-upgrade
https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/sriov-networking.html#enable-enhanced-networking
https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/UsingConfig_Install.html

CouchDB gives no_majority when trying to update _security

After couchdb upgrade it was no longer possible to create new databases or update _security of old databases.
I've just run into this with CouchDB 2.1.1. My problem was that the security object I was attempting to pass in was malformed.
From https://issues.apache.org/jira/browse/COUCHDB-2326,
Attempts to write security objects where the "admins" or "members" values are malformed will result in an HTTP 500 response with the following body:
{"error":"error","reason":"no_majority"}
This should really be an HTTP 400 response with a "bad_request" error value and a different error reason.
To reproduce:
$ curl -X PUT http://localhost:15984/test
{"ok":true}
$ curl -X PUT http://localhost:15984/test/_security -d '{"admins":[]}'
{"error":"error","reason":"no_majority"}
Yet another reason could be that old nodes were lingering in _membership configuration.
I.e., _membership showed:
{
"all_nodes": [
"couchdb#localhost"
],
"cluster_nodes": [
"couchdb#127.0.0.1",
"couchdb#localhost"
]
}
when it should show
{
"all_nodes": [
"couchdb#localhost"
],
"cluster_nodes": [
"couchdb#localhost"
]
}
Doing a deletion of the bad cluster node as described in docs helped.
Note, that _nodes might not be available on port 5984, but on 5986 only.
For a PUT to /db/_security, if the user is not a a db or server admin, the response is HTTP status 500 with {"error":"error","reason":"no_majority"} but the server logs are more informative, including: {forbidden,<<"You are not a db or server admin.">>}
One reason could be couchdb process reached its maximum number of open files, which lead to read errors and (wrongly) no_majority errors.
Another reason could be that the server switched from single node configuration to multiple node configuration (for example during an upgrade).
Changing back number of nodes to 1 helped

Service Fabric Application PackageDeployment Operation Time Out exception

i have service fabric cluster and 3 nodes are created in 3 systems and it is inter-connected. i am able to connect each of nodes. These nodes are created in windows server. These Windows Server(VMs) are on-premises.
Manually i am trying to deploy my package into my cluster/one of nodes, i am getting Operation Timeout exception. i have used below commands to execute for deployment.
Service Fabric Power shell Commands:
Copy-ServiceFabricApplicationPackage -ApplicationPackagePath 'c:\sample\etc' -ApplicationPackagePathInImageStore 'abc.app.portaltype'
after execute above command it runs for 2 -3 mins and throws Operation Timeout exception. My package size is almost 250 MB and approximately 15000 file exist in my package. after that i have passed an extra parameter -TimeOutSec to 600(10mins) explicitly in above command, then it successfully executed and it copied to service fabric imagestore.
Register-ServiceFabricApplicationType -ApplicationPathInImageStore 'abc.app.portaltype'
after executed Copy-ServiceFabricApplicationPackage command , i have executed above Register-ServiceFabricApplicationType command to register my in cluster.but it also throws Operation timeout exception then i have passed an extra parameter -TimeOutSec to 600(10mins) explicitly in above command, but no luck it throws same operation timeout exception.
Just to make sure these operation Timeout issue because of no files in package or not. i have created simple empty service fabric asp.net core app and created package and try to deploy in same server with using above command, it deployed with in fraction of second and it works as smoothly.
Anybody has any idea how to over come service fabric operation timeout issue ?
How to handle the operation timeout issue if the package contains large set of files ?
Any help/suggestion would be very appreciated.
Thanks,
If this is taking longer than the 10 Minute default max it's probably one of the following issues:
Large application packages (>100s of MB)
Slow network connections
A large number of files within the application package (>1000s).
The following workarounds should help you.
Add the following settings to your cluster config:
"fabricSettings": [
{
"name": "NamingService",
"parameters": [
{
"name": "MaxOperationTimeout",
"value": "3600"
},
]
}
]
Also add:
"fabricSettings": [
{
"name": "EseStore",
"parameters": [
{
"name": "MaxCursors",
"value": "32768"
},
]
}
]
There’s a couple additional features which are currently rolling out. For these to be present and functional, you need to be sure that the client is at least 2.4.28 and the runtime of your cluster is at least 5.4.157. If you’re staying up to date these should already be present in your environment.
For register you can specify the -Async flag which will handle the upload asynchronously, reducing the need for the timeout to just the time necessary to send the command, not the application package. You can also query the status of the registration with Get-ServiceFabricApplicationType. 5.5 fixes some issues with these commands, so if they aren't working for you you'll have to wait for that release to hit your environment.

"Server x timed out" during MongoDB aggregation

I have a script that periodically runs aggregation on a mongodb collection. As the dataset has grown, the amount of time it takes to aggregate has also grown. My aggregation script has recently stopped working consistently, and the error logs show:
error: { [MongoError: server <x> timed out]
name: 'MongoError',
message: 'server <x> timed out' }
I've tried debugging this, and the only pattern I can find is that this timeout seems to only occur when the aggregation takes longer than 2 minutes (it times out right around 2m). Does anyone have additional debugging tips for this? The 2-minute thing is giving me the impression that I just need to configure some timeout somewhere but I can't figure out where or if i'm just falling into a red-herring trap.
About the system configuration: This aggregation script is a node.js (v5.9.1) application running in an alpine-based docker (v1.9.1) container. It uses the mongodb node driver (v2.1.19). Single mongodb server (though this is also happening in a separate environment with a replSet) running mongod (v3.2.6)
I got the same problem for logs time aggregation. I think I have the solution for you.
I found that the option socketTimeoutMS is responsible for that.
Check your mongo_client.js default socketTimeoutMS value. For me it was 2min. Mongodb module version 2.1.18.
So just add this option into your url :
mongodb://localhost:27017/test?maxPoolSize=2&socketTimeoutMS=60000
It will set timeout to 10 mins. That does the trick for me.

Using memcached failover servers in nodejs app

I'm trying to set up a robust memcached configuration for a nodejs app with the node-memcached driver, but it does not seem to use the specified failover servers when one server dies.
My local experiment goes as follows:
shell
memcached -p 11212
node
MC = require('memcached')
c = new MC('localhost:11211', //this process does not exist
{failOverServers: ['localhost:11212']})
c.get('foo', console.log) //this will eventually time out
c.get('foo', console.log) //repeat 5 or 6 times to exceed the retries number
//wait until all the connection errors appear in the console
//at this point, the failover server should be in use
c.get('foo', console.log) //this still times out :(
Any ideas of what might we be doing wrong?
It seems that the failover feature is somewhat buggy in node-memcached.
To enable failover you must set the remove options:
c = new MC('localhost:11211', //this process does not exist
{failOverServers: ['localhost:11212'],
remove : true})
Unfortunately, this is not going to work because of the following error:
[depricated] HashRing#replaceServer is removed.
[depricated] the API has no replacement
That is, when trying to replace a dead server with a replacement from the failover list, node-memcached outputs a deprecation error from the HashRing library (which, in turn, is maintained by the same author of node-memcached). IMHO, feel free to open a bug :-)
This is come when your nodejs server not getting any session id from memcached
Please check properly in php.ini file you are setting properly or not for memcached
session.save = 'memcache'
session.path = 'tcp://localhost:11212'

Resources