Running into RedisTimeoutException and other exceptions with Redisson and Azure Redis Cache - azure

A lot of timeout exceptions and Can't add slave exceptions
Steps to reproduce or test case
intermittent
Redis version
Azure Redis Cache with 5 shards
4.0.14, 3.2.7
Redisson version
3.11.4
Redisson configuration
Default clustered config with the following overrides:
REDIS_ENABLED | true
REDIS_KEEP_ALIVE | true
REDIS_THREADS | 512
REDIS_NETTY_THREADS | 1024
REDIS_MASTER_CONNECTION_MINIMUM_IDLE_SIZE | 5
REDIS_MASTER_CONNECTION_POOL_SIZE | 10
REDIS_SLAVE_CONNECTION_MINIMUM_IDLE_SIZE | 5
REDIS_SLAVE_CONNECTION_POOL_SIZE | 10
REDIS_TIMEOUT | 1000
REDIS_RETRY_INTERVAL | 500
REDIS_TCP_NO_DELAY | true
I see following exceptions in the log:
`
exception: { [-]
class: org.redisson.client.RedisConnectionException
thrownfrom: unknown
}
level: ERROR
logger_name: org.redisson.cluster.ClusterConnectionManager
message: Can't add slave: rediss://:15002
process: 6523
stack_trace: org.redisson.client.RedisTimeoutException: Command execution timeout for command: (READONLY), params: [], Redis client: [addr=rediss://:15002]
at org.redisson.client.RedisConnection.lambda$async$1(RedisConnection.java:207)
at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:680)
at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:755)
at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:483)
... 2 common frames omitted
Wrapped by: org.redisson.client.RedisConnectionException: Unable to connect to Redis server: /:15002
at org.redisson.connection.pool.ConnectionPool$1.lambda$run$0(ConnectionPool.java:160)
at org.redisson.misc.RedissonPromise.lambda$onComplete$0(RedissonPromise.java:183)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
at org.redisson.misc.RedissonPromise.tryFailure(RedissonPromise.java:96)
at org.redisson.connection.pool.ConnectionPool.promiseFailure(ConnectionPool.java:330)
at org.redisson.connection.pool.ConnectionPool.lambda$createConnection$1(ConnectionPool.java:296)
at org.redisson.misc.RedissonPromise.lambda$onComplete$0(RedissonPromise.java:183)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
at org.redisson.misc.RedissonPromise.tryFailure(RedissonPromise.java:96)
at org.redisson.client.RedisClient$2$1.run(RedisClient.java:240)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518)
at i.n.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
stack_trace: org.redisson.client.RedisTimeoutException: Unable to get connection! Try to increase 'nettyThreads' and/or connection pool size settingsNode source: NodeSource [slot=15393, addr=redis://:15007, redisClient=null, redirect=MOVED, entry=null], command: (PSETEX), params: [some key, 3600000, PooledUnsafeDirectByteBuf(ridx: 0, widx: 457, cap: 512)] after 0 retry attempts
at org.redisson.command.RedisExecutor$2.run(RedisExecutor.java:209)
at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:680)
at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:755)
at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:483)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
level: ERROR
logger_name: org.redisson.cluster.ClusterConnectionManager
message: Can't add master: rediss://:15007 for slot ranges: [[15019-15564], [12288-13652], [4096-5461]]
process: 6574
thread_name: redisson-netty-2-718
timestamp: 2019-10-29 22:32:15.592
`
My logs are flooded with these exceptions and when I login to Azure portal, I see the CPU metric for Redis spiked to 100%. Any help is appreciated.

Related

How to capture data change in yugabyte db?

terminal 1:
postgres=# \c yugastore
You are now connected to database "yugastore" as user "postgres".
yugastore=# select count(*) from yugastore.users;
count
-------
2500
(1 row)
yugastore=# delete from yugastore.users;
DELETE 2500
(After starting insertion script at terminal 2)
yugastore=# select count(*) from yugastore.users;
ERROR: Query error: Restart read required at: { read: { physical: 1580057095845877 } local_limit: { physical: 1580057095880226 } global_limit: <min> in_txn_limit: <max> serial_no: 0 }
yugastore=# select count(*) from yugastore.users;
ERROR: Query error: Restart read required at: { read: { physical: 1580057098605539 } local_limit: { physical: 1580057098715271 } global_limit: <min> in_txn_limit: <max> serial_no: 0 }
terminal2:
yugastore.users table is created and being populated.
time: 11:44:31.796 cumulative records: 100
time: 11:44:32.608 cumulative records: 200
time: 11:44:32.909 cumulative records: 300
time: 11:44:33.213 cumulative records: 400
time: 11:44:33.661 cumulative records: 500
...
time: 11:46:24.710 cumulative records: 18900
time: 11:46:25.137 cumulative records: 19000
time: 11:46:25.606 cumulative records: 19100
terminal 3:
[root#srvr0 ~]# java -jar ./yb_cdc_connector.jar --table_name yugastore.users --master_addrs 127.0.0.1:7100 --log_only
[2020-01-26 11:45:57,844] INFO Starting CDC Kafka Connector... (org.yb.cdc.Main:28)
2020-01-26 11:45:58,201 [INFO|org.yb.cdc.KafkaConnector|KafkaConnector] Creating new YB client...
[2020-01-26 11:46:02,853] INFO Discovered tablet YB Master for table YB Master with partition ["", "") (org.yb.client.AsyncYBClient:1593)
[2020-01-26 11:46:03,724] ERROR [Peer fakeUUID -> 127.0.0.1:9100] Tablet server sent error Invalid argument (yb/rpc/yb_rpc.cc:411): Call on service yb.cdc.CDCService received from Connection (0x0000000005b8e2d0) server 127.0.0.1:46926 => 127.0.0.1:9100 with an invalid method name: CreateCDCStream (org.yb.client.TabletClient:380)
2020-01-26 11:46:03,725 [ERROR|org.yb.cdc.Main|Main] Application ran into error:
org.yb.client.NonRecoverableException: [Peer fakeUUID -> 127.0.0.1:9100] Tablet server sent error Invalid argument (yb/rpc/yb_rpc.cc:411): Call on service yb.cdc.CDCService received from Connection (0x0000000005b8e2d0) server 127.0.0.1:46926 => 127.0.0.1:9100 with an invalid method name: CreateCDCStream
at org.yb.client.TabletClient.decode(TabletClient.java:379)
at org.yb.client.TabletClient.decode(TabletClient.java:98)
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:500)
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.yb.client.TabletClient.handleUpstream(TabletClient.java:608)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.jboss.netty.handler.timeout.ReadTimeoutHandler.messageReceived(ReadTimeoutHandler.java:184)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.yb.client.AsyncYBClient$TabletClientPipeline.sendUpstream(AsyncYBClient.java:2002)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Update1:
After installing yugabyte db 2.0.10.0, error "Restart read required" resolved but no change logs printed:
Error logs:
[root#srvr0 ~]# java -jar ./yb-cdc-connector.jar --table_name yugastore.users --master_addrs 127.0.0.1:7100 --stream_id 1 --log_only
[2020-01-28 08:27:31,101] INFO Starting CDC Kafka Connector... (org.yb.cdc.Main:28)
2020-01-28 08:27:31,154 [INFO|org.yb.cdc.KafkaConnector|KafkaConnector] Creating new YB client...
[2020-01-28 08:27:32,288] INFO Discovered tablet YB Master for table YB Master with partition ["", "") (org.yb.client.AsyncYBClient:1593)
2020-01-28 08:27:32,597 [INFO|org.yb.cdc.KafkaConnector|KafkaConnector] Polling for new tablet ce5115a780224cd0ab8a8e9c1a46b961
2020-01-28 08:27:32,604 [INFO|org.yb.cdc.KafkaConnector|KafkaConnector] Polling for new tablet cca5b30bb7784ae2a8796097d6fd5b2f
2020-01-28 08:27:32,694 [ERROR|org.yb.cdc.Poller|Poller] Invalid Request
2020-01-28 08:27:32,695 [ERROR|org.yb.cdc.Poller|Poller] Invalid Request
[root#srvr0 ~]#
Please help me in resolving the issues.
The read restart issue that you see with select count(*) .. query has been fixed and is available from version 2.0.5.2: https://github.com/yugabyte/yugabyte-db/commit/3212616e351647436f808d4963d229e7881996c8.
Similarly, it seems like you are using an older, deprecated version of the CDC connector. You can get the connector using:
wget -O yb-cdc-connector.jar https://github.com/yugabyte/yb-kafka-connector/blob/master/yb-cdc/yb-cdc-connector.jar?raw=true
And then run:
java -jar ./yb-cdc-connector.jar --table_name yugastore.users --master_addrs 127.0.0.1:7100 --log_only

Kafka Connect Sink to Cassandra :: java.lang.VerifyError: Bad return type

I'm trying to setup a Kafka Connect Sink to collect data from a topic into a Cassandra Table using the Datastax connector : https://downloads.datastax.com/#akc
Running a standalone worker running directly on the broker, running Kafka 0.10.2.2-1 :
name=dse-sink
connector.class=com.datastax.kafkaconnector.DseSinkConnector
tasks.max=1
datastax-java-driver.advanced.protocol.version = V4
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.storage.StringConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
plugin.path=/usr/share/java/kafka-connect-dse/kafka-connect-dse-1.2.1.jar
topics=connect-test
contactPoints=172.16.0.48
loadBalancing.localDc=datacenter1
port=9042
ignoreErrors=true
topic.connect-test.cdrs.test.mapping= kafkakey=key, value=value
topic.connect-test.cdrs.test.consistencyLevel=LOCAL_QUORUM
But i have the following error :
2019-12-23 16:58:43,165] ERROR Task dse-sink-0 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
java.lang.VerifyError: Bad return type
Exception Details:
Location:
com/fasterxml/jackson/databind/cfg/MapperBuilder.streamFactory()Lcom/fasterxml/jackson/core/TokenStreamFactory; #7: areturn
Reason:
Type 'com/fasterxml/jackson/core/JsonFactory' (current frame, stack[0]) is not assignable to 'com/fasterxml/jackson/core/TokenStreamFactory' (from method signature)
Current Frame:
bci: #7
flags: { }
locals: { 'com/fasterxml/jackson/databind/cfg/MapperBuilder' }
stack: { 'com/fasterxml/jackson/core/JsonFactory' }
Bytecode:
0x0000000: 2ab4 0002 b600 08b0
at com.fasterxml.jackson.databind.json.JsonMapper.builder(JsonMapper.java:114)
at com.datastax.dsbulk.commons.codecs.json.JsonCodecUtils.getObjectMapper(JsonCodecUtils.java:36)
at com.datastax.kafkaconnector.codecs.CodecSettings.init(CodecSettings.java:131)
at com.datastax.kafkaconnector.state.LifeCycleManager.lambda$buildInstanceState$9(LifeCycleManager.java:423)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.HashMap$ValueSpliterator.forEachRemaining(HashMap.java:1625)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at com.datastax.kafkaconnector.state.LifeCycleManager.buildInstanceState(LifeCycleManager.java:457)
at com.datastax.kafkaconnector.state.LifeCycleManager.lambda$startTask$0(LifeCycleManager.java:106)
at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
at com.datastax.kafkaconnector.state.LifeCycleManager.startTask(LifeCycleManager.java:101)
at com.datastax.kafkaconnector.DseSinkTask.start(DseSinkTask.java:74)
at org.apache.kafka.connect.runtime.WorkerSinkTask.initializeAndStart(WorkerSinkTask.java:244)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:145)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:139)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:182)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
No additional error on cassandra or Kafka side.
I see active connection on the cassandra node but nothing arrive in the Keyspace.
Any idea why ?
Imho this is a problem caused by use of the JSON internal converters with BigDecimal data (see related SO question). As described in the following blog post, the internal.key.converter and internal.value.converter are deprecated since Kafka 2.0, and shouldn't be explicitly set. Can you comment out all internal. properties & re-try?
P.S. Also see how JSON + Decimal has changed in Kafka 2.4

failed for get of /hbase/hbaseid, code = CONNECTIONLOSS, retries = 6

I am trying to connect spark application with hbase. Below is the configuration I am giving
val conf = HBaseConfiguration.create()
conf.set("hbase.master", "localhost:16010")
conf.setInt("timeout", 120000)
conf.set("hbase.zookeeper.quorum", "2181")
val connection = ConnectionFactory.createConnection(conf)
and below are the 'jps' details:
5808 ResourceManager
8150 HMaster
8280 HRegionServer
5131 NameNode
8076 HQuorumPeer
5582 SecondaryNameNode
2798 org.eclipse.equinox.launcher_1.4.0.v20161219-1356.jar
8623 Jps
5951 NodeManager
5279 DataNode
I have alsotried with hbase master 16010
I am getting below error:
19/09/12 21:49:00 WARN ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.SocketException: Invalid argument
at sun.nio.ch.Net.connect0(Native Method)
at sun.nio.ch.Net.connect(Net.java:454)
at sun.nio.ch.Net.connect(Net.java:446)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
at org.apache.zookeeper.ClientCnxnSocketNIO.registerAndConnect(ClientCnxnSocketNIO.java:277)
at org.apache.zookeeper.ClientCnxnSocketNIO.connect(ClientCnxnSocketNIO.java:287)
at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1024)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060)
19/09/12 21:49:00 WARN ReadOnlyZKClient: 0x1e3ff233 to 2181:2181 failed for get of /hbase/hbaseid, code = CONNECTIONLOSS, retries = 4
19/09/12 21:49:01 INFO ClientCnxn: Opening socket connection to server 2181/0.0.8.133:2181. Will not attempt to authenticate using SASL (unknown error)
19/09/12 21:49:01 ERROR ClientCnxnSocketNIO: Unable to open socket to 2181/0.0.8.133:2181
Looks like there is a problem to join zookeeper.
Check first that zookeeper is started on your local host on port 2181.
netstat -tunelp | grep 2181 | grep -i LISTEN
tcp6 0 0 :::2181 :::* LISTEN
In your conf, in hbase.zookeeper.quorum property you have to pass the ip of your zookeeper and not the port (hbase.zookeeper.property.clientPort)
My hbase connector is build with :
val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", "10.80.188.65")
conf.set("hbase.master", "10.80.188.64:60000")
conf.set("hbase.zookeeper.property.clientPort", "2181")
conf.set("zookeeper.znode.parent", "/hbase-unsecure")
val connection = ConnectionFactory.createConnection(conf)

NodeJS request timeouts with concurrency 100

I have two machines, one "server" and one "client". Both are CentOS6 with NodeJS v5.8.0.
The server runs the following program:
const AppPort = 8080;
var app = require('express')();
var logger = require('log4js').getLogger();
var onFinished = require('on-finished');
var uid = require('uid');
var reqCnt = 0;
var reqFin = 0;
app.get('/', function(req, res) {
onFinished(req, function() {
reqFin++;
var ts2 = (new Date()).getTime();
logger.info(`uid=${req.uid}, dt=${ts2-req.ts1}`);
});
req.ts1 = (new Date()).getTime();
req.uid = uid();
reqCnt++;
logger.info(`ReqCnt=${reqCnt}, fins=${reqFin}`);
res.send("This is XML");
});
app.listen(AppPort);
It's only purpose to return "This is XML" string and calculate time of fulfilling the request.
On the "client" machine I run the following program:
const AppPort = 10000;
var onFinished = require('on-finished');
var async = require('async');
var request = require('request');
var logger = require('log4js').getLogger();
var app = require('express')();
var fs = require('fs');
var util = require('util');
url = "http://my-server";
var errCnt = 0;
var okCnt = 0;
var active2 = 0;
setInterval(function() {
var errFrac = Math.floor(errCnt/(okCnt+errCnt)*100);
logger.info(`${okCnt},${errCnt},${active2},${errFrac}`);
}, 1000);
app.get('/test', function(req,res) {
onFinished(res, function() {
active2--;
});
active2++;
var ts1 = (new Date()).getTime();
request(url, {timeout: 1000}, function(err, response, body ) {
var ts2 = (new Date()).getTime();
var dt = ts2-ts1;
if ( err ) {
errCnt += 1;
logger.error(`Error: ${err}, dt=${dt}, errCnt=${errCnt}`);
res.send(`Error: ${err}`);
}
else {
okCnt += 1;
logger.info(`OK: ${url}`);
res.send(`OK: ${body}`);
}
});
});
var http = app.listen(AppPort);
logger.info(`Listening on ${AppPort}, pid=${process.pid}`);
This "client" code listens by itself on port 10000 and makes request to "server" machine to get "This is XML" string. This data is transferred back to "client"'s client.
I load-test my client code with siege:
siege -v -r 100 -c 100 http://my-client:10000/test
Almost immediately I start to get ETIMEOUT errors:
[2016-03-15 18:17:05.155] [ERROR] [default] - Error: Error: ETIMEDOUT, dt=1028, errCnt=3
[2016-03-15 18:17:05.156] [ERROR] [default] - Error: Error: ETIMEDOUT, dt=1028, errCnt=4
[2016-03-15 18:17:05.156] [ERROR] [default] - Error: Error: ETIMEDOUT, dt=1027, errCnt=5
[2016-03-15 18:17:05.157] [ERROR] [default] - Error: Error: ETIMEDOUT, dt=1027, errCnt=6
[2016-03-15 18:17:05.157] [ERROR] [default] - Error: Error: ETIMEDOUT, dt=1027, errCnt=7
[2016-03-15 18:17:05.157] [ERROR] [default] - Error: Error: ETIMEDOUT, dt=1027, errCnt=8
[2016-03-15 18:17:05.158] [ERROR] [default] - Error: Error: ETIMEDOUT, dt=1027, errCnt=9
[2016-03-15 18:17:05.160] [ERROR] [default] - Error: Error: ETIMEDOUT, dt=1029, errCnt=10
[2016-03-15 18:17:05.160] [ERROR] [default] - Error: Error: ETIMEDOUT, dt=1028, errCnt=11
[2016-03-15 18:17:05.161] [ERROR] [default] - Error: Error: ETIMEDOUT, dt=1028, errCnt=12
Also, though much less frequently, getaddrinfo errors appear:
Error: Error: getaddrinfo ENOTFOUND {my-server-domain-here}:8080, dt=2, errCnt=4478
However, all requests to the "server" are processed within less then 3 milliseconds (dt values) on the server itself:
[2016-03-15 18:19:13.847] [INFO] [default] - uid=66ohx90, dt=1
[2016-03-15 18:19:13.862] [INFO] [default] - ReqCnt=5632, fins=5631
[2016-03-15 18:19:13.862] [INFO] [default] - uid=j8mpxdm, dt=0
[2016-03-15 18:19:13.865] [INFO] [default] - ReqCnt=5633, fins=5632
[2016-03-15 18:19:13.866] [INFO] [default] - uid=xcetqyj, dt=1
[2016-03-15 18:19:13.877] [INFO] [default] - ReqCnt=5634, fins=5633
[2016-03-15 18:19:13.877] [INFO] [default] - uid=i5qnbit, dt=0
[2016-03-15 18:19:13.895] [INFO] [default] - ReqCnt=5635, fins=5634
[2016-03-15 18:19:13.895] [INFO] [default] - uid=hpdmxpg, dt=1
[2016-03-15 18:19:13.930] [INFO] [default] - ReqCnt=5636, fins=5635
[2016-03-15 18:19:13.930] [INFO] [default] - uid=8g3t8md, dt=0
[2016-03-15 18:19:13.934] [INFO] [default] - ReqCnt=5637, fins=5636
[2016-03-15 18:19:13.934] [INFO] [default] - uid=8rwkad6, dt=0
[2016-03-15 18:19:14.163] [INFO] [default] - ReqCnt=5638, fins=5637
[2016-03-15 18:19:14.165] [INFO] [default] - uid=1sh2frd, dt=2
[2016-03-15 18:19:14.169] [INFO] [default] - ReqCnt=5639, fins=5638
[2016-03-15 18:19:14.170] [INFO] [default] - uid=comn76k, dt=1
[2016-03-15 18:19:14.174] [INFO] [default] - ReqCnt=5640, fins=5639
[2016-03-15 18:19:14.174] [INFO] [default] - uid=gj9e0fm, dt=0
[2016-03-15 18:19:14.693] [INFO] [default] - ReqCnt=5641, fins=5640
[2016-03-15 18:19:14.693] [INFO] [default] - uid=x0yw66n, dt=0
[2016-03-15 18:19:14.713] [INFO] [default] - ReqCnt=5642, fins=5641
[2016-03-15 18:19:14.714] [INFO] [default] - uid=e2cumjv, dt=1
[2016-03-15 18:19:14.734] [INFO] [default] - ReqCnt=5643, fins=5642
[2016-03-15 18:19:14.735] [INFO] [default] - uid=34e0ohl, dt=1
[2016-03-15 18:19:14.747] [INFO] [default] - ReqCnt=5644, fins=5643
[2016-03-15 18:19:14.749] [INFO] [default] - uid=34aau79, dt=2
So, the problem is not that the "server" processes the requests too long, but there is a problem with the client.
In NodeJS 5.8 globalAgent looks like the following:
console.log(require('http.globalAgent'))
Agent {
domain: null,
_events: { free: [Function] },
_eventsCount: 1,
_maxListeners: undefined,
defaultPort: 80,
protocol: 'http:',
options: { path: null },
requests: {},
sockets: {},
freeSockets: {},
keepAliveMsecs: 1000,
keepAlive: false,
maxSockets: Infinity,
maxFreeSockets: 256 }
ulimits on my system look like:
root#njs testreq]# ulimit -all
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 128211
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 200000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 128211
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
What can be a reason for the timeouts?
When running a few load tests recently I ran into a similar error, however instead of ETIMEDOUT errors I saw multiple EADDRINUSE errors. At the time I was running the tests with the following HTTP agent configuration changes.
{
maxSockets: 256,
keepAlive: false
}
It turns out this configuration wastes a lot of cycles intentionally closing each connection after a single request, and the EADDRINUSE errors were due to running out of ephemeral ports.
For my tests I was still using version 0.12.9 so I'm not sure if this still holds in versions >= 4.x, but the core HTTP library will automatically maintain connections to servers based on the host/port/protocol when possible. This can greatly reduce the load on the client and server, but can also cause requests to build up if the client pool is too small to handle the rate of outbound requests. The best configuration then is one that will keep alive connections whenever possible, but still has a large enough connection pool to quickly handle each outbound request.
Additionally, Node.js is built on top of libuv which implements the event loop interface. One way or another, almost any asynchronous operation implemented by a core Node.js library will interact with libuv. In order to implement this type of interface libuv will use one of several different policies, one of which is a thread pool. The default size of this thread pool is 4, with a max of 128.
The important point here is that any calls to getaddrinfo and getnameinfo will use the thread pool, which means regardless of the size of your HTTP connection pool, DNS queries and some operations lower in the network stack will be serialized based on the thread pool size. It's possible to change the thread pool size by setting the environment variable UV_THREADPOOL_SIZE to a value in the range 4 - 128.
For my tests the ideal settings were UV_THREADPOOL_SIZE=50 with the following HTTP agent configuration.
{
maxSockets: 256,
keepAlive: true
}
This answer has more info on when and how libuv is used.

How do I reconnect to Cassandra using Hector?

I have the following code:
StringSerializer ss = StringSerializer.get();
String cf = "TEST";
CassandraHostConfigurator conf = new CassandraHostConfigurator("localhost:9160");
conf.setCassandraThriftSocketTimeout(40000);
conf.setExhaustedPolicy(ExhaustedPolicy.WHEN_EXHAUSTED_BLOCK);
conf.setRetryDownedHostsDelayInSeconds(5);
conf.setRetryDownedHostsQueueSize(128);
conf.setRetryDownedHosts(true);
conf.setLoadBalancingPolicy(new LeastActiveBalancingPolicy());
String key = Long.toString(System.currentTimeMillis());
Cluster cluster = HFactory.getOrCreateCluster("TestCluster", conf);
Keyspace keyspace = HFactory.createKeyspace("TestCluster", cluster);
Mutator<String> mutator = HFactory.createMutator(keyspace, StringSerializer.get()); int count = 0;
while (!"q".equals(new Scanner( System.in).next())) {
try{
mutator.insert(key, cf, HFactory.createColumn("column_" + count, "v_" + count, ss, ss));
count++;
} catch (Exception e) {
e.printStackTrace();
}
}
and I can write some values using it, but when I restart cassandra, it fails. Here is the log:
[15:11:07] INFO [CassandraHostRetryService ] Downed Host Retry service started with >queue size 128 and retry delay 5s
[15:11:07] INFO [JmxMonitor ] Registering JMX >me.prettyprint.cassandra.service_ASG:ServiceType=hector,MonitorType=hector
[15:11:17] ERROR [HThriftClient ] Could not flush transport (to be expected >if the pool is shutting down) in close for client: CassandraClient
org.apache.thrift.transport.TTransportException: java.net.SocketException: Broken pipe
at >org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147)
at org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:156)
at >me.prettyprint.cassandra.connection.client.HThriftClient.close(HThriftClient.java:98)
at >me.prettyprint.cassandra.connection.client.HThriftClient.close(HThriftClient.java:26)
at >me.prettyprint.cassandra.connection.HConnectionManager.closeClient(HConnectionManager.java:308)
at >me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:257)
at >me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:97)
at me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:243)
at me.prettyprint.cassandra.model.MutatorImpl.insert(MutatorImpl.java:69)
at com.app.App.main(App.java:40)
Caused by: java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
at >org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145)
... 9 more
[15:11:17] ERROR [HConnectionManager ] MARK HOST AS DOWN TRIGGERED for host >localhost(127.0.0.1):9160
[15:11:17] ERROR [HConnectionManager ] Pool state on shutdown: >:{localhost(127.0.0.1):9160}; IsActive?: true; Active: 1; Blocked: 0; Idle: 15; NumBeforeExhausted: 49
[15:11:17] INFO [ConcurrentHClientPool ] Shutdown triggered on :{localhost(127.0.0.1):9160}
[15:11:17] INFO [ConcurrentHClientPool ] Shutdown complete on :{localhost(127.0.0.1):9160}
[15:11:17] INFO [CassandraHostRetryService ] Host detected as down was added to retry queue: localhost(127.0.0.1):9160
[15:11:17] WARN [HConnectionManager ] Could not fullfill request on this host CassandraClient
[15:11:17] WARN [HConnectionManager ] Exception:
me.prettyprint.hector.api.exceptions.HectorTransportException: org.apache.thrift.transport.TTransportException: java.net.SocketException: Broken pipe
at >me.prettyprint.cassandra.connection.client.HThriftClient.getCassandra(HThriftClient.java:82)
at >me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:236)
at >me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:97)
at me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:243)
at me.prettyprint.cassandra.model.MutatorImpl.insert(MutatorImpl.java:69)
at com.app.App.main(App.java:40)
Caused by: org.apache.thrift.transport.TTransportException: java.net.SocketException: Broken pipe
at org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147)
at org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:157)
at org.apache.cassandra.thrift.Cassandra$Client.send_set_keyspace(Cassandra.java:466)
at org.apache.cassandra.thrift.Cassandra$Client.set_keyspace(Cassandra.java:455)
at >me.prettyprint.cassandra.connection.client.HThriftClient.getCassandra(HThriftClient.java:78)
... 5 more
Caused by: java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
at >org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145)
... 9 more
[15:11:17] INFO [HConnectionManager ] Client CassandraClient released to inactive or dead pool. Closing.
[15:11:17] INFO [HConnectionManager ] Client CassandraClient released to inactive or dead pool. Closing.
[15:11:17] INFO [HConnectionManager ] Added host localhost(127.0.0.1):9160 to pool
You have set -
conf.setRetryDownedHostsDelayInSeconds(5);
Try to to wait after the restart for more than 5 seconds.
Also, you may need to upgrade.
What is the size thrift_max_message_length_in_mb you have set?
Kind regards.

Resources