We are having issues with dropped connections in a node-kaka client. It appears that the zookeeper is resetting connections on us. When I run a tcpdump I see the following all over the place when viewed in wireshark:
The source is always one of our zookeeper servers and the destination is our kafka consumer. It appears that our client handles these in most situations just fine. In fact, I'm not at all convinced this is the cause for our failures. But, it does seem odd. I was hoping someone with more experience in how kaka-node, zookeeper, and kafka interact could provide some explanation.
ADDING SOME DETAILS FROM LOG
So, I see a few things in the logs. First, there are a ton of the following:
2016-03-11 20:26:32,357 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer#868] - Client attempting to establish new session at /10.196.2.106:59300
2016-03-11 20:26:32,358 [myid:2] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer#822] - Connection request from old client /10.196.2.106:59296; will be dropped if server is in r-o mode
Then there are a whole bunch of these:
2016-03-12 03:40:49,041 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#357] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x1527b11827bfcfe, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:745)
2016-03-12 03:40:49,042 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#1007] - Closed socket connection for client /10.196.2.106:33197 which had sessionid 0x1527b11827bfcfe
The strange thing is that IPs correlate to our secor servers. So, this is probably not related.
Other than that, I do not see anything really out of the ordinary.
Related
I'm trying to use GCP Pub/Sub StreamingPull using the NodeJs client and I understand that the pub sub is designed for 100% error rate as mentioned in Docs.
So do I have to restart the listener if I face errors in the errorHandler and also please tell what error code should I be looking for to see if the streaming connection is closed. Here is the ref Error Codes
const errorHandler=(error)=>{
if(errorCodeCheckCondition){
subscription.on('message', messageHandler);
subscription.removeListener('message', messageHandler);
}
}
subscription.on('error', errorHandler);
I'm using GCP Pub/Sub StreamingPull for first time, so please guide.
You do need to re-establish the streaming pull connection after you get any error.
According to the rpc StreamingPull
The server will close the stream and return the status on any error. The server may close the stream with status UNAVAILABLE to reassign server-side resources, in which case, the client should re-establish the stream. Flow control can be achieved by configuring the underlying RPC channel.
Since You know about StreamingPull has a 100% error rate, I believe you must have also gone through the Diagnosing StreamingPull errors.
The Pub/Sub client library will re-establish the underlying streaming pull connection when it disconnects for a retriable reason, e.g., an UNAVAILABLE error. You can see in the StreamingPull config in the library the set of errors that are retried internally.
The errors you would typically get back at the application level would be ones where some additional intervention is likely necessary, e.g., a PERMISSION_DENIED error (where the subscriber does not have permission to receive messages on the subscription) or a NOT_FOUND error (where the subscription does not exist. Retrying on these types of errors is likely just to result in the error reoccurring until the underlying issue is resolved.
You could decide that retrying is what you want to do because you want the subscriber to start working again without having to manually restart it once other steps are taken to fix the problem, but you'll want to make sure you have some way to discover these types of issues, perhaps through some kind of Cloud Monitoring alerting on streaming pull errors or on a large number of unprocessed messages building up.
I installed Apache pulsar standalone. Pulsar get timeout sometimes. It's not related to high throuput neither to a particular topic (following log). Pulsar-admin brokers healthcheck returns OK or timeout also. How to investigate on it ?
10:46:46.365 [pulsar-ordered-OrderedExecutor-7-0] WARN org.apache.pulsar.broker.service.BrokerService - Got exception when reading persistence policy for persistent://nnx/agent_ns/action_up-53da8177-b4b9-4b92-8f75-efe94dc2309d: null
java.util.concurrent.TimeoutException: null
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784) ~[?:1.8.0_232]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) ~[?:1.8.0_232]
at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97) ~[org.apache.pulsar-pulsar-zookeeper-utils-2.5.0.jar:2.5.0]
at org.apache.pulsar.broker.service.BrokerService.lambda$getManagedLedgerConfig$32(BrokerService.java:922) ~[org.apache.pulsar-pulsar-broker-2.5.0.jar:2.5.0]
at org.apache.bookkeeper.mledger.util.SafeRun$2.safeRun(SafeRun.java:49) [org.apache.pulsar-managed-ledger-2.5.0.jar:2.5.0]
at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_232]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_232]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
I am glad you were able to resolve the issue my adding more cores. The issue was a connection timeout while trying to access some topic metadata that is stored inside of ZookKeeper as indicated by the following line in the stack trace:
at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97) ~[org.apache.pulsar-pulsar-zookeeper-utils-2.5.0.jar:2.5.0]
Increasing the cores must of freed up enough threads to allow the ZK node to respond to this request.
You can check the connection to the server looks like connection issue if you are using any TLScertificate file path check if you have the right certificate.
The Problem is we don't have lot of solutions in internet for apache pulsar but if you are following the apache pulsar doc might help and also we have apache pulsar git hub and sample projects.
How long does the threads take to stop and exit for ActiveMQConsumer? I get a segmentation fault on closing my application. Which I figured out was due to the ActiveMQ threads. If I comment the consumer the issue is no longer present. Currently I am using cms::MessageConsumer in activemq-cpp-library-3.9.4.
I see that the activemq::core::ActiveMQConsumer has isClosed() function that I can use to confirm if the consumer is closed and then move forward with deleting the objects thereby avoiding the segmentation fault. I am assuming this will solve my issue. But I wanted to know what is the correct approach with these ActiveMQ objects to avoid the issues with threads?
I was using the same session with consumer and producer, but when the broker is stopped and started the ActiveMQ reconnect was adding threads. I am not using failover.
So I have separated the session to send and receive and have instantiated connection factory, connection, and session for each separately. This design has no issues until the applications memory was not getting cleaned up due to above segmentation fault.
That's why I wanted to know when should I use cms::MessageConsumer vs ActiveMQConsumer?
The ActiveMQ Website has documentation with examples for the CMS client. I'd suggest reading those and following the example code in how it shuts down the connection and the library resources prior to application shutdown to ensure that resources are cleaned up appropriately.
As with JMS the CMS consumer instance is linked with the thread in the session that created it so if you are closing down a good rule to follow is to close the session to ensure that message deliveries get stopped before you delete anything consumer instances.
While trying to use the cassandra 2.0.1 version, i started facing the handshaking with version problem .
There was an exception from OutboundTcpConnection.java stating that handshaking is not possible with a particular node.
I had a look at the TCP dump and cleared off the doubts that there was no problem in the network layer.
The application is not completing the handshaking process .Moreover , the port 7000 is still active.
For example, all my 8 nodes are up . But when i try a nodetool status, some nodes give a DN- down node status. Later on, after examining , the TCP backlog queue was found overflowing and the particular server has stopped listening for other servers in the cluster.
Am still not able to spot the root cause of this problem.
Note: I have tried with previous version of cassandra , 1.2.4, and it was working ok at that time. Before going to production , i thought it is better to go to 2.0.x version to avoid a migration overhead mainly. Can anyone provide an idea on this ?
Exception am getting is
NFO [HANDSHAKE-/aa.bb.cc.XX] 2013-10-03 17:36:16,948 OutboundTcpConnection.java (line 385) Handshaking version with /aa.bb.cc.XX
INFO [HANDSHAKE-/aa.bb.cc.YY] 2013-10-03 17:36:17,280 OutboundTcpConnection.java (line 396) Cannot handshake version with /aa.bb.cc.YY
This sounds like https://issues.apache.org/jira/browse/CASSANDRA-6349. You should upgrade.
I'm running a redis / node.js server and had a
[Error: Auth error: Error: ERR max number of clients reached]
My current setup is, that I have a connection manager, that adds connections until the maximum number of concurrent connections for my heroku app (256, or 128 per dyno) is reached. If so, it just delivers an already existing connection. It's ultra fast and it's working.
However, yesterday night I got this error and I'm not able to reproduce it. It may be a rare error and I'm not sleeping well, knowing it's out there. Because: Once the error is thrown, my app is no longer reachable.
So my questions would be:
is that kind of a connection manager a good idea?
would it be a better idea to use that manager to wait for 'idle' to be called and the close the connection, meaning that I had to reestablish a connection everytime a requests kicks in (this is what I wanted to avoid)
how can I stop my app from going down? Should i just flush the connection pool whenever an error occurs?
What are your general strategies for handling multiple concurrent connections with a given maximum?
In case somebody is reading along:
The error was caused by a messed up redis 0.8.x that I deployed to live:
https://github.com/mranney/node_redis/issues/251
I was smart enough to remove the failed connections from the connection pool but forgot to call '.quit()' on it, hence the connection was out there in the wild but still a connection.