cassandra node not starting - cassandra

I have 4 node cassandra cluster. out of which 2 are up but 2 are down.
when I start them they immediately getting down.
when I check using service cassandra status
I am getting could not access pidfile for cassandra
and in system.log file, the error is:
ERROR [main] 2017-09-15 15:44:46,277 CassandraDaemon.java:752 - Exception encountered during startup
java.lang.NullPointerException: null
at org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:756) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:553) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:800) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:666) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:612) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:394) [apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:601) [apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:735) [apache-cassandra-3.10.jar:3.10]
INFO [StorageServiceShutdownHook] 2017-09-15 15:44:46,281 HintsService.java:221 - Paused hints dispatch
INFO [StorageServiceShutdownHook] 2017-09-15 15:44:46,282 Gossiper.java:1506 - Announcing shutdown

From source code of Gossiper(link), I suspect that your nodes are stuck in boostrapping phase. Other nodes see them as already bootstrapped, but they failed to finish joining the cluster.
What could help is forcibly removing stuck nodes from your cluster, by using nodetool removenode in any other instance that was able to boot.
After that, you should clean data on the stuck instances by wiping out datadirectory (located in data/ or under system folder, if you installed from OS package) and start instances one-by-one.
If you send the gossipinfo and status of your cluster, it might help to figure out what the real problem is.
For further reference see official guide

One cause can be misconfigured ip addresses/seeds:
Double-check the cluster name, the seeds are the same on all nodes in their cassandra.yaml file and that all of the ip addresses are correct.

In cassandra.yaml file , comment these values and let it use default
data_file_directories: db/cassandra/data
commitlog_directory: db/cassandra/commitlog

Related

Apache pulsar get timeout in unpredictable way

I installed Apache pulsar standalone. Pulsar get timeout sometimes. It's not related to high throuput neither to a particular topic (following log). Pulsar-admin brokers healthcheck returns OK or timeout also. How to investigate on it ?
10:46:46.365 [pulsar-ordered-OrderedExecutor-7-0] WARN org.apache.pulsar.broker.service.BrokerService - Got exception when reading persistence policy for persistent://nnx/agent_ns/action_up-53da8177-b4b9-4b92-8f75-efe94dc2309d: null
java.util.concurrent.TimeoutException: null
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784) ~[?:1.8.0_232]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) ~[?:1.8.0_232]
at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97) ~[org.apache.pulsar-pulsar-zookeeper-utils-2.5.0.jar:2.5.0]
at org.apache.pulsar.broker.service.BrokerService.lambda$getManagedLedgerConfig$32(BrokerService.java:922) ~[org.apache.pulsar-pulsar-broker-2.5.0.jar:2.5.0]
at org.apache.bookkeeper.mledger.util.SafeRun$2.safeRun(SafeRun.java:49) [org.apache.pulsar-managed-ledger-2.5.0.jar:2.5.0]
at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_232]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_232]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
I am glad you were able to resolve the issue my adding more cores. The issue was a connection timeout while trying to access some topic metadata that is stored inside of ZookKeeper as indicated by the following line in the stack trace:
at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97) ~[org.apache.pulsar-pulsar-zookeeper-utils-2.5.0.jar:2.5.0]
Increasing the cores must of freed up enough threads to allow the ZK node to respond to this request.
You can check the connection to the server looks like connection issue if you are using any TLScertificate file path check if you have the right certificate.
The Problem is we don't have lot of solutions in internet for apache pulsar but if you are following the apache pulsar doc might help and also we have apache pulsar git hub and sample projects.

Nodetool status shows java.lang.ClassCastException

I have a single node cassandra server. It has been working well for a long time until after a server restart I made yesterday.
Now nodetool status gives me the following error
error: null
-- StackTrace --
java.lang.ClassCastException
Cassandra itself seems to be running. The subset of the logs:
...
INFO 03:58:18 Node /10.0.0.4 state jump to NORMAL
INFO 03:58:18 Waiting for gossip to settle before accepting client requests...
INFO 03:58:26 No gossip backlog; proceeding
Release version is 3.7
I haven't been able to crack this. I'd be very thankful for any help.
Let me know if I can provide any more useful information.

Spark scheduler thersholds

I'm running on top of Spark some analysis tool that creates plenty of overhead, so computations takes a lot more time. When I run it I get this error:
16/08/30 23:36:37 WARN TransportChannelHandler: Exception in connection from /132.68.60.126:36922
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
16/08/30 23:36:37 ERROR TaskSchedulerImpl: Lost executor 0 on 132.68.60.126: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
I guess this happens because the scheduler thinks the executor failed, so it starts another one.
The workload is a simple string search (grep), both master and slave are local so there aren't suppose to be any failures. When running without the overheads things are fine.
The question is - can I configure those timeout thresholds somewhere?
Thanks!
Solved it with spark.network.timeout 10000000 on spark-defaults.conf.
I was getting the same error even if I tried many things.My job used to get stuck throwing this error after running a very long time. I tried few work around which helped me to resolve. Although, I still get the same error by at least my job runs fine.
one reason could be the executors kills themselves thinking that they
lost the connection from the master. I added the below configurations
in spark-defaults.conf file.
spark.network.timeout 10000000
spark.executor.heartbeatInterval 10000000
basically,I have increased the network timeout and heartbeat interval
The particular step which used to get stuck, I just cached the
dataframe that is used for processing (in the step which used to get
stuck)
Note:- These are work arounds, I still see the same error in error logs but the my job does not get terminated.

How to recover Cassandra node from failed bootstrap

A node when down while bootstrapping a new node, and the bootstrapping failed. The node shut down, leaving the following messages in its log:
INFO [main] 2015-02-07 06:03:32,761 StorageService.java:1025 - JOINING: Starting to bootstrap...
ERROR [main] 2015-02-07 06:03:32,799 CassandraDaemon.java:465 - Exception encountered during startup
java.lang.RuntimeException: A node required to move the data consistently is down (/10.0.3.56). If you wish to move the data from a potentially inconsistent replica, restart the node with -Dcassandra.consistent.rangemovement=false
How do I recover the situation? Can I restart the bootstrap process once the failed node is back online? Or do I need to revert the partial bootstrap and try again somehow?
I have tracked down the original cause. The new node was able to connect to the node at 10.0.3.56, but 10.0.3.56 was not able to open connections back to the new node. 10.0.3.56 contained the only copy of some data that needed to be moved to the new node (replication factor == 1), but its attempts to send the data were blocked.
Since this involves data move, not just replication, and based on the place in the code where exception is thrown, I assume you are trying to replace a dead node as it is described here: http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_replace_node_t.html
By the look of it, the node did not get to joining the ring. You can certainly doublecheck by running nodetool status, if the node has joined at all.
If not then you can simply delete all from the data, commitlog and saved_caches, and restart the process. What was wrong with that 10.0.2.56 node?
If this node has joined the ring then it should be still safe to simply restart it once you start node 10.0.2.56 up.

Cassandra handshake and internode communication

While trying to use the cassandra 2.0.1 version, i started facing the handshaking with version problem .
There was an exception from OutboundTcpConnection.java stating that handshaking is not possible with a particular node.
I had a look at the TCP dump and cleared off the doubts that there was no problem in the network layer.
The application is not completing the handshaking process .Moreover , the port 7000 is still active.
For example, all my 8 nodes are up . But when i try a nodetool status, some nodes give a DN- down node status. Later on, after examining , the TCP backlog queue was found overflowing and the particular server has stopped listening for other servers in the cluster.
Am still not able to spot the root cause of this problem.
Note: I have tried with previous version of cassandra , 1.2.4, and it was working ok at that time. Before going to production , i thought it is better to go to 2.0.x version to avoid a migration overhead mainly. Can anyone provide an idea on this ?
Exception am getting is
NFO [HANDSHAKE-/aa.bb.cc.XX] 2013-10-03 17:36:16,948 OutboundTcpConnection.java (line 385) Handshaking version with /aa.bb.cc.XX
INFO [HANDSHAKE-/aa.bb.cc.YY] 2013-10-03 17:36:17,280 OutboundTcpConnection.java (line 396) Cannot handshake version with /aa.bb.cc.YY
This sounds like https://issues.apache.org/jira/browse/CASSANDRA-6349. You should upgrade.

Resources