Nodes fail after restart with the error:
INFO [Messaging-EventLoop-3-12] 2021-08-17 11:09:07,845 InboundConnectionInitiator.java:464 - /X.X.46.68:7000(/X.X.46.68:56090)->/X.X.X.77:7000-URGENT_MESSAGES-cdaa1ab9 messaging connection established, version = 12, framing = LZ4, encryption = unencrypted
INFO [Messaging-EventLoop-3-1] 2021-08-17 11:09:07,867 InboundConnectionInitiator.java:464 - /X.X.86.42:7000(/X.X.86.42:52188)->/X.X.X.77:7000-URGENT_MESSAGES-9c2d74c5 messaging connection established, version = 12, framing = CRC, encryption = unencrypted
ERROR [main] 2021-08-17 11:09:08,523 CassandraDaemon.java:909 - Exception encountered during startup
java.lang.RuntimeException: Unable to gossip with any peers
at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1801)
at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:648)
at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:934)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:784)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:729)
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:420)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:763)
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:887)
INFO [StorageServiceShutdownHook] 2021-08-17 11:09:08,530 HintsService.java:220 - Paused hints dispatch
WARN [StorageServiceShutdownHook] 2021-08-17 11:09:08,531 Gossiper.java:1989 - No local state, state is in silent shutdown, or node hasn't joined, not announcing shutdown
INFO [StorageServiceShutdownHook] 2021-08-17 11:09:08,531 MessagingService.java:441 - Waiting for messaging service to quiesce
INFO [Messaging-EventLoop-3-7] 2021-08-17 11:09:08,534 OutboundConnection.java:1150 - /X.X.X.77:7000(/X.X.X.77:52766)->/X.X.X.76:7000-SMALL_MESSAGES-27a82ea6 successfully connected, version = 12, framing = CRC, encryption = unencrypted
INFO [Messaging-EventLoop-3-8] 2021-08-17 11:09:08,534 OutboundConnection.java:1150 - /X.X.X.77:7000(/X.X.X.77:52768)->/X.X.X.76:7000-LARGE_MESSAGES-762ad3e9 successfully connected, version = 12, framing = CRC, encryption = unencrypted
INFO [Messaging-EventLoop-3-1] 2021-08-17 11:09:08,535 OutboundConnection.java:1150 - /X.X.X.77:7000(/X.X.X.77:35938)->/X.X.X.40:7000-SMALL_MESSAGES-97e069da successfully connected, version = 12, framing = CRC, encryption = unencrypted
Seeds and other nodes show the following in the debug log while the node is starting up:
ERROR [Messaging-EventLoop-3-2] 2021-08-17 11:09:07,535 OutboundConnection.java:1058 - /X.X.X.116:7000->/X.X.X.77:7000-URGENT_MESSAGES-ef747971 channel in potentially inconsistent state after error; closing
java.lang.IllegalArgumentException: Maximum payload size is 128KiB
at org.apache.cassandra.net.FrameEncoderCrc.encode(FrameEncoderCrc.java:73)
at org.apache.cassandra.net.FrameEncoder.write(FrameEncoder.java:134)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:790)
at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:758)
at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1020)
at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:299)
at org.apache.cassandra.net.AsyncChannelPromise.writeAndFlush(AsyncChannelPromise.java:77)
at org.apache.cassandra.net.OutboundConnection$EventLoopDelivery.doRun(OutboundConnection.java:837)
at org.apache.cassandra.net.OutboundConnection$Delivery.run(OutboundConnection.java:687)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
INFO [Messaging-EventLoop-3-10] 2021-08-17 11:09:08,540 InboundConnectionInitiator.java:464 - /X.X.X.77:7000(/X.X.X.77:36684)->/X.X.X.116:7000-SMALL_MESSAGES-8ab4a5dc messaging connection established, version = 12, framing = CRC, encryption = unencrypted
INFO [Messaging-EventLoop-3-11] 2021-08-17 11:09:08,540 InboundConnectionInitiator.java:464 - /X.X.X.77:7000(/X.X.X.77:36686)->/X.X.X.116:7000-LARGE_MESSAGES-7f053d49 messaging connection established, version = 12, framing = CRC, encryption = unencrypted
INFO [Messaging-EventLoop-3-2] 2021-08-17 11:09:15,680 NoSpamLogger.java:92 - /X.X.X.116:7000->/X.X.X.77:7000-URGENT_MESSAGES-[no-channel] failed to connect
io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /X.X.X.77:7000
Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused
at io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)
at io.netty.channel.unix.Socket.finishConnect(Socket.java:251)
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:673)
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:650)
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:530)
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:470)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
INFO [Messaging-EventLoop-3-2] 2021-08-17 11:09:45,714 NoSpamLogger.java:92 - /X.X.X.116:7000->/X.X.X.77:7000-URGENT_MESSAGES-[no-channel] failed to connect
Started happening after upgrading from 3.10 to 4.0. Not a firewall issue or a bad configuration, as the same configuration was working prior.
Neither of those are errors so they're not the cause of your nodes failing to restart.
The first entry for gossip is logged at DEBUG so it's not an issue. The second entry for messaging is logged at INFO level so it's just informational and nothing to be concerned with.
You need to review the system.log and pay attention to the last 1 or 2 ERROR entries because those are relevant for understanding why the nodes failed to restart. Cheers!
[EDIT] This error indicates that there is an issue with contacting the seed nodes:
ERROR [main] 2021-08-17 11:09:08,523 CassandraDaemon.java:909 - Exception encountered during startup
java.lang.RuntimeException: Unable to gossip with any peers
at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1801)
at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:648)
...
In Cassandra 4.0, nodes are now identified by a combination of their IP + port (CASSANDRA-7544) so make sure that you've configured the seeds list accordingly. For example:
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "10.1.2.3:7000,10.1.2.4:7000,10.1.2.5:7000"
It's very important that at least one of the seed nodes is up and fully operational. For this reason, it is recommended to upgrade the seed nodes first.
Also ensure that there is network connectivity between nodes using Linux utilities such as nc and telnet. Check that traffic between nodes on port 7000 is not being blocked by firewalls (for example iptables or firewalld). If you rebooted the servers, it's quite common for firewalls to be enabled by accident.
[UPDATE] Check that the clocks on the servers are in sync. If there is too much drift, nodes will not be able to gossip. Cheers!
The below error message in the question was due to the size of gossip messages sent when a node (re)starts may exceed the hard limit in large cluster.
ERROR [Messaging-EventLoop-3-2] 2021-08-17 11:09:07,535 OutboundConnection.java:1058 - /X.X.X.116:7000->/X.X.X.77:7000-URGENT_MESSAGES-ef747971 channel in potentially inconsistent state after error; closing
java.lang.IllegalArgumentException: Maximum payload size is 128KiB
This is a bug since 4.0-alpha1 and was fixed in 4.0.1. Check CASSANDRA-16877.
Also if you see log message like below in one of the seed node, It is due to drift in time clock among node which Erick has mentioned in his updated answer.
INFO [ScheduledTasks:1] 2021-09-10 11:14:26,567 MessagingMetrics.java:206 - GOSSIP_DIGEST_SYN messages were dropped in last 5000 ms: 0 internal and 1 cross node. Mean internal dropped latency: 0 ms and Mean cross-node dropped latency: 15137813 ms
Related
I'm trying some exercise with spark streaming with kafka. If I use kafka producer and consumer in command line, I can publish and consume the messages in kafka. When I try to do it using pyspark in jupyter notebook. I am getting zookeeper connection timeout error.
Client session timed out, have not heard from server in 6004ms for sessionid 0x0, closing socket connection and attempting reconnect
[2017-08-04 15:49:37,494] INFO Initiating client connection, connectString=127.0.0.1:2181 sessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient#158da8e (org.apache.zookeeper.ZooKeeper)
[2017-08-04 15:49:37,524] INFO Waiting for keeper state SyncConnected (org.I0Itec.zkclient.ZkClient)
[2017-08-04 15:49:37,527] INFO Opening socket connection to server 127.0.0.1/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2017-08-04 15:49:37,533] WARN Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
[2017-08-04 15:49:38,637] INFO Opening socket connection to server 127.0.0.1/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2017-08-04 15:49:38,639] WARN Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
java.net.ConnectException: Connection refused
`
Zookeeper has issues when using localhost (127.0.0.1). Described in https://issues.apache.org/jira/browse/ZOOKEEPER-1661?focusedCommentId=13599352
This little program explains the following things:
ZooKeeper does call InetAddress.getAllByName (see StaticHostProvider:60) on the connect string "localhost:2181" => as a result it gets 3 different addresses for localhost which then get shuffled (Collections.shuffle(this.serverAddresses): L72
Because of the shuffling (random), the call to StaticHostProvider.next will sometime return the fe80:0:0:0:0:0:0:1%1 address which as you can see from this small program times out after 5s => this explains the randomness I am experiencing.
It really seems to me that what I am experiencing is a reverse dns lookup issue with IPv6. Whether this reverse dns lookup is actually useful and required by ZooKeeper, I do not know. It did not behave this way in 3.3.3.
Solution, specify your zookeeper address as a FQDN and make sure the reverse lookup works or use 0.0.0.0 instead of localhost.
I am trying to understand the warning, every time i am seeing the below exception when i run my spark job .I am seeing this in 2 nodes of my 3 node cluster.But as i said its just warn , job succeeds how ever.
com.datastax.driver.core.exceptions.ConnectionException: [x.x.x.x/x.x.x.x:9042] Pool was closed during initialization
CASSANDRA LOG
INFO [SharedPool-Worker-1] 2017-07-17 22:25:48,716 Message.java:605
- Unexpected exception during request; channel = [id: 0xf0ee1096, /x.x.x.x:54863 => /x.x.x.x:9042]
io.netty.channel.unix.Errors$NativeIoException: readAddress() failed:
Connection timed out
at io.netty.channel.unix.Errors.newIOException(Errors.java:105)
~[netty-all-4.0.34.Final.jar:4.0.34.Final]
at io.netty.channel.unix.Errors.ioResult(Errors.java:121) ~[netty-all-4.0.34.Final.jar:4.0.34.Final]
at io.netty.channel.unix.FileDescriptor.readAddress(FileDescriptor.java:134)
~[netty-all-4.0.34.Final.jar:4.0.34.Final]
at io.netty.channel.epoll.AbstractEpollChannel.doReadBytes(AbstractEpollChannel.java:239)
~[netty-all-4.0.34.Final.jar:4.0.34.Final]
at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:822)
~[netty-all-4.0.34.Final.jar:4.0.34.Final]
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:348)
~[netty-all-4.0.34.Final.jar:4.0.34.Final]
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:264)
~[netty-all-4.0.34.Final.jar:4.0.34.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
~[netty-all-4.0.34.Final.jar:4.0.34.Final]
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
~[netty-all-4.0.34.Final.jar:4.0.34.Final]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
The core of the error is "Connection timed out". I recommend troubleshooting network connectivity to the Cassandra cluster, starting with simpler tools such as ping, telnet and nc. Some potential causes:
The Cassandra client's connection configuration included an address that is not valid (not a node in the Cassandra cluster).
A network misconfiguration or firewall rule is preventing connections from the client to the Cassandra server.
The destination Cassandra server is overloaded, such that it cannot respond to new connection requests.
You mentioned that the problem is intermittent ("seeing this in 2 nodes of my 3 node cluster") and does not cause job failure. This could be an indicator that any of the problems listed above is happening for just a subset of nodes in the cluster. (If connectivity to all nodes was broken, then the job likely would have failed.)
I am seeing below log statements every time the client disconnects. Anything we can do to avoid them as they doesn't seem to be really a warning?
I believe these should be at DEBUG level.
01/11/2017 14:14:38,465 - INFO [hz._hzInstance_1_FB_API.IO.thread-in-0][][com.hazelcast.nio.tcp.TcpIpConnection] [127.0.0.1]:31201 [FB_API] [3.6.2] Connection [Address[127.0.0.1]:59972] lost. Reason: java.io.EOFException[Remote socket closed!]
01/11/2017 14:14:38,466 - WARN [hz._hzInstance_1_FB_API.IO.thread-in-0][][com.hazelcast.nio.tcp.nonblocking.NonBlockingSocketReader] [127.0.0.1]:31201 [FB_API] [3.6.2] hz._hzInstance_1_FB_API.IO.thread-in-0 Closing socket to endpoint Address[127.0.0.1]:59972, Cause:java.io.EOFException: Remote socket closed!
01/11/2017 14:14:38,467 - INFO [hz._hzInstance_1_FB_API.event-3][][com.hazelcast.client.ClientEndpointManager] [127.0.0.1]:31201 [FB_API] [3.6.2] Destroying ClientEndpoint{conn=Connection [0.0.0.0/0.0.0.0:31201 -> /127.0.0.1:59972], endpoint=Address[127.0.0.1]:59972, alive=false, type=JAVA_CLIENT, principal='ClientPrincipal{uuid='7181902f-8fe9-4065-b610-5a0e9c4ad212', ownerUuid='7ff02f81-c0e3-4094-a968-507c7df4df9f'}', firstConnection=true, authenticated=true}
01/11/2017 14:14:38,467 - INFO [hz._hzInstance_1_FB_API.event-3][][com.hazelcast.transaction.TransactionManagerService] [127.0.0.1]:31201 [FB_API] [3.6.2] Committing/rolling-back alive transactions of client, UUID: 7181902f-8fe9-4065-b610-5a0e9c4ad212
These exceptions/Errors are distracting our log analysis.
It's a code issue. You have three choices:
Ignore it
Filter it out with logging, something like this for Log4j2:
<Logger name="com.hazelcast.nio.tcp.nonblocking.NonBlockingSocketReader" level="error" additivity="false">
(This is a bad idea, as you will miss genuine warnings)
Upgrade to 3.7, or 3.8 if this is GA when you do the upgrade
We are running a map reduce/spark job to bulk load hbase data in one of the environments.
While running it, connection to the hbase zookeeper cannot initialize throwing the following error.
16/05/10 06:36:10 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=c321shu.int.westgroup.com:2181,c149jub.int.westgroup.com:2181,c167rvm.int.westgroup.com:2181 sessionTimeout=90000 watcher=hconnection-0x74b47a30, quorum=c321shu.int.westgroup.com:2181,c149jub.int.westgroup.com:2181,c167rvm.int.westgroup.com:2181, baseZNode=/hbase
16/05/10 06:36:10 INFO zookeeper.ClientCnxn: Opening socket connection to server c321shu.int.westgroup.com/10.204.152.28:2181. Will not attempt to authenticate using SASL (unknown error)
16/05/10 06:36:10 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.204.24.16:35740, server: c321shu.int.westgroup.com/10.204.152.28:2181
16/05/10 06:36:10 INFO zookeeper.ClientCnxn: Session establishment complete on server c321shu.int.westgroup.com/10.204.152.28:2181, sessionid = 0x5534bebb441bd3f, negotiated timeout = 60000
16/05/10 06:36:11 INFO mapreduce.HFileOutputFormat2: Looking up current regions for table ecpdevv1patents:NormNovusDemo
Exception in thread "main" org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=35, exceptions:
Tue May 10 06:36:11 CDT 2016, org.apache.hadoop.hbase.client.RpcRetryingCaller#3927df20, java.io.IOException: Call to c873gpv.int.westgroup.com/10.204.67.9:60020 failed on local exception: java.io.EOFException
We have executed the same job in Titan DEV too but facing the same problem. Please let us know if anyone has faced the same problem before.
Details are,
• Earlier job was failing to connect to localhost/127.0.0.1:2181. Hence only the property hbase.zookeeper.quorum has been set in map reduce code with c149jub.int.westgroup.com,c321shu.int.westgroup.com,c167rvm.int.westgroup.com which we got from hbase-site.xml.
• We are using jars of cdh version 5.3.3.
I was trying to create a table inside Accumulo using the createtable command and found out that it was getting stuck. I waited for around 20 mins before cancelling the createtable command.
createtable test_table
I have one master and 2 tablet servers and found out that my master and one of the tablets died. I could not telnet to port 9997 of that particular tablet server and I could not even telnet to port 29999 (master.port.client in accumulo-site.xml). When I saw the tserver logs of the dead server, I saw the following entries.
2016-05-10 02:12:07,456 [zookeeper.DistributedWorkQueue] INFO : Got unexpected z
ookeeper event: None for /accumulo/be4f66be-1508-4314-9bff-888b56d9b0ce/recovery
2016-05-10 02:12:23,883 [zookeeper.ZooCache] WARN : Saw (possibly) transient exc
eption communicating with ZooKeeper, will retry
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /accumulo/be4f66be-1508-4314-9bff-888b56d9b0ce/tables
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
at org.apache.accumulo.fate.zookeeper.ZooCache$1.run(ZooCache.java:210)
at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:162)
at org.apache.accumulo.fate.zookeeper.ZooCache.getChildren(ZooCache.java
:221)
at org.apache.accumulo.core.client.impl.Tables.exists(Tables.java:142)
at org.apache.accumulo.server.tabletserver.LargestFirstMemoryManager.tab
leExists(LargestFirstMemoryManager.java:149)
at org.apache.accumulo.server.tabletserver.LargestFirstMemoryManager.get
MemoryManagementActions(LargestFirstMemoryManager.java:175)
at org.apache.accumulo.tserver.TabletServerResourceManager$MemoryManagem
entFramework.manageMemory(TabletServerResourceManager.java:408)
at org.apache.accumulo.tserver.TabletServerResourceManager$MemoryManagem
entFramework.access$400(TabletServerResourceManager.java:318)
at org.apache.accumulo.tserver.TabletServerResourceManager$MemoryManagem
entFramework$2.run(TabletServerResourceManager.java:346)
at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.jav
a:35)
at java.lang.Thread.run(Thread.java:745)
2016-05-10 02:12:23,884 [zookeeper.ZooCache] WARN : Saw (possibly) transient exc
eption communicating with ZooKeeper, will retry
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /accumulo/be4f66be-1508-4314-9bff-888b56d9b0ce/tables/!0/con
f/table.classpath.context
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:264)
at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:162)
at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:289)
at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:238)
at org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCache
PropertyAccessor.java:117)
at org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCache
PropertyAccessor.java:103)
at org.apache.accumulo.server.conf.TableConfiguration.get(TableConfigura
tion.java:99)
at org.apache.accumulo.tserver.constraints.ConstraintChecker.classLoader
Changed(ConstraintChecker.java:93)
at org.apache.accumulo.tserver.tablet.Tablet.checkConstraints(Tablet.jav
a:1225)
at org.apache.accumulo.tserver.TabletServer$8.run(TabletServer.java:2848
)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:51
1)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:617)
at java.lang.Thread.run(Thread.java:745)
2016-05-10 02:12:23,887 [zookeeper.ZooReader] WARN : Saw (possibly) transient ex
ception communicating with ZooKeeper
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /accumulo/be4f66be-1508-4314-9bff-888b56d9b0ce/tservers/accu
mulo.tablet.2:9997
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
at org.apache.accumulo.fate.zookeeper.ZooReader.getStatus(ZooReader.java
:132)
at org.apache.accumulo.fate.zookeeper.ZooLock.process(ZooLock.java:383)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.j
ava:522)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2016-05-10 02:12:24,252 [watcher.MonitorLog4jWatcher] INFO : Changing monitor lo
g4j address to accumulo.master:4560
2016-05-10 02:12:24,252 [watcher.MonitorLog4jWatcher] INFO : Enabled log-forward
ing
Even the master server's logs had the same stacktrace. My zookeeper is running.
At first, I thought it was a disk issue. Maybe there was no space. But that was not the case. I ran the fsck on the accumulo instance.volumes and it returned the HEALTHY status.
Does anyone know what exactly happened and if possible, how to avoid it?
EDIT : Even the tracer_accumulo.master.log had the same stacktrace.
ZooKeeper session expirations occur when a thread inside the ZooKeeper client does not get run within the necessary time (by default, 30s) to maintain the session which is an in-memory state between ZooKeeper client and server. There is no single explanation for this, but many common culprits:
JVM garbage collection pauses in the client. Accumulo should log a warning if it experienced a pause.
Lack of CPU time. If the host itself is overburdened, Accumulo might not have the cycles to run all of the tasks it needs to in a timely manner.
Lack of sockets/filehandles, Accumulo could be trying to connect to ZooKeeper, but be unable to open new connections
ZooKeeper might be rate-limiting connections as a denial-of-service prevention. Check the zookeeper logs for errors about dropping/denying new connections from a specific IP, and, if you see these errors, consider increasing maxClientCnxns in zoo.cfg.