Getting exception com.hazelcast.spi.exception.TargetNotMemberException: Not Member! target - hazelcast

Single hazelcast node is running which means only one hazelcast member.
We are getting below exception once a while. Means say system is continuously running for 10 days and we start receiving this exception on 10th day onward.
Hazelcast version is 3.12.10
The function call which results in this exception:
return resolveExecutor(task)
.submitToMember(new HazelcastAdapterTask(key, task), cluster.getLocalMember());
If hazelcast instance is down, then System will throw com.hazelcast.core.HazelcastInstanceNotActiveException: Hazelcast instance is not active!.
No clues why we are getting com.hazelcast.spi.exception.TargetNotMemberException: Not Member! target:
After restart, everything works fine.
1. Does anyone know the reason of this?
2. For multiple nodes, its understood that the node on which Hazelcast cluster is trying to submit the task is somehow went down. Why this exception when single node is running?
3. Any debugging logs we can enable on hazelcast?
We tried to reproduce the issue, but not able to reproduce this.
Caused by: com.hazelcast.spi.exception.TargetNotMemberException: Not Member! target: [10.232.104.29]:44536, partitionId: -1, operation: com.hazelcast.executor.impl.operations.MemberCallableTaskOperation, service: hz:impl:executorService
at com.hazelcast.spi.impl.operationservice.impl.Invocation.initInvocationTarget(Invocation.java:307)
at com.hazelcast.spi.impl.operationservice.impl.Invocation.doInvoke(Invocation.java:614)
at com.hazelcast.spi.impl.operationservice.impl.Invocation.invoke0(Invocation.java:592)
at com.hazelcast.spi.impl.operationservice.impl.Invocation.invoke(Invocation.java:256)
at com.hazelcast.spi.impl.operationservice.impl.OperationServiceImpl.invokeOnTarget(OperationServiceImpl.java:326)
at com.hazelcast.executor.impl.ExecutorServiceProxy.submitToMember(ExecutorServiceProxy.java:319)
at com.hazelcast.executor.impl.ExecutorServiceProxy.submitToMember(ExecutorServiceProxy.java:308)
at com.tpt.atlant.grid.task.hazelcast.HazelcastTaskExecutor.submit(HazelcastTaskExecutor.java:166)
at com.tpt.valuation.ion.service.GenerateUpdateObligationTalkFunctionImpl.delegateTask(GenerateUpdateObligationTalkFunctionImpl.java:98)
at com.tpt.valuation.ion.service.GenerateUpdateObligationTalkFunctionImpl.lambda$generateUpdateObligation$0(GenerateUpdateObligationTalkFunctionImpl.java:72)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
at com.iontrading.isf.executors.impl.monitoring.e.run(MonitoredRunnable.java:16)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
at ------ submitted from ------.(Unknown Source)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:126)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrowIfException(InvocationFuture.java:79)
... 12 more
============Update===========================
I analyzed JVM memory usage.
It's not more than 55%.2 months logs.
Then went through 2-month logs.
An application was started successfully and continuously running for 5-6 days. I observed below in the logs:
2021-11-04:-
2021-11-04 14:11:50,566 [hz._hzInstance_1_tpt-valuation_TEST02.cached.thread-73] WARN NioChannelOptions (log:46) - The configured tcp receive buffer size conflicts with the value actually being used by the socket and can lead to sub-optimal performance. Configured 1048576 bytes, actual 212992 bytes. On Linux look for kernel parameters 'net.ipv4.tcp_rmem' and 'net.core.rmem_max'.This warning will only be shown once.
2021-11-04 14:11:50,566 [hz._hzInstance_1_tpt-valuation_TEST02.cached.thread-73] WARN NioChannelOptions (log:46) - The configured tcp receive buffer size conflicts with the value actually being used by the socket and can lead to sub-optimal performance. Configured 1048576 bytes, actual 212992 bytes. On Linux look for kernel parameters 'net.ipv4.tcp_rmem' and 'net.core.rmem_max'.This warning will only be shown once.
2021-11-04 14:11:50,609 [hz._hzInstance_1_tpt-valuation_TEST02.migration] WARN MigrationManager (log:51) - [10.232.104.29]:44536 [tpt-valuation_TEST02] [3.12.10] partitionId=0 is completely lost!
2021-11-04 14:11:50,609 [hz._hzInstance_1_tpt-valuation_TEST02.migration] WARN MigrationManager (log:51) - [10.232.104.29]:44536 [tpt-valuation_TEST02] [3.12.10] partitionId=0 is completely lost!
2021-11-04 14:11:50,610 [hz._hzInstance_1_tpt-valuation_TEST02.migration] WARN MigrationManager (log:51) - [10.232.104.29]:44536 [tpt-valuation_TEST02] [3.12.10] partitionId=1 is completely lost!
2021-11-04 14:11:50,610 [hz._hzInstance_1_tpt-valuation_TEST02.migration] WARN MigrationManager (log:51) - [10.232.104.29]:44536 [tpt-valuation_TEST02] [3.12.10] partitionId=1 is completely lost!
2021-11-04 14:11:50,610 [hz._hzInstance_1_tpt-valuation_TEST02.migration] WARN MigrationManager (log:51) - [10.232.104.29]:44536 [tpt-valuation_TEST02] [3.12.10] partitionId=2 is completely lost!
2021-11-18:-
2021-11-18 19:13:20,434 [hz._hzInstance_1_tpt-valuation_TEST02.cached.thread-52] WARN NioChannelOptions (log:46) - The configured tcp receive buffer size conflicts with the value actually being used by the socket and can lead to sub-optimal performance. Configured 1048576 bytes, actual 212992 bytes. On Linux look for kernel parameters 'net.ipv4.tcp_rmem' and 'net.core.rmem_max'.This warning will only be shown once.
2021-11-18 19:13:20,434 [hz._hzInstance_1_tpt-valuation_TEST02.cached.thread-52] WARN NioChannelOptions (log:46) - The configured tcp receive buffer size conflicts with the value actually being used by the socket and can lead to sub-optimal performance. Configured 1048576 bytes, actual 212992 bytes. On Linux look for kernel parameters 'net.ipv4.tcp_rmem' and 'net.core.rmem_max'.This warning will only be shown once.
2021-11-18 19:13:20,478 [hz._hzInstance_1_tpt-valuation_TEST02.migration] WARN MigrationManager (log:51) - [10.232.104.29]:44536 [tpt-valuation_TEST02] [3.12.10] partitionId=0 is completely lost!
2021-11-18 19:13:20,478 [hz._hzInstance_1_tpt-valuation_TEST02.migration] WARN MigrationManager (log:51) - [10.232.104.29]:44536 [tpt-valuation_TEST02] [3.12.10] partitionId=0 is completely lost!
2021-11-18 19:13:20,479 [hz._hzInstance_1_tpt-valuation_TEST02.migration] WARN MigrationManager (log:51) - [10.232.104.29]:44536 [tpt-valuation_TEST02] [3.12.10] partitionId=1 is completely lost!
2021-11-18 19:13:20,479 [hz._hzInstance_1_tpt-valuation_TEST02.migration] WARN MigrationManager (log:51) - [10.232.104.29]:44536 [tpt-valuation_TEST02] [3.12.10] partitionId=1 is completely lost!
2021-11-18 19:13:20,479 [hz._hzInstance_1_tpt-valuation_TEST02.migration] WARN MigrationManager (log:51) - [10.232.104.29]:44536 [tpt-valuation_TEST02] [3.12.10] partitionId=2 is completely lost!
2021-11-18 19:13:20,479 [hz._hzInstance_1_tpt-valuation_TEST02.migration] WARN MigrationManager (log:51) - [10.232.104.29]:44536 [tpt-valuation_TEST02] [3.12.10] partitionId=2 is completely lost!
2021-11-18 19:13:20,479 [hz._hzInstance_1_tpt-valuation_TEST02.migration] WARN MigrationManager (log:51) - [10.232.104.29]:44536 [tpt-valuation_TEST02] [3.12.10] partitionId=3 is completely lost!
Only a single hazelcast node is running.
Any specific reason for partition lost when single node is running?

Related

K8ssandra pod is replaying a large commit log and is not responding

We a 3 node Cassandra 4 cluster, at some point (I don't know why) we get in one of the ndoes:
CommitLog.java:173 - Replaying /opt/cassandra/data/commitlog/CommitLog-7-1674673652744.log
With a long list of logs
We can see in the metrics that disk throughput was about 17 GB
During this time we see in the other 2 nodes (the node replaying is not responsive for almost 2m) :
NoSpamLogger.java:98 - /20.9.1.45:7000->prod-k8ssandra-seed-service/20.9.0.242:7000-SMALL_MESSAGES-[no-channel] failed to connect
java.nio.channels.ClosedChannelException: null
at org.apache.cassandra.net.OutboundConnectionInitiator$Handler.channelInactive(OutboundConnectionInitiator.java:248)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:819)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)
Questions:
What is the reason for this commit log replay?
Can we mitigate this node outage risk?
Update:
it seems the restart of node looks like somthing initiated by k8ssandra... this can explain the replay, what is the rason to the HTTP 500? I can't seem to see an
INFO [nioEventLoopGroup-2-2] 2023-01-25 19:07:10,694 Cli.java:617 - address=/127.0.0.6:53027 url=/api/v0/probes/liveness status=200 OK
INFO [nioEventLoopGroup-2-1] 2023-01-25 19:07:12,698 Cli.java:617 - address=http url=/api/v0/probes/readiness status=500 Internal Server Error
INFO [epollEventLoopGroup-38-1] 2023-01-25 19:07:20,700 Clock.java:47 - Using native clock for microsecond precision
WARN [epollEventLoopGroup-38-2] 2023-01-25 19:07:20,701 AbstractBootstrap.java:452 - Unknown channel option 'TCP_NODELAY' for channel '[id: 0x919a5c8b]'
WARN [epollEventLoopGroup-38-2] 2023-01-25 19:07:20,703 Loggers.java:39 - [s33] Error connecting to Node(endPoint=/tmp/cassandra.sock, hostId=null, hashCode=71aac1d0), trying next node (AnnotatedConnectException: connect(..) failed: Connection refused: /tmp/cassandra.sock)
INFO [nioEventLoopGroup-2-2] 2023-01-25 19:07:20,703 Cli.java:617 - address=/127.0.0.6:51773 url=/api/v0/probes/readiness status=500 Internal Server Error
INFO [epollEventLoopGroup-39-1] 2023-01-25 19:07:25,393 Clock.java:47 - Using native clock for microsecond precision
WARN [epollEventLoopGroup-39-2] 2023-01-25 19:07:25,394 AbstractBootstrap.java:452 - Unknown channel option 'TCP_NODELAY' for channel '[id: 0x80b52436]'
WARN [epollEventLoopGroup-39-2] 2023-01-25 19:07:25,395 Loggers.java:39 - [s34] Error connecting to Node(endPoint=/tmp/cassandra.sock, hostId=null, hashCode=cc8ec36), trying next node (AnnotatedConnectException: connect(..) failed: Connection refused: /tmp/cassandra.sock)
INFO [pool-2-thread-1] 2023-01-25 19:07:25,602 LifecycleResources.java:186 - Started Cassandra
When a Cassandra doesn't shutdown cleanly, Cassandra doesn't have a chance to persist the contents of the memtable to disk so when it is restarted, Cassandra replays the commit logs to repopulate the memtables.
It seems like you're confusing cause and effect. The K8ssandra operator restarted the pod because it was unresponsive -- the restart is the effect, not the cause.
You will need to review the Cassandra logs on the pod for clues as to why it became unresponsive. From your description that there was a large commitlog replayed on restart, I would suspect that there was a lot of traffic to the cluster (a large commitlog is a result of lots of writes) and an overloaded node would explain why it became unresponsive. Again, you will need to review the logs to determine the cause.
K8ssandra monitors the pods using "liveness" and "readiness" probes (aka health checks) and the HTTP 500 error would have been a result of the node being unresponsive. This would have triggered the operator to initiate a restart of the pod to automatically recover it. Cheers!

cassandra ec2 node is shutting down Caused by: java.nio.file.FileSystemException

This is the log I see
Caused by: java.nio.file.FileSystemException: /var/lib/cassandra/data/dev_fortis_mtd/explanationofbenefit-5fb6576031e511ec8611d5b080c74d01/snapshots/dropped-166672
6203042-explanationofbenefit/mc-1-big-Summary.db -> /var/lib/cassandra/data/dev_fortis_mtd/explanationofbenefit-5fb6576031e511ec8611d5b080c74d01/mc-1-big-Summary.d
b: Operation not permitted
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) ~[na:1.8.0_342]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[na:1.8.0_342]
at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476) ~[na:1.8.0_342]
at java.nio.file.Files.createLink(Files.java:1086) ~[na:1.8.0_342]
at org.apache.cassandra.io.util.FileUtils.createHardLink(FileUtils.java:100) ~[apache-cassandra-3.11.11.jar:3.11.11]
... 23 common frames omitted
ERROR [InternalResponseStage:6] 2022-10-25 19:30:03,044 DefaultFSErrorHandler.java:66 - Stopping transports as disk_failure_policy is stop
ERROR [InternalResponseStage:6] 2022-10-25 19:30:03,044 StorageService.java:518 - Stopping gossiper
WARN [InternalResponseStage:6] 2022-10-25 19:30:03,044 StorageService.java:360 - Stopping gossip by operator request
INFO [InternalResponseStage:6] 2022-10-25 19:30:03,044 Gossiper.java:1683 - Announcing shutdown
INFO [InternalResponseStage:6] 2022-10-25 19:30:03,046 StorageService.java:2480 - Node /172.X.X.X state jump to shutdown
looks like a problem with mc-1-big-Summary.db for the dev_fortis_mtd.explanationofbenefit table, can you check the data dir to see if mc-1... has a complete sstable set? If not, can you remove the incomplete set and then repair this table to pull data from another node in the cluster?

Error committing 10k records to Janus graph with cassandra

I'm fetching around 10 million records from a oracle DB and trying to persist those to Janus graph with Cassandra as storage backend [using Spark framework].
When i tried iterating the records in a loop and tried to commit every 10k, I'm getting the below error
ERROR StandardJanusGraph: Could not commit transaction [1] due to storage exception in commit
org.janusgraph.core.JanusGraphException: Could not execute operation due to backend exception
When i tried to get only the first 1L record from Oracle and committed every 1K, then its working fine.
Can someone help me to resolve this error? Appreciate your help. Thank you!!
Update:
WARN [ReadStage-3] 2019-09-29 08:39:28,327 AbstractLocalAwareExecutorService.java:167 - Uncaught exception on thread Thread[ReadStage-3,5,main]: {}
WARN [MemtableFlushWriter:17] 2019-09-29 09:09:40,843 NativeLibrary.java:304 - open(/var/lib/cassandra/data/circuit_equipment/system_properties-eeef4cb0e29711e9af61a34111381c19, O_RDONLY) failed, errno (2).
ERROR [MemtableFlushWriter:17] 2019-09-29 09:09:40,846 LogTransaction.java:272 - Transaction log [md_txn_flush_de900e80-e298-11e9-af61-a34111381c19.log in /var/lib/cassandra/data/circuit_equipment/system_properties-eeef4cb0e29711e9af61a34111381c19] indicates txn was not completed, trying to abort it now
ERROR [MemtableFlushWriter:17] 2019-09-29 09:09:40,847 LogTransaction.java:275 - Failed to abort transaction log [md_txn_flush_de900e80-e298-11e9-af61-a34111381c19.log in /var/lib/cassandra/data/circuit_equipment/system_properties-eeef4cb0e29711e9af61a34111381c19]
ERROR [MemtableFlushWriter:17] 2019-09-29 09:09:40,848 LogTransaction.java:222 - Unable to delete /var/lib/cassandra/data/circuit_equipment/system_properties-eeef4cb0e29711e9af61a34111381c19/md_txn_flush_de900e80-e298-11e9-af61-a34111381c19.log as it does not exist, see debug log file for stack trace
ERROR [MemtablePostFlush:9] 2019-09-29 09:09:40,849 CassandraDaemon.java:228 - Exception in thread Thread[MemtablePostFlush:9,5,main]
WARN [StorageServiceShutdownHook] 2019-09-29 09:09:40,849 StorageService.java:4591 - Caught exception while waiting for memtable flushes during shutdown hook
ERROR [StorageServiceShutdownHook] 2019-09-29 09:09:40,931 AbstractCommitLogSegmentManager.java:308 - Failed to force-recycle all segments; at least one segment is still in use with dirty CFs.
WARN [main] 2019-09-29 09:09:44,580 NativeLibrary.java:187 - Unable to lock JVM memory (ENOMEM). This can result in part of the JVM being swapped out, especially with mmapped I/O enabled. Increase RLIMIT_MEMLOCK or run Cassandra as root.
WARN [main] 2019-09-29 09:09:44,581 StartupChecks.java:169 - JMX is not enabled to receive remote connections. Please see cassandra-env.sh for more info.
WARN [main] 2019-09-29 09:09:44,591 SigarLibrary.java:174 - Cassandra server running in degraded mode. Is swap disabled? : false, Address space adequate? : true, nofile limit adequate? : true, nproc limit adequate? : true
WARN [main] 2019-09-29 09:09:44,593 StartupChecks.java:311 - Maximum number of memory map areas per process (vm.max_map_count) 65530 is too low, recommended value: 1048575, you can change it with sysctl.
WARN [Native-Transport-Requests-1] 2019-09-29 09:12:12,841 CompressionParams.java:383 - The sstable_compression option has been deprecated. You should use class instead
WARN [Native-Transport-Requests-1] 2019-09-29 09:12:12,842 CompressionParams.java:334 - The chunk_length_kb option has been deprecated. You should use chunk_length_in_kb instead
WARN [main] 2019-09-29 12:59:57,584 NativeLibrary.java:187 - Unable to lock JVM memory (ENOMEM). This can result in part of the JVM being swapped out, especially with mmapped I/O enabled. Increase RLIMIT_MEMLOCK or run Cassandra as root.
WARN [main] 2019-09-29 12:59:57,585 StartupChecks.java:169 - JMX is not enabled to receive remote connections. Please see cassandra-env.sh for more info.
WARN [main] 2019-09-29 12:59:57,599 SigarLibrary.java:174 - Cassandra server running in degraded mode. Is swap disabled? : false, Address space adequate? : true, nofile limit adequate? : true, nproc limit adequate? : true
WARN [main] 2019-09-29 12:59:57,602 StartupChecks.java:311 - Maximum number of memory map areas per process (vm.max_map_count) 65530 is too low, recommended value: 1048575, you can change it with sysctl.
root#f451df425ca8:/var/log/cassandra#
From these messages, you should disable swap (this is actually one of the main recommendations in Cassandra):
WARN [main] 2019-09-29 09:09:44,580 NativeLibrary.java:187 - Unable to lock JVM memory (ENOMEM). This can result in part of the JVM being swapped out, especially with mmapped I/O enabled. Increase RLIMIT_MEMLOCK or run Cassandra as root.
WARN [main] 2019-09-29 09:09:44,591 SigarLibrary.java:174 - Cassandra server running in degraded mode. Is swap disabled? : false, Address space adequate? : true, nofile limit adequate? : true, nproc limit adequate? : true
You should also change max_map_count. You can use this guide to set the other values for production environments. From this message:
WARN [main] 2019-09-29 12:59:57,602 StartupChecks.java:311 - Maximum number of memory map areas per process (vm.max_map_count) 65530 is too low, recommended value: 1048575, you can change it with sysctl.

Streaming error occurred org.apache.cassandra.io.FSReadError: java.io.IOException: Broken pipe

I am trying to add a node to the cluster. Adding new node to the cluster fails with a broken pipe. Cassandra fails after starting within 2 minutes. I removed the node from the ring and adding it back fails.
OS info: 4.4.0-59-generic #80-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux.
This is the error I get on the node that I am trying to bootstrap.
cassandra version - 2.2.7. Getting Broken pipe exception..
ERROR [STREAM-OUT-/123.120.56.71] 2017-04-10 23:46:15,410 StreamSession.java:532 - Stream #cbb7a150-1e47-11e7-a556-a98ec456f4de Streaming error occurred
org.apache.cassandra.io.FSReadError: java.io.IOException: Broken pipe
at org.apache.cassandra.io.util.ChannelProxy.transferTo(ChannelProxy.java:144) ~[apache-cassandra-2.2.7.jar:2.2.7]
at org.apache.cassandra.streaming.compress.CompressedStreamWriter$1.apply(CompressedStreamWriter.java:91) ~[apache-cassandra-2.2.7.jar:2.2. 7]
at org.apache.cassandra.streaming.compress.CompressedStreamWriter$1.apply(CompressedStreamWriter.java:88) ~[apache-cassandra-2.2.7.jar:2.2. 7]
at org.apache.cassandra.io.util.BufferedDataOutputStreamPlus.applyToChannel(BufferedDataOutputStreamPlus.java:297) ~[apache-cassandra-2.2.7 .jar:2.2.7]
at org.apache.cassandra.streaming.compress.CompressedStreamWriter.write(CompressedStreamWriter.java:87) ~[apache-cassandra-2.2.7.jar:2.2.7]
at org.apache.cassandra.streaming.messages.OutgoingFileMessage.serialize(OutgoingFileMessage.java:90) ~[apache-cassandra-2.2.7.jar:2.2.7]
at org.apache.cassandra.streaming.messages.OutgoingFileMessage$1.serialize(OutgoingFileMessage.java:48) ~[apache-cassandra-2.2.7.jar:2.2.7]
at org.apache.cassandra.streaming.messages.OutgoingFileMessage$1.serialize(OutgoingFileMessage.java:40) ~[apache-cassandra-2.2.7.jar:2.2.7]
at org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:47) ~[apache-cassandra-2.2.7.jar:2.2.7]
at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.sendMessage(ConnectionHandler.java:389) ~[apache-cassandra-2.2.7 .jar:2.2.7]
at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.run(ConnectionHandler.java:361) ~[apache-cassandra-2.2.7.jar:2.2.7]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_101]
Caused by: java.io.IOException: Broken pipe
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) ~[na:1.8.0_101]
at sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428) ~[na:1.8.0_101]
at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493) ~[na:1.8.0_101]
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:608) ~[na:1.8.0_101]
at org.apache.cassandra.io.util.ChannelProxy.transferTo(ChannelProxy.java:140) ~[apache-cassandra-2.2.7.jar:2.2.7]
... 11 common frames omitted
INFO [STREAM-OUT-/123.120.56.71] 2017-04-10 23:46:15,424 StreamResultFuture.java:183 - Stream #cbb7a150-1e47-11e7-a556-a98ec456f4de Session with / 123.120.56.71 is complete
WARN [STREAM-OUT-/123.120.56.71] 2017-04-10 23:46:15,425 StreamResultFuture.java:210 - Stream #cbb7a150-1e47-11e7-a556-a98ec456f4de Stream failed
Can be due to corrupted data, wrong ssl configuration, schema disagreement or network failures.
Look like you have corrupted data or schema disagreement, so try the following:
1) Remove all the data from your data and commitlog directories, and then try to start.
2) If it doesn't help, try to to start with auto_bootstrap: false in cassandra.yaml. After the node starts and up, run nodetool rebuild.
If it fails, please attach all the errors here.

Restart dead accumulo tablet server

I use a single-instance Accumulo database. It worked all fine until I tried to ingest multiple data (following this tutorial), then my tablet sever died.
I tried to restart it (using bin/start-all or bin/start-here) but it did not work. Then I restarted the whole server and it seams, that bin/start-all starts the tablet server first:
WARN : Using Zookeeper /root/Installs/zookeeper-3.4.6/zookeeper-3.4.6. Use version 3.3.0 or greater to avoid zookeeper deadlock bug.
Starting monitor on localhost
WARN : Max open files on localhost is 1024, recommend 32768
Starting tablet servers .... done
Starting tablet server on 46.101.229.80
WARN : Max open files on 46.101.229.80 is 1024, recommend 32768
OpenJDK Client VM warning: You have loaded library /root/Installs/hadoop-2.6.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
2016-01-27 04:44:18,778 [util.NativeCodeLoader] WARN : Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-01-27 04:44:23,770 [fs.VolumeManagerImpl] WARN : dfs.datanode.synconclose set to false in hdfs-site.xml: data loss is possible on hard system reset or power loss
2016-01-27 04:44:23,803 [server.Accumulo] INFO : Attempting to talk to zookeeper
2016-01-27 04:44:24,246 [server.Accumulo] INFO : ZooKeeper connected and initialized, attempting to talk to HDFS
2016-01-27 04:44:24,802 [server.Accumulo] INFO : Connected to HDFS
Starting master on 46.101.229.80
WARN : Max open files on 46.101.229.80 is 1024, recommend 32768
Starting garbage collector on 46.101.229.80
WARN : Max open files on 46.101.229.80 is 1024, recommend 32768
Starting tracer on 46.101.229.80
WARN : Max open files on 46.101.229.80 is 1024, recommend 32768
But checking the monitor the tablet server is still dead.
The tserver_46.101.229.80.err-log ist empty, the tserver_46.101.229.80.out-log says:
OpenJDK Client VM warning: You have loaded library /root/Installs/hadoop-2.6.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 3225"...
How can I get the tabletServer up again?
I use a 32-bit 14.04 Linux of DigitalOcean, Hadoop 2.6, ZooKeeper 3.4.6 and Accuulo 1.6.4
If the TabletServer is repeatedly crashing with an OutOfMemoryError, you need to increase the JVM maximum heap size via the -Xmx option in the ACCUMULO_TSERVER_OPTS in accumulo-env.sh.

Resources