K8ssandra pod is replaying a large commit log and is not responding - cassandra

We a 3 node Cassandra 4 cluster, at some point (I don't know why) we get in one of the ndoes:
CommitLog.java:173 - Replaying /opt/cassandra/data/commitlog/CommitLog-7-1674673652744.log
With a long list of logs
We can see in the metrics that disk throughput was about 17 GB
During this time we see in the other 2 nodes (the node replaying is not responsive for almost 2m) :
NoSpamLogger.java:98 - /20.9.1.45:7000->prod-k8ssandra-seed-service/20.9.0.242:7000-SMALL_MESSAGES-[no-channel] failed to connect
java.nio.channels.ClosedChannelException: null
at org.apache.cassandra.net.OutboundConnectionInitiator$Handler.channelInactive(OutboundConnectionInitiator.java:248)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:819)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)
Questions:
What is the reason for this commit log replay?
Can we mitigate this node outage risk?
Update:
it seems the restart of node looks like somthing initiated by k8ssandra... this can explain the replay, what is the rason to the HTTP 500? I can't seem to see an
INFO [nioEventLoopGroup-2-2] 2023-01-25 19:07:10,694 Cli.java:617 - address=/127.0.0.6:53027 url=/api/v0/probes/liveness status=200 OK
INFO [nioEventLoopGroup-2-1] 2023-01-25 19:07:12,698 Cli.java:617 - address=http url=/api/v0/probes/readiness status=500 Internal Server Error
INFO [epollEventLoopGroup-38-1] 2023-01-25 19:07:20,700 Clock.java:47 - Using native clock for microsecond precision
WARN [epollEventLoopGroup-38-2] 2023-01-25 19:07:20,701 AbstractBootstrap.java:452 - Unknown channel option 'TCP_NODELAY' for channel '[id: 0x919a5c8b]'
WARN [epollEventLoopGroup-38-2] 2023-01-25 19:07:20,703 Loggers.java:39 - [s33] Error connecting to Node(endPoint=/tmp/cassandra.sock, hostId=null, hashCode=71aac1d0), trying next node (AnnotatedConnectException: connect(..) failed: Connection refused: /tmp/cassandra.sock)
INFO [nioEventLoopGroup-2-2] 2023-01-25 19:07:20,703 Cli.java:617 - address=/127.0.0.6:51773 url=/api/v0/probes/readiness status=500 Internal Server Error
INFO [epollEventLoopGroup-39-1] 2023-01-25 19:07:25,393 Clock.java:47 - Using native clock for microsecond precision
WARN [epollEventLoopGroup-39-2] 2023-01-25 19:07:25,394 AbstractBootstrap.java:452 - Unknown channel option 'TCP_NODELAY' for channel '[id: 0x80b52436]'
WARN [epollEventLoopGroup-39-2] 2023-01-25 19:07:25,395 Loggers.java:39 - [s34] Error connecting to Node(endPoint=/tmp/cassandra.sock, hostId=null, hashCode=cc8ec36), trying next node (AnnotatedConnectException: connect(..) failed: Connection refused: /tmp/cassandra.sock)
INFO [pool-2-thread-1] 2023-01-25 19:07:25,602 LifecycleResources.java:186 - Started Cassandra

When a Cassandra doesn't shutdown cleanly, Cassandra doesn't have a chance to persist the contents of the memtable to disk so when it is restarted, Cassandra replays the commit logs to repopulate the memtables.
It seems like you're confusing cause and effect. The K8ssandra operator restarted the pod because it was unresponsive -- the restart is the effect, not the cause.
You will need to review the Cassandra logs on the pod for clues as to why it became unresponsive. From your description that there was a large commitlog replayed on restart, I would suspect that there was a lot of traffic to the cluster (a large commitlog is a result of lots of writes) and an overloaded node would explain why it became unresponsive. Again, you will need to review the logs to determine the cause.
K8ssandra monitors the pods using "liveness" and "readiness" probes (aka health checks) and the HTTP 500 error would have been a result of the node being unresponsive. This would have triggered the operator to initiate a restart of the pod to automatically recover it. Cheers!

Related

How to set up Spark with Zookeeper have passowrd for HA?

spark.deploy.zookeeper.url
Introduction to connections without zoopeer passwords
https://spark.apache.org/docs/latest/configuration.html#deploy
if zookeeper have password, spark ha how to set up ?
thank you
I try to configure like this, but error
-Dspark.deploy.zookeeper.url=test:test123#172.28.1.43:2181
2023-02-08 16:16:53,448 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=test:test123#172.28.1.43:2181 sessionTimeout=60000 watcher=org.apache.curator.ConnectionState#8eb0a94
2023-02-08 16:17:03,495 WARN zookeeper.ClientCnxn: Session 0x0 for server test:test123#172.28.1.43:2181, unexpected error, closing socket connection and attempting reconnect
java.lang.IllegalArgumentException: Unable to canonicalize address test:test123#172.28.1.43:2181 because it's not resolvable
at org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:65)
at org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41)
at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1001)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060)

cassandra ec2 node is shutting down Caused by: java.nio.file.FileSystemException

This is the log I see
Caused by: java.nio.file.FileSystemException: /var/lib/cassandra/data/dev_fortis_mtd/explanationofbenefit-5fb6576031e511ec8611d5b080c74d01/snapshots/dropped-166672
6203042-explanationofbenefit/mc-1-big-Summary.db -> /var/lib/cassandra/data/dev_fortis_mtd/explanationofbenefit-5fb6576031e511ec8611d5b080c74d01/mc-1-big-Summary.d
b: Operation not permitted
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) ~[na:1.8.0_342]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[na:1.8.0_342]
at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476) ~[na:1.8.0_342]
at java.nio.file.Files.createLink(Files.java:1086) ~[na:1.8.0_342]
at org.apache.cassandra.io.util.FileUtils.createHardLink(FileUtils.java:100) ~[apache-cassandra-3.11.11.jar:3.11.11]
... 23 common frames omitted
ERROR [InternalResponseStage:6] 2022-10-25 19:30:03,044 DefaultFSErrorHandler.java:66 - Stopping transports as disk_failure_policy is stop
ERROR [InternalResponseStage:6] 2022-10-25 19:30:03,044 StorageService.java:518 - Stopping gossiper
WARN [InternalResponseStage:6] 2022-10-25 19:30:03,044 StorageService.java:360 - Stopping gossip by operator request
INFO [InternalResponseStage:6] 2022-10-25 19:30:03,044 Gossiper.java:1683 - Announcing shutdown
INFO [InternalResponseStage:6] 2022-10-25 19:30:03,046 StorageService.java:2480 - Node /172.X.X.X state jump to shutdown
looks like a problem with mc-1-big-Summary.db for the dev_fortis_mtd.explanationofbenefit table, can you check the data dir to see if mc-1... has a complete sstable set? If not, can you remove the incomplete set and then repair this table to pull data from another node in the cluster?

Error committing 10k records to Janus graph with cassandra

I'm fetching around 10 million records from a oracle DB and trying to persist those to Janus graph with Cassandra as storage backend [using Spark framework].
When i tried iterating the records in a loop and tried to commit every 10k, I'm getting the below error
ERROR StandardJanusGraph: Could not commit transaction [1] due to storage exception in commit
org.janusgraph.core.JanusGraphException: Could not execute operation due to backend exception
When i tried to get only the first 1L record from Oracle and committed every 1K, then its working fine.
Can someone help me to resolve this error? Appreciate your help. Thank you!!
Update:
WARN [ReadStage-3] 2019-09-29 08:39:28,327 AbstractLocalAwareExecutorService.java:167 - Uncaught exception on thread Thread[ReadStage-3,5,main]: {}
WARN [MemtableFlushWriter:17] 2019-09-29 09:09:40,843 NativeLibrary.java:304 - open(/var/lib/cassandra/data/circuit_equipment/system_properties-eeef4cb0e29711e9af61a34111381c19, O_RDONLY) failed, errno (2).
ERROR [MemtableFlushWriter:17] 2019-09-29 09:09:40,846 LogTransaction.java:272 - Transaction log [md_txn_flush_de900e80-e298-11e9-af61-a34111381c19.log in /var/lib/cassandra/data/circuit_equipment/system_properties-eeef4cb0e29711e9af61a34111381c19] indicates txn was not completed, trying to abort it now
ERROR [MemtableFlushWriter:17] 2019-09-29 09:09:40,847 LogTransaction.java:275 - Failed to abort transaction log [md_txn_flush_de900e80-e298-11e9-af61-a34111381c19.log in /var/lib/cassandra/data/circuit_equipment/system_properties-eeef4cb0e29711e9af61a34111381c19]
ERROR [MemtableFlushWriter:17] 2019-09-29 09:09:40,848 LogTransaction.java:222 - Unable to delete /var/lib/cassandra/data/circuit_equipment/system_properties-eeef4cb0e29711e9af61a34111381c19/md_txn_flush_de900e80-e298-11e9-af61-a34111381c19.log as it does not exist, see debug log file for stack trace
ERROR [MemtablePostFlush:9] 2019-09-29 09:09:40,849 CassandraDaemon.java:228 - Exception in thread Thread[MemtablePostFlush:9,5,main]
WARN [StorageServiceShutdownHook] 2019-09-29 09:09:40,849 StorageService.java:4591 - Caught exception while waiting for memtable flushes during shutdown hook
ERROR [StorageServiceShutdownHook] 2019-09-29 09:09:40,931 AbstractCommitLogSegmentManager.java:308 - Failed to force-recycle all segments; at least one segment is still in use with dirty CFs.
WARN [main] 2019-09-29 09:09:44,580 NativeLibrary.java:187 - Unable to lock JVM memory (ENOMEM). This can result in part of the JVM being swapped out, especially with mmapped I/O enabled. Increase RLIMIT_MEMLOCK or run Cassandra as root.
WARN [main] 2019-09-29 09:09:44,581 StartupChecks.java:169 - JMX is not enabled to receive remote connections. Please see cassandra-env.sh for more info.
WARN [main] 2019-09-29 09:09:44,591 SigarLibrary.java:174 - Cassandra server running in degraded mode. Is swap disabled? : false, Address space adequate? : true, nofile limit adequate? : true, nproc limit adequate? : true
WARN [main] 2019-09-29 09:09:44,593 StartupChecks.java:311 - Maximum number of memory map areas per process (vm.max_map_count) 65530 is too low, recommended value: 1048575, you can change it with sysctl.
WARN [Native-Transport-Requests-1] 2019-09-29 09:12:12,841 CompressionParams.java:383 - The sstable_compression option has been deprecated. You should use class instead
WARN [Native-Transport-Requests-1] 2019-09-29 09:12:12,842 CompressionParams.java:334 - The chunk_length_kb option has been deprecated. You should use chunk_length_in_kb instead
WARN [main] 2019-09-29 12:59:57,584 NativeLibrary.java:187 - Unable to lock JVM memory (ENOMEM). This can result in part of the JVM being swapped out, especially with mmapped I/O enabled. Increase RLIMIT_MEMLOCK or run Cassandra as root.
WARN [main] 2019-09-29 12:59:57,585 StartupChecks.java:169 - JMX is not enabled to receive remote connections. Please see cassandra-env.sh for more info.
WARN [main] 2019-09-29 12:59:57,599 SigarLibrary.java:174 - Cassandra server running in degraded mode. Is swap disabled? : false, Address space adequate? : true, nofile limit adequate? : true, nproc limit adequate? : true
WARN [main] 2019-09-29 12:59:57,602 StartupChecks.java:311 - Maximum number of memory map areas per process (vm.max_map_count) 65530 is too low, recommended value: 1048575, you can change it with sysctl.
root#f451df425ca8:/var/log/cassandra#
From these messages, you should disable swap (this is actually one of the main recommendations in Cassandra):
WARN [main] 2019-09-29 09:09:44,580 NativeLibrary.java:187 - Unable to lock JVM memory (ENOMEM). This can result in part of the JVM being swapped out, especially with mmapped I/O enabled. Increase RLIMIT_MEMLOCK or run Cassandra as root.
WARN [main] 2019-09-29 09:09:44,591 SigarLibrary.java:174 - Cassandra server running in degraded mode. Is swap disabled? : false, Address space adequate? : true, nofile limit adequate? : true, nproc limit adequate? : true
You should also change max_map_count. You can use this guide to set the other values for production environments. From this message:
WARN [main] 2019-09-29 12:59:57,602 StartupChecks.java:311 - Maximum number of memory map areas per process (vm.max_map_count) 65530 is too low, recommended value: 1048575, you can change it with sysctl.

Cassandra 3.11 throwing error "Exiting due to error while processing commit log during initialization"

I installed Cassandra using brew on mac, it was working fine for few days. But now it started throwing the error without changing anything in yaml file.
Connection error: ('Unable to connect to any servers', {'127.0.0.1': error(61, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")})
So I tried to update Cassandra using brew to 3.11. Now while starting Cassandra I am getting this error.
ERROR [main] 2017-09-20 12:52:02,732 JVMStabilityInspector.java:82 - Exiting due to error while processing commit log during initialization.
org.apache.cassandra.db.commitlog.CommitLogReadHandler$CommitLogReadException: Encountered bad header at position 157007 of commit log /usr/local/var/lib/cassandra/commitlog/CommitLog-6-1505888222471.log, with bad position but valid CRC
at org.apache.cassandra.db.commitlog.CommitLogSegmentReader$SegmentIterator.computeNext(CommitLogSegmentReader.java:113) [apache-cassandra-3.11.0.jar:3.11.0]
at org.apache.cassandra.db.commitlog.CommitLogSegmentReader$SegmentIterator.computeNext(CommitLogSegmentReader.java:84) [apache-cassandra-3.11.0.jar:3.11.0]
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) [guava-18.0.jar:na]
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) [guava-18.0.jar:na]
at org.apache.cassandra.db.commitlog.CommitLogReader.readCommitLogSegment(CommitLogReader.java:190) [apache-cassandra-3.11.0.jar:3.11.0]
at org.apache.cassandra.db.commitlog.CommitLogReader.readAllFiles(CommitLogReader.java:84) [apache-cassandra-3.11.0.jar:3.11.0]
at org.apache.cassandra.db.commitlog.CommitLogReplayer.replayFiles(CommitLogReplayer.java:140) [apache-cassandra-3.11.0.jar:3.11.0]
at org.apache.cassandra.db.commitlog.CommitLog.recoverFiles(CommitLog.java:177) [apache-cassandra-3.11.0.jar:3.11.0]
at org.apache.cassandra.db.commitlog.CommitLog.recoverSegmentsOnDisk(CommitLog.java:158) [apache-cassandra-3.11.0.jar:3.11.0]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:325) [apache-cassandra-3.11.0.jar:3.11.0]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:600) [apache-cassandra-3.11.0.jar:3.11.0]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:689) [apache-cassandra-3.11.0.jar:3.11.0]
From this link Cassandra: Exiting due to error while processing commit log during initialization
I got some info about node tool repair. But even node tool repair is not working.
objc[15089]: Class JavaLaunchHelper is implemented in both /Library/Java/JavaVirtualMachines/jdk1.8.0_111.jdk/Contents/Home/bin/java (0x10934b4c0) and /Library/Java/JavaVirtualMachines/jdk1.8.0_111.jdk/Contents/Home/jre/lib/libinstrument.dylib (0x10abba4e0). One of the two will be used. Which one is undefined.
nodetool: Failed to connect to '127.0.0.1:7199' - ConnectException: 'Connection refused (Connection refused)'.
commit log /usr/local/var/lib/cassandra/commitlog/CommitLog-6-1505888222471.log
It sounds like one of your commit log files got corrupted. Remove that file, and restart.
"But even node tool repair is not working."
I wouldn't worry about that. If you're on a single node cluster (ex: your own Mac), repair doesn't have any other nodes to stream data from, so it won't work anyway.

NEO4J local server does not start

I am running Linux in VirtualBox and am having an issue that I did not encounter on my machine with Linux as the primary OS.
When launching the neo4j service through sudo ./neo4j start in /opt/neo4j-community-2.3.1/bin I get a timeout with the message Failed to start within 120 seconds. Neo4j Server may have failed to start, please check the logs
my log from /opt/neo4j-community-2.3.1/data/graph.db/messages.log says:
http://pastebin.com/wUA715QQ
and data/log/console.log says:
2016-01-06 02:07:03.404+0100 INFO Successfully started database
2016-01-06 02:07:03.603+0100 INFO Successfully stopped database
2016-01-06 02:07:03.604+0100 INFO Successfully shutdown Neo4j Server
2016-01-06 02:07:03.608+0100 ERROR Failed to start Neo4j: Starting Neo4j failed: Component 'org.neo4j.server.security.auth.FileUserRepository#9ab182' was successfully initialized, but failed to start. Please see attached cause exception. Starting Neo4j failed: Component 'org.neo4j.server.security.auth.FileUserRepository#9ab182' was successfully initialized, but failed to start. Please see attached cause exception.
org.neo4j.server.ServerStartupException: Starting Neo4j failed: Component 'org.neo4j.server.security.auth.FileUserRepository#9ab182' was successfully initialized, but failed to start. Please see attached cause exception.
at org.neo4j.server.exception.ServerStartupErrors.translateToServerStartupError(ServerStartupErrors.java:67)
at org.neo4j.server.AbstractNeoServer.start(AbstractNeoServer.java:234)
at org.neo4j.server.Bootstrapper.start(Bootstrapper.java:97)
at org.neo4j.server.CommunityBootstrapper.start(CommunityBootstrapper.java:48)
at org.neo4j.server.CommunityBootstrapper.main(CommunityBootstrapper.java:35)
Caused by: org.neo4j.kernel.lifecycle.LifecycleException: Component 'org.neo4j.server.security.auth.FileUserRepository#9ab182' was successfully initialized, but failed to start. Please see attached cause exception.
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:462)
at org.neo4j.kernel.lifecycle.LifeSupport.start(LifeSupport.java:111)
at org.neo4j.server.AbstractNeoServer.start(AbstractNeoServer.java:194)
... 3 more
Caused by: java.nio.file.AccessDeniedException: /opt/neo4j-community-2.3.1/data/dbms/auth
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:84)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.newByteChannel(Files.java:407)
at java.nio.file.Files.readAllBytes(Files.java:3152)
at org.neo4j.server.security.auth.FileUserRepository.loadUsersFromFile(FileUserRepository.java:208)
at org.neo4j.server.security.auth.FileUserRepository.start(FileUserRepository.java:73)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:452)
... 5 more
Any idea why the server won't start?
Check the permissions on /opt/neo4j-community-2.3.1/data/dbms/auth
See the line that says:
Caused by: java.nio.file.AccessDeniedException: /opt/neo4j-community-2.3.1/data/dbms/auth

Resources