Zookeeper Leader Auto Failover - apache-spark

We are using spark standalone cluster 3 zookeepers in HA mode.I am seeing this issue in zookeeper.log.
Exception causing close of session 0x0 due to java.io.IOException: Len error 1195725856
Closed socket connection for client /10.23...... (no session established for client)
Zookeeper leader is getting auto failed over from one server to another server and hence followed by this, spark master is getting auto failed over
Also Some clients are getting continually disconnected/reconnected with this error.
How to fix
Full Log:
[myid:3] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#354] - Exception causing close of session 0x0 due to java.io.IOException: Len error 1195725856
[myid:3] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#1001] - Closed socket connection for client /10....... (no session established for client)
[myid:3] - ERROR [LearnerHandler-/159.1.......:LearnerHandler#562] - Unexpected exception causing shutdown while sock still open
[myid:3] - WARN [LearnerHandler-/159.1......:LearnerHandler#575] - ******* GOODBYE /159.1..... ********
[myid:3] - INFO [WorkerReceiver[myid=3]:FastLeaderElection#542] - Notification: 1 (n.leader), 0x29000000ed (n.zxid), 0xa (n.round), LOOKING (n.state), 1 (n.sid), 0x29 (n.peerEPoch), LEADING (my state)
[myid:3] - INFO [LearnerHandler-/159........:LearnerHandler#263] - Follower sid: 1 : info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer#e144103
[myid:3] - INFO [LearnerHandler-/159.......LearnerHandler#318] - Synchronizing with Follower sid: 1 maxCommittedLog=0x29000000ed minCommittedLog=0x2800000007 peerLastZxid=0x29000000ed
[myid:3] - INFO [LearnerHandler-/159.......:LearnerHandler#395] - Sending DIFF

Related

K8ssandra pod is replaying a large commit log and is not responding

We a 3 node Cassandra 4 cluster, at some point (I don't know why) we get in one of the ndoes:
CommitLog.java:173 - Replaying /opt/cassandra/data/commitlog/CommitLog-7-1674673652744.log
With a long list of logs
We can see in the metrics that disk throughput was about 17 GB
During this time we see in the other 2 nodes (the node replaying is not responsive for almost 2m) :
NoSpamLogger.java:98 - /20.9.1.45:7000->prod-k8ssandra-seed-service/20.9.0.242:7000-SMALL_MESSAGES-[no-channel] failed to connect
java.nio.channels.ClosedChannelException: null
at org.apache.cassandra.net.OutboundConnectionInitiator$Handler.channelInactive(OutboundConnectionInitiator.java:248)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:819)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)
Questions:
What is the reason for this commit log replay?
Can we mitigate this node outage risk?
Update:
it seems the restart of node looks like somthing initiated by k8ssandra... this can explain the replay, what is the rason to the HTTP 500? I can't seem to see an
INFO [nioEventLoopGroup-2-2] 2023-01-25 19:07:10,694 Cli.java:617 - address=/127.0.0.6:53027 url=/api/v0/probes/liveness status=200 OK
INFO [nioEventLoopGroup-2-1] 2023-01-25 19:07:12,698 Cli.java:617 - address=http url=/api/v0/probes/readiness status=500 Internal Server Error
INFO [epollEventLoopGroup-38-1] 2023-01-25 19:07:20,700 Clock.java:47 - Using native clock for microsecond precision
WARN [epollEventLoopGroup-38-2] 2023-01-25 19:07:20,701 AbstractBootstrap.java:452 - Unknown channel option 'TCP_NODELAY' for channel '[id: 0x919a5c8b]'
WARN [epollEventLoopGroup-38-2] 2023-01-25 19:07:20,703 Loggers.java:39 - [s33] Error connecting to Node(endPoint=/tmp/cassandra.sock, hostId=null, hashCode=71aac1d0), trying next node (AnnotatedConnectException: connect(..) failed: Connection refused: /tmp/cassandra.sock)
INFO [nioEventLoopGroup-2-2] 2023-01-25 19:07:20,703 Cli.java:617 - address=/127.0.0.6:51773 url=/api/v0/probes/readiness status=500 Internal Server Error
INFO [epollEventLoopGroup-39-1] 2023-01-25 19:07:25,393 Clock.java:47 - Using native clock for microsecond precision
WARN [epollEventLoopGroup-39-2] 2023-01-25 19:07:25,394 AbstractBootstrap.java:452 - Unknown channel option 'TCP_NODELAY' for channel '[id: 0x80b52436]'
WARN [epollEventLoopGroup-39-2] 2023-01-25 19:07:25,395 Loggers.java:39 - [s34] Error connecting to Node(endPoint=/tmp/cassandra.sock, hostId=null, hashCode=cc8ec36), trying next node (AnnotatedConnectException: connect(..) failed: Connection refused: /tmp/cassandra.sock)
INFO [pool-2-thread-1] 2023-01-25 19:07:25,602 LifecycleResources.java:186 - Started Cassandra
When a Cassandra doesn't shutdown cleanly, Cassandra doesn't have a chance to persist the contents of the memtable to disk so when it is restarted, Cassandra replays the commit logs to repopulate the memtables.
It seems like you're confusing cause and effect. The K8ssandra operator restarted the pod because it was unresponsive -- the restart is the effect, not the cause.
You will need to review the Cassandra logs on the pod for clues as to why it became unresponsive. From your description that there was a large commitlog replayed on restart, I would suspect that there was a lot of traffic to the cluster (a large commitlog is a result of lots of writes) and an overloaded node would explain why it became unresponsive. Again, you will need to review the logs to determine the cause.
K8ssandra monitors the pods using "liveness" and "readiness" probes (aka health checks) and the HTTP 500 error would have been a result of the node being unresponsive. This would have triggered the operator to initiate a restart of the pod to automatically recover it. Cheers!

How to set up Spark with Zookeeper have passowrd for HA?

spark.deploy.zookeeper.url
Introduction to connections without zoopeer passwords
https://spark.apache.org/docs/latest/configuration.html#deploy
if zookeeper have password, spark ha how to set up ?
thank you
I try to configure like this, but error
-Dspark.deploy.zookeeper.url=test:test123#172.28.1.43:2181
2023-02-08 16:16:53,448 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=test:test123#172.28.1.43:2181 sessionTimeout=60000 watcher=org.apache.curator.ConnectionState#8eb0a94
2023-02-08 16:17:03,495 WARN zookeeper.ClientCnxn: Session 0x0 for server test:test123#172.28.1.43:2181, unexpected error, closing socket connection and attempting reconnect
java.lang.IllegalArgumentException: Unable to canonicalize address test:test123#172.28.1.43:2181 because it's not resolvable
at org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:65)
at org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41)
at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1001)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060)

Add a Cassandra OSS 4.0 RC1 node into a cluster with DSE 6.0.14 nodes

Every nodes of the cluster are in Version DSE 6.0.14, they set in ssl mode (listen on port 7001).
We're trying to add a node in version open Sources 4.0 RC1.
We force the port communication on this node:
storage_port: 7001
else the node try to communicate on the 7000 port that is closed.
We encountered the following error, when I try to start the service of the new node :
INFO [main] 2021-05-10 16:22:00,985 StorageService.java:528 - Gathering node replacement information for /10.135.66.204:7001
DEBUG [main] 2021-05-10 16:22:00,986 YamlConfigurationLoader.java:112 - Loading settings from file:/etc/cassandra/default.conf/cassandra.yaml
DEBUG [main] 2021-05-10 16:22:00,996 YamlConfigurationLoader.java:112 - Loading settings from file:/etc/cassandra/default.conf/cassandra.yaml
INFO [Messaging-EventLoop-3-1] 2021-05-10 16:22:01,138 InboundConnectionInitiator.java:281 - peer /10.137.65.201:54916 only supports messaging versions lower (2) than this node supports (10)
ERROR [Messaging-EventLoop-3-2] 2021-05-10 16:22:01,237 NoSpamLogger.java:98 - /xx.xxx.xx.xxx:7001->/xx.xxx.xx.xxx:7001-URGENT_MESSAGES-[no-channel] failed to connect
java.nio.channels.ClosedChannelException: null
[...]
INFO [ScheduledTasks:1] 2021-05-10 16:22:02,398 TokenMetadata.java:525 - Updating topology for all endpoints that have changed
ERROR [Messaging-EventLoop-3-1] 2021-05-10 16:22:09,467 InboundConnectionInitiator.java:360 - Failed to properly handshake with peer /xx.xxx.xx.xxx:54922. Closing the channel.
java.lang.AssertionError: null
[...]
I don't know if the error come from a mistake in the config of the node oss 4.0 or if there is an incompatibility between the new node version and the version of the existing node in the cluster.

Hawkular: Could not connect to Cassandra DB on LocalHost

Im tring t build up an Hawkular Server in Windows 7 - unfortunately this server works with a Cassandra DB - ive installed the newest version and in the starting process of Hawkular I got following error:
14:57:20,102 ERROR [org.hawkular.alerts.bus.log] (Thread-195 (ActiveMQ-client-global-threads-205387390)) HAWKALERT210009: Error accesing to DefinitionsService. Description: [java.lang.RuntimeException: Cassandra session is null]
14:57:19,843 WARN [org.hawkular.metrics.api.jaxrs.MetricsServiceLifecycle] (metricsservice-lifecycle-thread) HAWKMETRICS200004: [18] Retrying connecting to Cassandra cluster in [2]s...
14:57:19,839 ERROR [org.hawkular.alerts.engine.log] (EE-ManagedExecutorService-default-Thread-10) HAWKALERT220009: Definitions Service error in [Triggers]. Msg: [java.lang.RuntimeException: Cassandra session is null]
14:57:20,354 ERROR [org.hawkular.alerts.bus.log] (Thread-190 (ActiveMQ-client-global-threads-205387390)) HAWKALERT210009: Error accesing to DefinitionsService. Description: [com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for qu
ery failed (tried: /127.0.0.1:9042 (com.datastax.driver.core.exceptions.InvalidQueryException: unconfigured table schema_keyspaces))]
14:57:22,104 INFO [org.hawkular.inventory.impl.tinkerpop] (ServerService Thread Pool -- 111) HAWKINV001000: Using graph provider: org.hawkular.inventory.impl.tinkerpop.provider.TitanProvider
14:57:22,360 INFO [org.hawkular.metrics.api.jaxrs.MetricsServiceLifecycle] (metricsservice-lifecycle-thread) HAWKMETRICS200002: Initializing metrics service
14:57:22,396 WARN [org.hawkular.metrics.api.jaxrs.MetricsServiceLifecycle] (metricsservice-lifecycle-thread) HAWKMETRICS200003: Could not connect to Cassandra cluster - assuming its not up yet: All host(s) tried for query failed (tried: /127.0.0.1:9042 (com.d
atastax.driver.core.exceptions.InvalidQueryException: unconfigured table schema_keyspaces))
14:57:22,397 WARN [org.hawkular.metrics.api.jaxrs.MetricsServiceLifecycle] (metricsservice-lifecycle-thread) HAWKMETRICS200004: [19] Retrying connecting to Cassandra cluster in [3]s...
14:57:23,102 WARN [org.hawkular.inventory.cdi] (ServerService Thread Pool -- 111) HAWKINV003501: Inventory backend failed to initialize in an attempt 10 of 15 with message: Could not instantiate implementation: com.thinkaurelius.titan.diskstorage.cassandra.th
rift.CassandraThriftStoreManager.
14:57:25,398 INFO [org.hawkular.metrics.api.jaxrs.MetricsServiceLifecycle] (metricsservice-lifecycle-thread) HAWKMETRICS200002: Initializing metrics service
14:57:25,438 WARN [org.hawkular.metrics.api.jaxrs.MetricsServiceLifecycle] (metricsservice-lifecycle-thread) HAWKMETRICS200003: Could not connect to Cassandra cluster - assuming its not up yet: All host(s) tried for query failed (tried: /127.0.0.1:9042 (com.d
atastax.driver.core.exceptions.InvalidQueryException: unconfigured table schema_keyspaces))
Cassandra DB is online and I can connect to localhost:9160 with the Cassandra CQL Shell - but not the hawkular server - have I forgotten something?
From the log, the port Hawkular is trying to use is 9042. Make sure of this setting in your cassandra.yaml
start_native_transport: true
native_transport_port: 9042
The port you are using (9160) is thrift port.
I can only take a stab in the dark with such limited information.
However you said you connected to localhost:9160. In the log it mentions "127.0.0.1:9042".

PouchDB Sync Error 500 / Couch DB 'Unknown peer'

When my local pouchdb is performing an initial (i.e. long) replication from a remote couchdb, I sometimes get an Error Code 500 (Database encountered an unknown error) on my client - midway through the replication process.
If I look at the server logs I see :
[error] [<0.1304.0>] Unknown peer: {error,enotconn} for #Port<0.4817>
[error] [<0.1195.0>] Unknown peer: {error,enotconn} for #Port<0.4812>
[error] [<0.1314.0>] Unknown peer: {error,enotconn} for #Port<0.4826>
[error] [<0.1063.0>] Unknown peer: {error,enotconn} for #Port<0.4813>
Anyone seen anything like tis before ?
Restarting the client by closing the tab entirely (sometimes more than once) results in the replication completing successfully.

Resources