How to recover from a terminated pulsar bookie - apache-pulsar

We are running Apache Pulsar 2.72. in Prod which uses a 5 node (aws r5ad.2xlarge) bookie cluster (4.12.0). One of the nodes was terminated. As per our ASG a new node came up and joined the cluster.
The Bookies have
autoRecoveryDaemonEnabled=true
lostBookieRecoveryDelay=0
bookkeeperClientMinNumRacksPerWriteQuorum=2
managedLedgerDefaultEnsembleSize=3
managedLedgerDefaultWriteQuorum=3
However the ledger re-replication wasn't taking place. I tried decommissioning the terminated node using sudo /opt/apache-pulsar/apache-pulsar-2.7.2/bin/bookkeeper shell decommissionbookie -bookieid bookieIP:port but it was stuck at
23:53:36.465 [main] INFO org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 793
00:03:37.293 [main] INFO org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 793
00:13:38.119 [main] INFO org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 793
00:23:39.194 [main] INFO org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 793
00:33:39.995 [main] INFO org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 793
for more than 30 mins. We even tried getting the ledgers which were underreplicated using sh bookkeeper shell listunderreplicated and read some of the returned ledgers using sh bookkeeper shell ledger -m but that failed with an exception complaining about unable to access terminated bookie. We ended up deleting the underreplicated ledgers.
I am looking for a suggestion to best recover from a terminated bookie with our having to delete ledgers

Now that Apache Pulsar 2.8.1 is out, can you upgrade and try again. It seems unusual.
To get access to all the Pulsar people in one location, sign up for the summit
https://streamnative.io/en/blog/community/2021-09-07-speakers-announced-for-pulsar-virtual-summit-europe-2021/

Related

Reaper failed to run repair on Cassandra nodes

After Reaper failed to run repair on 18 nodes of Cassandra cluster, I ran a full repair of each node to fix the failed repair issue, after the full repair, Reaper executed successfully, but after a few days again the Reaper failed to run, I can see the following error in system.log
ERROR [RMI TCP Connection(33673)-10.196.83.241] 2021-09-01 09:01:18,005 RepairRunnable.java:276 - Repair session 81540931-0b20-11ec-a7fa-8d6977dd3c87 for range [(-606604147644314041,-98440495518284645], (-3131564913406859309,-3010160047914391044]] failed with error Terminate session is called
java.io.IOException: Terminate session is called
at org.apache.cassandra.service.ActiveRepairService.terminateSessions(ActiveRepairService.java:191) ~[apache-cassandra-3.11.0.jar:3.11.0]
INFO [Native-Transport-Requests-2] 2021-09-01 09:02:52,020 Message.java:619 - Unexpected exception during request; channel = [id: 0x1e99a957, L:/10.196.18.230:9042 ! R:/10.254.252.33:62100]
io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: Connection timed out
in nodetool tpstats I can see some pending tasks
Pool Name Active Pending
ReadStage 0 0
Repair#18 3 90
ValidationExecutor 3 3
Also in nodetool compactionstats there are 4 pending tasks:
-bash-4.2$ nodetool compactionstats
pending tasks: 4
- Main.visit: 1
- Main.post: 1
- Main.stream: 2
My question is why even after a full repair, reaper is still failing? and what is the root cause of pending repair?
PS: version of Reaper is 2.2.3, not sure if it is a bug in Reaper!
You most likely don't have enough segments in your Reaper repair definition, or the default timeout (30 mins) is too low for your repair.
Segments (and the associated repair session) get terminated when they reach the timeout, in order to avoid stuck repairs. When tuned inappropriately, this can give the behavior you're observing.
Nodetool doesn't set a timeout on repairs, which explains why it passes there. The good news is that nothing will prevent repair from passing with Reaper once tuned correctly.
We're currently working on adaptive repairs to have Reaper deal with this situation automatically, but in the meantime you'll need to deal with this manually.
Check the list of segments in the UI and apply the following rule:
If you have less than 20% of segments failing, double the timeout by adjusting the hangingRepairTimeoutMins value in the config yaml.
If you have more than 20% of segments failing, double the number of segments.
Once repair passes at least twice, check the maximum duration of segments and further tune the number of segments to have them last at most 15 mins.
Assuming you're not running Cassandra 4.0 yet, now that you ran repair through nodetool, you have sstables which are marked as repaired like incremental repair would. This will create a problem as Reaper's repairs don't mark sstables as repaired and you now have two different sstables pools (repaired and unrepaired), which cannot be compacted together.
You'll need to use the sstablerepairedset tool to mark all sstables as unrepaired to put all sstables back in the same pool. Please read the documentation to learn how to achieve this.
There could be a number of things taking place such as Reaper can't connect to the nodes via JMX (for whatever reason). It isn't possible to diagnose the problem with the limited information you've provided.
You'll need to check the Reaper logs for clues on the root cause.
As a side note, this isn't related to repairs and is a client/driver/app connecting to the node on the CQL port:
INFO [Native-Transport-Requests-2] 2021-09-01 09:02:52,020 Message.java:619 - Unexpected exception during request; channel = [id: 0x1e99a957, L:/10.196.18.230:9042 ! R:/10.254.252.33:62100]
io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: Connection timed out
Cheers!

How to avoid query taking long time when adding new node to cluster

I found query (just a simple select query) takes long time when adding a new node to cluster.
My execution time log :
17:49:40.008 [ThreadPoolTaskScheduler-14] INFO task.DiskCounting - void task.DiskCounting.runJob() executed in 8 ms
17:50:00.010 [ThreadPoolTaskScheduler-3] INFO task.DiskCounting - void task.DiskCounting.runJob() executed in 15010 ms
17:50:15.008 [ThreadPoolTaskScheduler-4] INFO task.DiskCounting - void task.DiskCounting.runJob() executed in 10008 ms
17:50:20.008 [ThreadPoolTaskScheduler-16] INFO task.DiskCounting - void task.DiskCounting.runJob() executed in 7 ms
Normally it takes about 10ms, and suddently takes 15000ms when adding node.
And I found it stuck because waiting for new node init data
Cassandra log (the new node):
INFO [HANDSHAKE-/194.187.1.52] 2019-05-31 17:49:36,056 OutboundTcpConnection.java:560 - Handshaking version with /194.187.1.52
INFO [GossipStage:1] 2019-05-31 17:49:36,059 Gossiper.java:1055 - Node /194.187.1.52 is now part of the cluster
INFO [RequestResponseStage-1] 2019-05-31 17:49:36,069 Gossiper.java:1019 - InetAddress /194.187.1.52 is now UP
INFO [GossipStage:1] 2019-05-31 17:49:36,109 TokenMetadata.java:479 - Updating topology for /194.187.1.52
INFO [GossipStage:1] 2019-05-31 17:49:36,109 TokenMetadata.java:479 - Updating topology for /194.187.1.52
INFO [MigrationStage:1] 2019-05-31 17:49:39,347 ViewManager.java:137 - Not submitting build tasks for views in keyspace system_traces as storage service is not initialized
INFO [MigrationStage:1] 2019-05-31 17:49:39,352 ColumnFamilyStore.java:411 - Initializing system_traces.events
INFO [MigrationStage:1] 2019-05-31 17:49:39,382 ColumnFamilyStore.java:411 - Initializing system_traces.sessions
Stuck when : Node /194.187.1.52 is now part of the cluster
And client will wait for new node init all data
What I have tried:
1. I try use consistency with ONE or QUORUM, and is no difference
2. I try turn replication factor to 1, 2 or 3, and still no difference
Why new node become part of cluster when that node not init data completely.
Is there a way to solve this.
I expect when I query to old node, the performance is not influenced by just waiting for new node to init data.
.
.
.
I resolved this problem.
I write wrong config, I let all node become seeds even before they joining cluster, this cause read timed-out during adding new node to cluster.
After fix this. all read is normal, but somehow I found insert query timed-out during adding node.
Finally I tune this to avoid insert timed-out:
/sbin/sysctl -w net.ipv4.tcp_keepalive_time=60 net.ipv4.tcp_keepalive_intvl=60 net.ipv4.tcp_keepalive_probes=5
and also change conf to limit throughput
stream_throughput_outbound_megabits_per_sec : 100
Really thanks for help.
This is just a theory, but one possible cause of this is that the new node is being chosen as a coordinator by your driver client, in this case, consistency level and replication aren't the main contributing factor to the delay in servicing your query.
If the new node is slowly performing initially for whatever reason and the driver is sending requests to it, the behavior of the coordinator can impact the servicing of your request.
What exactly is runJob doing? You suggested it is making a single query, but is it possible that it's a range query?
If it's a single query and it's taking as long as 10 seconds, that seems odd as the default read_request_timeout is 5 seconds. If it's a range query (a read involving multiple partitions), the default is 10 seconds. Are you adjusting those timeouts?
When you see responses that long for a single query that could mean the coordinator is what is impeding responsiveness as otherwise if the coordinator was responsive and the replicas were slow, you'd see ReadTimeoutException message serviced to the client.
To better react to these cases, a number of client drivers implement a strategy called 'speculative execution'. As described in the documentation for the DataStax Java Driver for Apache Cassandra:
Sometimes a Cassandra node might be experiencing difficulties (ex: long GC pause) and take longer than usual to reply. Queries sent to that node will experience bad latency.
One thing we can do to improve that is pre-emptively start a second execution of the query against another node, before the first node has replied or errored out. If that second node replies faster, we can send the response back to the client (we also cancel the first execution – note that “cancelling” in this context simply means discarding the response when it arrives later, Cassandra does not support cancellation of in flight requests at this stage)
You could configure your driver to speculatively execute with a constant threshold for idempotent requests (such as reads are). In the 3.x java driver, its done this way:
Cluster cluster = Cluster.builder()
.addContactPoint("127.0.0.1")
.withSpeculativeExecutionPolicy(
new ConstantSpeculativeExecutionPolicy(
500, // delay before a new execution is launched
2 // maximum number of executions
))
.build();
In this case, if the coordinator was slow to respond, after 500ms the driver chooses another coordinator and submits a second quest, and the first coordinator to respond wins.
Note that this might cause an amplification of requests sent to your cluster overall, so you want to tune that delay in such a way that it only kicks in when response time is highly anomalous. In your case, if requests normally take less than 10ms, 500ms is probably a reasonable number depending on what your higher percentile latencies look like.
All that being said, if you are able to identify that the problem is the new node behaving poorly as a coordinator. It's worth understanding why. Adding speculative execution could be a nice way of possibly working around the problem, but it's probably better to try to understand why the new node is so slowly performing. Having monitoring in place to observe Cassandra's metrics would likely give great visibility into the problem.
That is a behaviour that you can find with too high consistency, or not having enough copies of data (replication factor). When a new node is added to the cluster, a rearrangement of the ownership of tokens occur, once that it is determined what data will the new node be the owner, it will start streaming that data, which may saturate the network.
In your question you don't mention the network setting or if you are using cloud instances, which have a direct impact for these constraints, for example, an AWS m3.large instance will be more restricted in network capabilities than an i3.4xlarge.
Other variable to consider is disk configuration, if you are using your own hardware look for the cap in IO of your drives settings; if you are on the cloud, using the instance storage, when available, will have a better performance than external volumes ( as AWS EBS; if this is the case, ensure that you are enabling the " EBS optimized" option if your instance allows it)
Usually a RF of 3 with consistency level of Quorum should also help you to prevent the issue.

Cassandra node down...any ideas why?

I've put up a test cluster - four nodes. Severely underpowered(!) - ok CPU, only 2 gigs of ram, shared non ssd storage. Hey, it's test :)
I just kept it running for three days. No data going in or out..everything's just idle. Connected with opscenter.
This morning, we found one of the nodes went down around 2 am last night. The OS didn't go down (was responding to pings). The cassandra log around that time is:
INFO [MemtableFlushWriter:114] 2014-07-29 02:07:34,952 Memtable.java:360 - Completed flushing /var/lib/cassandra/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/system-sstable_activity-ka-107-Data.db (686 bytes) for commitlog position ReplayPosition(segmentId=1406304454537, position=29042136)
INFO [ScheduledTasks:1] 2014-07-29 02:08:24,227 GCInspector.java:116 - GC for ParNew: 276 ms for 1 collections, 648591696 used; max is 1040187392
Next entry is:
INFO [main] 2014-07-29 09:18:41,661 CassandraDaemon.java:102 - Hostname: xxxxx
i.e. when we restarted the node through opscenter.
Does that mean it crashed on GC, or that GC finished and something else crashed? Is there some other log I should be looking at?
Note: In opscenter eventlog, we see this:
7/29/2014, 2:15am Warning Node reported as being down: xxxxxxx
I appreciate the nodes are underpowered, but for being completely idle, it shouldn't crash, should it?
Using 2.1.0-rc4 btw.
My guess is your node was shut down by the OOM killer. Because the Linux system over commits ram, when a heavy stress is on the system it may shut down applications to recover memory for the os. With 2G total ram this can happen very easily.

Index initializer warning during bootstrap

I'm trying to simultaneously add 4 nodes to my current 2-node DC. I have Vnodes turned off as per Datastax suggestion. Right after the major index build in each node, the following warning is printed several times in the logs:
WARN [SolrSecondaryIndex ks.cf index initializer.] 2014-06-20
09:39:59,904 CassandraUtil.java (line 108) Error Operation timed out -
received only 3 responses. on attempt 1 out of 4 with CL QUORUM...
I understand what it means. But why is Cassandra expecting the nodes to fulfill the CL when these nodes are still bootstrapping? More importantly, how does the warning affect the bootstrap? I noticed that the nodes are not doing any index build or streaming anymore; but they also remained in "Active - Joining" state. Is there any chance that they will finish? What should I do?
I'm using DSE 4.0.3. All existing and new nodes in the DC are Search nodes. I pre-computed the tokens using the python program for MurMur3Partitioner.
EDIT:
Although nodetool compactionstats does not show any on-going index build in the nodes, for some reason, I still see a lot of these lines in the logs:
INFO [IndexPool backpressure thread-0] 2014-06-20 12:30:31,346 IndexPool.java (line 472) Throttling at 26 index requests per second with target total queue size at 40
INFO [IndexPool backpressure thread-0] 2014-06-20 12:30:34,169 IndexPool.java (line 428) Back pressure is active with total index queue size 18586 and average processing time 2770
EDIT:
Interestingly, I found the following lines in each node after digging through the log files:
INFO [main] 2014-06-20 09:39:48,588 StorageService.java (line 1036) Bootstrap completed! for the tokens [node token]
INFO [SolrSecondaryIndex ks.cf index initializer.] 2014-06-20 11:32:07,833 AbstractSolrSecondaryIndex.java (line 411) Reindexing 1417116631 commit log updates for core ks.cf
Based from these lines, I feel a lot safer that the bootstrap actually completed and that the nodes are simply re-indexing their data. I don't know, though, why the re-indexing process is not being shown in nodetool compactionstats.
It appears the bootstrap completed, and the DSE Search system is running normally.
why the re-indexing process is not being shown in nodetool compactionstat
DSE Search is not generally exposed via Cassandra command line tools. The log output should show the indexing as having completed, were you able to verify that?

Timeout cassandra hector

i've started working with cassandra. Therefore I’ve download cassandra (1.1.1) to my windows pc and started it. Everything works fine.
Thus I began to reimplement a old application (in java using hector 1.1) which imports about 200.000.000 for 4 tables, which should insertet into 4 columnfamilies. After importing about 2.000.000 records I get an timeout exception and cassandra doesn't response on requests:
2012-07-03 15:35:43,299 WARN - Could not fullfill request on this host CassandraClient<localhost:9160-16>
2012-07-03 15:35:43,300 WARN - Exception: me.prettyprint.hector.api.exceptions.HTimedOutException: TimedOutException()
....
Caused by: TimedOutException()
at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:20269)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:922)
at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:908)
at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:246)
at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:243)
at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:103)
at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:258)
The last entries inside the logfile are:
INFO 15:35:31,678 Writing Memtable-cf2#678837311(7447722/53551072 serialized/live bytes, 262236 ops)
INFO 15:35:32,810 Completed flushing \var\lib\cassandra\data\keySpaceName\cf2\keySpaceName-cf2-hd-205-Data.db (3292685 bytes) for commitlog position ReplayPosition(segmentId=109596147695328, position=131717208)
INFO 15:35:33,282 Compacted to [\var\lib\cassandra\data\keySpaceName\cf3\keySpaceName-cf3-hd-29-Data.db,]. 33.992.615 to 30.224.481 (~88% of original) bytes for 282.032 keys at 1,378099MB/s. Time: 20.916ms.
INFO 15:35:33,286 Compacting [SSTableReader(path='\var\lib\cassandra\data\keySpaceName\cf4\keySpaceName-cf4-hd-8-Data.db'), SSTableReader(path='\var\lib\cassandra\data\keySpaceName\cf4\keySpaceName-cf4-hd-6-Data.db'), SSTableReader(path='\var\lib\cassandra\data\keySpaceName\cf4\keySpaceName-cf4-hd-7-Data.db'), SSTableReader(path='\var\lib\cassandra\data\keySpaceName\cf4\keySpaceName-cf4-hd-5-Data.db')]
INFO 15:35:34,871 Compacted to [\var\lib\cassandra\data\keySpaceName\cf4\keySpaceName-cf4-hd-9-Data.db,]. 4.249.270 to 2.471.543 (~58% of original) bytes for 30.270 keys at 1,489916MB/s. Time: 1.582ms.
INFO 15:35:41,858 Compacted to [\var\lib\cassandra\data\keySpaceName\cf2\keySpaceName-cf2-hd-204-Data.db,]. 48.868.818 to 24.033.164 (~49% of original) bytes for 135.367 keys at 2,019011MB/s. Time: 11.352ms.
I created 4 column families like following:
ColumnFamilyDefinition cf1 = HFactory.createColumnFamilyDefinition(
“keyspacename”,
“cf1”,
ComparatorType.ASCIITYPE);
The column families have following column count:
16 columns
14 columns
7 colmuns
5 columns
The keyspace is created with replication factor 1 and default strategy (simple)
I insert the records (rows) with 'Mutator#AddInsertion'
Any advice avoiding this exception?
Regards
WM
That exception is basically Cassandra saying that it's far enough behind on mutations that it won't complete your requests before they time out. Assuming your PC isn't a beast, you should probably throttle your requests. I suggest sleeping for a while after catching that exception and then retrying; there's no harm in accidentally writing the same row twice, and Cassandra should catch up on write pretty quickly.
If you were in a production environment, I would look more closely at other reasons why the node might be performing poorly.

Resources