Cassandra avoid JOINING state go to RUNNING

Cassandra avoid JOINING state go to RUNNING - cassandra

I've got a few nodes in a ring with replication 3 and trying to change the hardware on the node. What's happening is that I'm getting a streaming failure exception.
I've tried a few times always with the same failure. The upstream node (10.0.10.54) is dreadfully out of space and it's not realistic to compact or do any sstable operations on it. What I would like to do is:
Bring up a new node with all the data prior streaming prior to the failed event
Run a repair on it (nodetool repair -pr)
Decomission the 10.0.10.54 node
What I can't figure out how to do is everytime I bring up the new node it goes into JOINING, what I want is to force it into RUNNING with the data that it has copied from it's JOINING state.
The exception for those interested -
WARN [StreamReceiveTask:6] 2016-04-25 06:48:51,107 StreamResultFuture.java:207 - [Stream #bb34c010-0a1b-11e6-a009-d100b9716be2] Stream failed
INFO [MemtableFlushWriter:214] 2016-04-25 06:48:51,107 Memtable.java:382 - Completed flushing /mnt/cassandra/data/system/compactions_in_progress-55080ab05d9c388690a4acb25fe1f77b/system-compactions_in_progress-tmp-ka-276-Data.db (0.000KiB) for commitlog position ReplayPosition(segmentId=1461502431578, position=9474892)
INFO [CompactionExecutor:259] 2016-04-25 06:48:51,252 CompactionTask.java:141 - Compacting [SSTableReader(path='/mnt/cassandra/data/trends/stream_trends-a5bb42a07e2911e58fd6f3cfff022ad4/trends-stream_trends-ka-79-Data.db'), SSTableReader(path='/mnt/cassandra/data/trends/stream_trends-a5bb42a07e2911e58fd6f3cfff022ad4/trends-stream_trends-ka-87-Data.db')]
ERROR [main] 2016-04-25 06:48:51,270 CassandraDaemon.java:581 - Exception encountered during startup
java.lang.RuntimeException: Error during boostrap: Stream failed
at org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:86) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1166) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:944) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:740) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:617) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:389) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at com.datastax.bdp.server.DseDaemon.setup(DseDaemon.java:336) ~[dse-core-4.8.6.jar:4.8.6]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:564) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at com.datastax.bdp.DseModule.main(DseModule.java:74) [dse-core-4.8.6.jar:4.8.6]
Caused by: org.apache.cassandra.streaming.StreamException: Stream failed
at org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(StreamEventJMXNotifier.java:85) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at com.google.common.util.concurrent.Futures$4.run(Futures.java:1172) ~[guava-16.0.1.jar:na]
at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) ~[guava-16.0.1.jar:na]
at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156) ~[guava-16.0.1.jar:na]
at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145) ~[guava-16.0.1.jar:na]
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202) ~[guava-16.0.1.jar:na]
at org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:208) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:184) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:415) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.streaming.StreamSession.maybeCompleted(StreamSession.java:692) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.streaming.StreamSession.taskCompleted(StreamSession.java:653) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:179) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_77]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_77]
at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_77]

You can't skip the JOINING stage while a node is bootstrapping. I'm guessing you were following these steps to replace the node? https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_replace_node_t.html
Were all of your nodes online throughout the entire streaming process? If a replica crashes or goes offline while streaming it can cause the stream to fail. If you nodes are very low on disk space it can cause Cassandra to act in odd ways or crash. If this is the case, you may need to add additional storage to your existing node before adding the new node.
You can add more disk space to your existing node like this:
Stop Cassandra
Attach a larger disk to the machine/VM
Copy the cassandra data directory (/var/lib/cassandra/data) to the new disk
Change the cassandra data directory mount point to the new disk using a symlink
Start Cassandra up

Related

Cassandra: Cannot remove a node and it gives gossip ERROR

I have one reachable node in my cluster and I tried replacing it, it wasn't successful. So, I left the node and ignored the data loss because of the replication factor 3.
Now, when I try to decommission or add a server, it's not working as expected.
I'm getting these INFO messages in all the nodes. I have tried to assassinate and remove as well. This node doesn't show up in the node tool status. But I guess is, it is somewhere persisted and Gossips are giving issues.
INFO [GossipStage:1] 2021-05-29 07:25:37,404 Gossiper.java:1029 - InetAddress /10.43.5.118 is now DOWN
INFO [GossipStage:1] 2021-05-29 07:25:37,405 StorageService.java:2324 - Removing tokens [] for /10.43.5.118
And also, while restarting the node, I get an ERROR from the gossip which is NullPointerException. It's not able to get the host id. I tried removing it with the old method mentioned in the stackoverflow. Using JXM.
ERROR [GossipStage:1] 2021-05-29 08:48:35,229 CassandraDaemon.java:226 - Exception in thread Thread[GossipStage:1,5,main]
java.lang.NullPointerException: null
at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:866) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:2096) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.service.StorageService.onChange(StorageService.java:1822) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.service.StorageService.onJoin(StorageService.java:2536) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1070) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1181) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:49) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:64) ~[apache-cassandra-3.9.jar:3.9]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_181]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_181]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_181]
Can someone let me know how to remove this node completely?

In this case, Please remove all the data from removed node and try to start stand alone. if it works then again you should clear the data on that node and join the cluster with configuration change.
Also, if you have less amount of data on other nodes then please run cleanup and repair once before adding new node.

cassandra 3.9 flush fails

We have a 5 node cassandra cluster running cassandra 3.9. We have a keyspace "ks" and a table "cf". We created several indexes on the table like "cf_c1_idx", "cf_c1_idx_1", "cf_c2_idx".
When I do a nodetool flush, the flush of 1 of the index files fails with the following exception:
-- StackTrace --
java.lang.RuntimeException: Last written key DecoratedKey(4dd1d75b-e52f-6c49-e7cd-c52a968e70de, 4dd1d75be52f6c49e7cdc52a968e70de) >= current key DecoratedKey(00000000-0000-0000-0000-000000000000, 5331cc31ae396031e6be66312c89c379) writing into /var/lib/cassandra/data/ks/cf-8d8b1ba0081c11e7a4206f8b05d669ae/.cf_c1_idx_1/mc-401-big-Data.db
at org.apache.cassandra.io.sstable.format.big.BigTableWriter.beforeAppend(BigTableWriter.java:122)
at org.apache.cassandra.io.sstable.format.big.BigTableWriter.append(BigTableWriter.java:161)
at org.apache.cassandra.io.sstable.SimpleSSTableMultiWriter.append(SimpleSSTableMultiWriter.java:48)
at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:458)
at org.apache.cassandra.db.Memtable$FlushRunnable.call(Memtable.java:493)
at org.apache.cassandra.db.Memtable$FlushRunnable.call(Memtable.java:380)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
When I run nodetool flush again after a few seconds, it succeeds without a hitch. We also notice the same exception showing up during commitlog replay after restarting a node sometimes. We end up deleting the commitlog directory so cassandra can start and run a repair to sync the data that was lost. Is this happening because of secondary indexes not getting updated in time? Also, this is a read intensive cluster.

Cassandra 3 Repair never finishes

We have a cluster with 6 nodes in datacenters (3 nodes each). We are starting a repair on one node and shortly afterwords we can find something like this in the logs:
ERROR [Repair#1:1] 2016-05-31 01:33:28,075 CassandraDaemon.java:195 - Exception in thread Thread[Repair#1:1,5,RMI Runtime]
com.google.common.util.concurrent.UncheckedExecutionException: org.apache.cassandra.exceptions.RepairException: [repair #e8e21070-26be-11e6-aae8-77b20cefeee5 on ..... Validation failed in /xx.xxx.xx.xx
at com.google.common.util.concurrent.Futures.wrapAndThrowUnchecked(Futures.java:1525) ~[guava-18.0.jar:na]
at com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1511) ~[guava-18.0.jar:na]
at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:162) ~[apache-cassandra-3.0.4.jar:3.0.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_77]
at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_77]
Afterwords nothing seems to happen anymore. We did not interrupt the repair for several days, but still nothing happens. We also tried it on two different clusters with the same result.
After searching through the web we stumbled upon https://support.datastax.com/hc/en-us/articles/205256895--Validation-failed-when-running-a-nodetool-repair. It says that we should run "nodetool scrub" and if it does not help "sstablescrub".
We tried the nodetool scrub but the repair does still not work. We now started a sstablescrub but it seems to take forever. It uses only one cpu at 100% and the data and index file is growing, but it now runs for over a day and the file now only has a size of 1.2GB.
Is it normal that "sstablescrub" is so slow?
The cluster is already running for some time, and we missed the GCGraceSeconds for the repair. Might that lead to the not working repair?
We currently do not know how to get the repair running hope someone can help.

What the exception indicates is that the node was not able to receive the results from the merkle tree computation that was supposed to happen on /xx.xxx.xx.xx. Please check the logs for this node instead. The node you started the repair run is likely fine and does not require sstable scrubbing.

Can't add a new Cassandra datacenter due to streaming errors

Using DSE 4.8.6 (C* 2.1.13.1218)
When I try adding a new node in a new datacenter, bootstraping / node rebuild is always interrupted by streaming errors.
Error example from system.log:
ERROR [STREAM-IN-/172.31.47.213] 2016-04-19 12:30:28,531 StreamSession.java:621 - [Stream #743d44e0-060e-11e6-985c-c1820b05e9ae] Remote peer 172.31.47.213 failed stream session.
INFO [STREAM-IN-/172.31.47.213] 2016-04-19 12:30:30,665 StreamResultFuture.java:180 - [Stream #743d44e0-060e-11e6-985c-c1820b05e9ae] Session with /172.31.47.213 is complete
There is about 500GB of data to be streamed to the new node. Boostrap or rebuild operation stream those from 4 different nodes on the other (main) DC.
When a streaming error occurs, all synced data is wiped (and I have to start over).
What I tried so far:
bootstraping the node
setup auto_boostrap: False in cassandra.yaml and manually run nodetool rebuild
disabling streaming_socket_timeout_in_ms and setting up more aggressive TCP Keep Alive values in my linux conf (following advice in the CASSANDRA-9440 ticket)
increasing phi_convict_threshold (to the max)
do not bootstrap the node and use repair to stream the data (stopping the repair at a nearly full disk and 80K SSTables. After 3 days of trying to compact them, I gave up)
Any other things I should try ? I'm in the process of running nodetool scrub on every failing node to see if this helps...
On the stream out node, these are the error messages:
ERROR [STREAM-IN-/172.31.45.28] 2016-05-11 13:10:43,842 StreamSession.java:505 - [Stream #ecfe0390-1763-11e6-b6c8-c1820b05e9ae] Streaming error occurred
java.net.SocketTimeoutException: null
at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:229) ~[na:1.7.0_80]
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) ~[na:1.7.0_80]
at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385) ~[na:1.7.0_80]
at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:51) ~[cassandra-all-2.1.14.1272.jar:2.1.14.1272]
at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:257) ~[cassandra-all-2.1.14.1272.jar:2.1.14.1272]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_80]
and then:
INFO [STREAM-IN-/172.31.45.28] 2016-05-10 07:59:14,023 StreamResultFuture.java:180 - [Stream #ea1271b0-1679-11e6-917a-c1820b05e9ae] Session with /172.31.45.28 is complete
WARN [STREAM-IN-/172.31.45.28] 2016-05-10 07:59:14,023 StreamResultFuture.java:207 - [Stream #ea1271b0-1679-11e6-917a-c1820b05e9ae] Stream failed
ERROR [STREAM-OUT-/172.31.45.28] 2016-05-10 07:59:14,024 StreamSession.java:505 - [Stream #ea1271b0-1679-11e6-917a-c1820b05e9ae] Streaming error occurred
java.lang.AssertionError: Memory was freed
at org.apache.cassandra.io.util.SafeMemory.checkBounds(SafeMemory.java:97) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.io.util.Memory.getLong(Memory.java:249) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.io.compress.CompressionMetadata.getTotalSizeForSections(CompressionMetadata.java:247) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.streaming.messages.FileMessageHeader.size(FileMessageHeader.java:112) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.streaming.StreamSession.fileSent(StreamSession.java:546) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.streaming.messages.OutgoingFileMessage$1.serialize(OutgoingFileMessage.java:50) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.streaming.messages.OutgoingFileMessage$1.serialize(OutgoingFileMessage.java:41) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:45) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.sendMessage(ConnectionHandler.java:358) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.run(ConnectionHandler.java:338) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]

As answered in the Cassandra ticket CASSANDRA-11345, this issue was due to a big SSTable file (40GB) being transferred.
The transfer of said file takes more than 1 hour and by default streaming operations time out if an outgoing transfer takes more than 1 hour.
To change this default behavior you can set the streaming_socket_timeout_in_ms in the cassandra.yaml configuration file to a large value (eg: 72000000 ms or 20 hours)

Don't forget to change this value on the existing nodes too, not just the new nodes!
(not that I'm admitting anything here...)

Kinesis Spark Streaming longevity issues

I'm having issues with the longevity of Spark-Kinesis Streaming application running on spark standalone cluster manager. The program runs for around 50 hours and stops receiving data from kinesis without giving any valid error why it stopped. But if i restart the application, it works for another day and half or so.
I'm seeing whole lot of errors during the program execution. I'm not sure this is the related to unexpected stoppage of events. Because these errors are there in logs even when the spark application is working fine.
There is no error specific to stoppage in driver or executor.Also I checked if there is any out of memory error but I was not able to spot in the logs. Could you please help me understand what are these error message means? Is this having anything to do with the longevity? Where do you think i should debug to understand whats happening with this?
2016-04-15 13:32:19 INFO KinesisRecordProcessor:58 - Shutdown: Shutting down workerId ip-10-205-1-150.us-west-2.compute.internal:6394789f-acb9-4702-8ea2-c2a3637d925a with reason ZOMBIE
2016-04-15 13:32:19 ERROR ShutdownTask:123 - Application exception.
java.lang.NullPointerException
at java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333)
at java.util.concurrent.ConcurrentHashMap.remove(ConcurrentHashMap.java:1175)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer.removeCheckpointer(KinesisCheckpointer.scala:66)
at org.apache.spark.streaming.kinesis.KinesisReceiver.removeCheckpointer(KinesisReceiver.scala:249)
at org.apache.spark.streaming.kinesis.KinesisRecordProcessor.shutdown(KinesisRecordProcessor.scala:124)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.V1ToV2RecordProcessorAdapter.shutdown(V1ToV2RecordProcessorAdapter.java:48)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask.call(ShutdownTask.java:94)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:48)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:23)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2016-04-15 13:32:19 ERROR ShutdownTask:123 - Application exception.
java.lang.NullPointerException
at java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333)
at java.util.concurrent.ConcurrentHashMap.remove(ConcurrentHashMap.java:1175)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer.removeCheckpointer(KinesisCheckpointer.scala:66)
at org.apache.spark.streaming.kinesis.KinesisReceiver.removeCheckpointer(KinesisReceiver.scala:249)
at org.apache.spark.streaming.kinesis.KinesisRecordProcessor.shutdown(KinesisRecordProcessor.scala:124)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.V1ToV2RecordProcessorAdapter.shutdown(V1ToV2RecordProcessorAdapter.java:48)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask.call(ShutdownTask.java:94)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:48)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:23)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
016-04-15 13:33:12 INFO LeaseRenewer:235 - Worker ip-10-205-1-151.us-west-2.compute.internal:188bd3f5-095b-405f-ac9f-b7eff11e16d1 lost lease with key shardId-000000000046 - discovered during update
2016-04-15 13:33:12 WARN MetricsHelper:67 - No metrics scope set in thread RecurringTimer - Kinesis Checkpointer - Worker ip-10-205-1-151.us-west-2.compute.internal:188bd3f5- 095b-405f-ac9f-b7eff11e16d1, getMetricsScope returning NullMetricsScope.
2016-04-15 13:33:12 ERROR KinesisRecordProcessor:95 - ShutdownException: Caught shutdown exception, skipping checkpoint.
com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: Can't update checkpoint - instance doesn't hold the lease for this shard
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator.setCheckpoint(KinesisClientLibLeaseCoordinator.java:120)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.advancePosition(RecordProcessorCheckpointer.java:216)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:137)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:103)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply$mcV$sp(KinesisCheckpointer.scala:81)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.kinesis.KinesisRecordProcessor$.retryRandom(KinesisRecordProcessor.scala:145)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:81)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:75)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer.checkpoint(KinesisCheckpointer.scala:75)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer.org$apache$spark$streaming$kinesis$KinesisCheckpointer$$checkpointAll(KinesisCheckpointer.scala:103)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$1.apply$mcVJ$sp(KinesisCheckpointer.scala:117)
at org.apache.spark.streaming.util.RecurringTimer.triggerActionForNextInterval(RecurringTimer.scala:94)
at org.apache.spark.streaming.util.RecurringTimer.org$apache$spark$streaming$util$RecurringTimer$$loop(RecurringTimer.scala:106)
at org.apache.spark.streaming.util.RecurringTimer$$anon$1.run(RecurringTimer.scala:29)
2016-04-15 13:33:12 WARN KinesisCheckpointer:91 - Failed to checkpoint shardId shardId-000000000046 to DynamoDB.
com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: Can't update checkpoint - instance doesn't hold the lease for this shard
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator.setCheckpoint(KinesisClientLibLeaseCoordinator.java:120)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.advancePosition(RecordProcessorCheckpointer.java:216)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:137)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:103)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply$mcV$sp(KinesisCheckpointer.scala:81)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.kinesis.KinesisRecordProcessor$.retryRandom(KinesisRecordProcessor.scala:145)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:81)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:75)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer.checkpoint(KinesisCheckpointer.scala:75)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer.org$apache$spark$streaming$kinesis$KinesisCheckpointer$$checkpointAll(KinesisCheckpointer.scala:103)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$1.apply$mcVJ$sp(KinesisCheckpointer.scala:117)
at org.apache.spark.streaming.util.RecurringTimer.triggerActionForNextInterval(RecurringTimer.scala:94)
at org.apache.spark.streaming.util.RecurringTimer.org$apache$spark$streaming$util$RecurringTimer$$loop(RecurringTimer.scala:106)
at org.apache.spark.streaming.util.RecurringTimer$$anon$1.run(RecurringTimer.scala:29)
2016-04-15 13:33:12 INFO LeaseRenewer:116 - Worker ip-10-205-1-151.us-west-2.compute.internal:188bd3f5-095b-405f-ac9f-b7eff11e16d1 lost lease with key shardId-000000000046
2016-04-15 13:33:12 INFO MemoryStore:58 - Block input-2-1460727103359 stored as values in memory (estimated size 120.3 KB, free 171.8 MB)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Cassandra avoid JOINING state go to RUNNING - cassandra

Related

Cassandra: Cannot remove a node and it gives gossip ERROR

cassandra 3.9 flush fails

Cassandra 3 Repair never finishes

Can't add a new Cassandra datacenter due to streaming errors

Kinesis Spark Streaming longevity issues

Categories

Resources