Cassandra: Cannot remove a node and it gives gossip ERROR - cassandra

I have one reachable node in my cluster and I tried replacing it, it wasn't successful. So, I left the node and ignored the data loss because of the replication factor 3.
Now, when I try to decommission or add a server, it's not working as expected.
I'm getting these INFO messages in all the nodes. I have tried to assassinate and remove as well. This node doesn't show up in the node tool status. But I guess is, it is somewhere persisted and Gossips are giving issues.
INFO [GossipStage:1] 2021-05-29 07:25:37,404 Gossiper.java:1029 - InetAddress /10.43.5.118 is now DOWN
INFO [GossipStage:1] 2021-05-29 07:25:37,405 StorageService.java:2324 - Removing tokens [] for /10.43.5.118
And also, while restarting the node, I get an ERROR from the gossip which is NullPointerException. It's not able to get the host id. I tried removing it with the old method mentioned in the stackoverflow. Using JXM.
ERROR [GossipStage:1] 2021-05-29 08:48:35,229 CassandraDaemon.java:226 - Exception in thread Thread[GossipStage:1,5,main]
java.lang.NullPointerException: null
at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:866) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:2096) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.service.StorageService.onChange(StorageService.java:1822) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.service.StorageService.onJoin(StorageService.java:2536) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1070) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1181) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:49) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:64) ~[apache-cassandra-3.9.jar:3.9]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_181]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_181]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_181]
Can someone let me know how to remove this node completely?

In this case, Please remove all the data from removed node and try to start stand alone. if it works then again you should clear the data on that node and join the cluster with configuration change.
Also, if you have less amount of data on other nodes then please run cleanup and repair once before adding new node.

Related

Cassandra 3.9 stuck in joining state

I have a 14 node cassandra 3.9 cluster with ~250GB of data on each node. Recently I have been attempting to add a 15th node to this cluster. The node is stuck in Joining state for the past 2 days. netstas is clear. The main thing I find suspicious in the system.log for that joining node is errors like these.
ERROR [Native-Transport-Requests-1] 2018-02-16 15:43:32,635 Message.java:617 - Unexpected exception during request; channel = [id: 0x8ed1cb3b, L:/**.**.**.42:9042 - R:/**.**.**.**:41614]
java.lang.NullPointerException: null
at org.apache.cassandra.auth.PasswordAuthenticator.authenticate(PasswordAuthenticator.java:88) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.auth.PasswordAuthenticator.access$300(PasswordAuthenticator.java:59) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.auth.PasswordAuthenticator$PlainTextSaslAuthenticator.getAuthenticatedUser(PasswordAuthenticator.java:220) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.transport.messages.AuthResponse.execute(AuthResponse.java:78) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513) [apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407) [apache-cassandra-3.9.jar:3.9]
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.39.Final.jar:4.0.39.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366) [netty-all-4.0.39.Final.jar:4.0.39.Final]
at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35) [netty-all-4.0.39.Final.jar:4.0.39.Final]
at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:357) [netty-all-4.0.39.Final.jar:4.0.39.Final]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_91]
at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) [apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.9.jar:3.9]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
This error message is from a client trying to connect to this node. It seems to fail authentication. How might I proceed in this situation? How should I bring this node to normalcy?
There are two different problems here,
The auth issue that client is facing is related to a bug with Cassandra 3.9 during bootstrap of new nodes. It has been resolved in the later versions of Cassandra as documented here https://issues.apache.org/jira/browse/CASSANDRA-12813.
We had a streaming issue similar to this with Cassandra 3.9. While taking a deeper look at the system.log there was an error with huge partition (partition greater than 100MB) not able to compacted since it exceeds the default commitlog_segment_size. . We were able to get around it once we increased the commitlog_segment_size_in_mb to 512mb. Check for huge partition warnings and adjust the size accordingly.

Running nodetool resetlocalschema causes exceptions complaining system_auth keyspace doesn't exist

I have a 15 node cassandra 3.9 cluster. I recently faced an issue where one of my nodes as piling up GossipStage messages. Following some guidance I found on a similar report I ran 'nodetool resetlocalschema' on that node. While gossip errors like these continue to show in the logs
WARN [GossipTasks:1] 2018-02-11 23:55:34,197 Gossiper.java:771 - Gossip stage has 180317 pending tasks; skipping status check (no nodes will be marked down)
I also see the following exception. Any guidance on how I can overcome this and bring this node back to normal? Also I should mention I have PasswordAuthenticator enabled in the cassandra.yaml file.
ERROR [Native-Transport-Requests-1] 2018-02-11 23:55:33,581 Message.java:617 - Unexpected exception during request; channel = [id: 0xbaa65545,
L:/10.1.21.51:9042 - R:/10.1.86.40:35082]
java.lang.RuntimeException: com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalArgumentException: Unknown keyspace
/cf pair (system_auth.roles)
at org.apache.cassandra.auth.PasswordAuthenticator.authenticate(PasswordAuthenticator.java:107) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.auth.PasswordAuthenticator.access$300(PasswordAuthenticator.java:59) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.auth.PasswordAuthenticator$PlainTextSaslAuthenticator.getAuthenticatedUser(PasswordAuthenticator.java:220) ~[ap
ache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.transport.messages.AuthResponse.execute(AuthResponse.java:78) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513) [apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407) [apache-cassandra-3.9.jar:3.9]
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.39.Final.jar:4.0.39.Fi
nal]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366) [netty-all-4.0.39.Final.jar
:4.0.39.Final]
at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35) [netty-all-4.0.39.Final.jar:4.0.39.
Final]
at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:357) [netty-all-4.0.39.Final.jar:4.0.39.Fina
l]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_91]
at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) [apache
-cassandra-3.9.jar:3.9]
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.9.jar:3.9]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
Caused by: com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalArgumentException: Unknown keyspace/cf pair (system_auth.roles)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]
at com.google.common.cache.LocalCache.get(LocalCache.java:3937) ~[guava-18.0.jar:na]
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) ~[guava-18.0.jar:na]
at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824) ~[guava-18.0.jar:na]
at org.apache.cassandra.auth.AuthCache.get(AuthCache.java:108) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.auth.PasswordAuthenticator.authenticate(PasswordAuthenticator.java:88) ~[apache-cassandra-3.9.jar:3.9]
... 13 common frames omitted
Actually this issue was resolved by simply restarting the seed nodes of my cluster first followed by the rest of the nodes in my cluster. Thanks for all the inputs. Truly appreciate it.

Cassandra 3 Repair never finishes

We have a cluster with 6 nodes in datacenters (3 nodes each). We are starting a repair on one node and shortly afterwords we can find something like this in the logs:
ERROR [Repair#1:1] 2016-05-31 01:33:28,075 CassandraDaemon.java:195 - Exception in thread Thread[Repair#1:1,5,RMI Runtime]
com.google.common.util.concurrent.UncheckedExecutionException: org.apache.cassandra.exceptions.RepairException: [repair #e8e21070-26be-11e6-aae8-77b20cefeee5 on ..... Validation failed in /xx.xxx.xx.xx
at com.google.common.util.concurrent.Futures.wrapAndThrowUnchecked(Futures.java:1525) ~[guava-18.0.jar:na]
at com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1511) ~[guava-18.0.jar:na]
at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:162) ~[apache-cassandra-3.0.4.jar:3.0.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_77]
at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_77]
Afterwords nothing seems to happen anymore. We did not interrupt the repair for several days, but still nothing happens. We also tried it on two different clusters with the same result.
After searching through the web we stumbled upon https://support.datastax.com/hc/en-us/articles/205256895--Validation-failed-when-running-a-nodetool-repair. It says that we should run "nodetool scrub" and if it does not help "sstablescrub".
We tried the nodetool scrub but the repair does still not work. We now started a sstablescrub but it seems to take forever. It uses only one cpu at 100% and the data and index file is growing, but it now runs for over a day and the file now only has a size of 1.2GB.
Is it normal that "sstablescrub" is so slow?
The cluster is already running for some time, and we missed the GCGraceSeconds for the repair. Might that lead to the not working repair?
We currently do not know how to get the repair running hope someone can help.
What the exception indicates is that the node was not able to receive the results from the merkle tree computation that was supposed to happen on /xx.xxx.xx.xx. Please check the logs for this node instead. The node you started the repair run is likely fine and does not require sstable scrubbing.

Cassandra avoid JOINING state go to RUNNING

I've got a few nodes in a ring with replication 3 and trying to change the hardware on the node. What's happening is that I'm getting a streaming failure exception.
I've tried a few times always with the same failure. The upstream node (10.0.10.54) is dreadfully out of space and it's not realistic to compact or do any sstable operations on it. What I would like to do is:
Bring up a new node with all the data prior streaming prior to the failed event
Run a repair on it (nodetool repair -pr)
Decomission the 10.0.10.54 node
What I can't figure out how to do is everytime I bring up the new node it goes into JOINING, what I want is to force it into RUNNING with the data that it has copied from it's JOINING state.
The exception for those interested -
WARN [StreamReceiveTask:6] 2016-04-25 06:48:51,107 StreamResultFuture.java:207 - [Stream #bb34c010-0a1b-11e6-a009-d100b9716be2] Stream failed
INFO [MemtableFlushWriter:214] 2016-04-25 06:48:51,107 Memtable.java:382 - Completed flushing /mnt/cassandra/data/system/compactions_in_progress-55080ab05d9c388690a4acb25fe1f77b/system-compactions_in_progress-tmp-ka-276-Data.db (0.000KiB) for commitlog position ReplayPosition(segmentId=1461502431578, position=9474892)
INFO [CompactionExecutor:259] 2016-04-25 06:48:51,252 CompactionTask.java:141 - Compacting [SSTableReader(path='/mnt/cassandra/data/trends/stream_trends-a5bb42a07e2911e58fd6f3cfff022ad4/trends-stream_trends-ka-79-Data.db'), SSTableReader(path='/mnt/cassandra/data/trends/stream_trends-a5bb42a07e2911e58fd6f3cfff022ad4/trends-stream_trends-ka-87-Data.db')]
ERROR [main] 2016-04-25 06:48:51,270 CassandraDaemon.java:581 - Exception encountered during startup
java.lang.RuntimeException: Error during boostrap: Stream failed
at org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:86) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1166) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:944) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:740) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:617) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:389) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at com.datastax.bdp.server.DseDaemon.setup(DseDaemon.java:336) ~[dse-core-4.8.6.jar:4.8.6]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:564) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at com.datastax.bdp.DseModule.main(DseModule.java:74) [dse-core-4.8.6.jar:4.8.6]
Caused by: org.apache.cassandra.streaming.StreamException: Stream failed
at org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(StreamEventJMXNotifier.java:85) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at com.google.common.util.concurrent.Futures$4.run(Futures.java:1172) ~[guava-16.0.1.jar:na]
at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) ~[guava-16.0.1.jar:na]
at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156) ~[guava-16.0.1.jar:na]
at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145) ~[guava-16.0.1.jar:na]
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202) ~[guava-16.0.1.jar:na]
at org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:208) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:184) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:415) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.streaming.StreamSession.maybeCompleted(StreamSession.java:692) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.streaming.StreamSession.taskCompleted(StreamSession.java:653) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:179) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_77]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_77]
at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_77]
You can't skip the JOINING stage while a node is bootstrapping. I'm guessing you were following these steps to replace the node? https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_replace_node_t.html
Were all of your nodes online throughout the entire streaming process? If a replica crashes or goes offline while streaming it can cause the stream to fail. If you nodes are very low on disk space it can cause Cassandra to act in odd ways or crash. If this is the case, you may need to add additional storage to your existing node before adding the new node.
You can add more disk space to your existing node like this:
Stop Cassandra
Attach a larger disk to the machine/VM
Copy the cassandra data directory (/var/lib/cassandra/data) to the new disk
Change the cassandra data directory mount point to the new disk using a symlink
Start Cassandra up

Kinesis Spark Streaming longevity issues

I'm having issues with the longevity of Spark-Kinesis Streaming application running on spark standalone cluster manager. The program runs for around 50 hours and stops receiving data from kinesis without giving any valid error why it stopped. But if i restart the application, it works for another day and half or so.
I'm seeing whole lot of errors during the program execution. I'm not sure this is the related to unexpected stoppage of events. Because these errors are there in logs even when the spark application is working fine.
There is no error specific to stoppage in driver or executor.Also I checked if there is any out of memory error but I was not able to spot in the logs. Could you please help me understand what are these error message means? Is this having anything to do with the longevity? Where do you think i should debug to understand whats happening with this?
2016-04-15 13:32:19 INFO KinesisRecordProcessor:58 - Shutdown: Shutting down workerId ip-10-205-1-150.us-west-2.compute.internal:6394789f-acb9-4702-8ea2-c2a3637d925a with reason ZOMBIE
2016-04-15 13:32:19 ERROR ShutdownTask:123 - Application exception.
java.lang.NullPointerException
at java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333)
at java.util.concurrent.ConcurrentHashMap.remove(ConcurrentHashMap.java:1175)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer.removeCheckpointer(KinesisCheckpointer.scala:66)
at org.apache.spark.streaming.kinesis.KinesisReceiver.removeCheckpointer(KinesisReceiver.scala:249)
at org.apache.spark.streaming.kinesis.KinesisRecordProcessor.shutdown(KinesisRecordProcessor.scala:124)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.V1ToV2RecordProcessorAdapter.shutdown(V1ToV2RecordProcessorAdapter.java:48)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask.call(ShutdownTask.java:94)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:48)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:23)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2016-04-15 13:32:19 ERROR ShutdownTask:123 - Application exception.
java.lang.NullPointerException
at java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333)
at java.util.concurrent.ConcurrentHashMap.remove(ConcurrentHashMap.java:1175)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer.removeCheckpointer(KinesisCheckpointer.scala:66)
at org.apache.spark.streaming.kinesis.KinesisReceiver.removeCheckpointer(KinesisReceiver.scala:249)
at org.apache.spark.streaming.kinesis.KinesisRecordProcessor.shutdown(KinesisRecordProcessor.scala:124)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.V1ToV2RecordProcessorAdapter.shutdown(V1ToV2RecordProcessorAdapter.java:48)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask.call(ShutdownTask.java:94)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:48)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:23)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
016-04-15 13:33:12 INFO LeaseRenewer:235 - Worker ip-10-205-1-151.us-west-2.compute.internal:188bd3f5-095b-405f-ac9f-b7eff11e16d1 lost lease with key shardId-000000000046 - discovered during update
2016-04-15 13:33:12 WARN MetricsHelper:67 - No metrics scope set in thread RecurringTimer - Kinesis Checkpointer - Worker ip-10-205-1-151.us-west-2.compute.internal:188bd3f5- 095b-405f-ac9f-b7eff11e16d1, getMetricsScope returning NullMetricsScope.
2016-04-15 13:33:12 ERROR KinesisRecordProcessor:95 - ShutdownException: Caught shutdown exception, skipping checkpoint.
com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: Can't update checkpoint - instance doesn't hold the lease for this shard
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator.setCheckpoint(KinesisClientLibLeaseCoordinator.java:120)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.advancePosition(RecordProcessorCheckpointer.java:216)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:137)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:103)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply$mcV$sp(KinesisCheckpointer.scala:81)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.kinesis.KinesisRecordProcessor$.retryRandom(KinesisRecordProcessor.scala:145)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:81)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:75)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer.checkpoint(KinesisCheckpointer.scala:75)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer.org$apache$spark$streaming$kinesis$KinesisCheckpointer$$checkpointAll(KinesisCheckpointer.scala:103)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$1.apply$mcVJ$sp(KinesisCheckpointer.scala:117)
at org.apache.spark.streaming.util.RecurringTimer.triggerActionForNextInterval(RecurringTimer.scala:94)
at org.apache.spark.streaming.util.RecurringTimer.org$apache$spark$streaming$util$RecurringTimer$$loop(RecurringTimer.scala:106)
at org.apache.spark.streaming.util.RecurringTimer$$anon$1.run(RecurringTimer.scala:29)
2016-04-15 13:33:12 WARN KinesisCheckpointer:91 - Failed to checkpoint shardId shardId-000000000046 to DynamoDB.
com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: Can't update checkpoint - instance doesn't hold the lease for this shard
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator.setCheckpoint(KinesisClientLibLeaseCoordinator.java:120)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.advancePosition(RecordProcessorCheckpointer.java:216)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:137)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:103)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply$mcV$sp(KinesisCheckpointer.scala:81)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.kinesis.KinesisRecordProcessor$.retryRandom(KinesisRecordProcessor.scala:145)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:81)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:75)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer.checkpoint(KinesisCheckpointer.scala:75)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer.org$apache$spark$streaming$kinesis$KinesisCheckpointer$$checkpointAll(KinesisCheckpointer.scala:103)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$1.apply$mcVJ$sp(KinesisCheckpointer.scala:117)
at org.apache.spark.streaming.util.RecurringTimer.triggerActionForNextInterval(RecurringTimer.scala:94)
at org.apache.spark.streaming.util.RecurringTimer.org$apache$spark$streaming$util$RecurringTimer$$loop(RecurringTimer.scala:106)
at org.apache.spark.streaming.util.RecurringTimer$$anon$1.run(RecurringTimer.scala:29)
2016-04-15 13:33:12 INFO LeaseRenewer:116 - Worker ip-10-205-1-151.us-west-2.compute.internal:188bd3f5-095b-405f-ac9f-b7eff11e16d1 lost lease with key shardId-000000000046
2016-04-15 13:33:12 INFO MemoryStore:58 - Block input-2-1460727103359 stored as values in memory (estimated size 120.3 KB, free 171.8 MB)

Resources