Can't add a new Cassandra datacenter due to streaming errors - cassandra

Using DSE 4.8.6 (C* 2.1.13.1218)
When I try adding a new node in a new datacenter, bootstraping / node rebuild is always interrupted by streaming errors.
Error example from system.log:
ERROR [STREAM-IN-/172.31.47.213] 2016-04-19 12:30:28,531 StreamSession.java:621 - [Stream #743d44e0-060e-11e6-985c-c1820b05e9ae] Remote peer 172.31.47.213 failed stream session.
INFO [STREAM-IN-/172.31.47.213] 2016-04-19 12:30:30,665 StreamResultFuture.java:180 - [Stream #743d44e0-060e-11e6-985c-c1820b05e9ae] Session with /172.31.47.213 is complete
There is about 500GB of data to be streamed to the new node. Boostrap or rebuild operation stream those from 4 different nodes on the other (main) DC.
When a streaming error occurs, all synced data is wiped (and I have to start over).
What I tried so far:
bootstraping the node
setup auto_boostrap: False in cassandra.yaml and manually run nodetool rebuild
disabling streaming_socket_timeout_in_ms and setting up more aggressive TCP Keep Alive values in my linux conf (following advice in the CASSANDRA-9440 ticket)
increasing phi_convict_threshold (to the max)
do not bootstrap the node and use repair to stream the data (stopping the repair at a nearly full disk and 80K SSTables. After 3 days of trying to compact them, I gave up)
Any other things I should try ? I'm in the process of running nodetool scrub on every failing node to see if this helps...
On the stream out node, these are the error messages:
ERROR [STREAM-IN-/172.31.45.28] 2016-05-11 13:10:43,842 StreamSession.java:505 - [Stream #ecfe0390-1763-11e6-b6c8-c1820b05e9ae] Streaming error occurred
java.net.SocketTimeoutException: null
at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:229) ~[na:1.7.0_80]
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) ~[na:1.7.0_80]
at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385) ~[na:1.7.0_80]
at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:51) ~[cassandra-all-2.1.14.1272.jar:2.1.14.1272]
at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:257) ~[cassandra-all-2.1.14.1272.jar:2.1.14.1272]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_80]
and then:
INFO [STREAM-IN-/172.31.45.28] 2016-05-10 07:59:14,023 StreamResultFuture.java:180 - [Stream #ea1271b0-1679-11e6-917a-c1820b05e9ae] Session with /172.31.45.28 is complete
WARN [STREAM-IN-/172.31.45.28] 2016-05-10 07:59:14,023 StreamResultFuture.java:207 - [Stream #ea1271b0-1679-11e6-917a-c1820b05e9ae] Stream failed
ERROR [STREAM-OUT-/172.31.45.28] 2016-05-10 07:59:14,024 StreamSession.java:505 - [Stream #ea1271b0-1679-11e6-917a-c1820b05e9ae] Streaming error occurred
java.lang.AssertionError: Memory was freed
at org.apache.cassandra.io.util.SafeMemory.checkBounds(SafeMemory.java:97) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.io.util.Memory.getLong(Memory.java:249) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.io.compress.CompressionMetadata.getTotalSizeForSections(CompressionMetadata.java:247) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.streaming.messages.FileMessageHeader.size(FileMessageHeader.java:112) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.streaming.StreamSession.fileSent(StreamSession.java:546) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.streaming.messages.OutgoingFileMessage$1.serialize(OutgoingFileMessage.java:50) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.streaming.messages.OutgoingFileMessage$1.serialize(OutgoingFileMessage.java:41) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:45) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.sendMessage(ConnectionHandler.java:358) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.run(ConnectionHandler.java:338) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]

As answered in the Cassandra ticket CASSANDRA-11345, this issue was due to a big SSTable file (40GB) being transferred.
The transfer of said file takes more than 1 hour and by default streaming operations time out if an outgoing transfer takes more than 1 hour.
To change this default behavior you can set the streaming_socket_timeout_in_ms in the cassandra.yaml configuration file to a large value (eg: 72000000 ms or 20 hours)

Don't forget to change this value on the existing nodes too, not just the new nodes!
(not that I'm admitting anything here...)

Related

Cassandra rebuild getting halt

I have a cassandra cluster of 18 prod nodes in 1 DC1 and 12 backup nodes in DC2 data center, few days before all backup nodes went down and crossed gc_grace period. now i am trying to make all Backup nodes up so have removed all data from backup nodes and trying to rebuild but it is getting halted with FileNotFoundException:.
Rebuild commands is : nohup nodetool rebuild DC1 &
(DC1 is prod data center )
Error in nohup.out file :
Error while rebuilding node: Stream failed
-- StackTrace --
java.lang.RuntimeException: Error while rebuilding node: Stream failed
at org.apache.cassandra.service.StorageService.rebuild(StorageService.java:1076)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
Error in system.log:
Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.io.FileNotFoundException: /data1/cassandra/data/system/compactions_in_progress-55080ab05d9c388690a4acb25fe1f77b/system-compactions_in_progress-tmp-ka-62-Data.db (No such file or directory)
at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) ~[guava-16.0.jar:na]
at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) ~[guava-16.0.jar:na]
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) ~[guava-16.0.jar:na]
at org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:372) ~[apache-cassandra-2.1.16.jar:2.1.16]
... 12 common frames omitted
Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: /data1/cassandra/data/system/compactions_in_progress-55080ab05d9c388690a4acb25fe1f77b/system-compactions_in_progress-tmp-ka-62-Data.db (No such file or directory)
at org.apache.cassandra.io.util.SequentialWriter.<init>(SequentialWriter.java:82) ~[apache-cassandra-2.1.16.jar:2.1.16]
at org.apache.cassandra.io.compress.CompressedSequentialWriter.<init>(CompressedSequentialWriter.java:67) ~[apache-cassandra-2.1.16.jar:2.1.16]
at org.apache.cassandra.io.util.SequentialWriter.open(SequentialWriter.java:124) ~[apache-cassandra-2.1.16.jar:2.1.16]
at org.apache.cassandra.io.sstable.SSTableWriter.<init>(SSTableWriter.java:130) ~[apache-cassandra-2.1.16.jar:2.1.16]
at org.apache.cassandra.db.Memtable$FlushRunnable.createFlushWriter(Memtable.java:414) ~[apache-cassandra-2.1.16.jar:2.1.16]
at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:351) ~[apache-cassandra-2.1.16.jar:2.1.16]
at org.apache.cassandra.db.Memtable$FlushRunnable.runMayThrow(Memtable.java:335) ~[apache-cassandra-2.1.16.jar:2.1.16]
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-2.1.16.jar:2.1.16]
at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) ~[guava-16.0.jar:na]
at org.apache.cassandra.db.ColumnFamilyStore$Flush.run(ColumnFamilyStore.java:1134) ~[apache-cassandra-2.1.16.jar:2.1.16]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_79]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_79]
... 5 common frames omitted
Caused by: java.io.FileNotFoundException: /data1/cassandra/data/system/compactions_in_progress-55080ab05d9c388690a4acb25fe1f77b/system-compactions_in_progress-tmp-ka-62-Data.db (No such file or directory)
Your problem is not the FileNotFound exception. It's the fact that you are streaming system tables. System tables will be created on the node locally when it's started up. All the data should be streamed EXCEPT the system tables data. /data1/cassandra/data/system/
Which Cassandra version are you using?
If you didn't change anything that forced Cassandra to stream the system tables, I would say this is a bug.
While you triggered the rebuild in DC2, there were compactions in progress in DC1. You can issue the following command in all nodes of DC1 to see the compactions in progress
nodetool compactionstats
As part of compaction, sstables would be merged together and the tmp "compaction_in_progress" tables would disappear once the merge completes. So streaming of those temp tables gets lost along the way from DC1 to DC2 and resulting in this streaming failure.
Also these compactions could be triggered by a "nodetool repair" initiated in DC1. So wait for repairs to complete if they are any in progress, to avoid this situation.
Since 18 nodes in DC1, I believe the storage size of the cluster is huge. A cleaner way to get around this situation is to pause compaction during the period of rebuild and rebuild one keyspace at a time. So rather than rebuilding the entire cluster with
nohup nodetool rebuild DC1 &
Issue the following command in DC1
nodetool disableautocompaction keyspace-name1
Then rebuild that keyspace in DC2, one node at a time
nohup nodetool rebuild keyspace-name1 DC1 &
Once rebuild is complete in all nodes in DC2 for that keyspace
nodetool enableautocompaction keyspace-name1
Repeat the above two steps for all the keyspaces until done. You can skip system tables like "system", which is local to that node and gets rebuilt automatically as you bring up that node (even with a empty data directory).
If there are too many application keyspaces to deal with, it becomes a lit bit of manual work.

cassandra 3.9 flush fails

We have a 5 node cassandra cluster running cassandra 3.9. We have a keyspace "ks" and a table "cf". We created several indexes on the table like "cf_c1_idx", "cf_c1_idx_1", "cf_c2_idx".
When I do a nodetool flush, the flush of 1 of the index files fails with the following exception:
-- StackTrace --
java.lang.RuntimeException: Last written key DecoratedKey(4dd1d75b-e52f-6c49-e7cd-c52a968e70de, 4dd1d75be52f6c49e7cdc52a968e70de) >= current key DecoratedKey(00000000-0000-0000-0000-000000000000, 5331cc31ae396031e6be66312c89c379) writing into /var/lib/cassandra/data/ks/cf-8d8b1ba0081c11e7a4206f8b05d669ae/.cf_c1_idx_1/mc-401-big-Data.db
at org.apache.cassandra.io.sstable.format.big.BigTableWriter.beforeAppend(BigTableWriter.java:122)
at org.apache.cassandra.io.sstable.format.big.BigTableWriter.append(BigTableWriter.java:161)
at org.apache.cassandra.io.sstable.SimpleSSTableMultiWriter.append(SimpleSSTableMultiWriter.java:48)
at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:458)
at org.apache.cassandra.db.Memtable$FlushRunnable.call(Memtable.java:493)
at org.apache.cassandra.db.Memtable$FlushRunnable.call(Memtable.java:380)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
When I run nodetool flush again after a few seconds, it succeeds without a hitch. We also notice the same exception showing up during commitlog replay after restarting a node sometimes. We end up deleting the commitlog directory so cassandra can start and run a repair to sync the data that was lost. Is this happening because of secondary indexes not getting updated in time? Also, this is a read intensive cluster.

Cassandra 3 Repair never finishes

We have a cluster with 6 nodes in datacenters (3 nodes each). We are starting a repair on one node and shortly afterwords we can find something like this in the logs:
ERROR [Repair#1:1] 2016-05-31 01:33:28,075 CassandraDaemon.java:195 - Exception in thread Thread[Repair#1:1,5,RMI Runtime]
com.google.common.util.concurrent.UncheckedExecutionException: org.apache.cassandra.exceptions.RepairException: [repair #e8e21070-26be-11e6-aae8-77b20cefeee5 on ..... Validation failed in /xx.xxx.xx.xx
at com.google.common.util.concurrent.Futures.wrapAndThrowUnchecked(Futures.java:1525) ~[guava-18.0.jar:na]
at com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1511) ~[guava-18.0.jar:na]
at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:162) ~[apache-cassandra-3.0.4.jar:3.0.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_77]
at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_77]
Afterwords nothing seems to happen anymore. We did not interrupt the repair for several days, but still nothing happens. We also tried it on two different clusters with the same result.
After searching through the web we stumbled upon https://support.datastax.com/hc/en-us/articles/205256895--Validation-failed-when-running-a-nodetool-repair. It says that we should run "nodetool scrub" and if it does not help "sstablescrub".
We tried the nodetool scrub but the repair does still not work. We now started a sstablescrub but it seems to take forever. It uses only one cpu at 100% and the data and index file is growing, but it now runs for over a day and the file now only has a size of 1.2GB.
Is it normal that "sstablescrub" is so slow?
The cluster is already running for some time, and we missed the GCGraceSeconds for the repair. Might that lead to the not working repair?
We currently do not know how to get the repair running hope someone can help.
What the exception indicates is that the node was not able to receive the results from the merkle tree computation that was supposed to happen on /xx.xxx.xx.xx. Please check the logs for this node instead. The node you started the repair run is likely fine and does not require sstable scrubbing.

Cassandra avoid JOINING state go to RUNNING

I've got a few nodes in a ring with replication 3 and trying to change the hardware on the node. What's happening is that I'm getting a streaming failure exception.
I've tried a few times always with the same failure. The upstream node (10.0.10.54) is dreadfully out of space and it's not realistic to compact or do any sstable operations on it. What I would like to do is:
Bring up a new node with all the data prior streaming prior to the failed event
Run a repair on it (nodetool repair -pr)
Decomission the 10.0.10.54 node
What I can't figure out how to do is everytime I bring up the new node it goes into JOINING, what I want is to force it into RUNNING with the data that it has copied from it's JOINING state.
The exception for those interested -
WARN [StreamReceiveTask:6] 2016-04-25 06:48:51,107 StreamResultFuture.java:207 - [Stream #bb34c010-0a1b-11e6-a009-d100b9716be2] Stream failed
INFO [MemtableFlushWriter:214] 2016-04-25 06:48:51,107 Memtable.java:382 - Completed flushing /mnt/cassandra/data/system/compactions_in_progress-55080ab05d9c388690a4acb25fe1f77b/system-compactions_in_progress-tmp-ka-276-Data.db (0.000KiB) for commitlog position ReplayPosition(segmentId=1461502431578, position=9474892)
INFO [CompactionExecutor:259] 2016-04-25 06:48:51,252 CompactionTask.java:141 - Compacting [SSTableReader(path='/mnt/cassandra/data/trends/stream_trends-a5bb42a07e2911e58fd6f3cfff022ad4/trends-stream_trends-ka-79-Data.db'), SSTableReader(path='/mnt/cassandra/data/trends/stream_trends-a5bb42a07e2911e58fd6f3cfff022ad4/trends-stream_trends-ka-87-Data.db')]
ERROR [main] 2016-04-25 06:48:51,270 CassandraDaemon.java:581 - Exception encountered during startup
java.lang.RuntimeException: Error during boostrap: Stream failed
at org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:86) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1166) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:944) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:740) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:617) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:389) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at com.datastax.bdp.server.DseDaemon.setup(DseDaemon.java:336) ~[dse-core-4.8.6.jar:4.8.6]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:564) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at com.datastax.bdp.DseModule.main(DseModule.java:74) [dse-core-4.8.6.jar:4.8.6]
Caused by: org.apache.cassandra.streaming.StreamException: Stream failed
at org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(StreamEventJMXNotifier.java:85) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at com.google.common.util.concurrent.Futures$4.run(Futures.java:1172) ~[guava-16.0.1.jar:na]
at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) ~[guava-16.0.1.jar:na]
at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156) ~[guava-16.0.1.jar:na]
at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145) ~[guava-16.0.1.jar:na]
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202) ~[guava-16.0.1.jar:na]
at org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:208) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:184) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:415) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.streaming.StreamSession.maybeCompleted(StreamSession.java:692) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.streaming.StreamSession.taskCompleted(StreamSession.java:653) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:179) ~[cassandra-all-2.1.13.1218.jar:2.1.13.1218]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_77]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_77]
at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_77]
You can't skip the JOINING stage while a node is bootstrapping. I'm guessing you were following these steps to replace the node? https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_replace_node_t.html
Were all of your nodes online throughout the entire streaming process? If a replica crashes or goes offline while streaming it can cause the stream to fail. If you nodes are very low on disk space it can cause Cassandra to act in odd ways or crash. If this is the case, you may need to add additional storage to your existing node before adding the new node.
You can add more disk space to your existing node like this:
Stop Cassandra
Attach a larger disk to the machine/VM
Copy the cassandra data directory (/var/lib/cassandra/data) to the new disk
Change the cassandra data directory mount point to the new disk using a symlink
Start Cassandra up

Kinesis Spark Streaming longevity issues

I'm having issues with the longevity of Spark-Kinesis Streaming application running on spark standalone cluster manager. The program runs for around 50 hours and stops receiving data from kinesis without giving any valid error why it stopped. But if i restart the application, it works for another day and half or so.
I'm seeing whole lot of errors during the program execution. I'm not sure this is the related to unexpected stoppage of events. Because these errors are there in logs even when the spark application is working fine.
There is no error specific to stoppage in driver or executor.Also I checked if there is any out of memory error but I was not able to spot in the logs. Could you please help me understand what are these error message means? Is this having anything to do with the longevity? Where do you think i should debug to understand whats happening with this?
2016-04-15 13:32:19 INFO KinesisRecordProcessor:58 - Shutdown: Shutting down workerId ip-10-205-1-150.us-west-2.compute.internal:6394789f-acb9-4702-8ea2-c2a3637d925a with reason ZOMBIE
2016-04-15 13:32:19 ERROR ShutdownTask:123 - Application exception.
java.lang.NullPointerException
at java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333)
at java.util.concurrent.ConcurrentHashMap.remove(ConcurrentHashMap.java:1175)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer.removeCheckpointer(KinesisCheckpointer.scala:66)
at org.apache.spark.streaming.kinesis.KinesisReceiver.removeCheckpointer(KinesisReceiver.scala:249)
at org.apache.spark.streaming.kinesis.KinesisRecordProcessor.shutdown(KinesisRecordProcessor.scala:124)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.V1ToV2RecordProcessorAdapter.shutdown(V1ToV2RecordProcessorAdapter.java:48)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask.call(ShutdownTask.java:94)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:48)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:23)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2016-04-15 13:32:19 ERROR ShutdownTask:123 - Application exception.
java.lang.NullPointerException
at java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333)
at java.util.concurrent.ConcurrentHashMap.remove(ConcurrentHashMap.java:1175)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer.removeCheckpointer(KinesisCheckpointer.scala:66)
at org.apache.spark.streaming.kinesis.KinesisReceiver.removeCheckpointer(KinesisReceiver.scala:249)
at org.apache.spark.streaming.kinesis.KinesisRecordProcessor.shutdown(KinesisRecordProcessor.scala:124)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.V1ToV2RecordProcessorAdapter.shutdown(V1ToV2RecordProcessorAdapter.java:48)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask.call(ShutdownTask.java:94)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:48)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:23)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
016-04-15 13:33:12 INFO LeaseRenewer:235 - Worker ip-10-205-1-151.us-west-2.compute.internal:188bd3f5-095b-405f-ac9f-b7eff11e16d1 lost lease with key shardId-000000000046 - discovered during update
2016-04-15 13:33:12 WARN MetricsHelper:67 - No metrics scope set in thread RecurringTimer - Kinesis Checkpointer - Worker ip-10-205-1-151.us-west-2.compute.internal:188bd3f5- 095b-405f-ac9f-b7eff11e16d1, getMetricsScope returning NullMetricsScope.
2016-04-15 13:33:12 ERROR KinesisRecordProcessor:95 - ShutdownException: Caught shutdown exception, skipping checkpoint.
com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: Can't update checkpoint - instance doesn't hold the lease for this shard
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator.setCheckpoint(KinesisClientLibLeaseCoordinator.java:120)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.advancePosition(RecordProcessorCheckpointer.java:216)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:137)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:103)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply$mcV$sp(KinesisCheckpointer.scala:81)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.kinesis.KinesisRecordProcessor$.retryRandom(KinesisRecordProcessor.scala:145)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:81)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:75)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer.checkpoint(KinesisCheckpointer.scala:75)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer.org$apache$spark$streaming$kinesis$KinesisCheckpointer$$checkpointAll(KinesisCheckpointer.scala:103)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$1.apply$mcVJ$sp(KinesisCheckpointer.scala:117)
at org.apache.spark.streaming.util.RecurringTimer.triggerActionForNextInterval(RecurringTimer.scala:94)
at org.apache.spark.streaming.util.RecurringTimer.org$apache$spark$streaming$util$RecurringTimer$$loop(RecurringTimer.scala:106)
at org.apache.spark.streaming.util.RecurringTimer$$anon$1.run(RecurringTimer.scala:29)
2016-04-15 13:33:12 WARN KinesisCheckpointer:91 - Failed to checkpoint shardId shardId-000000000046 to DynamoDB.
com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: Can't update checkpoint - instance doesn't hold the lease for this shard
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator.setCheckpoint(KinesisClientLibLeaseCoordinator.java:120)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.advancePosition(RecordProcessorCheckpointer.java:216)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:137)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:103)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply$mcV$sp(KinesisCheckpointer.scala:81)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.kinesis.KinesisRecordProcessor$.retryRandom(KinesisRecordProcessor.scala:145)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:81)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:75)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer.checkpoint(KinesisCheckpointer.scala:75)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer.org$apache$spark$streaming$kinesis$KinesisCheckpointer$$checkpointAll(KinesisCheckpointer.scala:103)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$1.apply$mcVJ$sp(KinesisCheckpointer.scala:117)
at org.apache.spark.streaming.util.RecurringTimer.triggerActionForNextInterval(RecurringTimer.scala:94)
at org.apache.spark.streaming.util.RecurringTimer.org$apache$spark$streaming$util$RecurringTimer$$loop(RecurringTimer.scala:106)
at org.apache.spark.streaming.util.RecurringTimer$$anon$1.run(RecurringTimer.scala:29)
2016-04-15 13:33:12 INFO LeaseRenewer:116 - Worker ip-10-205-1-151.us-west-2.compute.internal:188bd3f5-095b-405f-ac9f-b7eff11e16d1 lost lease with key shardId-000000000046
2016-04-15 13:33:12 INFO MemoryStore:58 - Block input-2-1460727103359 stored as values in memory (estimated size 120.3 KB, free 171.8 MB)

Resources