Cassandra repair failing because of GC - garbage-collection

We have a 9 node cluster and are running repairs every night as recommended (1 node each night).
We recently started having problems during the repairs, some nodes would die OutOfMemory because the GC would not collect fast enough. In the beginning it was a promotion issue (as shown by the detailed GC logs).
So we assumed that the CMS was not triggered fast enough and prevented ParNew from promoting surviging objects. When then lowered XX:CMSInitiatingOccupancyFraction from 75 to 50 to force the old GC to trigger faster.
It seemed to work, but yesterday two nodes dies because the GC couldn't cope with the allocation speed, producing this kind of logs :
INFO [ScheduledTasks:1] 2013-09-27 23:36:38,111 GCInspector.java (line 119) GC for ConcurrentMarkSweep: 21756 ms for 1 collections, 8003258240 used; max is 8211660800
WARN [ScheduledTasks:1] 2013-09-27 23:36:38,878 GCInspector.java (line 142) Heap is 0.9746211436302873 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically
INFO [ScheduledTasks:1] 2013-09-27 23:36:57,018 GCInspector.java (line 119) GC for ConcurrentMarkSweep: 17265 ms for 1 collections, 6587223560 used; max is 8211660800
WARN [ScheduledTasks:1] 2013-09-27 23:36:57,243 GCInspector.java (line 142) Heap is 0.802179208376459 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically
INFO [ScheduledTasks:1] 2013-09-27 23:37:18,180 GCInspector.java (line 119) GC for ConcurrentMarkSweep: 18437 ms for 1 collections, 6961687392 used; max is 8211660800
WARN [ScheduledTasks:1] 2013-09-27 23:37:18,785 GCInspector.java (line 142) Heap is 0.8477806818323523 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically
INFO [ScheduledTasks:1] 2013-09-27 23:37:40,416 GCInspector.java (line 119) GC for ConcurrentMarkSweep: 19032 ms for 1 collections, 7338693168 used; max is 8211660800
WARN [ScheduledTasks:1] 2013-09-27 23:37:40,456 GCInspector.java (line 142) Heap is 0.893691708259552 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically
INFO [ScheduledTasks:1] 2013-09-27 23:38:02,994 GCInspector.java (line 119) GC for ConcurrentMarkSweep: 18853 ms for 1 collections, 7570047632 used; max is 8211660800
WARN [ScheduledTasks:1] 2013-09-27 23:38:03,008 GCInspector.java (line 142) Heap is 0.9218656026318086 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically
INFO [ScheduledTasks:1] 2013-09-27 23:38:26,110 GCInspector.java (line 119) GC for ConcurrentMarkSweep: 19564 ms for 1 collections, 7714594464 used; max is 8211660800
WARN [ScheduledTasks:1] 2013-09-27 23:38:26,132 GCInspector.java (line 142) Heap is 0.9394682332713986 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically
INFO [ScheduledTasks:1] 2013-09-27 23:38:49,733 GCInspector.java (line 119) GC for ConcurrentMarkSweep: 20388 ms for 1 collections, 7843428464 used; max is 8211660800
WARN [ScheduledTasks:1] 2013-09-27 23:38:49,748 GCInspector.java (line 142) Heap is 0.9551573859456055 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically
INFO [ScheduledTasks:1] 2013-09-27 23:39:14,564 GCInspector.java (line 119) GC for ConcurrentMarkSweep: 20956 ms for 1 collections, 7934286376 used; max is 8211660800
WARN [ScheduledTasks:1] 2013-09-27 23:39:14,578 GCInspector.java (line 142) Heap is 0.9662218848591505 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically
INFO [ScheduledTasks:1] 2013-09-27 23:39:40,186 GCInspector.java (line 119) GC for ConcurrentMarkSweep: 22440 ms for 1 collections, 8008275464 used; max is 8211660800
WARN [ScheduledTasks:1] 2013-09-27 23:39:40,915 GCInspector.java (line 142) Heap is 0.9752321313612954 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically
INFO [ScheduledTasks:1] 2013-09-27 23:40:01,836 GCInspector.java (line 119) GC for ConcurrentMarkSweep: 19911 ms for 1 collections, 8022614576 used; max is 8211660800
WARN [ScheduledTasks:1] 2013-09-27 23:40:06,032 GCInspector.java (line 142) Heap is 0.976978320390438 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically
INFO [ScheduledTasks:1] 2013-09-27 23:40:27,407 GCInspector.java (line 119) GC for ConcurrentMarkSweep: 22590 ms for 1 collections, 8058828880 used; max is 8211660800
WARN [ScheduledTasks:1] 2013-09-27 23:40:31,091 GCInspector.java (line 142) Heap is 0.9813884275395302 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically
INFO [GossipTasks:1] 2013-09-27 23:40:53,798 Gossiper.java (line 799) InetAddress /<datacenter02>.<node2> is now DOWN
INFO [GossipTasks:1] 2013-09-27 23:40:53,846 Gossiper.java (line 799) InetAddress /<datacenter01>.<node3> is now DOWN
INFO [GossipStage:1] 2013-09-27 23:40:53,857 Gossiper.java (line 785) InetAddress /<datacenter01>.<node3> is now UP
INFO [GossipStage:1] 2013-09-27 23:40:53,909 Gossiper.java (line 785) InetAddress /<datacenter02>.<node2> is now UP
This time the heap grows and the GC run for 10-20 seconds without reducing the heap size, causing nodes to think that each other is down because they are busy GCing. In the end the nodes died of OOM.
We then tried to update to the latest version of Cassandra (1.2.8 -> 1.2.10) even though no fixed bug in these versions suggested any improvement for our problem. We then reran a repair during last night, but even though no nodes crashed, they failed to repair some ranges because of GCs of this kind :
INFO [ScheduledTasks:1] 2013-09-29 04:45:05,467 GCInspector.java (line 119) GC for ParNew: 22875 ms for 2 collections, 4128819328 used; max is 8211660800
INFO [ScheduledTasks:1] 2013-09-29 04:53:24,597 GCInspector.java (line 119) GC for ParNew: 133643 ms for 2 collections, 3102634584 used; max is 8211660800
This time it's ParNew taking ridiculous amounts of time.
I first thought of a load issue, but it continued to happen during the w-e when only the repair is happening.
Any help would be appreciated to diagnose / fix our issue.

The StatusLogger info doesn't show anything unusual except for GC taking a while. (Are you running on VMs? That tends to reduce GC performance: http://www.slideshare.net/eonnen/high-performance-network-programming-on-the-jvm-oscon-2012/62.)
My guess: repair adds enough load to the system that it falls behind processing requests and spends too much memory buffering them. You can verify this by looking for "dropped" messages in the log. By default it will buffer 10s worth of requests; to reduce this, lower the appropriate rpc timeouts in cassandra.yaml.

Try using the G1 GC instead of the CMS. G1 does not pause like that:
https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsTuneJVM.html

Related

Cassandra: Long Par New GC Pauses when Bootstrapping new nodes to cluster

I've seen an issue that happens fairly often when bootstrapping new nodes to a Datastax Enterprise Cassandra cluster (ver: 2.0.10.71)
When starting the new node to be bootstrapped, the bootstrap process starts to stream data from other nodes in the cluster. After a short period of time (usually a min or less) - other nodes in the cluster show high Par New GC pause times and then the nodes drop off from the cluster, failing the stream session.
INFO [main] 2015-04-27 16:59:58,644 StreamResultFuture.java (line 91) [Stream #d42dfef0-ecfe-11e4-8099-5be75b0950b8] Beginning stream session with /10.1.214.186
INFO [GossipTasks:1] 2015-04-27 17:01:06,342 Gossiper.java (line 890) InetAddress /10.1.214.186 is now DOWN
INFO [HANDSHAKE-/10.1.214.186] 2015-04-27 17:01:21,400 OutboundTcpConnection.java (line 386) Handshaking version with /10.1.214.186
INFO [RequestResponseStage:11] 2015-04-27 17:01:23,439 Gossiper.java (line 876) InetAddress /10.1.214.186 is now UP
Then on the other node:
10.1.214.186 ERROR [STREAM-IN-/10.1.212.233] 2015-04-27 17:02:07,007 StreamSession.java (line 454) [Stream #d42dfef0-ecfe-11e4-8099-5be75b0950b8] Streaming error occurred
Also see things in the logs:
10.1.219.232 INFO [ScheduledTasks:1] 2015-04-27 18:20:19,987 GCInspector.java (line 116) GC for ParNew: 118272 ms for 2 collections, 980357368 used; max is 12801015808
10.1.221.146 INFO [ScheduledTasks:1] 2015-04-27 18:20:29,468 GCInspector.java (line 116) GC for ParNew: 154911 ms for 1 collections, 1287263224 used; max is 12801015808`
It seems that it happens on different nodes each time we try to bootstrap a new node.
I've found this related ticket. https://issues.apache.org/jira/browse/CASSANDRA-6653
My only guess is that when the new node comes up a lot of compactions are firing off and that might be causing the GC pause times, I had considered setting concurrent_compactors = 1/2 my total CPU
Anyone have an idea?
Edit: More details around GC settings Using i2.2xlarge nodes on EC2:
MAX_HEAP_SIZE="12G"
HEAP_NEWSIZE="800M"
Also
JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=1"
JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
With the help from the DSE crew - the following settings helped us.
With an i2.2xlarge node (8 cpu, 60G of ram, local SSD only)
Increasing Heap New Size to 512M * num CPU (in our case 4G)
Setting memtable_flush_writers = 8
Setting concurrent_compactors = total CPU / 2 (in our case 4)
Making these changes no longer seeing ParNew GC times exceeding 1sec on bootstrap (previously we were seeing 50-100 SECOND Gc times). FWIW We don't see any ParNew GC times during normal operation - only bootstrap.

Cassandra bootstrap hang in join state for a very long time

Our production environment has two cassandra node(v2.0.5), and we want to add additional node to extend scalability. We followed step desc in Datastax doc
After bootstrap new node, we observed some exception log
ERROR [CompactionExecutor:42] 2015-03-25 19:01:01,821 CassandraDaemon.java (line 192) Exception in thread Thread[CompactionExecutor:42,1,main]
java.lang.RuntimeException: org.apache.cassandra.db.filter.TombstoneOverwhelmingException
at org.apache.cassandra.service.pager.QueryPagers$1.next(QueryPagers.java:154)
at org.apache.cassandra.service.pager.QueryPagers$1.next(QueryPagers.java:137)
And it repeat some compaction task and after two week it didn't complete bootstrap. Node remains not join state
INFO [CompactionExecutor:4468] 2015-03-30 09:18:20,288 ColumnFamilyStore.java (line 784) Enqueuing flush of Memtable-compactions_in_progress#1247174540(212/13568 serialized/live bytes, 7 ops)
INFO [FlushWriter:314] 2015-03-30 09:18:22,408 Memtable.java (line 373) Completed flushing /var/lib/cassandra/data/production_alarm_keyspace/alarm_history_data_new/production_alarm_keyspace-alarm_history_data_new-jb-118-Data.db (11216962 bytes) for commitlog position ReplayPosition(segmentId=1427280544702, position=24550137)
INFO [FlushWriter:314] 2015-03-30 09:18:22,409 Memtable.java (line 333) Writing Memtable-alarm_master_data#37361826(26718076/141982437 serialized/live bytes, 791595 ops)
INFO [FlushWriter:314] 2015-03-30 09:18:24,018 Memtable.java (line 373) Completed flushing /var/lib/cassandra/data/production_alarm_keyspace/alarm_master_data/production_alarm_keyspace-alarm_master_data-jb-346-Data.db (8407637 bytes) for commitlog position ReplayPosition(segmentId=1427280544702, position=24550137)
INFO [FlushWriter:314] 2015-03-30 09:18:24,018 Memtable.java (line 333) Writing Memtable-compactions_in_progress#1247174540(212/13568 serialized/live bytes, 7 ops)
INFO [FlushWriter:314] 2015-03-30 09:18:24,185 Memtable.java (line 373) Completed flushing /var/lib/cassandra/data/system/compactions_in_progress/system-compactions_in_progress-jb-1019-Data.db (201 bytes) for commitlog position ReplayPosition(segmentId=1427280544702, position=24550511)
INFO [CompactionExecutor:4468] 2015-03-30 09:18:24,186 CompactionTask.java (line 115) Compacting [SSTableReader(path='/var/lib/cassandra/data/production_alarm_keyspace/alarm_common_dump_by_minutes/production_alarm_keyspace-alarm_common_dump_by_minutes-jb-356-Data.db'), SSTableReader(path='/var/lib/cassandra/data/production_alarm_keyspace/alarm_common_dump_by_minutes/production_alarm_keyspace-alarm_common_dump_by_minutes-jb-357-Data.db'), SSTableReader(path='/var/lib/cassandra/data/production_alarm_keyspace/alarm_common_dump_by_minutes/production_alarm_keyspace-alarm_common_dump_by_minutes-jb-355-Data.db'), SSTableReader(path='/var/lib/cassandra/data/production_alarm_keyspace/alarm_common_dump_by_minutes/production_alarm_keyspace-alarm_common_dump_by_minutes-jb-354-Data.db')]
INFO [CompactionExecutor:4468] 2015-03-30 09:18:39,189 ColumnFamilyStore.java (line 784) Enqueuing flush of Memtable-compactions_in_progress#810255650(0/0 serialized/live bytes, 1 ops)
INFO [FlushWriter:314] 2015-03-30 09:18:39,189 Memtable.java (line 333) Writing Memtable-compactions_in_progress#810255650(0/0 serialized/live bytes, 1 ops)
INFO [FlushWriter:314] 2015-03-30 09:18:39,357 Memtable.java (line 373) Completed flushing /var/lib/cassandra/data/system/compactions_in_progress/system-compactions_in_progress-jb-1020-Data.db (42 bytes) for commitlog position ReplayPosition(segmentId=1427280544702, position=25306969)
INFO [CompactionExecutor:4468] 2015-03-30 09:18:39,367 CompactionTask.java (line 275) Compacted 4 sstables to [/var/lib/cassandra/data/production_alarm_keyspace/alarm_common_dump_by_minutes/production_alarm_keyspace-alarm_common_dump_by_minutes-jb-358,]. 70,333,241 bytes to 70,337,669 (~100% of original) in 15,180ms = 4.418922MB/s. 260 total partitions merged to 248. Partition merge counts were {1:236, 2:12, }
Nodetool status just show two node, and it's accept because 2.0.5 has bug in nodetool don't show join node.
[bpmesb#bpmesbap2 ~]$ nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 172.18.8.56 99 GB 256 51.1% 7321548a-3998-4122-965f-0366dd0cc47e rack1
UN 172.18.8.57 93.78 GB 256 48.9% bb306032-ff1c-4209-8300-d8c3de843f26 rack1
Can anybody help about this condition? Because datastax says bootstrap only take few minutes but our situation didn't complete after 2 weeks? We search stackoverflow and find This issue
may be related to our problem
After few days test and look at exception log. We found this may be key issue about this problem.
ERROR 19:01:01,821 Exception in thread Thread[NonPeriodicTasks:1,5,main]
java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: org.apache.cassandra.db.filter.TombstoneOverwhelmingException
at org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:413)
at org.apache.cassandra.db.index.SecondaryIndexManager.maybeBuildSecondaryIndexes(SecondaryIndexManager.java:140)
at org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:113)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
The log shows that one stream receiver task encounter tombstone overwhelming exception. And we think that's the key point cause cassandra never change state to normal.
And here is following step we done to fix this problem. We use nodetool to compact table and secondary index at two original node.
nohup nodetool compact production_alarm_keyspace object_daily_data &
nohup nodetool rebuild_index production_alarm_keyspace object_daily_data object_daily_data_alarm_no_idx_1 &
And we restart new node again, and after a hour, new node jump to normal and work fine until now.

Cassandra CPU Load (too much)

Using top
8260 root 20 0 5163m 4.7g **133m** S 144.6 30.5 2496:46 java
Most of the time %CPU is >170.
I am trying to identify the issue. I think GC or flushing is too blame.
S0 S1 E O P YGC YGCT FGC FGCT GCT LGCC GCC
0.00 16.73 74.74 29.33 59.91 27819 407.186 206 10.729 417.914 Allocation Failure No GC
0.00 16.73 99.57 29.33 59.91 27820 407.186 206 10.729 417.914 Allocation Failure Allocation Failure
Also from Cassandra logs, it says Replaying position with the same segment ID and memtable is flushing too often.
INFO [SlabPoolCleaner] 2015-01-20 13:55:48,515 ColumnFamilyStore.java:840 - Enqueuing flush of bid_list: 112838010 (11%) on-heap, 0 (0%) off-heap
INFO [MemtableFlushWriter:1587] 2015-01-20 13:55:48,516 Memtable.java:325 - Writing Memtable-bid_list#2003093066(23761503 serialized bytes, 211002 ops, 11%/0% of on/off-heap limit)
INFO [MemtableFlushWriter:1587] 2015-01-20 13:55:49,251 Memtable.java:364 - Completed flushing /root/Cassandra/apache-cassandra-2.1.2/bin/./../data/data/bigdspace/bid_list-27b59f109fa211e498559b0947587867/bigdspace-bid_list-ka-3965-Data.db (4144688 bytes) for commitlog position ReplayPosition(segmentId=1421647511710, position=25289038)
INFO [SlabPoolCleaner] 2015-01-20 13:56:23,429 ColumnFamilyStore.java:840 - Enqueuing flush of bid_list: 104056985 (10%) on-heap, 0 (0%) off-heap
INFO [MemtableFlushWriter:1589] 2015-01-20 13:56:23,429 Memtable.java:325 - Writing Memtable-bid_list#1124683519(21909522 serialized bytes, 194778 ops, 10%/0% of on/off-heap limit)
INFO [MemtableFlushWriter:1589] 2015-01-20 13:56:24,130 Memtable.java:364 - Completed flushing /root/Cassandra/apache-cassandra-2.1.2/bin/./../data/data/bigdspace/bid_list-27b59f109fa211e498559b0947587867/bigdspace-bid_list-ka-3967-Data.db (3830733 bytes) for commitlog position ReplayPosition(segmentId=1421647511710, position=25350445)
INFO [SlabPoolCleaner] 2015-01-20 13:56:55,493 ColumnFamilyStore.java:840 - Enqueuing flush of bid_list: 95807739 (9%) on-heap, 0 (0%) off-heap
INFO [MemtableFlushWriter:1590] 2015-01-20 13:56:55,494 Memtable.java:325 - Writing Memtable-bid_list#473510037(20170635 serialized bytes, 179514 ops, 9%/0% of on/off-heap limit)
INFO [MemtableFlushWriter:1590] 2015-01-20 13:56:56,151 Memtable.java:364 - Completed flushing /root/Cassandra/apache-cassandra-2.1.2/bin/./../data/data/bigdspace/bid_list-27b59f109fa211e498559b0947587867/bigdspace-bid_list-ka-3968-Data.db (3531752 bytes) for commitlog position ReplayPosition(segmentId=1421647511710, position=25373052)
Any help or suggestion would be great. I have also disabled durable write false for the KeySpace. Thanks
Just found out after restarting all the nodes, YGC on one of the server is kicking in even if nothing is happening. Stopped the dumping of data etc.
What type of compaction do you use? Size tiered or Leveled?
If you are using leveled compaction, can you switch off over to Size tiered as you seem to have too many compactions. Increasing the sstable size for leveled compaction may also help.
sstable_size_in_mb (Default: 160MB)
The target size for SSTables that use the leveled compaction strategy. Although SSTable sizes
should be less or equal to sstable_size_in_mb, it is possible to have
a larger SSTable during compaction. This occurs when data for a given
partition key is exceptionally large. The data is not split into two
SSTables.
(http://www.datastax.com/documentation/cassandra/1.2/cassandra/reference/referenceTableAttributes.html#reference_ds_zyq_zmz_1k__sstable_size_in_mb)
If you are using size tiered compaction, increase the number of SS Tables before you see a minor compaction. This is set when the table is created, so you can change it using ALTER command. Example below:
ALTER TABLE users WITH
compaction_strategy_class='SizeTieredCompactionStrategy' AND
min_compaction_threshold = 6;
Compact after 6 SSTables are created

DataStax Enterprise: Spark Cassandra Batch Size

I set the parameter spark.cassandra.output.batch.size.rows in my SparkConf as following:
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "host")
.set("spark.cassandra.auth.username", "cassandra")
.set("spark.cassandra.auth.password", "cassandra")
.set("spark.cassandra.output.batch.size.rows", "5120")
.set("spark.cassandra.output.concurrent.writes", "10")
but when i perform
saveToCassandra("data","ten_days")
I continue to see warning in my system.log
NFO [FlushWriter:7] 2014-11-20 11:11:16,498 Memtable.java (line 395) Completed flushing /var/lib/cassandra/data/system/hints/system-hints-jb-76-Data.db (5747287 bytes) for commitlog position ReplayPosition(segmentId=1416480663951, position=44882909)
INFO [FlushWriter:7] 2014-11-20 11:11:16,499 Memtable.java (line 355) Writing Memtable-ten_days#1656582530(32979978/329799780 serialized/live bytes, 551793 ops)
WARN [Native-Transport-Requests:761] 2014-11-20 11:11:16,499 BatchStatement.java (line 226) Batch of prepared statements for [data.ten_days] is of size 36825, exceeding specified threshold of 5120 by 31705.
WARN [Native-Transport-Requests:777] 2014-11-20 11:11:16,500 BatchStatement.java (line 226) Batch of prepared statements for [data.ten_days] is of size 36813, exceeding specified threshold of 5120 by 31693.
WARN [Native-Transport-Requests:822] 2014-11-20 11:11:16,501 BatchStatement.java (line 226) Batch of prepared statements for [data.ten_days] is of size 36823, exceeding specified threshold of 5120 by 31703.
WARN [Native-Transport-Requests:835] 2014-11-20 11:11:16,500 BatchStatement.java (line 226) Batch of prepared statements for [data.ten_days] is of size 36817, exceeding specified threshold of 5120 by 31697.
WARN [Native-Transport-Requests:781] 2014-11-20 11:11:16,501 BatchStatement.java (line 226) Batch of prepared statements for [data.ten_days] is of size 36817, exceeding specified threshold of 5120 by 31697.
WARN [Native-Transport-Requests:755] 2014-11-20 11:11:16,501 BatchStatement.java (line 226) Batch of prepared statements for [data.ten_days] is of size 36822, exceeding specified threshold of 5120 by 31702.
I know that are only warnings, but I would like to understand why my settings aren't working as expected. Then I can see a lot of hints in my cluster. Could the batch size affects the number of hints in the cluster ?
Thanks
You have set batch size rows instead of batch size bytes. This means the connector is limiting on the amount of rows rather than the memory size of the batch.
spark.cassandra.output.batch.size.rows: number of rows per single
batch; default is 'auto' which means the connector will adjust the
number of rows based on the amount of data in each row
spark.cassandra.output.batch.size.bytes: maximum total size of the
batch in bytes; defaults to 64 kB.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md
On a more important note you are most likely going to be better off with a larger batch size (64kb) and changing the warn limit in the cassandra.yaml file.
Edit:
Recently we've seen that larger batches can cause instability with certain C* configurations so lower the value if the system becomes unstable.

Cassandra nodetool repair freezes entire cluster

Need help understanding what's happening to Cassandra when attempting a nodetool repair on one of the column families in our keyspace.
We are running Cassandra 2.0.7 and have a table we use for indexing object data in our system.
CREATE TABLE ids_by_text (
object_type text,
field_name text,
ref_type text,
value text,
ref_id timeuuid,
PRIMARY KEY((object_type,field_name,ref_type),value,ref_id)
)
Rows can grow to be quite large. We have roughly 10 million objects in the database with an average of 4-6 fields that are indexing them via the table above. It doesn't seem like a lot to me.
When running nodetool repair, we will run for a bit and then hit a point where the following exception is thrown:
ERROR [AntiEntropySessions:8] 2014-07-06 16:47:48,863 RepairSession.java (line 286) [repair #5f37c2e0-052b-11e4-92f5-b9bfa38ef354] session completed with the following error
org.apache.cassandra.exceptions.RepairException: [repair #5f37c2e0-052b-11e4-92f5-b9bfa38ef354 on apps/ids_by_text, (-7683110849073497716,-7679039947314690170]] Sync failed between /10.0.2.166 and /10.0.2.163
at org.apache.cassandra.repair.RepairSession.syncComplete(RepairSession.java:207)
at org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:236)
at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:59)
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
INFO [ScheduledTasks:1] 2014-07-06 16:47:48,909 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 66029 ms for 1 collections, 7898896176 used; max is 8547991552
INFO [GossipTasks:1] 2014-07-06 16:47:48,901 Gossiper.java (line 883) InetAddress /10.0.2.162 is now DOWN
INFO [GossipTasks:1] 2014-07-06 16:47:49,181 Gossiper.java (line 883) InetAddress /10.0.2.163 is now DOWN
INFO [GossipTasks:1] 2014-07-06 16:47:49,184 StreamResultFuture.java (line 186) [Stream #da84b3e1-052b-11e4-92f5-b9bfa38ef354] Session with /10.0.2.163 is complete
WARN [GossipTasks:1] 2014-07-06 16:47:49,186 StreamResultFuture.java (line 215) [Stream #da84b3e1-052b-11e4-92f5-b9bfa38ef354] Stream failed
INFO [GossipTasks:1] 2014-07-06 16:47:49,187 Gossiper.java (line 883) InetAddress /10.0.2.165 is now DOWN
INFO [GossipTasks:1] 2014-07-06 16:47:49,188 Gossiper.java (line 883) InetAddress /10.0.2.164 is now DOWN
INFO [GossipTasks:1] 2014-07-06 16:47:49,189 Gossiper.java (line 883) InetAddress /10.0.2.166 is now DOWN
INFO [GossipTasks:1] 2014-07-06 16:47:49,189 StreamResultFuture.java (line 186) [Stream #da84b3e0-052b-11e4-92f5-b9bfa38ef354] Session with /10.0.2.166 is complete
WARN [GossipTasks:1] 2014-07-06 16:47:49,189 StreamResultFuture.java (line 215) [Stream #da84b3e0-052b-11e4-92f5-b9bfa38ef354] Stream failed
At this point, the other nodes will be unresponsive, throwing TPStatus logs and essentially be unresponsive. The system does not recover from this. We are dead.
I went through and ran 'nodetool scrub' on all of the nodes. That worked on most of them, some failed, so I used 'sstablescrub' on them. We wrote a script that did a subrange repair and I can identify the ranges that are problematic, but I haven't done enough testing to know if that is consistent or symptomatic. Testing is tough when it brings production down, so I have to be cautious.
Sidebar question... how do you stop a repair that is underway? If I can see things going sideways, I'd like to stop it.
Note that every other column family in the keyspace repairs just fine.
I am not sure what other detail to give. We have been beating our heads against this for a week and, well, we're stuck.
This(https://issues.apache.org/jira/browse/CASSANDRA-7330) may relate to unresponsiveness after repair failure. It is fixed in the latest 2.0.9 version.
how do you stop a repair that is underway?
It is still work in progress(https://issues.apache.org/jira/browse/CASSANDRA-3486).
You can stop a repair in 2.1.* as follows:
wget -q -O jmxterm.jar http://downloads.sourceforge.net/cyclops-group/jmxterm-1.0-alpha-4-uber.jar
java -jar ./jmxterm.jar
open localhost:7199 -u [optional username] -p [optional password]
bean org.apache.cassandra.db:type=StorageService
run forceTerminateAllRepairSessions

Resources