Why is data corruption happen in Cassandra 1.2? - cassandra

I dropped a column in Cassandra 1.2 couple days ago by:
1. drop the whole table,
2. recreate the table, without the column,
3. insert insert statement (without the column).
The reason why I did that way is because Cassandra 1.2 doesn't support "drop column" operation.
Today I was notified by Ops team because of the data corruption issue.
My questions:
What is the root cause?
How to fix it?
ERROR [ReadStage:79] 2014-11-04 11:29:55,021 CassandraDaemon.java (line 191) Exception in thread Thread[ReadStage:79,5,main]
org.apache.cassandra.io.sstable.CorruptSSTableException: org.apache.cassandra.db.ColumnSerializer$CorruptColumnException: invalid column name length 0 (/data/cassandra/data/xxx/yyy/zzz-Data.db, 1799885 bytes remaining)
at org.apache.cassandra.db.columniterator.SimpleSliceReader.computeNext(SimpleSliceReader.java:110)
at org.apache.cassandra.db.columniterator.SimpleSliceReader.computeNext(SimpleSliceReader.java:40)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
at org.apache.cassandra.db.columniterator.SSTableSliceIterator.hasNext(SSTableSliceIterator.java:90)
at org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:171)
at org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:154)
at org.apache.cassandra.utils.MergeIterator$OneToOne.computeNext(MergeIterator.java:199)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
at org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:160)
at org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:136)
at org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:84)
at org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:291)
at org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1398)
at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1214)
at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1130)
at org.apache.cassandra.db.Table.getRow(Table.java:344)
at org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:70)
at org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:44)
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.cassandra.db.ColumnSerializer$CorruptColumnException: invalid column name length 0 (/data/cassandra/data/xxx/yyy/zzz-Data.db, 1799885 bytes remaining)
at org.apache.cassandra.db.ColumnSerializer$CorruptColumnException.create(ColumnSerializer.java:148)
at org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(OnDiskAtom.java:86)
at org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(OnDiskAtom.java:73)
at org.apache.cassandra.db.columniterator.SimpleSliceReader.computeNext(SimpleSliceReader.java:106)
... 24 more
ERROR [ReadStage:89] 2014-11-04 11:29:58,076 CassandraDaemon.java (line 191) Exception in thread Thread[ReadStage:89,5,main]
java.lang.OutOfMemoryError: Java heap space
at org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:376)
at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:392)
at org.apache.cassandra.utils.ByteBufferUtil.readWithLength(ByteBufferUtil.java:355)
at org.apache.cassandra.db.ColumnSerializer.deserializeColumnBody(ColumnSerializer.java:108)
at org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(OnDiskAtom.java:92)
at org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(OnDiskAtom.java:73)
at org.apache.cassandra.db.columniterator.SimpleSliceReader.computeNext(SimpleSliceReader.java:106)
at org.apache.cassandra.db.columniterator.SimpleSliceReader.computeNext(SimpleSliceReader.java:40)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
at org.apache.cassandra.db.columniterator.SSTableSliceIterator.hasNext(SSTableSliceIterator.java:90)
at org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:171)
at org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:154)
at org.apache.cassandra.utils.MergeIterator$OneToOne.computeNext(MergeIterator.java:199)

C* 1.2 supports column deletions for cql tables - http://www.datastax.com/documentation/cql/3.0/cql/cql_using/use_delete.html
However, I do not see anything wrong from the procedure you described to re-create a new table without your column. Here are some steps to go forward.
Assumptions -
The corruption you are seeing is in the new table not the old one
(do they have the same name?)
You have a replication factor and number of nodes that are high
enough for you to be able to take this node offline
Your client's load balancing policy is set up appropriately so
that when the node goes down it will fail over to another node
Procedure -
1) Take your node offline
nodetool drain
This will flush memtables and make your node stop accepting requests.
2) Run nodetool scrub
nodetool scrub [keyspace][table]
If this completes successfully then you are done, bring your node back-up by restarting cassandra and run a nodetool repair keyspace table
3) If scrub errored out (probably with a corruption error), try the sstablescrub utility. ssh into your box and run:
sstablescrub <keyspace> <table>
Note, run this using the same os user you use to start cassandra.
If this completes successfully then you are done, bring your node back-up by restarting cassandra and run a nodetool repair keyspace table
4) If this doesn't work (again errors out with a corruption error) you will have to remove the SStable and rebuild it from your other replicas using repair:
mv the culprit sstable from your data directory to a backup directory
restart cassandra
(delete it later once it's rebuilt)
nodetool repair keyspace cf -- This repair will take time.
Please let me know if you are able to reproduce this corruption.

Related

Cassandra rebuild getting halt

I have a cassandra cluster of 18 prod nodes in 1 DC1 and 12 backup nodes in DC2 data center, few days before all backup nodes went down and crossed gc_grace period. now i am trying to make all Backup nodes up so have removed all data from backup nodes and trying to rebuild but it is getting halted with FileNotFoundException:.
Rebuild commands is : nohup nodetool rebuild DC1 &
(DC1 is prod data center )
Error in nohup.out file :
Error while rebuilding node: Stream failed
-- StackTrace --
java.lang.RuntimeException: Error while rebuilding node: Stream failed
at org.apache.cassandra.service.StorageService.rebuild(StorageService.java:1076)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
Error in system.log:
Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.io.FileNotFoundException: /data1/cassandra/data/system/compactions_in_progress-55080ab05d9c388690a4acb25fe1f77b/system-compactions_in_progress-tmp-ka-62-Data.db (No such file or directory)
at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) ~[guava-16.0.jar:na]
at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) ~[guava-16.0.jar:na]
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) ~[guava-16.0.jar:na]
at org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:372) ~[apache-cassandra-2.1.16.jar:2.1.16]
... 12 common frames omitted
Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: /data1/cassandra/data/system/compactions_in_progress-55080ab05d9c388690a4acb25fe1f77b/system-compactions_in_progress-tmp-ka-62-Data.db (No such file or directory)
at org.apache.cassandra.io.util.SequentialWriter.<init>(SequentialWriter.java:82) ~[apache-cassandra-2.1.16.jar:2.1.16]
at org.apache.cassandra.io.compress.CompressedSequentialWriter.<init>(CompressedSequentialWriter.java:67) ~[apache-cassandra-2.1.16.jar:2.1.16]
at org.apache.cassandra.io.util.SequentialWriter.open(SequentialWriter.java:124) ~[apache-cassandra-2.1.16.jar:2.1.16]
at org.apache.cassandra.io.sstable.SSTableWriter.<init>(SSTableWriter.java:130) ~[apache-cassandra-2.1.16.jar:2.1.16]
at org.apache.cassandra.db.Memtable$FlushRunnable.createFlushWriter(Memtable.java:414) ~[apache-cassandra-2.1.16.jar:2.1.16]
at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:351) ~[apache-cassandra-2.1.16.jar:2.1.16]
at org.apache.cassandra.db.Memtable$FlushRunnable.runMayThrow(Memtable.java:335) ~[apache-cassandra-2.1.16.jar:2.1.16]
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-2.1.16.jar:2.1.16]
at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) ~[guava-16.0.jar:na]
at org.apache.cassandra.db.ColumnFamilyStore$Flush.run(ColumnFamilyStore.java:1134) ~[apache-cassandra-2.1.16.jar:2.1.16]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_79]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_79]
... 5 common frames omitted
Caused by: java.io.FileNotFoundException: /data1/cassandra/data/system/compactions_in_progress-55080ab05d9c388690a4acb25fe1f77b/system-compactions_in_progress-tmp-ka-62-Data.db (No such file or directory)
Your problem is not the FileNotFound exception. It's the fact that you are streaming system tables. System tables will be created on the node locally when it's started up. All the data should be streamed EXCEPT the system tables data. /data1/cassandra/data/system/
Which Cassandra version are you using?
If you didn't change anything that forced Cassandra to stream the system tables, I would say this is a bug.
While you triggered the rebuild in DC2, there were compactions in progress in DC1. You can issue the following command in all nodes of DC1 to see the compactions in progress
nodetool compactionstats
As part of compaction, sstables would be merged together and the tmp "compaction_in_progress" tables would disappear once the merge completes. So streaming of those temp tables gets lost along the way from DC1 to DC2 and resulting in this streaming failure.
Also these compactions could be triggered by a "nodetool repair" initiated in DC1. So wait for repairs to complete if they are any in progress, to avoid this situation.
Since 18 nodes in DC1, I believe the storage size of the cluster is huge. A cleaner way to get around this situation is to pause compaction during the period of rebuild and rebuild one keyspace at a time. So rather than rebuilding the entire cluster with
nohup nodetool rebuild DC1 &
Issue the following command in DC1
nodetool disableautocompaction keyspace-name1
Then rebuild that keyspace in DC2, one node at a time
nohup nodetool rebuild keyspace-name1 DC1 &
Once rebuild is complete in all nodes in DC2 for that keyspace
nodetool enableautocompaction keyspace-name1
Repeat the above two steps for all the keyspaces until done. You can skip system tables like "system", which is local to that node and gets rebuilt automatically as you bring up that node (even with a empty data directory).
If there are too many application keyspaces to deal with, it becomes a lit bit of manual work.

Cassandra SStableLoader Streaming Error Broken pipe

I'm trying to use SSTableLoader to migrate tables from a Cassandra 2.1 cluster to a Cassandra 3.11 and while some SStables are loaded successfully, I keep hitting weird errors with others.
I'm trying to repeatedly load the same SSTables, sometimes I get a generic
java.util.concurrent.ExecutionException:
org.apache.cassandra.streaming.StreamException: Stream failed
Other times
org.apache.cassandra.io.FSReadError: java.io.IOException: Broken pipe
However, if I check in system.log I always find this error:
java.lang.IllegalArgumentException: No column name component found in
cell name
Tried searching for the error above, did not find any helpful details.
I tried nodetool repair on the tables in the 2.x cluster but after a couple of hours it looks like it's still not done. I would try to use nodetool scrub but not sure if this will result in data loss.
The problem is that the sstable format changed in Cassandra 3.0 and so you can't just stream sstables in the 2.1 format to a 3.x node as the sstable format.
The official (and better) way is that you would first need to upgrade your node(s) to a newer C* version and then run nodetool upgradesstables -a.

cassandra 3.9 flush fails

We have a 5 node cassandra cluster running cassandra 3.9. We have a keyspace "ks" and a table "cf". We created several indexes on the table like "cf_c1_idx", "cf_c1_idx_1", "cf_c2_idx".
When I do a nodetool flush, the flush of 1 of the index files fails with the following exception:
-- StackTrace --
java.lang.RuntimeException: Last written key DecoratedKey(4dd1d75b-e52f-6c49-e7cd-c52a968e70de, 4dd1d75be52f6c49e7cdc52a968e70de) >= current key DecoratedKey(00000000-0000-0000-0000-000000000000, 5331cc31ae396031e6be66312c89c379) writing into /var/lib/cassandra/data/ks/cf-8d8b1ba0081c11e7a4206f8b05d669ae/.cf_c1_idx_1/mc-401-big-Data.db
at org.apache.cassandra.io.sstable.format.big.BigTableWriter.beforeAppend(BigTableWriter.java:122)
at org.apache.cassandra.io.sstable.format.big.BigTableWriter.append(BigTableWriter.java:161)
at org.apache.cassandra.io.sstable.SimpleSSTableMultiWriter.append(SimpleSSTableMultiWriter.java:48)
at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:458)
at org.apache.cassandra.db.Memtable$FlushRunnable.call(Memtable.java:493)
at org.apache.cassandra.db.Memtable$FlushRunnable.call(Memtable.java:380)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
When I run nodetool flush again after a few seconds, it succeeds without a hitch. We also notice the same exception showing up during commitlog replay after restarting a node sometimes. We end up deleting the commitlog directory so cassandra can start and run a repair to sync the data that was lost. Is this happening because of secondary indexes not getting updated in time? Also, this is a read intensive cluster.

Cassandra 3 Repair never finishes

We have a cluster with 6 nodes in datacenters (3 nodes each). We are starting a repair on one node and shortly afterwords we can find something like this in the logs:
ERROR [Repair#1:1] 2016-05-31 01:33:28,075 CassandraDaemon.java:195 - Exception in thread Thread[Repair#1:1,5,RMI Runtime]
com.google.common.util.concurrent.UncheckedExecutionException: org.apache.cassandra.exceptions.RepairException: [repair #e8e21070-26be-11e6-aae8-77b20cefeee5 on ..... Validation failed in /xx.xxx.xx.xx
at com.google.common.util.concurrent.Futures.wrapAndThrowUnchecked(Futures.java:1525) ~[guava-18.0.jar:na]
at com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1511) ~[guava-18.0.jar:na]
at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:162) ~[apache-cassandra-3.0.4.jar:3.0.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_77]
at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_77]
Afterwords nothing seems to happen anymore. We did not interrupt the repair for several days, but still nothing happens. We also tried it on two different clusters with the same result.
After searching through the web we stumbled upon https://support.datastax.com/hc/en-us/articles/205256895--Validation-failed-when-running-a-nodetool-repair. It says that we should run "nodetool scrub" and if it does not help "sstablescrub".
We tried the nodetool scrub but the repair does still not work. We now started a sstablescrub but it seems to take forever. It uses only one cpu at 100% and the data and index file is growing, but it now runs for over a day and the file now only has a size of 1.2GB.
Is it normal that "sstablescrub" is so slow?
The cluster is already running for some time, and we missed the GCGraceSeconds for the repair. Might that lead to the not working repair?
We currently do not know how to get the repair running hope someone can help.
What the exception indicates is that the node was not able to receive the results from the merkle tree computation that was supposed to happen on /xx.xxx.xx.xx. Please check the logs for this node instead. The node you started the repair run is likely fine and does not require sstable scrubbing.

Decommission cassandra node times out with "received only 0 responses"

When I try to decommission a node in my Cassandra cluster, the process starts (I see active streams flowing from the node to decommission to the other nodes in the cluster (using vnodes)), but then after a little delay nodetool decommission exists with the following error message.
I can repeatedly run nodetool decommission and it will start streaming data to other nodes, but so far always exists with the below error.
Why am I seeing this, and is there a way I can safely decommission this node?
Exception in thread "main" java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.db.HintedHandOffManager.getHintsSlice(HintedHandOffManager.java:578)
at org.apache.cassandra.db.HintedHandOffManager.listEndpointsPendingHints(HintedHandOffManager.java:528)
at org.apache.cassandra.service.StorageService.streamHints(StorageService.java:2854)
at org.apache.cassandra.service.StorageService.unbootstrap(StorageService.java:2834)
at org.apache.cassandra.service.StorageService.decommission(StorageService.java:2795)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:27)
at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208)
at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120)
at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262)
at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836)
at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761)
at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1454)
at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:74)
at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1295)
at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1387)
at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:818)
at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:303)
at sun.rmi.transport.Transport$1.run(Transport.java:159)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:155)
at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:100)
at org.apache.cassandra.service.StorageProxy.getRangeSlice(StorageProxy.java:1213)
at org.apache.cassandra.db.HintedHandOffManager.getHintsSlice(HintedHandOffManager.java:573)
... 33 more
The hinted handoff manager is checking for hints to see if it needs to pass those off during
the decommission so that the hints don't get lost. You most likely have a lot of hints, or
a bunch of tombstones, or something in the table causing the query to timeout. You aren't
seeing any other exceptions in your logs before the timeout are you? Raising the read timeout
period on your nodes before you decommission them, or manually deleting the hints CF, should
most likely get your past this. If you delete them, you would then want to make sure you
ran a full cluster repair when you are done with all of your decommissions, to propagate data
from any hints you deleted.
The short answer is that the node I was trying to decommission was underpowered for the amount of data it held. As of this writing there seems to be a reasonable hard minimum of resources needed to handle nodes with arbitrary amounts of data, which seems to be somewhere in the neighborhood of what an AWS i2.2xlarge provides. In particular, the old m1 instances let you get into trouble by allowing you to store far more data on each node than the memory and compute resources available can support on it.

Resources