Cassandra 3 Repair never finishes - cassandra

We have a cluster with 6 nodes in datacenters (3 nodes each). We are starting a repair on one node and shortly afterwords we can find something like this in the logs:
ERROR [Repair#1:1] 2016-05-31 01:33:28,075 CassandraDaemon.java:195 - Exception in thread Thread[Repair#1:1,5,RMI Runtime]
com.google.common.util.concurrent.UncheckedExecutionException: org.apache.cassandra.exceptions.RepairException: [repair #e8e21070-26be-11e6-aae8-77b20cefeee5 on ..... Validation failed in /xx.xxx.xx.xx
at com.google.common.util.concurrent.Futures.wrapAndThrowUnchecked(Futures.java:1525) ~[guava-18.0.jar:na]
at com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1511) ~[guava-18.0.jar:na]
at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:162) ~[apache-cassandra-3.0.4.jar:3.0.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_77]
at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_77]
Afterwords nothing seems to happen anymore. We did not interrupt the repair for several days, but still nothing happens. We also tried it on two different clusters with the same result.
After searching through the web we stumbled upon https://support.datastax.com/hc/en-us/articles/205256895--Validation-failed-when-running-a-nodetool-repair. It says that we should run "nodetool scrub" and if it does not help "sstablescrub".
We tried the nodetool scrub but the repair does still not work. We now started a sstablescrub but it seems to take forever. It uses only one cpu at 100% and the data and index file is growing, but it now runs for over a day and the file now only has a size of 1.2GB.
Is it normal that "sstablescrub" is so slow?
The cluster is already running for some time, and we missed the GCGraceSeconds for the repair. Might that lead to the not working repair?
We currently do not know how to get the repair running hope someone can help.

What the exception indicates is that the node was not able to receive the results from the merkle tree computation that was supposed to happen on /xx.xxx.xx.xx. Please check the logs for this node instead. The node you started the repair run is likely fine and does not require sstable scrubbing.

Related

Databricks notebooks crashes on memory job

I am running few operations to aggregate a big quantity of data (about 600gb) on azure databricks. I noticed recently that the notebook crashes and the databricks returns the error below. The same code worked before with smaller 6 nodes cluster. After upgrading it to 12 nodes, I started getting this and I am doubting that it is a config problem.
Any help please, I use the default spark configuration with partitions number=200 and I have 88 executors on my nodes.
Thanks
Internal error, sorry. Attach your notebook to a different cluster or restart the current cluster.
java.lang.RuntimeException: abort: DriverClient destroyed
at com.databricks.backend.daemon.driver.DriverClient.$anonfun$poll$3(DriverClient.scala:381)
at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307)
at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at com.databricks.threading.NamedExecutor$$anon$2.$anonfun$run$1(NamedExecutor.scala:335)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:238)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:233)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:230)
at com.databricks.threading.NamedExecutor.withAttributionContext(NamedExecutor.scala:265)
at com.databricks.threading.NamedExecutor$$anon$2.run(NamedExecutor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I'm not sure about the cost implications, but how about enabling auto scaling option on cluster and bumping up Max Workers. Also you can try changing the Worker Type to have better resources
Just for other people facing similar issue.
In my situation, sometimes the same error happened when there's multiple Spark actions in one cell of a Databricks notebook.
Surprisingly, spliting the cell before the code where the error occurred or simply inserting time.sleep(5) there worked for me. However I'm not sure why it worked...
For example:
df1.count() # some Spark action
# split the cell or insert `time.sleep(5)` here
pipeline.fit(df1) # another Spark action where the error happened

cassandra 3.9 flush fails

We have a 5 node cassandra cluster running cassandra 3.9. We have a keyspace "ks" and a table "cf". We created several indexes on the table like "cf_c1_idx", "cf_c1_idx_1", "cf_c2_idx".
When I do a nodetool flush, the flush of 1 of the index files fails with the following exception:
-- StackTrace --
java.lang.RuntimeException: Last written key DecoratedKey(4dd1d75b-e52f-6c49-e7cd-c52a968e70de, 4dd1d75be52f6c49e7cdc52a968e70de) >= current key DecoratedKey(00000000-0000-0000-0000-000000000000, 5331cc31ae396031e6be66312c89c379) writing into /var/lib/cassandra/data/ks/cf-8d8b1ba0081c11e7a4206f8b05d669ae/.cf_c1_idx_1/mc-401-big-Data.db
at org.apache.cassandra.io.sstable.format.big.BigTableWriter.beforeAppend(BigTableWriter.java:122)
at org.apache.cassandra.io.sstable.format.big.BigTableWriter.append(BigTableWriter.java:161)
at org.apache.cassandra.io.sstable.SimpleSSTableMultiWriter.append(SimpleSSTableMultiWriter.java:48)
at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:458)
at org.apache.cassandra.db.Memtable$FlushRunnable.call(Memtable.java:493)
at org.apache.cassandra.db.Memtable$FlushRunnable.call(Memtable.java:380)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
When I run nodetool flush again after a few seconds, it succeeds without a hitch. We also notice the same exception showing up during commitlog replay after restarting a node sometimes. We end up deleting the commitlog directory so cassandra can start and run a repair to sync the data that was lost. Is this happening because of secondary indexes not getting updated in time? Also, this is a read intensive cluster.

Elasticsearch write using Spark

I'm creating a document collection in Spark as an RDD and using the Spark read/write library from Elasticsearch. The Cluster that creates the collection is large so when it writes to ES I get the errors below indicating ES is overloaded, which does not surprise me. This does not seem to fail the job. The tasks may being retried and eventually succeed. In the Spark GUI the job is reported as having finishing successfully.
is there a way to somehow throttle the ES writing lib to avoid the retries (I can't change the cluster size)?
Do these errors mean that some data was not written to the index?
Here is one of many reported task failure errors, but again no job failure is reported:
2017-03-20 10:48:27,745 WARN org.apache.spark.scheduler.TaskSetManager [task-result-getter-2] - Lost task 568.1 in stage 81.0 (TID 18982, ip-172-16-2-76.ec2.internal): org.apache.spark.util.TaskCompletionListenerException: Could not write all entries [41/87360] (maybe ES was overloaded?). Bailing out...
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:112)
at org.apache.spark.scheduler.Task.run(Task.scala:102)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The lib I'm using is
org.elasticsearch" % "elasticsearch-spark_2.10" % "2.1.2"
Can you follow this link - https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html
In the Spark conf property or in your elastic search property you need to increase the max number of records which can be dumped in a single post, and that should solve your problem.

Why is data corruption happen in Cassandra 1.2?

I dropped a column in Cassandra 1.2 couple days ago by:
1. drop the whole table,
2. recreate the table, without the column,
3. insert insert statement (without the column).
The reason why I did that way is because Cassandra 1.2 doesn't support "drop column" operation.
Today I was notified by Ops team because of the data corruption issue.
My questions:
What is the root cause?
How to fix it?
ERROR [ReadStage:79] 2014-11-04 11:29:55,021 CassandraDaemon.java (line 191) Exception in thread Thread[ReadStage:79,5,main]
org.apache.cassandra.io.sstable.CorruptSSTableException: org.apache.cassandra.db.ColumnSerializer$CorruptColumnException: invalid column name length 0 (/data/cassandra/data/xxx/yyy/zzz-Data.db, 1799885 bytes remaining)
at org.apache.cassandra.db.columniterator.SimpleSliceReader.computeNext(SimpleSliceReader.java:110)
at org.apache.cassandra.db.columniterator.SimpleSliceReader.computeNext(SimpleSliceReader.java:40)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
at org.apache.cassandra.db.columniterator.SSTableSliceIterator.hasNext(SSTableSliceIterator.java:90)
at org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:171)
at org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:154)
at org.apache.cassandra.utils.MergeIterator$OneToOne.computeNext(MergeIterator.java:199)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
at org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:160)
at org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:136)
at org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:84)
at org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:291)
at org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1398)
at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1214)
at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1130)
at org.apache.cassandra.db.Table.getRow(Table.java:344)
at org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:70)
at org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:44)
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.cassandra.db.ColumnSerializer$CorruptColumnException: invalid column name length 0 (/data/cassandra/data/xxx/yyy/zzz-Data.db, 1799885 bytes remaining)
at org.apache.cassandra.db.ColumnSerializer$CorruptColumnException.create(ColumnSerializer.java:148)
at org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(OnDiskAtom.java:86)
at org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(OnDiskAtom.java:73)
at org.apache.cassandra.db.columniterator.SimpleSliceReader.computeNext(SimpleSliceReader.java:106)
... 24 more
ERROR [ReadStage:89] 2014-11-04 11:29:58,076 CassandraDaemon.java (line 191) Exception in thread Thread[ReadStage:89,5,main]
java.lang.OutOfMemoryError: Java heap space
at org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:376)
at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:392)
at org.apache.cassandra.utils.ByteBufferUtil.readWithLength(ByteBufferUtil.java:355)
at org.apache.cassandra.db.ColumnSerializer.deserializeColumnBody(ColumnSerializer.java:108)
at org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(OnDiskAtom.java:92)
at org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(OnDiskAtom.java:73)
at org.apache.cassandra.db.columniterator.SimpleSliceReader.computeNext(SimpleSliceReader.java:106)
at org.apache.cassandra.db.columniterator.SimpleSliceReader.computeNext(SimpleSliceReader.java:40)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
at org.apache.cassandra.db.columniterator.SSTableSliceIterator.hasNext(SSTableSliceIterator.java:90)
at org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:171)
at org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:154)
at org.apache.cassandra.utils.MergeIterator$OneToOne.computeNext(MergeIterator.java:199)
C* 1.2 supports column deletions for cql tables - http://www.datastax.com/documentation/cql/3.0/cql/cql_using/use_delete.html
However, I do not see anything wrong from the procedure you described to re-create a new table without your column. Here are some steps to go forward.
Assumptions -
The corruption you are seeing is in the new table not the old one
(do they have the same name?)
You have a replication factor and number of nodes that are high
enough for you to be able to take this node offline
Your client's load balancing policy is set up appropriately so
that when the node goes down it will fail over to another node
Procedure -
1) Take your node offline
nodetool drain
This will flush memtables and make your node stop accepting requests.
2) Run nodetool scrub
nodetool scrub [keyspace][table]
If this completes successfully then you are done, bring your node back-up by restarting cassandra and run a nodetool repair keyspace table
3) If scrub errored out (probably with a corruption error), try the sstablescrub utility. ssh into your box and run:
sstablescrub <keyspace> <table>
Note, run this using the same os user you use to start cassandra.
If this completes successfully then you are done, bring your node back-up by restarting cassandra and run a nodetool repair keyspace table
4) If this doesn't work (again errors out with a corruption error) you will have to remove the SStable and rebuild it from your other replicas using repair:
mv the culprit sstable from your data directory to a backup directory
restart cassandra
(delete it later once it's rebuilt)
nodetool repair keyspace cf -- This repair will take time.
Please let me know if you are able to reproduce this corruption.

Decommission cassandra node times out with "received only 0 responses"

When I try to decommission a node in my Cassandra cluster, the process starts (I see active streams flowing from the node to decommission to the other nodes in the cluster (using vnodes)), but then after a little delay nodetool decommission exists with the following error message.
I can repeatedly run nodetool decommission and it will start streaming data to other nodes, but so far always exists with the below error.
Why am I seeing this, and is there a way I can safely decommission this node?
Exception in thread "main" java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.db.HintedHandOffManager.getHintsSlice(HintedHandOffManager.java:578)
at org.apache.cassandra.db.HintedHandOffManager.listEndpointsPendingHints(HintedHandOffManager.java:528)
at org.apache.cassandra.service.StorageService.streamHints(StorageService.java:2854)
at org.apache.cassandra.service.StorageService.unbootstrap(StorageService.java:2834)
at org.apache.cassandra.service.StorageService.decommission(StorageService.java:2795)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:27)
at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208)
at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120)
at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262)
at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836)
at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761)
at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1454)
at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:74)
at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1295)
at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1387)
at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:818)
at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:303)
at sun.rmi.transport.Transport$1.run(Transport.java:159)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:155)
at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:100)
at org.apache.cassandra.service.StorageProxy.getRangeSlice(StorageProxy.java:1213)
at org.apache.cassandra.db.HintedHandOffManager.getHintsSlice(HintedHandOffManager.java:573)
... 33 more
The hinted handoff manager is checking for hints to see if it needs to pass those off during
the decommission so that the hints don't get lost. You most likely have a lot of hints, or
a bunch of tombstones, or something in the table causing the query to timeout. You aren't
seeing any other exceptions in your logs before the timeout are you? Raising the read timeout
period on your nodes before you decommission them, or manually deleting the hints CF, should
most likely get your past this. If you delete them, you would then want to make sure you
ran a full cluster repair when you are done with all of your decommissions, to propagate data
from any hints you deleted.
The short answer is that the node I was trying to decommission was underpowered for the amount of data it held. As of this writing there seems to be a reasonable hard minimum of resources needed to handle nodes with arbitrary amounts of data, which seems to be somewhere in the neighborhood of what an AWS i2.2xlarge provides. In particular, the old m1 instances let you get into trouble by allowing you to store far more data on each node than the memory and compute resources available can support on it.

Resources