I am using Cassandra 2.1.5, and Cassandra-java-driver 2.0.10. I am facing below Exception when i am fetching Data from Cassandra Table.
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /127.0.0.1:9042 (com.datastax.driver.core.TransportException: [/127.0.0.1:9042] Connection has been closed))
at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84)
at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:265)
at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:179)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:36)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at com.datastax.spark.connector.cql.SessionProxy.invoke(SessionProxy.scala:33)
at com.sun.proxy.$Proxy8.execute(Unknown Source)
at com.exportstagging.SparkTest.ProductDataLoader.dbQuery(ProductDataLoader.java:418)
at com.exportstagging.SparkTest.ProductDataLoader.main(ProductDataLoader.java:442)
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /127.0.0.1:9042 (com.datastax.driver.core.TransportException: [/127.0.0.1:9042] Connection has been closed))
at com.datastax.driver.core.RequestHandler.reportNoMoreHosts(RequestHandler.java:216)
at com.datastax.driver.core.RequestHandler.access$900(RequestHandler.java:45)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.sendRequest(RequestHandler.java:276)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution$1.run(RequestHandler.java:374)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
In my cassandra Table there is 50000 columns and 380000 rows. when i am fire this query i am getting above error.
Select * from mykeyspace.productdata Where id In (1,...,6000).
I have used token to create batching for fetching data from Cassandra.
Trying to fectch 6000 partitions at once, on 380000 rows with a table with 50000 columns seems totally overkill.
Multi partition select should be done using asynchronous queries, using one query per partition.
Having 50000 columns in a table looks like a data modeling problem. What is your use case ?
Fetching 380000 rows at once shouldn't be necessary (and will take a very long time anyway), and it looks like an analytical query which should be way better handled through Spark.
Trying to get so much data at once, you are most likely getting an OOM on your cassandra node, which could explain why you're getting the message "Connection has been closed"
My advice is to review your model, and try to split the load the way you should (async queries) and use proper paging.
If you want to crunch so much data at once, you have to do it with Spark through batch processing, output the result in another cassandra table, and access that result table through smaller, faster interactive queries.
Related
I have been doing a left join on two tables in SQL (table A contains billions of rows and table B contains millions of rows) and creating a new table (table C) from the result of the join. I am using a XXL SQL Warehouse on Databricks with two workers to handle the compute of this operation.
The main issue is when I schedule this job to be performed it seems to fail every time. When I run it manually, the job finishes almost every time.
DROP TABLE IF EXISTS tableC;
CREATE TABLE tableC
AS (
SELECT *
FROM tableA
LEFT JOIN tableB
ON tableA.id = tableB.id
)
After running the SQL query as a scheduled job, I seem to get the error below. From my understanding the error occurs in the final stage of the job when it is writing the rows into a delta table, however I can't seem to figure out the reason for this failure. I have checked to make sure the data is not corrupted and I have enough memory. The main part that puzzles me is the fact that it works when the SQL command is run manually but always seems to fail when the query is apart of a scheduled job run.
Job aborted due to stage failure: ResultStage 9 (write at WriteIntoDeltaCommand.scala:70) has failed the maximum allowable number of times: 4. Most recent failure reason:
org.apache.spark.shuffle.FetchFailedException
at org.apache.spark.errors.SparkCoreErrors$.fetchFailedError(SparkCoreErrors.scala:315)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:1167)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$next$1(ShuffleBlockFetcherIterator.scala:905)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:736)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:85)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.execution.RowToColumnarExec$$anon$1.hasNext(Columnar.scala:491)
at com.databricks.photon.CloseableIterator$$anon$5.hasNext(CloseableIterator.scala:62)
at com.databricks.photon.NativeColumnBatchIterator.hasNext(NativeColumnBatchIterator.java:47)
at 0x8235362 <photon>.HasNext(external/workspace_spark_3_3/photon/jni-wrappers/jni-native-column-batch-iterator.cc:62)
at 0x4707a79 <photon>.OpenImpl(external/workspace_spark_3_3/photon/exec-nodes/file-writer-node.cc:161)
at com.databricks.photon.JniApiImpl.open(Native Method)
at com.databricks.photon.JniApi.open(JniApi.scala)
at com.databricks.photon.JniExecNode.open(JniExecNode.java:64)
at com.databricks.photon.PhotonWriteStageExec.$anonfun$executeWrite$5(PhotonWriteStageExec.scala:87)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.photon.PhotonExec.timeit(PhotonExec.scala:300)
at com.databricks.photon.PhotonExec.timeit$(PhotonExec.scala:298)
at com.databricks.photon.PhotonWriteStageExec.timeit(PhotonWriteStageExec.scala:38)
at com.databricks.photon.PhotonWriteStageExec.$anonfun$executeWrite$4(PhotonWriteStageExec.scala:87)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1730)
at com.databricks.photon.PhotonWriteStageExec.$anonfun$executeWrite$2(PhotonWriteStageExec.scala:84)
at com.databricks.photon.PhotonExec.$anonfun$executePhoton$2(PhotonExec.scala:484)
at com.databricks.photon.PhotonExec.$anonfun$executePhoton$2$adapted(PhotonExec.scala:353)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2(RDD.scala:882)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted(RDD.scala:882)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:372)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:336)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:169)
at org.apache.spark.scheduler.Task.$anonfun$run$4(Task.scala:137)
at com.databricks.unity.EmptyHandle$.runWithAndClose(UCSHandle.scala:104)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:137)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:96)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:902)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1696)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:905)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:760)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.nio.channels.ClosedChannelException
at org.apache.spark.network.client.StreamInterceptor.channelInactive(StreamInterceptor.java:62)
at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:223)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:813)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
... 1 more
Based on stacktrace the issue seems to be for two reasons
org.apache.spark.shuffle.FetchFailedException (operation failed)
Caused by: java.nio.channels.ClosedChannelException ( root cause )
The problem possibly is the default retry configuration / or / timeout configuration is not suitable for complex workloads.
Checkout this article and try the recommendations mainly under 'Network Timeout': section https://towardsdatascience.com/fetch-failed-exception-in-apache-spark-decrypting-the-most-common-causes-b8dff21075c
I am trying to load a dataset of avro files into a Teradata table through spark streaming (jdbc). The configuration is properly set and the load succeeds to certain extent (I can validate rows of data have been inserted into the table), but halfways through I start having exceptions and the load fails. The stacktrace is below. Any inkling as to what might causing this ?
18/02/08 17:27:42 ERROR executor.Executor: Exception in task 2.0 in stage 0.0 (TID 0)
java.sql.BatchUpdateException: [Teradata JDBC Driver] [TeraJDBC 16.20.00.02] [Error 1154] [SQLState HY000] A failure occurred while inserting the batch of rows destined for database table "database"."table". Details of the failure can be found in the exception chain that is accessible with getNextException.
at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeBatchUpdateException(ErrorFactory.java:149)
at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeBatchUpdateException(ErrorFactory.java:133)
at com.teradata.jdbc.jdbc.fastload.FastLoadManagerPreparedStatement.executeBatch(FastLoadManagerPreparedStatement.java:2389)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:592)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.sql.SQLException: [Teradata JDBC Driver] [TeraJDBC 16.20.00.02] [Error 1147] [SQLState HY000] The next failure(s) in the exception chain occurred while beginning FastLoad of database table "database"."table"
at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeDriverJDBCException(ErrorFactory.java:95)
at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeDriverJDBCException(ErrorFactory.java:70)
at com.teradata.jdbc.jdbc.fastload.FastLoadManagerPreparedStatement.beginFastLoad(FastLoadManagerPreparedStatement.java:966)
at com.teradata.jdbc.jdbc.fastload.FastLoadManagerPreparedStatement.executeBatch(FastLoadManagerPreparedStatement.java:2210)
The issue stems from trying to load data in append mode to an existing table. FAST LOAD doesn't support that. The table should be empty (a.k.a truncated) each time the process runs. Which makes this useful for staging data prior to crunching it. But not for storing it.
All of my nodes are throwing a FileNotFoundException during compaction. As such, not a single compaction (auto, manual) can finish and my SSTable count is now in the thousands for a single CF (CQL3).
nodetool compactionstats shows hundreds of pending tasks in each node but nothing is being processed.
Below is an example log of the exception:
Error occurred during compaction
java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.io.FileNotFoundException: /home/cassandra/data/mtg_keywords_v5/keyword_organic_results/mtg_keywords_v5-keyword_organic_results-jb-31111-Data.db (No such file or directory)
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:188)
at org.apache.cassandra.db.compaction.CompactionManager.performMaximal(CompactionManager.java:281)
at org.apache.cassandra.db.ColumnFamilyStore.forceMajorCompaction(ColumnFamilyStore.java:1935)
at org.apache.cassandra.service.StorageService.forceKeyspaceCompaction(StorageService.java:2210)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:75)
at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:279)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:46)
at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237)
at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:138)
at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:252)
at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819)
at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801)
at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1487)
at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:97)
at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1328)
at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1420)
at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:848)
at sun.reflect.GeneratedMethodAccessor40.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:322)
at sun.rmi.transport.Transport$1.run(Transport.java:177)
at sun.rmi.transport.Transport$1.run(Transport.java:174)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:173)
at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:556)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:811)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:670)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: /home/cassandra/data/mtg_keywords_v5/keyword_organic_results/mtg_keywords_v5-keyword_organic_results-jb-31111-Data.db (No such file or directory)
at org.apache.cassandra.io.compress.CompressedThrottledReader.open(CompressedThrottledReader.java:52)
at org.apache.cassandra.io.sstable.SSTableReader.openDataReader(SSTableReader.java:1355)
at org.apache.cassandra.io.sstable.SSTableScanner.<init>(SSTableScanner.java:67)
at org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1161)
at org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1173)
at org.apache.cassandra.db.compaction.AbstractCompactionStrategy.getScanners(AbstractCompactionStrategy.java:252)
at org.apache.cassandra.db.compaction.AbstractCompactionStrategy.getScanners(AbstractCompactionStrategy.java:258)
at org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:126)
at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60)
at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59)
at org.apache.cassandra.db.compaction.CompactionManager$6.runMayThrow(CompactionManager.java:296)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
... 3 more
Caused by: java.io.FileNotFoundException: /home/cassandra/data/mtg_keywords_v5/keyword_organic_results/mtg_keywords_v5-keyword_organic_results-jb-31111-Data.db (No such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:241)
at org.apache.cassandra.io.util.RandomAccessReader.<init>(RandomAccessReader.java:58)
at org.apache.cassandra.io.compress.CompressedRandomAccessReader.<init>(CompressedRandomAccessReader.java:76)
at org.apache.cassandra.io.compress.CompressedThrottledReader.<init>(CompressedThrottledReader.java:34)
at org.apache.cassandra.io.compress.CompressedThrottledReader.open(CompressedThrottledReader.java:48)
... 18 more
I'm currently in the middle of migrating 4.8 billion rows from MySQL which I do via sstableloader in batches of 1 to 4 million rows. Does the exception mean that I've already lost data and must repeat the migration from scratch? So far I don't see any stream error in my logs.
My environment is as follows:
DSE 4.0.1 (Cassandra 2.0.5)
CentOS 6.x x86_64
Java 1.7.0_5x
EDIT:
Some additional info:
During the bulk loading process, I devised a mechanism to kill the sstableloader when the total progress reaches 100%. I also issue a "nodetool stop INDEX_BUILD" to all nodes. The reason for this is because sstableloader waits for the secondary index build to finish and this takes hours to complete (whereas the actual import time is just a fraction of the index build time). I figured out that the imported data remains intact after killing the sstableloader process and cancelling the secondary index build so I wrote a script to automate the mechanism. So far, I have completed more than 200 bulk loads with this trick.
I have paused the migration and restarted the nodes several times in the past week because the OS load reaches high levels (yellow or red in OpsCenter) after finishing several cycles of note #1. It's possible that a compaction is in progress when I restarted the nodes via dse cassandra-stop (yes, we are running DSE as a standalone process)
Could any of these be the cause? How do I get out of this situation? Manual compaction/repair doesn't work because they always throw exceptions. For repair, the exception is different but the meaning is the same - some sstable files are missing:
ERROR [MiscStage:2] 2014-05-03 00:42:10,386 CassandraDaemon.java (line 196) Exception in thread Thread[MiscStage:2,5,main]
java.lang.RuntimeException: Tried to hard link to file that does not exist /home/cassandra/data/mtg_keywords_v5/keyword_organic_results/mtg_keywords_v5-keyword_organic_results-jb-23797-Summary.db
at org.apache.cassandra.io.util.FileUtils.createHardLink(FileUtils.java:76)
at org.apache.cassandra.io.sstable.SSTableReader.createLinks(SSTableReader.java:1215)
at org.apache.cassandra.db.ColumnFamilyStore.snapshotWithoutFlush(ColumnFamilyStore.java:1816)
at org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:1849)
at org.apache.cassandra.service.SnapshotVerbHandler.doVerb(SnapshotVerbHandler.java:40)
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Have you dropped and recreated the keyspace? If so, it's probably this:
https://issues.apache.org/jira/browse/CASSANDRA-4857
Restart your nodes to clear the bad filename out of memory.
When I try to decommission a node in my Cassandra cluster, the process starts (I see active streams flowing from the node to decommission to the other nodes in the cluster (using vnodes)), but then after a little delay nodetool decommission exists with the following error message.
I can repeatedly run nodetool decommission and it will start streaming data to other nodes, but so far always exists with the below error.
Why am I seeing this, and is there a way I can safely decommission this node?
Exception in thread "main" java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.db.HintedHandOffManager.getHintsSlice(HintedHandOffManager.java:578)
at org.apache.cassandra.db.HintedHandOffManager.listEndpointsPendingHints(HintedHandOffManager.java:528)
at org.apache.cassandra.service.StorageService.streamHints(StorageService.java:2854)
at org.apache.cassandra.service.StorageService.unbootstrap(StorageService.java:2834)
at org.apache.cassandra.service.StorageService.decommission(StorageService.java:2795)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:27)
at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208)
at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120)
at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262)
at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836)
at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761)
at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1454)
at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:74)
at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1295)
at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1387)
at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:818)
at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:303)
at sun.rmi.transport.Transport$1.run(Transport.java:159)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:155)
at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:100)
at org.apache.cassandra.service.StorageProxy.getRangeSlice(StorageProxy.java:1213)
at org.apache.cassandra.db.HintedHandOffManager.getHintsSlice(HintedHandOffManager.java:573)
... 33 more
The hinted handoff manager is checking for hints to see if it needs to pass those off during
the decommission so that the hints don't get lost. You most likely have a lot of hints, or
a bunch of tombstones, or something in the table causing the query to timeout. You aren't
seeing any other exceptions in your logs before the timeout are you? Raising the read timeout
period on your nodes before you decommission them, or manually deleting the hints CF, should
most likely get your past this. If you delete them, you would then want to make sure you
ran a full cluster repair when you are done with all of your decommissions, to propagate data
from any hints you deleted.
The short answer is that the node I was trying to decommission was underpowered for the amount of data it held. As of this writing there seems to be a reasonable hard minimum of resources needed to handle nodes with arbitrary amounts of data, which seems to be somewhere in the neighborhood of what an AWS i2.2xlarge provides. In particular, the old m1 instances let you get into trouble by allowing you to store far more data on each node than the memory and compute resources available can support on it.
We migrated our cluster to Cassandra 1.2 (from 1.1.7) and we tried to run a repair after the migration (we run them periodically, as recommended).
The repair failed due to the SSTable version being different, so we used nodetool upgradesstables and nodetool scrub and retried.
But then the repair failed with this exception :
java.io.IOException: Broken pipe
at com.google.common.base.Throwables.propagate(Throwables.java:160)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.io.IOException: Broken pipe
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
at sun.nio.ch.FileChannelImpl.transferToDirectly(Unknown Source)
at sun.nio.ch.FileChannelImpl.transferTo(Unknown Source)
at org.apache.cassandra.streaming.compress.CompressedFileStreamTask.stream(CompressedFileStreamTask.java:90)
at org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:91)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
We also noticed that before dying, nodetool netstats shows weird statistics (SSTable transfers being at 756845%...).
EDIT
It seems to be a problem with the transfer of compressend SSTables, I will try to disable compression on this ColumnFamily and retry.
RE-EDIT
Disabling compression fixed the problem so we tried to re-enable it and got a
java.lang.RuntimeException:
Last written key DecoratedKey(97115[...]3136) >= current key DecoratedKey(8...
Any idea?
The issue was caused by a bug in Cassandra 1.2, affecting transfers of compressed column families between nodes. It was fixed in 1.2.2 => Jira entry for this bug.