Cassandra exceptions during repair - cassandra

I am currently having issues with cassandra when running a nodetool repair.
I ran a nodetool repair on each of our cassandra nodes and came across the exception mentioned below.
After doing some reading I ran a "nodetool scrub" and "sstablescrub" with no success, I am still seeing the same errors when running a nodetool repair.
We are using cassandra version 2.0.1.
Has anyone else seen this problem?
ERROR [AntiEntropySessions:3] 2013-11-26 15:25:03,315 RepairSession.java (line 278) [repair #f9f86700-56f1-11e3-9885-5938b4e97c9c] session completed with the following error
org.apache.cassandra.exceptions.RepairException: [repair #f9f86700-56f1-11e3-9885-5938b4e97c9c on <keyspace>/<table-1>, (-3566327001497837731,-3559225618918749690]] Validation failed in /<ipaddress-1>
at org.apache.cassandra.repair.RepairSession.validationComplete(RepairSession.java:152)
at org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:188)
at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:59)
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
ERROR [AntiEntropySessions:3] 2013-11-26 15:25:03,315 CassandraDaemon.java (line 185) Exception in thread Thread[AntiEntropySessions:3,5,RMI Runtime]
java.lang.RuntimeException: org.apache.cassandra.exceptions.RepairException: [repair #f9f86700-56f1-11e3-9885-5938b4e97c9c on <keyspace>/<table-1>, (-3566327001497837731,-3559225618918749690]] Validation failed in /<ipaddress-1>
at com.google.common.base.Throwables.propagate(Throwables.java:160)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.apache.cassandra.exceptions.RepairException: [repair #f9f86700-56f1-11e3-9885-5938b4e97c9c on <keyspace>/<table-1>, (-3566327001497837731,-3559225618918749690]] Validation failed in /<ipaddress-1>
at org.apache.cassandra.repair.RepairSession.validationComplete(RepairSession.java:152)
at org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:188)
at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:59)
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
... 3 more
INFO [AntiEntropySessions:4] 2013-11-26 15:25:03,319 RepairSession.java (line 236) [repair #fa075b20-56f1-11e3-9885-5938b4e97c9c] new session: will sync /<ipaddress1>, /<ipaddress2>, /<ipaddress3> on range (2637939872511762724,2642446772090452246] for <keyspace>.[<table-1>, <table-2>, <table-3>, <table-4>, <table-5>]
INFO [AntiEntropySessions:4] 2013-11-26 15:25:03,319 RepairJob.java (line 116) [repair #fa075b20-56f1-11e3-9885-5938b4e97c9c] requesting merkle trees for <table-2> (to [/<ipaddress1>, <ipaddress2>, <ipaddress3>])
ERROR [ValidationExecutor:5] 2013-11-26 15:25:03,353 Validator.java (line 242) Failed creating a merkle tree for [repair #f9f86700-56f1-11e3-9885-5938b4e97c9c on <keyspace>/<table-1>, (-3566327001497837731,-3559225618918749690]], /<ipaddress-2> (see log for details)
ERROR [ValidationExecutor:5] 2013-11-26 15:25:03,353 CassandraDaemon.java (line 185) Exception in thread Thread[ValidationExecutor:5,1,main]
java.lang.AssertionError
at org.apache.cassandra.db.compaction.PrecompactedRow.update(PrecompactedRow.java:171)
at org.apache.cassandra.repair.Validator.rowHash(Validator.java:198)
at org.apache.cassandra.repair.Validator.add(Validator.java:151)
at org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:798)
at org.apache.cassandra.db.compaction.CompactionManager.access$600(CompactionManager.java:60)
at org.apache.cassandra.db.compaction.CompactionManager$8.call(CompactionManager.java:395)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

Depending on your data model, you may be seeing the effects CASSANDRA-6152. See this comment for details:
https://issues.apache.org/jira/browse/CASSANDRA-6152?focusedCommentId=13793977&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13793977
Regardless, you should update to 2.0.3 immediately as there are several other fairly severe bugs lurking on your current version.

Related

Hive on Spark timeout

Under cloudera distribution, I have configured HIVE ON SPARK according to the online documentation.
When trying to execute a simple query for testing:
beeline -u "jdbc:hive2://<HOST_NAME>.<DOMAIN>:10000/default" -n mehditazi -p <PASSWORD> -e "SET hive.execution.engine=spark;SET spark.dynamicAllocation.enabled=true;SET spark.executor.memory=4g;SET spark.executor.cores=4;SET hive.spark.client.connect.timeout=5000;select count(*) from default.sample_07";
I get the following errors :
<== ON HUE I GET ==>
Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
<== ON HIVE I GET ==>
2018-05-31 18:29:51,625 WARN [main] mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing PrefixTreeCodec is not present. Continuing without it.
scan complete in 3ms
Connecting to jdbc:hive2://<HOST_NAME>.<DOMAIN>:10000/default
Connected to: Apache Hive (version 1.1.0-cdh5.8.0)
Driver: Hive JDBC (version 1.1.0-cdh5.8.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
No rows affected (0.101 seconds)
No rows affected (0.005 seconds)
No rows affected (0.005 seconds)
No rows affected (0.005 seconds)
No rows affected (0.005 seconds)
INFO : Compiling command(queryId=hive_20180531182929_1e4bd43e-df8a-4b87-b898-dc73eebfbda3): select count(*) from default.sample_07
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20180531182929_1e4bd43e-df8a-4b87-b898-dc73eebfbda3); Time taken: 0.463 seconds
INFO : Executing command(queryId=hive_20180531182929_1e4bd43e-df8a-4b87-b898-dc73eebfbda3): select count(*) from default.sample_07
INFO : Query ID = hive_20180531182929_1e4bd43e-df8a-4b87-b898-dc73eebfbda3
INFO : Total jobs = 1
INFO : Launching Job 1 out of 1
INFO : Starting task [Stage-1:MAPRED] in serial mode
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=<number>
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=<number>
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=<number>
ERROR : Failed to execute spark task, with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark client.)'
org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark client.
at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:64)
at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:114)
at org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSparkSession(SparkUtilities.java:125)
at org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:97)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1782)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1539)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1318)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1127)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1120)
at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:178)
at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:72)
at org.apache.hive.service.cli.operation.SQLOperation$2$1.run(SQLOperation.java:232)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hive.service.cli.operation.SQLOperation$2.run(SQLOperation.java:245)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Timed out waiting for client connection.
at com.google.common.base.Throwables.propagate(Throwables.java:156)
at org.apache.hive.spark.client.SparkClientImpl.<init>(SparkClientImpl.java:120)
at org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80)
at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.createRemoteClient(RemoteHiveSparkClient.java:99)
at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.<init>(RemoteHiveSparkClient.java:95)
at org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.createHiveSparkClient(HiveSparkClientFactory.java:65)
at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:62)
... 22 more
Caused by: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Timed out waiting for client connection.
at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
at org.apache.hive.spark.client.SparkClientImpl.<init>(SparkClientImpl.java:104)
... 27 more
Caused by: java.util.concurrent.TimeoutException: Timed out waiting for client connection.
at org.apache.hive.spark.client.rpc.RpcServer$2.run(RpcServer.java:141)
at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
ERROR : Failed to execute spark task, with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark client.)'
org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark client.
at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:64)
at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:114)
at org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSparkSession(SparkUtilities.java:125)
at org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:97)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1782)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1539)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1318)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1127)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1120)
at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:178)
at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:72)
at org.apache.hive.service.cli.operation.SQLOperation$2$1.run(SQLOperation.java:232)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hive.service.cli.operation.SQLOperation$2.run(SQLOperation.java:245)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Timed out waiting for client connection.
at com.google.common.base.Throwables.propagate(Throwables.java:156)
at org.apache.hive.spark.client.SparkClientImpl.<init>(SparkClientImpl.java:120)
at org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80)
at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.createRemoteClient(RemoteHiveSparkClient.java:99)
at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.<init>(RemoteHiveSparkClient.java:95)
at org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.createHiveSparkClient(HiveSparkClientFactory.java:65)
at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:62)
... 22 more
After investigation it's a bug which will be fixed in HIVE-10594[1].

how to understand yarn appattempt log and diagnostics?

I am running a spark application on YARN cluster(on AWS EMR). The application seems to be killed and I want to find the cause. I try to understand the YARN info given in the following screen.
The diagnostic line in the screen seems to show that YARN killing the app because of the memory limit:
Diagnostics: Container [pid=1540,containerID=container_1488651686158_0012_02_000001] is running beyond physical memory limits. Current usage: 1.6 GB of 1.4 GB physical memory used; 3.6 GB of 6.9 GB virtual memory used. Killing container.
However, the appattempt log shows completely different exception, something related to the IO/network. My question is : should I trust the diagnostic in the screen or the appattempt log? Is the IO exception causing the kill or the out of memory cause the IO exception in the appattempt log? Is it another log/diagnostic I should look at? Thanks.
17/03/04 21:59:02 ERROR Utils: Uncaught exception in thread task-result-getter-0
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:190)
at org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:104)
at org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:579)
at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:82)
at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:62)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Exception in thread "task-result-getter-0" java.lang.Error: java.lang.InterruptedException
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:190)
at org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:104)
at org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:579)
at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:82)
at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:62)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
... 2 more
17/03/04 21:59:02 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/03/04 21:59:02 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from ip-172-31-9-207.ec2.internal/172.31.9.207:38437 is closed
17/03/04 21:59:02 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 1 outstanding blocks after 5000 ms
17/03/04 21:59:02 ERROR DiskBlockManager: Exception while deleting local spark dir: /mnt/yarn/usercache/hadoop/appcache/application_1488651686158_0012/blockmgr-941a13d8-1b31-4347-bdec-180125b6f4ca
java.io.IOException: Failed to delete: /mnt/yarn/usercache/hadoop/appcache/application_1488651686158_0012/blockmgr-941a13d8-1b31-4347-bdec-180125b6f4ca
at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
at org.apache.spark.storage.DiskBlockManager$$anonfun$org$apache$spark$storage$DiskBlockManager$$doStop$1.apply(DiskBlockManager.scala:169)
at org.apache.spark.storage.DiskBlockManager$$anonfun$org$apache$spark$storage$DiskBlockManager$$doStop$1.apply(DiskBlockManager.scala:165)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.storage.DiskBlockManager.org$apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:165)
at org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:160)
at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1361)
at org.apache.spark.SparkEnv.stop(SparkEnv.scala:89)
at org.apache.spark.SparkContext$$anonfun$stop$11.apply$mcV$sp(SparkContext.scala:1842)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1841)
at org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:581)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
17/03/04 21:59:02 INFO MemoryStore: MemoryStore cleared
17/03/04 21:59:02 INFO BlockManager: BlockManager stopped
17/03/04 21:59:02 INFO BlockManagerMaster: BlockManagerMaster stopped
17/03/04 21:59:02 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/03/04 21:59:02 ERROR Utils: Uncaught exception in thread Thread-3
java.lang.NoClassDefFoundError: Could not initialize class java.nio.file.FileSystems$DefaultFileSystemHolder
at java.nio.file.FileSystems.getDefault(FileSystems.java:176)
at java.nio.file.Paths.get(Paths.java:138)
at org.apache.spark.util.Utils$.isSymlink(Utils.scala:1021)
at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:991)
at org.apache.spark.SparkEnv.stop(SparkEnv.scala:102)
at org.apache.spark.SparkContext$$anonfun$stop$11.apply$mcV$sp(SparkContext.scala:1842)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1841)
at org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:581)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
17/03/04 21:59:02 WARN ShutdownHookManager: ShutdownHook '$anon$2' failed, java.lang.NoClassDefFoundError: Could not initialize class java.nio.file.FileSystems$DefaultFileSystemHolder
java.lang.NoClassDefFoundError: Could not initialize class java.nio.file.FileSystems$DefaultFileSystemHolder
at java.nio.file.FileSystems.getDefault(FileSystems.java:176)
at java.nio.file.Paths.get(Paths.java:138)
at org.apache.spark.util.Utils$.isSymlink(Utils.scala:1021)
at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:991)
at org.apache.spark.SparkEnv.stop(SparkEnv.scala:102)
at org.apache.spark.SparkContext$$anonfun$stop$11.apply$mcV$sp(SparkContext.scala:1842)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1841)
at org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:581)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
The information in your screenshot is the most relevant. Your ApplicationMaster container ran out of memory. You need to increase yarn.app.mapreduce.am.resource.mb which is set in mapred-site.xml. I recommend a value of 2000 since that will usually accommodate running Spark and MapReduce applications at scale.
The container was killed (memory exceeds physical memory limits) so any attempt to reach this container fails.
Yarn is fine to have an overall view of the process, but you should prefer spark history server to analyse better your job (check unbalanced memory in spark history).

sstableloader fails to upload data if node is down

I have a test setup in which sstableloader fails to upload data if one of the Cassandra node is down. I can see that
Is there a way to instruct sstableloader not to connect (or open a stream) to the dead node (I don't want to decommission/remove the node from cluster)?
Cassandra cluster info: Datastax community version 2.1.2, 3 node cluster out of which 2 are seed nodes.
During testing bulk upload, one of the seed node was down. The keyspace has replication factor = 2.
Exception encountered:
progress: total: 100% 0 MB/s(avg: 0 MB/s)ERROR 09:07:48 [Stream #8972f510-efe1-11e4-abad-9d409520f182] Streaming error occurred
java.net.ConnectException: Connection refused
at sun.nio.ch.Net.connect0(Native Method) ~[na:1.7.0_65]
at sun.nio.ch.Net.connect(Net.java:465) ~[na:1.7.0_65]
at sun.nio.ch.Net.connect(Net.java:457) ~[na:1.7.0_65]
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:670) ~[na:1.7.0_65]
at java.nio.channels.SocketChannel.open(SocketChannel.java:184) ~[na:1.7.0_65]
at org.apache.cassandra.tools.BulkLoadConnectionFactory.createConnection(BulkLoadConnectionFactory.java:62) ~[apache-cassandra-2.1.2.jar:2.1.2]
at org.apache.cassandra.streaming.StreamSession.createConnection(StreamSession.java:229) ~[apache-cassandra-2.1.2.jar:2.1.2]
at org.apache.cassandra.streaming.ConnectionHandler.initiate(ConnectionHandler.java:79) ~[apache-cassandra-2.1.2.jar:2.1.2]
at org.apache.cassandra.streaming.StreamSession.start(StreamSession.java:216) ~[apache-cassandra-2.1.2.jar:2.1.2]
at org.apache.cassandra.streaming.StreamCoordinator$StreamSessionConnector.run(StreamCoordinator.java:208) [apache-cassandra-2.1.2.jar:2.1.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_65]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_65]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_65]
progress: [/192.168.1.17]0:1/1 100% total: 100% 0 MB/s(avg: 1 MB/s)WARN 09:07:48 [Stream #8972f510-efe1-11e4-abad-9d409520f182] Stream failed
Streaming to the following hosts failed:
[/192.168.1.15]
java.util.concurrent.ExecutionException: org.apache.cassandra.streaming.StreamException: Stream failed
at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at org.apache.cassandra.tools.BulkLoader.main(BulkLoader.java:121)
Caused by: org.apache.cassandra.streaming.StreamException: Stream failed
at org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:208)
at org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:184)
at org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:382)
at org.apache.cassandra.streaming.StreamSession.complete(StreamSession.java:574)
at org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:438)
at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:251)
at java.lang.Thread.run(Thread.java:745)
Thanks in advance,
Anirban.
I just figured that I can pass an ignore list to sstableloader. By passing dead nodes in the ignore list sstableloader ran successfully in my test setup.

Cassandra Nodetool Repair error RepairException

When I run "nodetool -h 192.168.1.161 repair" on a node I get the following error. How should I resolve this? I'm running Cassandra 2.0.2.
ERROR [Thread-26] 2014-08-17 18:24:01,289 StorageService.java (line 2477) Repair session 8d2a1190-25aa-11e4-8a15-ff681618d551 for range (1844674407370955161,5534023222112865484] failed with error org.apache.cassandra.exceptions.RepairException: [repair #8d2a1190-25aa-11e4-8a15-ff681618d551 on WIDGET_CF/STATS, (1844674407370955161,5534023222112865484]] Validation failed in /192.168.1.164
java.util.concurrent.ExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.RepairException: [repair #8d2a1190-25aa-11e4-8a15-
ff681618d551 on WIDGET_CF/STATS, (1844674407370955161,5534023222112865484]] Validation failed in /192.168.1.164
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252)
at java.util.concurrent.FutureTask.get(FutureTask.java:111)
at org.apache.cassandra.service.StorageService$4.runMayThrow(StorageService.java:2469)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.lang.Thread.run(Thread.java:724)
Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.RepairException: [repair #8d2a1190-25aa-11e4-8a15-ff681618d551 on WIDGET_CF/STA
TS, (1844674407370955161,5534023222112865484]] Validation failed in /192.168.1.164
at com.google.common.base.Throwables.propagate(Throwables.java:160)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
... 1 more
Caused by: org.apache.cassandra.exceptions.RepairException: [repair #8d2a1190-25aa-11e4-8a15-ff681618d551 on WIDGET_CF/STATS, (1844674407370955161,553
4023222112865484]] Validation failed in /192.168.1.164
at org.apache.cassandra.repair.RepairSession.validationComplete(RepairSession.java:152)
at org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:188)
at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:59)
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
... 3 more

Cassandra compaction fails: IOError IOException Map failed

We have been running Cassandra 1.0.2 in production for many months, with minimal trouble. Lately, we have started to see consistent failures in all nodes, with this error:
ERROR [CompactionExecutor:199] 2013-03-02 00:21:26,179 AbstractCassandraDaemon.java (line 133) Fatal exception in thread Thread[CompactionExecutor:199,1,RMI Runtime]
java.io.IOError: java.io.IOException: Map failed
at org.apache.cassandra.io.util.MmappedSegmentedFile$Builder.createSegments(MmappedSegmentedFile.java:225)
at org.apache.cassandra.io.util.MmappedSegmentedFile$Builder.complete(MmappedSegmentedFile.java:202)
at org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:308)
at org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:174)
at org.apache.cassandra.db.compaction.CompactionManager$4.call(CompactionManager.java:275)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.IOException: Map failed
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:758)
at org.apache.cassandra.io.util.MmappedSegmentedFile$Builder.createSegments(MmappedSegmentedFile.java:217)
... 9 more
Caused by: java.lang.OutOfMemoryError: Map failed
at sun.nio.ch.FileChannelImpl.map0(Native Method)
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:755)
... 10 more
INFO [Thread-2] 2013-03-02 00:21:26,222 MessagingService.java (line 488) Shutting down MessageService...
The node dies after this. We remove the oldest data files and start the node again. The node runs for a while, then dies again. This happens for all eight nodes in our ring.
We are using the default compaction strategy (size tiered) and we think it is the best choice for our problem domain.
Q: Why is this happening and what should we do to fix it?

Resources