I have a 14 node cassandra 3.9 cluster with ~250GB of data on each node. Recently I have been attempting to add a 15th node to this cluster. The node is stuck in Joining state for the past 2 days. netstas is clear. The main thing I find suspicious in the system.log for that joining node is errors like these.
ERROR [Native-Transport-Requests-1] 2018-02-16 15:43:32,635 Message.java:617 - Unexpected exception during request; channel = [id: 0x8ed1cb3b, L:/**.**.**.42:9042 - R:/**.**.**.**:41614]
java.lang.NullPointerException: null
at org.apache.cassandra.auth.PasswordAuthenticator.authenticate(PasswordAuthenticator.java:88) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.auth.PasswordAuthenticator.access$300(PasswordAuthenticator.java:59) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.auth.PasswordAuthenticator$PlainTextSaslAuthenticator.getAuthenticatedUser(PasswordAuthenticator.java:220) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.transport.messages.AuthResponse.execute(AuthResponse.java:78) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513) [apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407) [apache-cassandra-3.9.jar:3.9]
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.39.Final.jar:4.0.39.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366) [netty-all-4.0.39.Final.jar:4.0.39.Final]
at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35) [netty-all-4.0.39.Final.jar:4.0.39.Final]
at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:357) [netty-all-4.0.39.Final.jar:4.0.39.Final]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_91]
at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) [apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.9.jar:3.9]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
This error message is from a client trying to connect to this node. It seems to fail authentication. How might I proceed in this situation? How should I bring this node to normalcy?
There are two different problems here,
The auth issue that client is facing is related to a bug with Cassandra 3.9 during bootstrap of new nodes. It has been resolved in the later versions of Cassandra as documented here https://issues.apache.org/jira/browse/CASSANDRA-12813.
We had a streaming issue similar to this with Cassandra 3.9. While taking a deeper look at the system.log there was an error with huge partition (partition greater than 100MB) not able to compacted since it exceeds the default commitlog_segment_size. . We were able to get around it once we increased the commitlog_segment_size_in_mb to 512mb. Check for huge partition warnings and adjust the size accordingly.
Related
I have one reachable node in my cluster and I tried replacing it, it wasn't successful. So, I left the node and ignored the data loss because of the replication factor 3.
Now, when I try to decommission or add a server, it's not working as expected.
I'm getting these INFO messages in all the nodes. I have tried to assassinate and remove as well. This node doesn't show up in the node tool status. But I guess is, it is somewhere persisted and Gossips are giving issues.
INFO [GossipStage:1] 2021-05-29 07:25:37,404 Gossiper.java:1029 - InetAddress /10.43.5.118 is now DOWN
INFO [GossipStage:1] 2021-05-29 07:25:37,405 StorageService.java:2324 - Removing tokens [] for /10.43.5.118
And also, while restarting the node, I get an ERROR from the gossip which is NullPointerException. It's not able to get the host id. I tried removing it with the old method mentioned in the stackoverflow. Using JXM.
ERROR [GossipStage:1] 2021-05-29 08:48:35,229 CassandraDaemon.java:226 - Exception in thread Thread[GossipStage:1,5,main]
java.lang.NullPointerException: null
at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:866) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:2096) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.service.StorageService.onChange(StorageService.java:1822) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.service.StorageService.onJoin(StorageService.java:2536) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1070) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1181) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:49) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:64) ~[apache-cassandra-3.9.jar:3.9]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_181]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_181]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_181]
Can someone let me know how to remove this node completely?
In this case, Please remove all the data from removed node and try to start stand alone. if it works then again you should clear the data on that node and join the cluster with configuration change.
Also, if you have less amount of data on other nodes then please run cleanup and repair once before adding new node.
I am using Spark 2.2.0 to do data processing. I am using Dataframe.join to join 2 dataframes together, however I encountered this stack trace:
18/03/29 11:27:06 INFO YarnAllocator: Driver requested a total number of 0 executor(s).
18/03/29 11:27:09 ERROR FileFormatWriter: Aborting job null.
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:123)
at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:248)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:126)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
...........
Caused by: org.apache.spark.SparkException: Cannot broadcast the table that is larger than 8GB: 10 GB
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:86)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:73)
at org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:103)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:72)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:72)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I searched on Internet for this error, but didn't get any hint or solution how to fix this.
Does Spark automatically broadcast Dataframe as part of the join? I am very surprise with this 8GB limit because I would have thought Dataframe supports "big data" and 8GB is not very big at all.
Thank you very much in advance for your advice on this.
Linh
After some reading, I've tried to disable the auto-broadcast and it seemed to work. Change Spark config with:
'spark.sql.autoBroadcastJoinThreshold': '-1'
Currently it is a hard limit in spark that the broadcast variable size should be less than 8GB. See here.
The 8GB size is generally big enough. If you consider that you re running a job with 100 executors, spark driver needs to send the 8GB data to 100 Nodes resulting 800GB network traffic. This cost will be much less if you don't broadcast and use simple join.
We have a cassandra keyspace which has 2 tables in production. We have changed it's compression strategy from LZ4Compressor (which is default) to DeflateCompressor
using
ALTER TABLE "Keyspace"."TableName" WITH compression = {'class': 'DeflateCompressor'};
As we have around 300 GB data in each node of my cassandra 5 node cluster with replication factor 2. Is
nodetool upgradesstables recommended or not as best practice.
From all the sources that we have read
If necessary
I can use nodetool upgradesstables command. But I want to know what is actually the best practice as our data it is in production?
Sources :
When you add compression to an existing column family, existing SSTables on disk are not
compressed immediately. Any new SSTables that are created will be compressed, and any existing SSTables will be
compressed during the normal Cassandra compaction process. If necessary, you can force existing SSTables to be
rewritten and compressed by using nodetool upgradesstables (Cassandra 1.0.4 or later) or nodetool scrub
After all nodes complete upgradesstables A large no of exceptions are being encountered in my cassandra logs
UPDATE - After running upgradesstables now my cluster is throwing a lot of errors
Sample
`
ERROR [ReadRepairStage:74899] 2018-04-08 14:50:09,779
CassandraDaemon.java:229 - Exception in thread
Thread[ReadRepairStage:74899,5,main]
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed
out - received only 0 responses. at
org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171)
~[apache-cassandra-3.10.jar:3.10] at
org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182)
~[apache-cassandra-3.10.jar:3.10] at
org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82)
~[apache-cassandra-3.10.jar:3.10] at
org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89)
~[apache-cassandra-3.10.jar:3.10] at
org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50)
~[apache-cassandra-3.10.jar:3.10] at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
~[apache-cassandra-3.10.jar:3.10] at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
~[na:1.8.0_144] at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
~[na:1.8.0_144] at
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
~[apache-cassandra-3.10.jar:3.10] at
java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_144] EBUG
[ReadRepairStage:74889] 2018-04-08 14:50:07,777 ReadCallback.java:242
- Digest mismatch: org.apache.cassandra.service.DigestMismatchException: Mismatch for key
DecoratedKey(1013727261649388230, 715cb15cc5624c5a930ddfce290a690b)
(d728e9a275616b0e05a0cd1b03bd9ef6 vs d41d8cd98f00b204e9800998ecf8427e)
at
org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92)
~[apache-cassandra-3.10.jar:3.10] at
org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:233)
~[apache-cassandra-3.10.jar:3.10] at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[na:1.8.0_144] at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[na:1.8.0_144] at
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
[apache-cassandra-3.10.jar:3.10] at
java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_144] DEBUG
[GossipStage:1] 2018-04-08 14:50:08,490 FailureDetector.java:457 -
Ignoring interval time of 2000213620 for /10.196.22.208 DEBUG
[ReadRepairStage:74899] 2018-04-08 14:50:09,778 DataResolver.java:169
- Timeout while read-repairing after receiving all 1 data and digest responses ERROR [ReadRepairStage:74899] 2018-04-08 14:50:09,779
CassandraDaemon.java:229 - Exception in thread
Thread[ReadRepairStage:74899,5,main]
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed
out - received only 0 responses. at
org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171)
~[apache-cassandra-3.10.jar:3.10] at
org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182)
~[apache-cassandra-3.10.jar:3.10] at
org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82)
~[apache-cassandra-3.10.jar:3.10] at
org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89)
~[apache-cassandra-3.10.jar:3.10]`
When you use nodetool upgradesstables it writes new SSTables from existing but using the new options that you specified. This is IO-intensive process that may affect performance of your cluster, so you need to plan it accordingly. You also need to have enough disk space to perform this operation. This command should also run as the same user that runs Cassandra.
It's really depends on your needs - if it's not urgent, you can simply wait until the normal compaction occurs, and then data will be re-compressed.
I have a 15 node cassandra 3.9 cluster. I recently faced an issue where one of my nodes as piling up GossipStage messages. Following some guidance I found on a similar report I ran 'nodetool resetlocalschema' on that node. While gossip errors like these continue to show in the logs
WARN [GossipTasks:1] 2018-02-11 23:55:34,197 Gossiper.java:771 - Gossip stage has 180317 pending tasks; skipping status check (no nodes will be marked down)
I also see the following exception. Any guidance on how I can overcome this and bring this node back to normal? Also I should mention I have PasswordAuthenticator enabled in the cassandra.yaml file.
ERROR [Native-Transport-Requests-1] 2018-02-11 23:55:33,581 Message.java:617 - Unexpected exception during request; channel = [id: 0xbaa65545,
L:/10.1.21.51:9042 - R:/10.1.86.40:35082]
java.lang.RuntimeException: com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalArgumentException: Unknown keyspace
/cf pair (system_auth.roles)
at org.apache.cassandra.auth.PasswordAuthenticator.authenticate(PasswordAuthenticator.java:107) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.auth.PasswordAuthenticator.access$300(PasswordAuthenticator.java:59) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.auth.PasswordAuthenticator$PlainTextSaslAuthenticator.getAuthenticatedUser(PasswordAuthenticator.java:220) ~[ap
ache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.transport.messages.AuthResponse.execute(AuthResponse.java:78) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513) [apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407) [apache-cassandra-3.9.jar:3.9]
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.39.Final.jar:4.0.39.Fi
nal]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366) [netty-all-4.0.39.Final.jar
:4.0.39.Final]
at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35) [netty-all-4.0.39.Final.jar:4.0.39.
Final]
at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:357) [netty-all-4.0.39.Final.jar:4.0.39.Fina
l]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_91]
at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) [apache
-cassandra-3.9.jar:3.9]
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.9.jar:3.9]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
Caused by: com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalArgumentException: Unknown keyspace/cf pair (system_auth.roles)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]
at com.google.common.cache.LocalCache.get(LocalCache.java:3937) ~[guava-18.0.jar:na]
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) ~[guava-18.0.jar:na]
at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824) ~[guava-18.0.jar:na]
at org.apache.cassandra.auth.AuthCache.get(AuthCache.java:108) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.auth.PasswordAuthenticator.authenticate(PasswordAuthenticator.java:88) ~[apache-cassandra-3.9.jar:3.9]
... 13 common frames omitted
Actually this issue was resolved by simply restarting the seed nodes of my cluster first followed by the rest of the nodes in my cluster. Thanks for all the inputs. Truly appreciate it.
i'm using spark with cassandra and i want to write data into my cassandra table:
CREATE TABLE IF NOT EXISTS MyTable(
user TEXT,
date TIMESTAMP,
event TEXT,
PRIMARY KEY((user ),date , event)
);
But i got this error :
java.io.IOException: Failed to write statements to KeySpace.MyTable.
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:145)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:120)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:100)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:99)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:151)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:99)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:120)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
15/04/28 17:57:47 WARN TaskSetManager: Lost task 13.2 in stage 1.0 (TID 43, dev2-cim.aid.fr): TaskKilled (killed intentionally)
and this warnings in my Cassandra log File :
WARN [SharedPool-Worker-2] 2015-04-28 16:45:21,219 BatchStatement.java:243 - Batch of prepared statements for [*********] is of size 8158, exceeding specified threshold of 5120 by 3038
after making some searchs in the Internet, i've found this link who explain how he fixes the same problem :
http://progexc.blogspot.fr/2015/03/write-batch-size-error-spark-cassandra.html
So, Now i've modified my spark algorithm to add :
conf.set("spark.cassandra.output.batch.grouping.key", "None")
conf.set("spark.cassandra.output.batch.size.rows", "10")
conf.set("spark.cassandra.output.batch.size.bytes", "2048")
this values remove the warning message i got in cassandra Logs, but i still have the same error : Failed to write statements.
In my spark log fail i found this error :
Failed to execute:
com.datastax.spark.connector.writer.RichBatchStatement#67827d57
com.datastax.driver.core.exceptions.InvalidQueryException: Key may not be empty
at com.datastax.driver.core.Responses$Error.asException(Responses.java:103)
at com.datastax.driver.core.DefaultResultSetFuture.onSet(DefaultResultSetFuture.java:140)
at com.datastax.driver.core.RequestHandler.setFinalResult(RequestHandler.java:293)
at com.datastax.driver.core.RequestHandler.onSet(RequestHandler.java:455)
at com.datastax.driver.core.Connection$Dispatcher.messageReceived(Connection.java:734)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.jboss.netty.handler.timeout.IdleStateAwareChannelUpstreamHandler.handleUpstream(IdleStateAwareChannelUpstreamHandler.java:36)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.jboss.netty.handler.timeout.IdleStateHandler.messageReceived(IdleStateHandler.java:294)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at org.jboss.netty.handler.codec.oneone.OneToOneDecoder.handleUpstream(OneToOneDecoder.java:70)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
at org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
at org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
I had the same problem and found the solution in the comments above (by Amine CHERIFI and maasg).
The column corresponding to the primary key was not always filled with a proper value (in my case with an empty string "").
This triggered the ERROR
ERROR QueryExecutor: Failed to execute: \
com.datastax.spark.connector.writer.RichBatchStatement#26ad2668 \
com.datastax.driver.core.exceptions.InvalidQueryException: Key may not be empty
The solution was to provide a default non-empty string.
If you are running in yarn-cluster mode, don't forget to check entire log on yarn using yarn logs -applicationId <appId> --appOwner <appOwner>.
This gave me more reasons for failure than the logs on yarn webUI
Caused by: com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency LOCAL_QUORUM (2 required but only 1 alive)
at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:50)
at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:37)
at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:266)
at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:246)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89)
... 11 more
Solution is to set the spark.cassandra.output.consistency.level=ANY in your spark-defaults.conf
I resolved the issue by restarting my cluster as will as nodes.
Following is the things I tried.
I am also facing same issue I tried all the options above you mentioned in the blog but not success.
My data size is 174gb. Total 174 Gb data , My cluster having 3 node, each node having 16 cores and 48 gb ram.
I tried to lode 174gb in a single shot at that time i have the same issue.
After that I segregated 174 gb in 109 file each 1.6 Gb and tried to lode, this time I faced the same problem again after loading 100 files(each 1.6 gb).
I thought may be the problem with data in 101 file. I tried to load the first file and tried to lode the first file into the new table, and tried to lode new data into new table but all this cases having the issue.
Then I think it is the problem with cassandra cluster and restarted the cluster and nodes also.
Then the issue gone away.
Add a breakpoint in "com/datastax/spark/connector/writer/AsyncExecutor.scala:45 ", you can get the real exception.
In my case, replication_factor of my keyspace is 2, but I have only one alive.