I have a Spark Streaming application that has multiple data flows (DStreams) that write in the same Cassandra table. When testing my application on a large amount of random data, I'm receiving an error from Spark Cassandra Connector that has very little information helpful for debugging. The error looks like this:
java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.InvalidQueryException: Key may not be empty
at com.baynote.shaded.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
at com.baynote.shaded.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
at com.baynote.shaded.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at com.datastax.spark.connector.rdd.CassandraJoinRDD$$anonfun$fetchIterator$1.apply(CassandraJoinRDD.scala:268)
at com.datastax.spark.connector.rdd.CassandraJoinRDD$$anonfun$fetchIterator$1.apply(CassandraJoinRDD.scala:268)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at com.datastax.spark.connector.util.CountingIterator.hasNext(CountingIterator.scala:12)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:189)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.datastax.driver.core.exceptions.InvalidQueryException: Key may not be empty
at com.datastax.driver.core.Responses$Error.asException(Responses.java:136)
at com.datastax.driver.core.DefaultResultSetFuture.onSet(DefaultResultSetFuture.java:179)
at com.datastax.driver.core.RequestHandler.setFinalResult(RequestHandler.java:184)
at com.datastax.driver.core.RequestHandler.access$2500(RequestHandler.java:43)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.setFinalResult(RequestHandler.java:798)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onSet(RequestHandler.java:617)
at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1005)
at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:928)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:831)
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:346)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:254)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
The problem with it is that I can't tell which DStream, and which data cause it. I can check every DStream that writes to Cassandra, or write my own data validator, but I'm looking for a more generic solution.
The other problem is that it the error kills the whole job instead of ignoring it and continue writing other data. Basically in case of simple non-spark writing I would catch the exception, log it and continue writing the rest of the data. Is there a way to do something like that in Spark Cassandra Connector?
So is there something I can do about those two problems?
I think we should consider two cases:
Validate your input data to make sure data for Key (in cassandra columns) is not Null or invalid data format
Your data is RDD, so you can sort to ignore invalid data before calling save method.
Related
I'm trying to consume messages from RabbitMQ in Spark Streaming following this answer: https://stackoverflow.com/a/38172737/1344854. I'm failing with IllegalArgumentException and I don't know why.
6/09/05 13:23:22 ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting receiver 0 - java.lang.IllegalArgumentException: ssl://user:password#bunny.cloudamqp.com:8883/vhost
at org.eclipse.paho.client.mqttv3.MqttConnectOptions.validateURI(MqttConnectOptions.java:458)
at org.eclipse.paho.client.mqttv3.MqttAsyncClient.<init>(MqttAsyncClient.java:273)
at org.eclipse.paho.client.mqttv3.MqttAsyncClient.<init>(MqttAsyncClient.java:167)
at org.eclipse.paho.client.mqttv3.MqttClient.<init>(MqttClient.java:224)
at org.apache.spark.streaming.mqtt.MQTTReceiver.onStart(MQTTInputDStream.scala:71)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:575)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:565)
at org.apache.spark.SparkContext$$anonfun$37.apply(SparkContext.scala:1992)
at org.apache.spark.SparkContext$$anonfun$37.apply(SparkContext.scala:1992)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
I tried passing various combinations of brokerUrl argument to the MQTTUtils.createStream constructor including tcp:
tcp://user:password#bunny.cloudamqp.com:1883/vhost
Why is it failing to validate and how to fix it?
I am trying to save a Dataframe To cassandra, which results in error due to wrong password lets say. The job goes to failed state throwing the below exception in Jobs tab in Spark UI but in the streaming tab, the corresponding batch remains in active state forever.It doesn't fail the streaming batch in yarn-client mode.. Whereas, the same works fine in Yarn-cluster mode, it throws the same error and ends the application.
Why is this difference in behaviour in the 2 modes? Why does yarn-client mode behaves in this way?
Exception seen in both modes:
About to save to Cassandra.
16/08/04 08:04:43 ERROR org.apache.spark.streaming.scheduler.JobScheduler: Error running job streaming job 1470297880000 ms.0
java.io.IOException: Failed to open native connection to Cassandra at {172.x.x.x}:9042
at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:162)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(CassandraConnector.scala:148)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(CassandraConnector.scala:148)
at com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:31)
at com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scala:56)
at com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraConnector.scala:81)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109)
at com.datastax.spark.connector.rdd.partitioner.CassandraRDDPartitioner$.getTokenFactory(CassandraRDDPartitioner.scala:184)
at org.apache.spark.sql.cassandra.CassandraSourceRelation$.apply(CassandraSourceRelation.scala:267)
at org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:84)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:170)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
at boccstreamingall$$anonfun$process_kv_text_stream$1.apply(bocc_spark_all.scala:249)
at boccstreamingall$$anonfun$process_kv_text_stream$1.apply(bocc_spark_all.scala:233)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:631)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:631)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:207)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:207)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:207)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:206)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.datastax.driver.core.exceptions.AuthenticationException: Authentication error on host /172.x.x.x:9042: Username and/or password are incorrect
at com.datastax.driver.core.Connection$8.apply(Connection.java:376)
at com.datastax.driver.core.Connection$8.apply(Connection.java:346)
at shadeio.common.util.concurrent.Futures$ChainingListenableFuture.run(Futures.java:861)
at shadeio.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
at shadeio.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
at shadeio.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145)
at shadeio.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:185)
at com.datastax.driver.core.Connection$Future.onSet(Connection.java:1174)
at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1005)
at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:928)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1084)
at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:959)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:327)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:230)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:831)
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:346)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:254)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
I am running a spark streaming application with the input source as Kafka. The version of spark is 1.4.0.
My application runs fine under, but now when I enable checkpointing, run the job and then restart the job to see if check-pointing is working properly I get the following flooded into the logs and the job halts.
Could you help me in resolving this issue. Please let me know if any other information is needed. Basically I want to add the checkpointing feature to my spark streaming application.
15/10/30 13:23:00 INFO TorrentBroadcast: Started reading broadcast variable 4
java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_4_piece0 of broadcast_4
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1257)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
at com.toi.columbia.aggregate.util.CalendarUtil.isRecordCassandraInsertableV1(CalendarUtil.java:103)
at com.toi.columbia.aggregate.stream.v1.AdvPublisherV1$3.call(AdvPublisherV1.java:124)
at com.toi.columbia.aggregate.stream.v1.AdvPublisherV1$3.call(AdvPublisherV1.java:110)
at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$fn$1$1.apply(JavaDStreamLike.scala:172)
at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$fn$1$1.apply(JavaDStreamLike.scala:172)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at com.datastax.spark.connector.util.CountingIterator.hasNext(CountingIterator.scala:10)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at com.datastax.spark.connector.writer.TableWriter.measureMaxInsertSize(TableWriter.scala:89)
at com.datastax.spark.connector.writer.TableWriter.com$datastax$spark$connector$writer$TableWriter$$optimumBatchSize(TableWriter.scala:107)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:133)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:127)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:98)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:97)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:149)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:97)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:127)
at com.datastax.spark.connector.streaming.DStreamFunctions$$anonfun$saveToCassandra$1$$anonfun$apply$1.apply(DStreamFunctions.scala:26)
at com.datastax.spark.connector.streaming.DStreamFunctions$$anonfun$saveToCassandra$1$$anonfun$apply$1.apply(DStreamFunctions.scala:26)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to get broadcast_4_piece0 of broadcast_4
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:137)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:175)
at org.apache.spark.u
maybe you forgot to increase the spark.cleaner.ttl so the task gets cleaned.
see here https://issues.apache.org/jira/browse/SPARK-5594
I believe you are creating the broadcast variables inside
JavaStreamingContextFactory factory = new JavaStreamingContextFactory() {}
Try creating the broadcast variables outside this overridden method.
As is clear from you exception - the broadcast variables are not being intitialized when you restart your chekpointed application.
cheers!
Can IMap or other Hazelcast distributed data structures like AtomicLong be accessed from within process() method of an EntryProcessor?
I'm getting following exception:
java.util.concurrent.ExecutionException: java.lang.IllegalThreadStateException: Thread[hz.Alcatraz-ANP-Sys-HAZLE-2.actiance.local.partition-operation.thread-5,5,Alcatraz-ANP-Sys-HAZLE-2.actiance.local] cannot make remote call: com.hazelcast.concurrent.lock.operations.LockOperation#3229190f
at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[na:1.7.0_51]
at java.util.concurrent.FutureTask.get(FutureTask.java:188) ~[na:1.7.0_51]
at com.hazelcast.executor.impl.DistributedExecutorService$CallableProcessor.run(DistributedExecutorService.java:189) ~[hazelcast-3.3_actiance.jar:3.3]
at com.hazelcast.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:209) [hazelcast-3.3_actiance.jar:3.3]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_51]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_51]
at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51]
at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:76) [hazelcast-3.3_actiance.jar:3.3]
at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:92) [hazelcast-3.3_actiance.jar:3.3]
I'm using Hazelcast version 3.3
You can access other datastructures but you need to make sure they're in the same data partition as the currently processed entry. This means you can use (for example) data affinity to pin all data together in the same partition.
Sharing an IAtomicLong between different partitions is not possible though.
PS: You also shouldn't mutate other data than the current processed entry since it might end up in a deadlock between different nodes.
I'm trying to use the new join functionality from the 1.2 version but I get an error with the repartitionByCassandraReplica function in the repl.
I've tried to duplicate the example of the website and created a cassandra table (shopping_history) with a couple of elements :
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.mde
import com.datastax.spark.connector.rdd._
import com.datastax.spark.connector.cql.CassandraConnector
import com.datastax.spark.connector._
import com.datastax.driver.core._
case class CustomerID(cust_id: Int)
val idsOfInterest = sc.parallelize(1 to 1000).map(CustomerID(_))
val repartitioned = idsOfInterest.repartitionByCassandraReplica("cim_dev", "shopping_history", 10)
repartitioned.first()
I get this error :
15/04/13 18:35:43 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 2, dev2-cim.aid.fr): java.lang.ClassNotFoundException: $line31.$read$$iwC$$iwC$CustomerID
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:344)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$27.apply(RDD.scala:1098)
at org.apache.spark.rdd.RDD$$anonfun$27.apply(RDD.scala:1098)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I use spark 1.2.0 with connector 1.2.0 RC 3.
The joinWithCassandraTable function used on idsOfInterest works.
I'm also curious about the differences betwween : joinWithCassandraTable / cassandraTable with a In clause / foreachPartition(withSessionDo) syntax.
Do they all request the data to the local node which acts as a coordinator ?
Is joinWithCassandraTable combine with repartitionByCassandraReplica as efficient as an async query, requesting data only to the local node ? What happen if repartitionByCassandraReplica is not applied ?
I've already asked this question on the google group forum of the cassandra connector :
https://groups.google.com/a/lists.datastax.com/forum/#!topic/spark-connector-user/b615ANGSySc
Thanks
I'll answer your second question here, and followup with the first portion if I can figure something out based on more information.
I'm also curious about the differences betwween :
joinWithCassandraTable / cassandraTable with a In clause /
foreachPartition(withSessionDo) syntax.
The cassandraTable with an in Clause will create a single spark partition. So it may be allright for very small in clauses, but the clause must be serialized from the driver to the spark application. This could be really bad for large in clauses and in general we don't want to send data back and forth from the spark driver to the executors if we don't have to.
joinWithCassandraTable and foreachPartition(withSessionDo) are very similar. The main difference is that the joinWithCassandraTable call is using the Connector transformation and reading code which should make it much easier to get Scala objects out of your Cassandra Rows. In both of these cases your data stays in RDD form and won't get serialized back to the driver. They will also both use the partitioner from the previous RDD (or last RDD which exposes a preferredLocation method) so they will be able to do work with repartitionByCassandraTable.
If repartitionByCassandraTable is not applied the data will be requested on a node that may or may not be a coordinator for the information you are requesting. This will add an additional network hop in your query but this may not be a very large performance penalty. The times at which you want to repartition before joining really depend on the total volume of data and the cost of the spark shuffle in the repartition op.