Exception with saving and loading of LinearSVM model in Mllib - apache-spark

I want to use linear SVM for classification. Here is the problem that I encounter while using Mllib. I am using CDH 5.4.4 and Spark 1.3 with MLlib dependency is specified as the following in my pom file:
<properties>
<uber.jar.name>linearsvm.jar</uber.jar.name>
<cdh.version>2.6.0-cdh5.4.4</cdh.version>
<cdh.spark.version>1.3.0-cdh5.4.4</cdh.spark.version>
</properties>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.10</artifactId>
<version>1.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>${cdh.spark.version}</version>
<exclusions>
<exclusion>
<groupId>org.xerial.snappy</groupId>
<artifactId>snappy-java</artifactId>
</exclusion>
</exclusions>
</dependency>
Here is my code to train the model
def main() {
val numIterations = 100
// Run training algorithm to build the model
val model = SVMWithSGD.train(training, numIterations)
// Save the trained model
model.save(spark,"mymodelpath")
}
Here is another class where I load that model
def performScoring (test: RDD[LabeledPoint] ) {
// load the saved model
val savedModel = SVMModel.load(spark, "mymodelpath")
savedModel.clearThreshold()
// Compute raw scores on the test set.
val scoreAndLabels = test.map { point =>
val score = savedModel.predict(point.features)
(score, point.label)
}
// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()
println("Area under ROC = " + auROC)
}
Here is the exception that I get:
Exception in thread "main" java.lang.IncompatibleClassChangeError: Implementing class
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:190)
at org.apache.spark.mapred.SparkHadoopMapRedUtil$class.firstAvailableClass(SparkHadoopMapRedUtil.scala:61)
at org.apache.spark.mapred.SparkHadoopMapRedUtil$class.newJobContext(SparkHadoopMapRedUtil.scala:27)
at org.apache.spark.SparkHadoopWriter.newJobContext(SparkHadoopWriter.scala:40)
at org.apache.spark.SparkHadoopWriter.getJobContext(SparkHadoopWriter.scala:182)
at org.apache.spark.SparkHadoopWriter.preSetup(SparkHadoopWriter.scala:64)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1057)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:954)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:863)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1290)
at org.apache.spark.mllib.classification.impl.GLMClassificationModel$SaveLoadV1_0$.save(GLMClassificationModel.scala:61)
at org.apache.spark.mllib.classification.SVMModel.save(SVM.scala:84)
at LinearSVM$.main(LinearSVM.scala:32)
at LinearSVM.main(LinearSVM.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)

Related

IllegalName: com/azure/core/implementation/serializer/MalformedValueException

Hi all,I'm using this dependencies:
<dependency>
<groupId>com.microsoft.graph</groupId>
<artifactId>microsoft-graph</artifactId>
<version>5.45.0</version>
</dependency>
<dependency>
<groupId>com.azure</groupId>
<artifactId>azure-identity</artifactId>
<version>1.7.3</version>
<scope>compile</scope>
</dependency>
But I'm receiving this error:
`
at sun.reflect.GeneratedMethodAccessor56.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
.........
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4720)
at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5154)
at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
at org.apache.catalina.core.ContainerBase$StartChild.call(ContainerBase.java:1409)
at org.apache.catalina.core.ContainerBase$StartChild.call(ContainerBase.java:1399)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.NoClassDefFoundError: IllegalName: com/azure/core/implementation/serializer/MalformedValueException
at java.lang.ClassLoader.preDefineClass(ClassLoader.java:647)
at java.lang.ClassLoader.defineClass(ClassLoader.java:754)
at java.lang.ClassLoader.defineClass(ClassLoader.java:635)
Have you any idea?

Hive Warehouse Connector + Spark = signer information does not match signer information of other classes in the same package

I'm trying to use hive warehouse connector and spark on hdp 3.1 and getting exception even with simplest example (below).
The class causing problems: JaninoRuntimeException - is in org.codehaus.janino:janino:jar:3.0.8 (dependency of spark_sql) and in com.hortonworks.hive:hive-warehouse-connector_2.11:jar.
I've tried to exclude janino library from spark_sql, but this resulted in missing other classes from janino. And I need hwc to for the new functionality.
Anyone had same error? Any ideas how to deal with it?
I'm getting error:
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.CommandLineWrapper.main(CommandLineWrapper.java:66)
Caused by: java.lang.SecurityException: class "org.codehaus.janino.JaninoRuntimeException"'s signer information does not match signer information of other classes in the same package
at java.lang.ClassLoader.checkCerts(ClassLoader.java:898)
at java.lang.ClassLoader.preDefineClass(ClassLoader.java:668)
at java.lang.ClassLoader.defineClass(ClassLoader.java:761)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:197)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:36)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1321)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3277)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2489)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2489)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3259)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3258)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2489)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2703)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)
at org.apache.spark.sql.Dataset.show(Dataset.scala:723)
at org.apache.spark.sql.Dataset.show(Dataset.scala:682)
at org.apache.spark.sql.Dataset.show(Dataset.scala:691)
at Main$.main(Main.scala:15)
at Main.main(Main.scala)
... 5 more
my sbt file:
name := "testHwc"
version := "0.1"
scalaVersion := "2.11.11"
resolvers += "Hortonworks repo" at "http://repo.hortonworks.com/content/repositories/releases/"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "3.1.1.3.1.0.0-78"
// https://mvnrepository.com/artifact/com.hortonworks.hive/hive-warehouse-connector
libraryDependencies += "com.hortonworks.hive" %% "hive-warehouse-connector" % "1.0.0.3.1.0.0-78"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.2.3.1.0.0-78"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.2.3.1.0.0-78"
And the source code:
import com.hortonworks.hwc.HiveWarehouseSession
import org.apache.spark.sql.SparkSession
object Main {
def main(args: Array[String]): Unit = {
val ss = SparkSession.builder()
.config("spark.sql.hive.hiveserver2.jdbc.url", "nnn")
.master("local[*]").getOrCreate()
import ss.sqlContext.implicits._
val rdd = ss.sparkContext.makeRDD(Seq(1, 2, 3, 4, 5, 6, 7))
rdd.toDF("col1").show()
val hive = HiveWarehouseSession.session(ss).build()
}
}
After some investigation I've discovered that the presence of error depends on the order of libraries in classpath.
For unknown reason when I was running this project in IntelliJ IDEA the classpath was always with random order and the app was failing and succeeding randomly.
In the end - HiveWarehouseConnector jar in classpath should be after Spark Sql jar.
UPDATE
As suggested in this answer - the order inside IntelliJ IDEA can be changed in dependencies tab.
Otherwise - I was not able to solve this issue for IntelliJ - the order was always random, but when i executed program outside of IntelliJ - I set the order I needed.
In maven projects, I solve it like this
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.3.1</version>
<exclusions>
<exclusion>
<groupId>org.codehaus.janino</groupId>
<artifactId>janino</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.codehaus.janino</groupId>
<artifactId>janino</artifactId>
<version>3.0.10</version>
</dependency>

Structured Spark Streaming throwing java.lang.NoSuchMethodError: org.apache.spark.sql.SQLContext.internalCreateDataFrame

I'm starting with Structured Spark Streaming with Kafka source and was following a simple tutorial. My Kafka server is OSS kafka. My producer source code is as follows
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import java.util.Properties;
public class LogGenerator {
public static void main(String[] args) throws Exception {
Properties prop = new Properties();
prop.load(ClassLoader.getSystemResourceAsStream("kafka-producer.properties"));
KafkaProducer<String, String> producer = new KafkaProducer<String, String>(prop);
for (int i=0; i<10000; i++) {
System.out.printf("%d ", i);
producer.send(new ProducerRecord<>("my-log-topic", Integer.toString(i), Integer.toString(i)));
}
}
}
The producer writes just (0,0) through (999,999)
The structured Streaming code to read from this topic is as follows
mport org.apache.spark.sql.SparkSession
object DimaStreamer {
//initialize configs
val spark: SparkSession = SparkSession.builder().master("local").appName("StreamingData").getOrCreate()
def main(args: Array[String]): Unit = {
val dfRefundStream = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "0.0.0.0:9091,0.0.0.0:9092")
.option("subscribe", "my-log-topic")
.load()
import org.apache.spark.sql.functions._
dfRefundStream.printSchema()
dfRefundStream.select(col("value")).writeStream
.format("console")
.queryName("inittest")
.start()
.awaitTermination()
}
}
Its a maven project. The job is executed as follows with the --jars option used for passing the jars with comma separation.
spark-submit --class com.apple.arcadia.solr.batch.DimaStreamer --jars $JARSPATH target/mainproject-1.0-SNAPSHOT.jar
The job throws the following exception
19/06/13 15:10:05 INFO internals.ConsumerCoordinator: Setting newly assigned partitions [my-log-topic-1, my-log-topic-0] for group spark-kafka-source-0ce55060-09ad-430a-9916-e0063d8103fe--48657453-driver-0
19/06/13 15:10:06 INFO kafka010.KafkaSource: Initial offsets: {"my-log-topic":{"1":9878,"0":10122}}
19/06/13 15:10:06 INFO streaming.StreamExecution: Committed offsets for batch 0. Metadata OffsetSeqMetadata(0,1560463806021,Map(spark.sql.shuffle.partitions -> 200))
19/06/13 15:10:06 INFO kafka010.KafkaSource: GetBatch called with start = None, end = {"my-log-topic":{"1":9878,"0":10122}}
19/06/13 15:10:06 INFO kafka010.KafkaSource: Partitions added: Map()
19/06/13 15:10:06 INFO kafka010.KafkaSource: GetBatch generating RDD of offset range: KafkaSourceRDDOffsetRange(my-log-topic-0,10122,10122,None), KafkaSourceRDDOffsetRange(my-log-topic-1,9878,9878,None)
19/06/13 15:10:06 ERROR streaming.StreamExecution: Query inittest [id = 305e10aa-a446-4f42-a4e9-8d2372250fa8, runId = 2218049f-08a4-40bf-9b46-b7e898321b85] terminated with error
java.lang.NoSuchMethodError: org.apache.spark.sql.SQLContext.internalCreateDataFrame(Lorg/apache/spark/rdd/RDD;Lorg/apache/spark/sql/types/StructType;Z)Lorg/apache/spark/sql/Dataset;
at org.apache.spark.sql.kafka010.KafkaSource.getBatch(KafkaSource.scala:301)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$7.apply(StreamExecution.scala:614)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$7.apply(StreamExecution.scala:610)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2.apply(StreamExecution.scala:610)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2.apply(StreamExecution.scala:610)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:279)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:609)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(StreamExecution.scala:306)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:294)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:294)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:279)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:294)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:290)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:206)
Exception in thread "stream execution thread for inittest [id = 305e10aa-a446-4f42-a4e9-8d2372250fa8, runId = 2218049f-08a4-40bf-9b46-b7e898321b85]" java.lang.NoSuchMethodError: org.apache.spark.sql.SQLContext.internalCreateDataFrame(Lorg/apache/spark/rdd/RDD;Lorg/apache/spark/sql/types/StructType;Z)Lorg/apache/spark/sql/Dataset;
at org.apache.spark.sql.kafka010.KafkaSource.getBatch(KafkaSource.scala:301)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$7.apply(StreamExecution.scala:614)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$7.apply(StreamExecution.scala:610)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2.apply(StreamExecution.scala:610)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2.apply(StreamExecution.scala:610)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:279)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:609)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(StreamExecution.scala:306)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:294)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:294)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:279)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:294)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:290)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:206)
Exception in thread "main" org.apache.spark.sql.streaming.StreamingQueryException: org.apache.spark.sql.SQLContext.internalCreateDataFrame(Lorg/apache/spark/rdd/RDD;Lorg/apache/spark/sql/types/StructType;Z)Lorg/apache/spark/sql/Dataset;
=== Streaming Query ===
Identifier: inittest [id = 305e10aa-a446-4f42-a4e9-8d2372250fa8, runId = 2218049f-08a4-40bf-9b46-b7e898321b85]
Current Committed Offsets: {}
Current Available Offsets: {KafkaSource[Subscribe[my-log-topic]]: {"my-log-topic":{"1":9878,"0":10122}}}
Current State: ACTIVE
Thread State: RUNNABLE
Logical Plan:
Project [value#1]
+- StreamingExecutionRelation KafkaSource[Subscribe[my-log-topic]], [key#0, value#1, topic#2, partition#3, offset#4L, timestamp#5, timestampType#6]
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:343)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:206)
Caused by: java.lang.NoSuchMethodError: org.apache.spark.sql.SQLContext.internalCreateDataFrame(Lorg/apache/spark/rdd/RDD;Lorg/apache/spark/sql/types/StructType;Z)Lorg/apache/spark/sql/Dataset;
at org.apache.spark.sql.kafka010.KafkaSource.getBatch(KafkaSource.scala:301)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$7.apply(StreamExecution.scala:614)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$7.apply(StreamExecution.scala:610)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2.apply(StreamExecution.scala:610)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2.apply(StreamExecution.scala:610)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:279)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:609)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(StreamExecution.scala:306)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:294)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:294)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:279)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:294)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:290)
... 1 more
Based on the exception, it seems like kafka010.KafkaSource class seems to be missing a method with signature SQLContext.internalCreateDataFrame(rdd, schema)
Upon checking the source code here[https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala]
I see the method exists. Has anyone seen such an issue and resolved it? If yes, could you please help understand what the issue was
Finally, the details are the POM are
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.1</version>
<classifier>tests</classifier>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.3.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.11</artifactId>
<version>2.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.3.1</version>
</dependency>
I was able to run this as I noticed that the spark version where I'm running spark-submit was 2.2 and the libraries were all (as you can see) are 2.3.1. I changed spark-version and was able to successfully run it.

Spark Cassandra Java Connection NoSuchMethodError or NoClassDefFoundError

From Spark Java App submitted to the Spark Cluster hosted on my machine, I am trying to connect to a Cassandra DB hosted on my machine # 127.0.0.1:9042 and my Spring Boot application is failing to start.
Approach 1 -
** Based on the Spark-Cassandra-Connector link I had included the below in the POM file -**
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_2.11</artifactId>
<version>2.0.0-M3</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.0.0</version>
</dependency>
Approach 1 - NoSuchMethodError - Log File:
16/09/08 15:12:50 ERROR SpringApplication: Application startup failed
java.lang.NoSuchMethodError: com.datastax.driver.core.KeyspaceMetadata.getMaterializedViews()Ljava/util/Collection;
at com.datastax.spark.connector.cql.Schema$.com$datastax$spark$connector$cql$Schema$$fetchTables$1(Schema.scala:281)
at com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1$2.apply(Schema.scala:305)
at com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1$2.apply(Schema.scala:304)
at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:683)
at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316)
at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:682)
at com.datastax.spark.connector.cql.Schema$.com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1(Schema.scala:304)
at com.datastax.spark.connector.cql.Schema$$anonfun$fromCassandra$1.apply(Schema.scala:325)
at com.datastax.spark.connector.cql.Schema$$anonfun$fromCassandra$1.apply(Schema.scala:322)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withClusterDo$1.apply(CassandraConnector.scala:122)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withClusterDo$1.apply(CassandraConnector.scala:121)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:140)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:110)
at com.datastax.spark.connector.cql.CassandraConnector.withClusterDo(CassandraConnector.scala:121)
at com.datastax.spark.connector.cql.Schema$.fromCassandra(Schema.scala:322)
at com.datastax.spark.connector.cql.Schema$.tableFromCassandra(Schema.scala:342)
at com.datastax.spark.connector.rdd.CassandraTableRowReaderProvider$class.tableDef(CassandraTableRowReaderProvider.scala:50)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.tableDef$lzycompute(CassandraTableScanRDD.scala:60)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.tableDef(CassandraTableScanRDD.scala:60)
at com.datastax.spark.connector.rdd.CassandraTableRowReaderProvider$class.verify(CassandraTableRowReaderProvider.scala:137)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.verify(CassandraTableScanRDD.scala:60)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.getPartitions(CassandraTableScanRDD.scala:232)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1911)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:875)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:873)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.foreach(RDD.scala:873)
at org.apache.spark.api.java.JavaRDDLike$class.foreach(JavaRDDLike.scala:350)
at org.apache.spark.api.java.AbstractJavaRDDLike.foreach(JavaRDDLike.scala:45)
at com.initech.myapp.cassandra.service.CassandraDataService.getMatches(CassandraDataService.java:45)
at com.initech.myapp.processunit.MySparkApp.receive(MySparkApp.java:120)
at com.initech.myapp.processunit.MySparkApp.process(MySparkApp.java:61)
at com.initech.myapp.processunit.MySparkApp.run(MySparkApp.java:144)
at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:789)
at org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:779)
at org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:769)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:314)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1185)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1174)
at com.initech.myapp.MySparkAppBootApp.main(MyAppProcessingUnitsApplication.java:20)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:48)
at org.springframework.boot.loader.Launcher.launch(Launcher.java:87)
at org.springframework.boot.loader.Launcher.launch(Launcher.java:50)
at org.springframework.boot.loader.JarLauncher.main(JarLauncher.java:58)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
16/09/08 15:12:50 INFO AnnotationConfigApplicationContext: Closing org.springframework.context.annotation.AnnotationConfigApplicationContext#3381b4fc: startup date [Thu Sep 08 15:12:40 PDT 2016]; root of context hierarchy
Approach 2 -
** Since what I am developing is a Java Spark app, I thought of using the Spark-Cassandra-Connector-Java and had included the below in the POM file -**
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_2.11</artifactId>
<version>2.0.0-M3</version>
</dependency>
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector-java_2.11</artifactId>
<version>1.2.6</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.0.0</version>
</dependency>
and ended up with this
Approach 2 - SelectableColumnRef NoClassDefFoundError - Log File:
16/09/08 16:28:07 ERROR SpringApplication: Application startup failed
java.lang.NoClassDefFoundError: com/datastax/spark/connector/SelectableColumnRef
at com.initech.myApp.cassandra.service.CassandraDataService.getMatches(CassandraDataService.java:41)
** My Spark Main method calls the process() method below**
public boolean process() throws InterruptedException {
logger.debug("In the process() method");
SparkConf sparkConf = new SparkConf().setAppName("My Process Unit");
sparkConf.set("spark.cassandra.connection.host", "127.0.0.1");
sparkConf.set("spark.cassandra.connection.port","9042");
logger.debug("SparkConf = " + sparkConf);
JavaStreamingContext javaStreamingContext = new JavaStreamingContext(sparkConf, new Duration(1000));
logger.debug("JavaStreamingContext = " + javaStreamingContext);
JavaSparkContext javaSparkContext = javaStreamingContext.sparkContext();
logger.debug("Java Spark context = " + javaSparkContext);
JavaRDD<MyData> myDataJavaRDD = receive(javaSparkContext);
myDataJavaRDD.foreach(myData -> {
logger.debug("myData = " + myData);
});
javaStreamingContext.start();
javaStreamingContext.awaitTermination();
return true; }
** which calls the receive() below **
private JavaRDD<MyData> receive(JavaSparkContext javaSparkContext) {
logger.debug("receive method called...");
List<String> myAppConfigsStrings = myAppConfiguration.get();
logger.debug("Received ..." + myAppConfigsStrings);
for(String myAppConfigStr : myAppConfigsStrings)
{
ObjectMapper mapper = new ObjectMapper();
MyAppConfig myAppConfig;
try {
logger.debug("Parsing the myAppConfigStr..." + myAppConfigStr);
myAppConfig = mapper.readValue(myAppConfigStr, MyAppConfig.class);
logger.debug("Parse Complete...");
// Check for matching data in Cassandra
JavaRDD<MyData> cassandraRowsRDD = cassandraDataService.getMatches(myAppConfig, javaSparkContext);
cassandraRowsRDD.foreach(myData -> {
logger.debug("myData = " + myData);
});
return cassandraRowsRDD;
} catch (IOException e) {
e.printStackTrace();
}
}
return null;
}
** Eventually calling the Cassandra Data Service getMatches() below **
#Service
public class CassandraDataService implements Serializable {
private static final Log logger = LogFactory.getLog(CassandraDataService.class);
public JavaRDD<MyData> getMatches(MyAppConfig myAppConfig, JavaSparkContext javaSparkContext) {
logger.debug("Creating the MyDataID...");
MyDataID myDataID = new MyDataID();
myDataID.set...(myAppConfig.get...);
myDataID.set...(myAppConfig.get...);
myDataID.set...(myAppConfig.get...);
logger.debug("MyDataID = " + myDataID);
JavaRDD<MyData> cassandraRowsRDD = javaFunctions(javaSparkContext).cassandraTable("myKeySpace", "myData", mapRowTo(MyData.class));
cassandraRowsRDD.foreach(myData -> {
logger.debug("====== Cassandra Data Service ========");
logger.debug("myData = " + myData);
logger.debug("====== Cassandra Data Service ========");
});
return cassandraRowsRDD;
}
}
Has anyone experienced similar error or could provide me in some direction?
I have tried googling and reading through several items - but none to rescue. Thanks.
Update 9/9/2016 2:15 PM PST
I tried the approach above. Here is what I have done -
Spark cluster running with 1 worker thread
Submitted my Spark App using the Spring Boot Uber Jar using spark-submit command below -
./bin/spark-submit --class org.springframework.boot.loader.JarLauncher --master spark://localhost:6066 --deploy-mode cluster /Users/apple/Repos/Initech/Officespace/target/my-spring-spark-boot-streaming-app-0.1-SNAPSHOT.jar
The Spark Driver program started successfully and initiated my Spark App and was set to "WAITING" state as there was only one worker running that was allocated to the driver program
I then started another worker thread and then the App worker thread had failed because of "java.lang.ClassNotFoundException: com.datastax.spark.connector.rdd.partitioner.CassandraPartition". Below is the stack trace.
If it is useful in anyway - he is the stack I am using
1. cqlsh 5.0.1 | Cassandra 2.2.7 | CQL spec 3.3.1
2. Spark - 2.0.0
3. Spring Boot - 1.4.0.RELEASE
4. Jar's listed in the Approach 1 above
Exception Stack Tracke
16/09/09 14:13:24 ERROR SpringApplication: Application startup failed
java.lang.IllegalStateException: Failed to execute ApplicationRunner
at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:792)
at org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:779)
at org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:769)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:314)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1185)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1174)
at com.initech.officespace.MySpringBootSparkApp.main(MySpringBootSparkApp.java:23)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:48)
at org.springframework.boot.loader.Launcher.launch(Launcher.java:87)
at org.springframework.boot.loader.Launcher.launch(Launcher.java:50)
at org.springframework.boot.loader.JarLauncher.main(JarLauncher.java:58)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, 192.168.0.30): java.lang.ClassNotFoundException: com.datastax.spark.connector.rdd.partitioner.CassandraPartition
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1620)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:253)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1911)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:875)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:873)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.foreach(RDD.scala:873)
at org.apache.spark.api.java.JavaRDDLike$class.foreach(JavaRDDLike.scala:350)
at org.apache.spark.api.java.AbstractJavaRDDLike.foreach(JavaRDDLike.scala:45)
at com.initech.officespace.cassandra.service.CassandraDataService.getMatches(CassandraDataService.java:43)
at com.initech.officespace.processunit.MyApp.receive(MyApp.java:120)
at com.initech.officespace.processunit.MyApp.process(MyApp.java:61)
at com.initech.officespace.processunit.MyApp.run(MyApp.java:144)
at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:789)
... 20 more
Caused by: java.lang.ClassNotFoundException: com.datastax.spark.connector.rdd.partitioner.CassandraPartition
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1620)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:253)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/09/09 14:13:24 INFO AnnotationConfigApplicationContext: Closing org.springframework.context.annotation.AnnotationConfigApplicationContext#3381b4fc: startup date [Fri Sep 09 14:10:40 PDT 2016]; root of context hierarchy
Update 2 on 9/9/2016 3:20 PM PST
Issue is now resolved based on the answer provided by RussS # Issues with datastax spark-cassandra connector
After updating my spark-submit to the below, I am seeing that the worker is able to pickup the connecter and start working on the RDDs :)
./bin/spark-submit --class org.springframework.boot.loader.JarLauncher --master spark://localhost:6066 --deploy-mode cluster --packages com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M3 /Users/apple/Repos/Initech/Officespace/target/my-spring-spark-boot-streaming-app-0.1-SNAPSHOT.jar
Solution could be different.
I had this exception when tried to run spark with cassandra from PC(driver) on java.
You can add jar with spark-cassandra-connector to SparkContext in my case it was like in example below:
JavaSparkContext sc = new JavaSparkContext(conf);
sc.addJar("./build/libs/spark-cassandra-connector_2.11-2.4.2.jar"); // location of driver could be different.
com.datastax.driver.core.KeyspaceMetadata.getMaterializedViews is present starting version 3.0 of the driver.
Try adding this dependency to version 1:
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-core</artifactId>
<version>3.1.0</version>
</dependency>

Spark 1.5.1 Create RDD from Cassandra (ClassNotFoundException: com.datastax.spark.connector.japi.rdd.CassandraTableScanJavaRDD)

I am trying to fetch records from cassandra and create rdd.
JavaRDD<Encounters> rdd = javaFunctions(ctx).cassandraTable("kesyspace1", "employee", mapRowTo(Employee.class));
I am getting this error on submitting job on Spark 1.5.1
Exception in thread "main" java.lang.NoClassDefFoundError: com/datastax/spark/connector/japi/rdd/CassandraTableScanJavaRDD
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:274)
at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:56)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.ClassNotFoundException: com.datastax.spark.connector.japi.rdd.CassandraTableScanJavaRDD
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
Current Dependencies:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>1.5.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>1.5.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.1</version>
</dependency>
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector-java_2.11</artifactId>
<version>1.5.0-M2</version>
</dependency>
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-core</artifactId>
<version>3.0.0-alpha4</version>
</dependency>
Java Code:
import com.tempTable.Encounters;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.SparkConf;
import static com.datastax.spark.connector.japi.CassandraJavaUtil.javaFunctions;
import static com.datastax.spark.connector.japi.CassandraJavaUtil.mapRowTo;
Long now = new Date().getTime();
SparkConf conf = new SparkConf(true)
.setAppName("SparkSQLJob_" + now)
set("spark.cassandra.connection.host", "192.168.1.75")
set("spark.cassandra.connection.port", "9042");
SparkContext ctx = new SparkContext(conf);
JavaRDD<Encounters> rdd = javaFunctions(ctx).cassandraTable("keyspace1", "employee", mapRowTo(Employee.class));
System.out.println("rdd count = "+rdd.count());
Is there issue with version in dependencies?
Please help to resolve this error.
Thanks in advance.
you need to add jar file with SparkConf
.setJars(Seq(System.getProperty("user.dir") + "/target/scala-2.10/sparktest.jar"))
For more information refer http://www.datastax.com/dev/blog/common-spark-troubleshooting
The simple answer is "
you need all the dependencies bundled inside jar file
or
the executor machine should contain all your dependent jar files in
their classpath
Solution for building a fatJar using gradle:
buildscript {
dependencies {
classpath 'com.github.jengelman.gradle.plugins:shadow:1.2.2'
}
repositories {
jcenter()
}
}
apply plugin: 'com.github.johnrengelman.shadow'
Then call "gradle shadowJar" to build your jar file. After that submit your job it should resolve your problem.

Resources