Exception while submit spark job on yarn cluster with remote jvm - apache-spark

I am using below java code to submit job on yarn-cluster.
public ApplicationId submitQuery(String requestId, String query,String fileLocations) {
String driverJar = getDriverJar();
String driverClass = propertyService.getAppPropertyValue(TypeString.QUERY_DRIVER_CLASS);
String driverAppName = propertyService.getAppPropertyValue(TypeString.DRIVER_APP_NAME);
String extraJarsNeeded = propertyService.getAppPropertyValue(TypeString.DRIVER_EXTRA_JARS_NEEDED);
String[] args = new String[] {
// the name of your application
"--name",
driverAppName,
// memory for driver (optional)
"--driver-memory",
"1000M",
// path to your application's JAR file
// required in yarn-cluster mode
"--jar",
"local:/home/ankit/Repository/Personalization/rtis/Cust360QueryDriver/target/SnapdealCustomer360QueryDriver-jar-with-selective-dependencies.jar",
"--addJars",
"local:/home/ankit/Downloads/lib/spark-assembly-1.3.1-hadoop2.4.0.jar,local:/home/ankit/.m2/repository/org/slf4j/slf4j-api/1.7.5/slf4j-api-1.7.5.jar,local:/home/ankit/.m2/repository/org/slf4j/slf4j-log4j12/1.7.5/slf4j-log4j12-1.7.5.jar",
// name of your application's main class (required)
"--class",
driverClass,
"--arg",
requestId,
"--arg",
query,
"--arg",
fileLocations,
"--arg",
"yarn-client"
};
System.setProperty("HADOOP_CONF_DIR", "/home/hduser/hadoop-2.7.0/etc/hadoop");
Configuration config = new Configuration();
config.set("yarn.resourcemanager.address", propertyService.getAppPropertyValue(TypeString.RESOURCE_MANGER_URL));
config.set("fs.default.name", propertyService.getAppPropertyValue(TypeString.FS_DEFAULT_NAME));
System.setProperty("SPARK_YARN_MODE", "true");
SparkConf sparkConf = new SparkConf();
ClientArguments cArgs = new ClientArguments(args, sparkConf);
// create an instance of yarn Client client
Client client = new Client(cArgs, config, sparkConf);
ApplicationId id = client.submitApplication();
return id;
}
Job is getting submitted to yarn-cluster and i am able to retrieve application id but i am getting below Exception while running job on spark cluster.
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/Logging
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 13 more
though mentioned class in /home/ankit/Downloads/lib/spark-assembly-1.3.1-hadoop2.4.0.jar. looks like jar mentioned in --addJars is not getting added in driver's spark context.
Am i doing something wrong?? Any help would be appreciated.

Are you deploying on Cloudera's distribution ? spark.yarn.jar in the CDH 5.4 config has a 'local:' prefix for local files, but Spark version >= 1.5 does not like this, you should just use the full path name for your spark assembly. See also here.

Try building JAR without spark dependency and pass dependent jars with --jars in spark submit. Most of the times ClassNotFoundException is due to spark and application itself is dependent on same jar.
Suggested solution:
package without dependency and add dependent jars with --jars during
spark-submit
Modify application to use same version of third-party
library as spark has
Use shading in your build tool

Related

An Apache Beam pipeline on Azure HDInsight's SparkRunner

I try to get a Beam pipeline to run on Azure's HDInsight SparkRunner.
I tried first with a cluster based on Spark 2.3.0/Hadoop 2.7 (HDI 3.6) and then also 2.3.1/Hadoop 3.0 (HDI 4.0 Preview).
I tried using Apache Beam 2.2.0 and next 2.10.0-SNAPSHOT.
The spark-submit command is (for Beam 2.10.0):
JARS="wasbs:///dependency/hadoop-azure-3.1.1.3.0.2.0-50.jar,wasbs:///dependency/azure-storage-7.0.0.jar,wasbs:///dependency/beam-model-fn-execution-2.10.0-SNAPSHOT.jar,wasbs:///dependency/beam-model-job-management-2.10.0-SNAPSHOT.jar,wasbs:///dependency/beam-model-pipeline-2.10.0-SNAPSHOT.jar,wasbs:///dependency/beam-runners-core-construction-java-2.10.0-SNAPSHOT.jar,wasbs:///dependency/beam-runners-core-java-2.10.0-SNAPSHOT.jar,wasbs:///dependency/beam-runners-direct-java-2.10.0-SNAPSHOT.jar,wasbs:///dependency/beam-runners-spark-2.10.0-SNAPSHOT.jar,wasbs:///dependency/beam-sdks-java-core-2.10.0-SNAPSHOT.jar,wasbs:///dependency/beam-sdks-java-fn-execution-2.10.0-SNAPSHOT.jar,wasbs:///dependency/beam-sdks-java-io-hadoop-file-system-2.10.0-SNAPSHOT.jar,wasbs:///dependency/beam-vendor-grpc-1_13_1-0.1.jar"
spark-submit --conf spark.yarn.maxAppAttempts=1 --deploy-mode cluster --master yarn --jars $JARS --class example.MinimalWordCountJava8 wasbs:///mavenproject1-1.0-SNAPSHOT.jar --runner=SparkRunner
(initially -jars was not given the hadoop-azure and azure-storage jars, but that did not make any difference).
The main() looks like this:
public static void main(String[] args) {
JavaSparkContext ct = new JavaSparkContext();
Configuration config = ct.hadoopConfiguration();
config.set("fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem");
config.set("fs.wasb.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem");
config.set("fs.AbstractFileSystem.wasb.impl", "org.apache.hadoop.fs.azure.Wasb");
config.set("fs.AbstractFileSystem.wasb.impl", "org.apache.hadoop.fs.azure.Wasbs");
config.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem");
config.set("fs.azure.account.key." + account + ".blob.core.windows.net", key);
config.set("fs.defaultFS", "wasb://" + container + "#" + account + ".blob.core.windows.net");
System.out.println("### hello.txt content:");
JavaRDD<String> content = ct.textFile("wasbs:///hello.txt");
System.out.println(content.toString());
System.out.println("### MinimalWordCountJava8");
PipelineOptions options = PipelineOptionsFactory.create();
SparkContextOptions sparkContextOptions = options.as(SparkContextOptions.class);
sparkContextOptions.setUsesProvidedSparkContext(true);
sparkContextOptions.setProvidedSparkContext(ct);
sparkContextOptions.setRunner(SparkRunner.class);
Pipeline p = Pipeline.create(sparkContextOptions);
p.apply(TextIO.read().from("hello.txt"))
.apply(FlatMapElements
.into(TypeDescriptors.strings())
.via((String word) -> Arrays.asList(word.split("[^\\p{L}]+"))))
.apply(Filter.by((String word) -> !word.isEmpty()))
.apply(Count.<String>perElement())
.apply(MapElements
.into(TypeDescriptors.strings())
.via((KV<String, Long> wordCount) -> wordCount.getKey() + ": " + wordCount.getValue()))
// CHANGE 3/3: The Google Cloud Storage path is required for outputting the results to.
.apply(TextIO.write().to("output"));
p.run().waitUntilFinish();
It fails when calling Pipeline.create(options); with this exception trace:
18/12/09 14:47:10 ERROR ApplicationMaster: User class threw exception: java.lang.IllegalArgumentException: Failed to construct Hadoop filesystem with configuration Configuration: /usr/hdp/3.0.2.0-50/hadoop/conf/core-site.xml, /usr/hdp/3.0.2.0-50/hadoop/conf/hdfs-site.xml
java.lang.IllegalArgumentException: Failed to construct Hadoop filesystem with configuration Configuration: /usr/hdp/3.0.2.0-50/hadoop/conf/core-site.xml, /usr/hdp/3.0.2.0-50/hadoop/conf/hdfs-site.xml
at org.apache.beam.sdk.io.hdfs.HadoopFileSystemRegistrar.fromOptions(HadoopFileSystemRegistrar.java:59)
at org.apache.beam.sdk.io.FileSystems.verifySchemesAreUnique(FileSystems.java:489)
at org.apache.beam.sdk.io.FileSystems.setDefaultPipelineOptions(FileSystems.java:479)
at org.apache.beam.sdk.PipelineRunner.fromOptions(PipelineRunner.java:47)
at org.apache.beam.sdk.Pipeline.create(Pipeline.java:145)
at io.aptly.mavenproject1.MinimalWordCountJava8.main(MinimalWordCountJava8.java:88)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:721)
Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "wasbs"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3332)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3403)
at org.apache.hadoop.fs.FileSystem$Cache.getUnique(FileSystem.java:3377)
at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:530)
at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:542)
at org.apache.beam.sdk.io.hdfs.HadoopFileSystem.<init>(HadoopFileSystem.java:82)
at org.apache.beam.sdk.io.hdfs.HadoopFileSystemRegistrar.fromOptions(HadoopFileSystemRegistrar.java:56)
... 10 more
18/12/09 14:47:10 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.lang.IllegalArgumentException: Failed to construct Hadoop filesystem with configuration Configuration: /usr/hdp/3.0.2.0-50/hadoop/conf/core-site.xml, /usr/hdp/3.0.2.0-50/hadoop/conf/hdfs-site.xml
at org.apache.beam.sdk.io.hdfs.HadoopFileSystemRegistrar.fromOptions(HadoopFileSystemRegistrar.java:59)
at org.apache.beam.sdk.io.FileSystems.verifySchemesAreUnique(FileSystems.java:489)
at org.apache.beam.sdk.io.FileSystems.setDefaultPipelineOptions(FileSystems.java:479)
at org.apache.beam.sdk.PipelineRunner.fromOptions(PipelineRunner.java:47)
at org.apache.beam.sdk.Pipeline.create(Pipeline.java:145)
at io.aptly.mavenproject1.MinimalWordCountJava8.main(MinimalWordCountJava8.java:88)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:721)
Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "wasbs"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3332)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3403)
at org.apache.hadoop.fs.FileSystem$Cache.getUnique(FileSystem.java:3377)
at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:530)
at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:542)
at org.apache.beam.sdk.io.hdfs.HadoopFileSystem.<init>(HadoopFileSystem.java:82)
at org.apache.beam.sdk.io.hdfs.HadoopFileSystemRegistrar.fromOptions(HadoopFileSystemRegistrar.java:56)
The submit works (the wasps:// is recognised) and reading the small wasps:///hello.txt does not fail. These cases indicate that using wasps:// is fine until that point.
It's early inside Beam, that it seems to fail.
Because of this I passed the JavaSparkContext with the PipelineOptions (with dynamic hadoop configurations that were suggested by other SO question/answers). But this did not make a difference for me.
Anyone who can guide on how to get around this issue?
From quickly digging through code and bug trackers, it looks like Azure is supported as a Hadoop filesystem starting with Hadoop 3.2.0 (code, Jira). Currently Beam is pinned to version 2.7.3. This would explain the failure in Beam's HadoopFilesystem.
It may be that spark-submit succeeded because wasbs:// is supported via a different mechanism than Hadoop's libraries or using a bundled and newer version of Hadoop.

ClassNotFoundException when submitting JAR to Spark via spark-submit

I'm struggling to submit a JAR to Apache Spark using spark-submit.
To make things easier, I've experimented using this blog post. The code is
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object SimpleScalaSpark {
def main(args: Array[String]) {
val logFile = "/Users/toddmcgrath/Development/spark-1.6.1-bin-hadoop2.4/README.md" // I've replaced this with the path to an existing file
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
I'm running building this with Intellij Idea 2017.1 and running on Spark 2.1.0. Everything is running fine when I run it in the IDE.
I then build it as a JAR and attempt to use spark-submit as follows
./spark-submit --class SimpleScalaSpark --master local[*] ~/Documents/Spark/Scala/supersimple/out/artifacts/supersimple_jar/supersimple.jar
This results in the following error
java.lang.ClassNotFoundException: SimpleScalaSpark
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:229)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:695)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I'm at a loss as to what I'm missing...especially given that it runs as expected in the IDE.
As per your description above
,You are not giving the correct class name, so it is not able to find that class.
Just replace SimpleSparkScala with SimpleScalaSpark
Try running this command:
./spark-submit --class SimpleScalaSpark --master local[*] ~/Documents/Spark/Scala/supersimple/out/artifacts/supersimple_jar/supersimple.jar
Looks like there is an issue with your jar. You can check what classes are present in your jar by using the command:
vi supersimple.jar
If SimpleScalaSpark class does not appear in the output of the previous command, that means your jar is not built properly.
IDEs work differently from shell in many ways.
I believe for shell you need to add --jars parameter
spark submit add multiple jars in classpath
I am observing ClassNotFound on new classes I introduce. I am using a fat jar. I verified that the JAR file contains the new class file in all the copies in each node. (I am using the regular filesystem to load the Spark application, not hdfs nor an http URL).
The JAR file loaded by the worker did not have the new class I introduced. It is an older version.
The only way I found to get around the problem is to use a different filename for the JAR every time that I call spark-submit script.

AWS EMR no host: hdfs:///var/log/spark/apps

I am trying to use AWS EMR (emr-4.3.0) Spark 1.6.0, Hadoop 2.7.0
I created EMR cluster, and added Step(in AWS ERM web) with my sample jar.
It's SpringBoot application and written by Java(1.8) (I installed JDK8 in the box)
It run with following command
hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --deploy-mode cluster --class org.springframework.boot.loader.JarLauncher s3://my-test/SparkForSpring-S1.2014.jar
I created SparkContext as following code.
SparkConf conf = new SparkConf().setAppName("SparkForSpring");
return new JavaSparkContext(conf);
but it fails with following error, I feel like it's something like not related to my application, I am new to Spark, Yarn though.
Caused by: org.springframework.beans.factory.BeanDefinitionStoreException: Factory method [public org.apache.spark.api.java.JavaSparkContext com.pivotal.demo.spark.rocket.rdd.SparkConfig.javaSparkContext()] threw exception; nested exception is java.io.IOException: Incomplete HDFS URI, no host: hdfs:///var/log/spark/apps
at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:188)
at org.springframework.beans.factory.support.ConstructorResolver.instantiateUsingFactoryMethod(ConstructorResolver.java:586)
... 49 more
Caused by: java.io.IOException: Incomplete HDFS URI, no host: hdfs:///var/log/spark/apps
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:143)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1650)
at org.apache.spark.scheduler.EventLoggingListener.<init>(EventLoggingListener.scala:66)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:547)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:59)
at com.pivotal.demo.spark.rocket.rdd.SparkConfig.javaSparkContext(SparkConfig.java:35)
at com.pivotal.demo.spark.rocket.rdd.SparkConfig$$EnhancerBySpringCGLIB$$82429e1b.CGLIB$javaSparkContext$0(<generated>)
at com.pivotal.demo.spark.rocket.rdd.SparkConfig$$EnhancerBySpringCGLIB$$82429e1b$$FastClassBySpringCGLIB$$10b15a77.invoke(<generated>)
at org.springframework.cglib.proxy.MethodProxy.invokeSuper(MethodProxy.java:228)
at org.springframework.context.annotation.ConfigurationClassEnhancer$BeanMethodInterceptor.intercept(ConfigurationClassEnhancer.java:312)
at com.pivotal.demo.spark.rocket.rdd.SparkConfig$$EnhancerBySpringCGLIB$$82429e1b.javaSparkContext(<generated>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:166)
... 50 more
I read some document but quite not sure what should I do to fix this error. A hint will be greatly helpful.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-file-systems.html
I solved this problem by not using SpringBoot's executable jar rather I use maven shade plugin to package only spring related jar files in one jar and using system classloader. Here is full pom.xml
I got a hint from this question's answer
apache-spark 1.3.0 and yarn integration and spring-boot as a container

Spark streaming does not insert data to Cassandra

I have spark-streaming code which works in client mode: it reads data from kafka, does some processing, and use spark-cassandra-connector to insert data to cassandra.
When I use the "--deploy-mode cluster", data does not get inserted, and I get the following error:
Exception in thread "streaming-job-executor-53" java.lang.NoClassDefFoundError: com/datastax/spark/connector/ColumnSelector
at com.enerbyte.spark.jobs.wattiopipeline.WattiopipelineStreamingJob$$anonfun$main$2.apply(WattiopipelineStreamingJob.scala:94)
at com.enerbyte.spark.jobs.wattiopipeline.WattiopipelineStreamingJob$$anonfun$main$2.apply(WattiopipelineStreamingJob.scala:88)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:224)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:223)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.datastax.spark.connector.ColumnSelector
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
I added dependancy for connector like this:
"com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0" % "provided"
This is my application code:
val measurements = KafkaUtils.createDirectStream[
Array[Byte],
Array[Byte],
DefaultDecoder,
DefaultDecoder](ssc, kafkaConfig, Set("wattio"
))
.map {
case (k, v) => {
val decoder = new AvroDecoder[WattioMeasure](null,
WattioMeasure.SCHEMA$)
decoder.fromBytes(v)
}
}
//inserting into WattioRaw
WattioFunctions.run(WattioFunctions.
processWattioRaw(measurements))(
(rdd: RDD[
WattioTenantRaw], t: Time) => {
rdd.cache()
//get all the different tenants
val differentTenants = rdd.map(a
=> a.tenant).distinct().collect()
// for each tenant, create keyspace value and flush to cassandra
differentTenants.foreach(tenant => {
val keyspace = tenant + "_readings"
rdd.filter(a => a.tenant == tenant).map(s => s.wattioRaw).saveToCassandra(keyspace, "wattio_raw")
})
rdd.unpersist(true)
}
)
ssc.checkpoint("/tmp")
ssc.start()
ssc.awaitTermination()
You need to make sure your JAR is available to the workers. The spark master will open up a file server once the execution of the job starts.
You need to specify the path to your uber jar either by using SparkContext.setJars, or via the --jars flag passed to spark-submit.
From the documentation
When using spark-submit, the application jar along with any jars included with the --jars option will be automatically transferred to the cluster. Spark uses the following URL scheme to allow different strategies for disseminating jars
Actually I solved it by removing "provided" in the dependancy list, so that sbt packaged spark-cassandra-connector to my assembly jar.
Interesting thing is that in my launch script, even when I tryed to use
spark-submit --repositories "location of my artifactory repository" --packages "spark-cassandra-connector"
or
spark-submit --jars "spark-cassandra-connector.jar"
both of them failed!
scope provided means, you expect the JDK or a container to provide the dependency at runtime and that particular dependency jar will not be part of your final application War/jar you are creating hence that error.

why the map function of sc.cassandraTable("test", "users").select("username") can not work?

Following spark-cassandra-connector's demo and Installing the Cassandra / Spark OSS Stack, under spark-shell, I tried the following snippet:
sc.stop
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "172.21.0.131")
.set("spark.cassandra.auth.username", "adminxx")
.set("spark.cassandra.auth.password", "adminxx")
val sc = new SparkContext("172.21.0.131", "Cassandra Connector Test", conf)
val rdd = sc.cassandraTable("test", "users").select("username")
Many operators of rdd can work fine, such as:
rdd.first
rdd.count
But when I use map:
val result = rdd.map(x => 1) //just for simple
result: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[61] at map at <console>:32
Then, I run:
result.first
I got the following errors:
15/12/11 15:09:00 WARN TaskSetManager: Lost task 0.0 in stage 31.0 (TID 104, 124.250.36.124): java.lang.ClassNotFoundException:
$line346.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
Caused by: java.lang.ClassNotFoundException: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:278)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
I don't know why I got such error? Any advice will be appreciated!
UPDATED:
According #RussSpitzer's answer for CassandraRdd.map( row => row.getInt("id)) does not work , java.lang.ClassNotFoundException happened!, I resolved this error through following errors, instead of using sc.stop and creating an new SparkContext, I start spark-shell with options:
bin/spark-shell -conf spark.cassandra.connection.host=172.21.0.131 --conf spark.cassandra.auth.username=adminxx --conf spark.cassandra.auth.password=adminxx
And then all steps are the same and work fine.
Russell Spitzer's answer from the spark-connector-user list:
I'm pretty sure the main problem here is that you start a context with --jars and then kill that context and then start another one. Try simplifying your code, instead of setting all of those spark conf options and creating a new contexts run your shell like. Also the jar that you want on the classpath is the connector assembly jar, not a custom build of a Scala script you want to run.
./spark-shell --conf spark.casandra.connection.host=10.129.20.80 ...
You should not need to modify the ack.wait.timeout or the executor.extraClasspath.
Spark applications normally send their compiled code as jar files to the executors. This way the function that you map is present on the executors.
The situation is more tricky in spark-shell. It has to compile and broadcast the code for your every line interactively. There is not even a class you're operating inside. It creates these fake $$iwC$$ classes to solve this.
Normally this works out well, but you may have hit a spark-shell bug. You can try to work around it by putting your code inside a class in spark-shell:
object Obj { val mapper = { x: String => 1 } }
val result = rdd.map(Obj.mapper)
But it is probably safest to implement your code as an application instead of just writing it in spark-shell.

Resources