Spark streaming does not insert data to Cassandra

Spark streaming does not insert data to Cassandra - apache-spark

I have spark-streaming code which works in client mode: it reads data from kafka, does some processing, and use spark-cassandra-connector to insert data to cassandra.
When I use the "--deploy-mode cluster", data does not get inserted, and I get the following error:
Exception in thread "streaming-job-executor-53" java.lang.NoClassDefFoundError: com/datastax/spark/connector/ColumnSelector
at com.enerbyte.spark.jobs.wattiopipeline.WattiopipelineStreamingJob$$anonfun$main$2.apply(WattiopipelineStreamingJob.scala:94)
at com.enerbyte.spark.jobs.wattiopipeline.WattiopipelineStreamingJob$$anonfun$main$2.apply(WattiopipelineStreamingJob.scala:88)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:224)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:223)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.datastax.spark.connector.ColumnSelector
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
I added dependancy for connector like this:
"com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0" % "provided"
This is my application code:
val measurements = KafkaUtils.createDirectStream[
Array[Byte],
Array[Byte],
DefaultDecoder,
DefaultDecoder](ssc, kafkaConfig, Set("wattio"
))
.map {
case (k, v) => {
val decoder = new AvroDecoder[WattioMeasure](null,
WattioMeasure.SCHEMA$)
decoder.fromBytes(v)
}
}
//inserting into WattioRaw
WattioFunctions.run(WattioFunctions.
processWattioRaw(measurements))(
(rdd: RDD[
WattioTenantRaw], t: Time) => {
rdd.cache()
//get all the different tenants
val differentTenants = rdd.map(a
=> a.tenant).distinct().collect()
// for each tenant, create keyspace value and flush to cassandra
differentTenants.foreach(tenant => {
val keyspace = tenant + "_readings"
rdd.filter(a => a.tenant == tenant).map(s => s.wattioRaw).saveToCassandra(keyspace, "wattio_raw")
})
rdd.unpersist(true)
}
)
ssc.checkpoint("/tmp")
ssc.start()
ssc.awaitTermination()

You need to make sure your JAR is available to the workers. The spark master will open up a file server once the execution of the job starts.
You need to specify the path to your uber jar either by using SparkContext.setJars, or via the --jars flag passed to spark-submit.
From the documentation
When using spark-submit, the application jar along with any jars included with the --jars option will be automatically transferred to the cluster. Spark uses the following URL scheme to allow different strategies for disseminating jars

Actually I solved it by removing "provided" in the dependancy list, so that sbt packaged spark-cassandra-connector to my assembly jar.
Interesting thing is that in my launch script, even when I tryed to use
spark-submit --repositories "location of my artifactory repository" --packages "spark-cassandra-connector"
or
spark-submit --jars "spark-cassandra-connector.jar"
both of them failed!

scope provided means, you expect the JDK or a container to provide the dependency at runtime and that particular dependency jar will not be part of your final application War/jar you are creating hence that error.

Related

Unrecognized Hadoop major version number

I am trying to initialize an Apache Spark instance on Windows 10 to run a local test. My problem is during the initialization of the Spark instance, I get an error message. This code has worked for me a lot of times previously, so I am guessing something might have changed in the dependencies or the configuration. I am running using JDK version 1.8.0_192, Hadoop should be 3.0.0 and Spark version is 2.4.0. I am also using Maven as a build tool if that is relevant.
Here is the way I am setting up the session:
def withSparkSession(testMethod: SparkSession => Any) {
val uuid = UUID.randomUUID().toString
val pathRoot = s"C:/data/temp/spark-testcase/$uuid" // TODO: make this independent from Windows
val derbyRoot = s"C:/data/temp/spark-testcase/derby_system_root"
// TODO: clear me up -- Derby based metastore should be cleared up
System.setProperty("derby.system.home", s"${derbyRoot}")
val conf = new SparkConf()
.set("testcase.root.dir", s"${pathRoot}")
.set("spark.sql.warehouse.dir", s"${pathRoot}/test-hive-dwh")
.set("spark.sql.catalogImplementation", "hive")
.set("hive.exec.scratchdir", s"${pathRoot}/hive-scratchdir")
.set("hive.exec.dynamic.partition.mode", "nonstrict")
.setMaster("local[*]")
.setAppName("Spark Hive Test case")
val spark = SparkSession.builder()
.config(conf)
.enableHiveSupport()
.getOrCreate()
try {
testMethod(spark)
}
finally {
spark.sparkContext.stop()
println(s"Deleting test case root directory: $pathRoot")
deleteRecursively(nioPaths.get(pathRoot))
}
}
And this is the error message I receive:
An exception or error caused a run to abort.
java.lang.ExceptionInInitializerError
at org.apache.hadoop.hive.conf.HiveConf.<clinit>(HiveConf.java:105)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
at org.apache.spark.sql.SparkSession$.hiveClassesArePresent(SparkSession.scala:1117)
at org.apache.spark.sql.SparkSession$Builder.enableHiveSupport(SparkSession.scala:866)
.
.
.
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSpecLike$$anon$1.apply(FunSpecLike.scala:454)
at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196)
at org.scalamock.scalatest.AbstractMockFactory$$anonfun$withFixture$1.apply(AbstractMockFactory.scala:35)
at org.scalamock.scalatest.AbstractMockFactory$$anonfun$withFixture$1.apply(AbstractMockFactory.scala:34)
at org.scalamock.MockFactoryBase$class.withExpectations(MockFactoryBase.scala:41)
at org.scalamock.scalatest.AbstractMockFactory$class.withFixture(AbstractMockFactory.scala:34)
at org.scalatest.FunSpecLike$class.invokeWithFixture$1(FunSpecLike.scala:451)
at org.scalatest.FunSpecLike$$anonfun$runTest$1.apply(FunSpecLike.scala:464)
at org.scalatest.FunSpecLike$$anonfun$runTest$1.apply(FunSpecLike.scala:464)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
at org.scalatest.FunSpecLike$class.runTest(FunSpecLike.scala:464)
at org.scalatest.FunSpec.runTest(FunSpec.scala:1630)
at org.scalatest.FunSpecLike$$anonfun$runTests$1.apply(FunSpecLike.scala:497)
at org.scalatest.FunSpecLike$$anonfun$runTests$1.apply(FunSpecLike.scala:497)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:373)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:410)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
at org.scalatest.FunSpecLike$class.runTests(FunSpecLike.scala:497)
at org.scalatest.FunSpec.runTests(FunSpec.scala:1630)
at org.scalatest.Suite$class.run(Suite.scala:1147)
at org.scalatest.FunSpec.org$scalatest$FunSpecLike$$super$run(FunSpec.scala:1630)
at org.scalatest.FunSpecLike$$anonfun$run$1.apply(FunSpecLike.scala:501)
at org.scalatest.FunSpecLike$$anonfun$run$1.apply(FunSpecLike.scala:501)
at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
at org.scalatest.FunSpecLike$class.run(FunSpecLike.scala:501)
at org.scalatest.FunSpec.run(FunSpec.scala:1630)
at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:45)
at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$1.apply(Runner.scala:1346)
at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$1.apply(Runner.scala:1340)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1340)
at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1011)
at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1010)
at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1506)
at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:1010)
at org.scalatest.tools.Runner$.run(Runner.scala:850)
at org.scalatest.tools.Runner.run(Runner.scala)
at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2or3(ScalaTestRunner.java:43)
at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:26)
Caused by: java.lang.IllegalArgumentException: Unrecognized Hadoop major version number: 3.0.0-cdh6.3.4
at org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:174)
at org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:139)
at org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:100)
at org.apache.hadoop.hive.conf.HiveConf$ConfVars.<clinit>(HiveConf.java:368)
... 64 more
Process finished with exit code 2
So far I tried changing up the JDK versions to jdk1.8.0_181 and jdk11+28-x64. I also tried deleting the HADOOP_HOME environment variables from the system, but they didn't help. (Currently they are set to C:\Data\devtools\hadoop-win\3.0.0)

If you're on windows, you shouldn't be pulling CDH dependencies (3.0.0-cdh6.3.4), as Cloudera doesn't support Windows, last I checked.
But, you should be using Spark3, if you have Hadoop3+, and keep HADOOP_HOME, as that is definitely necessary.
Also, only Hadoop 3.3.4 has introduced Java 11 runtime support, so Java 8 is what you should stick with.

I have solved the problem. During the project development we also added HBase to the build, which pulled in a different Hadoop version from Cloudera as its dependency, so the versions got mixed up. Taking it out HBase dependency from the pom.xml solved the problem.

ClassNotFoundException when submitting JAR to Spark via spark-submit

I'm struggling to submit a JAR to Apache Spark using spark-submit.
To make things easier, I've experimented using this blog post. The code is
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object SimpleScalaSpark {
def main(args: Array[String]) {
val logFile = "/Users/toddmcgrath/Development/spark-1.6.1-bin-hadoop2.4/README.md" // I've replaced this with the path to an existing file
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
I'm running building this with Intellij Idea 2017.1 and running on Spark 2.1.0. Everything is running fine when I run it in the IDE.
I then build it as a JAR and attempt to use spark-submit as follows
./spark-submit --class SimpleScalaSpark --master local[*] ~/Documents/Spark/Scala/supersimple/out/artifacts/supersimple_jar/supersimple.jar
This results in the following error
java.lang.ClassNotFoundException: SimpleScalaSpark
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:229)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:695)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I'm at a loss as to what I'm missing...especially given that it runs as expected in the IDE.

As per your description above
,You are not giving the correct class name, so it is not able to find that class.
Just replace SimpleSparkScala with SimpleScalaSpark
Try running this command:
./spark-submit --class SimpleScalaSpark --master local[*] ~/Documents/Spark/Scala/supersimple/out/artifacts/supersimple_jar/supersimple.jar

Looks like there is an issue with your jar. You can check what classes are present in your jar by using the command:
vi supersimple.jar
If SimpleScalaSpark class does not appear in the output of the previous command, that means your jar is not built properly.

IDEs work differently from shell in many ways.
I believe for shell you need to add --jars parameter
spark submit add multiple jars in classpath

I am observing ClassNotFound on new classes I introduce. I am using a fat jar. I verified that the JAR file contains the new class file in all the copies in each node. (I am using the regular filesystem to load the Spark application, not hdfs nor an http URL).
The JAR file loaded by the worker did not have the new class I introduced. It is an older version.
The only way I found to get around the problem is to use a different filename for the JAR every time that I call spark-submit script.

--jars from different locations causes different jdbc behavior

When I load a MySQL JDBC driver by first copying it to the driver, and then including it via --jars /path/to/jdbc/driver.jar, then referencing that jdbc driver and loading data into a dataframe succeeds.
$ pyspark --jars /path/to/jdbc/driver.jar
>>> rdd = sqlContext.read.jdbc(url="jdbc:mysql://someAWSDatabase.us-west-2.rds.amazonaws.com:3306?user=root&password=somepassword", table="spark.test", properties={"driver":"com.mysql.jdbc.Driver"})
But, if I load the jar over the publicly available https-hosted version of that exact jar file, it fails.
$ pyspark --jars https://s3/path/to/jdbc/driver.jar
>>> rdd = sqlContext.read.jdbc(url="jdbc:mysql://someAWSDatabase.us-west-2.rds.amazonaws.com:3306?user=root&password=somepassword", table="spark.test", properties={"driver":"com.mysql.jdbc.Driver"})
py4j.protocol.Py4JJavaError: An error occurred while calling o37.jdbc.
: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
...
According to the docs, you can submit jars from various locations, from local to http/https, etc. Why would this cause a different behavior?
Update: I also tried running two spark-submit jobs, one with each variant of the jars path to the jdbc jar. The https jar submission threw the same error as above.

why the map function of sc.cassandraTable("test", "users").select("username") can not work?

Following spark-cassandra-connector's demo and Installing the Cassandra / Spark OSS Stack, under spark-shell, I tried the following snippet:
sc.stop
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "172.21.0.131")
.set("spark.cassandra.auth.username", "adminxx")
.set("spark.cassandra.auth.password", "adminxx")
val sc = new SparkContext("172.21.0.131", "Cassandra Connector Test", conf)
val rdd = sc.cassandraTable("test", "users").select("username")
Many operators of rdd can work fine, such as:
rdd.first
rdd.count
But when I use map:
val result = rdd.map(x => 1) //just for simple
result: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[61] at map at <console>:32
Then, I run:
result.first
I got the following errors:
15/12/11 15:09:00 WARN TaskSetManager: Lost task 0.0 in stage 31.0 (TID 104, 124.250.36.124): java.lang.ClassNotFoundException:
$line346.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
Caused by: java.lang.ClassNotFoundException: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:278)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
I don't know why I got such error? Any advice will be appreciated!
UPDATED:
According #RussSpitzer's answer for CassandraRdd.map( row => row.getInt("id)) does not work , java.lang.ClassNotFoundException happened!, I resolved this error through following errors, instead of using sc.stop and creating an new SparkContext, I start spark-shell with options:
bin/spark-shell -conf spark.cassandra.connection.host=172.21.0.131 --conf spark.cassandra.auth.username=adminxx --conf spark.cassandra.auth.password=adminxx
And then all steps are the same and work fine.

Russell Spitzer's answer from the spark-connector-user list:
I'm pretty sure the main problem here is that you start a context with --jars and then kill that context and then start another one. Try simplifying your code, instead of setting all of those spark conf options and creating a new contexts run your shell like. Also the jar that you want on the classpath is the connector assembly jar, not a custom build of a Scala script you want to run.
./spark-shell --conf spark.casandra.connection.host=10.129.20.80 ...
You should not need to modify the ack.wait.timeout or the executor.extraClasspath.

Spark applications normally send their compiled code as jar files to the executors. This way the function that you map is present on the executors.
The situation is more tricky in spark-shell. It has to compile and broadcast the code for your every line interactively. There is not even a class you're operating inside. It creates these fake $$iwC$$ classes to solve this.
Normally this works out well, but you may have hit a spark-shell bug. You can try to work around it by putting your code inside a class in spark-shell:
object Obj { val mapper = { x: String => 1 } }
val result = rdd.map(Obj.mapper)
But it is probably safest to implement your code as an application instead of just writing it in spark-shell.

Exception while submit spark job on yarn cluster with remote jvm

I am using below java code to submit job on yarn-cluster.
public ApplicationId submitQuery(String requestId, String query,String fileLocations) {
String driverJar = getDriverJar();
String driverClass = propertyService.getAppPropertyValue(TypeString.QUERY_DRIVER_CLASS);
String driverAppName = propertyService.getAppPropertyValue(TypeString.DRIVER_APP_NAME);
String extraJarsNeeded = propertyService.getAppPropertyValue(TypeString.DRIVER_EXTRA_JARS_NEEDED);
String[] args = new String[] {
// the name of your application
"--name",
driverAppName,
// memory for driver (optional)
"--driver-memory",
"1000M",
// path to your application's JAR file
// required in yarn-cluster mode
"--jar",
"local:/home/ankit/Repository/Personalization/rtis/Cust360QueryDriver/target/SnapdealCustomer360QueryDriver-jar-with-selective-dependencies.jar",
"--addJars",
"local:/home/ankit/Downloads/lib/spark-assembly-1.3.1-hadoop2.4.0.jar,local:/home/ankit/.m2/repository/org/slf4j/slf4j-api/1.7.5/slf4j-api-1.7.5.jar,local:/home/ankit/.m2/repository/org/slf4j/slf4j-log4j12/1.7.5/slf4j-log4j12-1.7.5.jar",
// name of your application's main class (required)
"--class",
driverClass,
"--arg",
requestId,
"--arg",
query,
"--arg",
fileLocations,
"--arg",
"yarn-client"
};
System.setProperty("HADOOP_CONF_DIR", "/home/hduser/hadoop-2.7.0/etc/hadoop");
Configuration config = new Configuration();
config.set("yarn.resourcemanager.address", propertyService.getAppPropertyValue(TypeString.RESOURCE_MANGER_URL));
config.set("fs.default.name", propertyService.getAppPropertyValue(TypeString.FS_DEFAULT_NAME));
System.setProperty("SPARK_YARN_MODE", "true");
SparkConf sparkConf = new SparkConf();
ClientArguments cArgs = new ClientArguments(args, sparkConf);
// create an instance of yarn Client client
Client client = new Client(cArgs, config, sparkConf);
ApplicationId id = client.submitApplication();
return id;
}
Job is getting submitted to yarn-cluster and i am able to retrieve application id but i am getting below Exception while running job on spark cluster.
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/Logging
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 13 more
though mentioned class in /home/ankit/Downloads/lib/spark-assembly-1.3.1-hadoop2.4.0.jar. looks like jar mentioned in --addJars is not getting added in driver's spark context.
Am i doing something wrong?? Any help would be appreciated.

Are you deploying on Cloudera's distribution ? spark.yarn.jar in the CDH 5.4 config has a 'local:' prefix for local files, but Spark version >= 1.5 does not like this, you should just use the full path name for your spark assembly. See also here.

Try building JAR without spark dependency and pass dependent jars with --jars in spark submit. Most of the times ClassNotFoundException is due to spark and application itself is dependent on same jar.
Suggested solution:
package without dependency and add dependent jars with --jars during
spark-submit
Modify application to use same version of third-party
library as spark has
Use shading in your build tool

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark streaming does not insert data to Cassandra - apache-spark

scope provided means, you expect the JDK or a container to provide the dependency at runtime and that particular dependency jar will not be part of your final application War/jar you are creating hence that error.

Related

Unrecognized Hadoop major version number

ClassNotFoundException when submitting JAR to Spark via spark-submit

--jars from different locations causes different jdbc behavior

why the map function of sc.cassandraTable("test", "users").select("username") can not work?

Exception while submit spark job on yarn cluster with remote jvm

Categories

Resources