java.lang.IllegalStateException: impossible to get artifacts when data has not been loaded. IvyNode = org.codehaus.jackson#jackson-core-asl;1.8.3 - illegalstateexception

When I am building the Spark 1.6 source code for Hadoop version 2.6.0-cdh5.7.0 and Yarn with the sbt build and maven build i am getting below same error:
[error] (yarn/*:update) java.lang.IllegalStateException: impossible to get artifacts when data has not been loaded.
IvyNode = org.codehaus.jackson#jackson-core-asl;1.8.3
I have added maven depencies
<dependency>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-core-asl</artifactId>
<version>1.8.3</version>
</dependency>
and also I have added below dependency in plugins.sbt file inside project folder
libraryDependencies += "org.codehaus.jackson" % "jackson-core-asl" % "1.8.3" % "test" intransitive()
and I tried adding as
dependencyOverrides += "org.codehaus.jackson" % "jackson-core-asl" % "1.8.3" in plugins.sbt
But still error is not gone.
Please help on the same,
Thank you
Karim

I think this is the reason...
"All Codehaus services have now been terminated."

Related

Spark submit not working with Protobuf dependency

Apache Spark: 3.0.0
Protobuf: 3.5.1
Exception:
"main" java.lang.NoSuchMethodError: com.google.protobuf.Descriptors$Descriptor.getOneofs()Ljava/util/List;
Issue: When submitting my spake scala application on my local kubernetes, I am getting:
"java.lang.NoSuchMethodError: com.google.protobuf.Descriptors$Descriptor.getOneofs()Ljava/util/List;"
Seems there is a conflict in dependency for Protobuff.
Have tried a few things like https://github.com/nats-io/stan.java/issues/20 but nothing is working.
my build.sbt:
name := "test"
version := "0.1"
scalaVersion := "2.12.8"
val sparkVersion = "3.0.0"
val protobufVersion = "3.5.1"
resolvers += "confluent" at "http://packages.confluent.io/maven/"
resolvers += Resolver.jcenterRepo
resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"
assemblyMergeStrategy in assembly := {
case "META-INF/services/org.apache.spark.sql.sources.DataSourceRegister" => MergeStrategy.concat
case PathList("META-INF", xs#_*) => MergeStrategy.discard
case "application.conf" => MergeStrategy.concat
case x => MergeStrategy.first
}
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.google.protobuf.*" -> "shadedproto.#1").inProject
.inLibrary("com.google.protobuf" % "protobuf-java" % protobufVersion)
)
coverageEnabled.in(ThisBuild, IntegrationTest, test) := true
//skipping test cases during package
test in assembly := {}
lazy val server = (project in file("."))
.configs(IntegrationTest)
.settings(Defaults.itSettings)
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-sql" % sparkVersion % Provided,
"org.apache.spark" %% "spark-sql-kafka-0-10" % sparkVersion,
"org.apache.kafka" % "kafka-clients" % "2.2.1",
"org.postgresql" % "postgresql" % postgreSqlVersion,
"com.google.protobuf" % "protobuf-java" % protobufVersion,
"com.typesafe" % "config" % "1.4.0"
)
To resolve this you need todo two things. 1) remove the duplicated protobuf library so that there is only one version of it avaible. 2) fix any code that did not use the version of protobuff that spark uses.
One of my favorite quick and dirty tricks to resolve this is to do class search in Intellij of the conflicting class, in this case "Descriptors". This will show all jars that contain the class. Once you've figured out which jars are bringing in the conflicting class, you can remove one of them. Chances are its going to be easiest to just match spark's version of protobuf.
If you sift through the dependencies in maven it seems like it should be using v2.5.0 so some other dependency may be bringing it in.
You may need to check your code to match this version. Whats the full stack trace of your error? Is it your code calling the protobuf function or is it another library? If its another library you may need to fork the library to make it compatible with sparks dependencies.
As this is something that i am currently trying to resolve my self.
In essense spark_parent is using protobuf version 2.5.
You are using protobufVersion 3.5
Working on Intelij will not have any issues with the following code but when trying to spark-submit the code from the jar file you will encounter the following problem
In essence the spark-parent forces the older version of protobuf.
Excluding from maven, a parent dependancy is not do-able.
Recompilining your proto files to 2.5 is out of the questions and will not work.
I have tried shading the jar, but havent managed to get that working.
Just download spark code to see if there is something that can be done from that part.
Will revert back when i have a solution for this
-- Solved :
You do not need to create a Fat Jar to get this working. You can use an experimental feature in order to force spark to use your package.
Now i am using Spring Boot as well so i do not have to define my package to be executed but this is the spark-submit that is needed
spark-submit -v --conf spark.driver.userClassPathFirst=true \
--conf spark.executor.userClassPathFirst=true \
--driver-java-options "-Dspring.profiles.active=local -Djava.util.logging.config.file=log4j.properies " \
--jars /path/to/com/google/protobuf/protobuf-java/3.17.3/protobuf-java-3.17.3.jar \
target/SparkStream-0.1.jar
Hope this helps you out. Do post your solution as well.
Use a shade plugin to shade your version of protobuf. This should resolve the issue.

I am trying to develop a spark application with kafka, but I get a cNoClassDefFoundError [duplicate]

The common problems when building and deploying Spark applications are:
java.lang.ClassNotFoundException.
object x is not a member of package y compilation errors.
java.lang.NoSuchMethodError
How these can be resolved?
Apache Spark's classpath is built dynamically (to accommodate per-application user code) which makes it vulnerable to such issues. #user7337271's answer is correct, but there are some more concerns, depending on the cluster manager ("master") you're using.
First, a Spark application consists of these components (each one is a separate JVM, therefore potentially contains different classes in its classpath):
Driver: that's your application creating a SparkSession (or SparkContext) and connecting to a cluster manager to perform the actual work
Cluster Manager: serves as an "entry point" to the cluster, in charge of allocating executors for each application. There are several different types supported in Spark: standalone, YARN and Mesos, which we'll describe bellow.
Executors: these are the processes on the cluster nodes, performing the actual work (running Spark tasks)
The relationsip between these is described in this diagram from Apache Spark's cluster mode overview:
Now - which classes should reside in each of these components?
This can be answered by the following diagram:
Let's parse that slowly:
Spark Code are Spark's libraries. They should exist in ALL three components as they include the glue that let's Spark perform the communication between them. By the way - Spark authors made a design decision to include code for ALL components in ALL components (e.g. to include code that should only run in Executor in driver too) to simplify this - so Spark's "fat jar" (in versions up to 1.6) or "archive" (in 2.0, details bellow) contain the necessary code for all components and should be available in all of them.
Driver-Only Code this is user code that does not include anything that should be used on Executors, i.e. code that isn't used in any transformations on the RDD / DataFrame / Dataset. This does not necessarily have to be separated from the distributed user code, but it can be.
Distributed Code this is user code that is compiled with driver code, but also has to be executed on the Executors - everything the actual transformations use must be included in this jar(s).
Now that we got that straight, how do we get the classes to load correctly in each component, and what rules should they follow?
Spark Code: as previous answers state, you must use the same Scala and Spark versions in all components.
1.1 In Standalone mode, there's a "pre-existing" Spark installation to which applications (drivers) can connect. That means that all drivers must use that same Spark version running on the master and executors.
1.2 In YARN / Mesos, each application can use a different Spark version, but all components of the same application must use the same one. That means that if you used version X to compile and package your driver application, you should provide the same version when starting the SparkSession (e.g. via spark.yarn.archive or spark.yarn.jars parameters when using YARN). The jars / archive you provide should include all Spark dependencies (including transitive dependencies), and it will be shipped by the cluster manager to each executor when the application starts.
Driver Code: that's entirely up to - driver code can be shipped as a bunch of jars or a "fat jar", as long as it includes all Spark dependencies + all user code
Distributed Code: in addition to being present on the Driver, this code must be shipped to executors (again, along with all of its transitive dependencies). This is done using the spark.jars parameter.
To summarize, here's a suggested approach to building and deploying a Spark Application (in this case - using YARN):
Create a library with your distributed code, package it both as a "regular" jar (with a .pom file describing its dependencies) and as a "fat jar" (with all of its transitive dependencies included).
Create a driver application, with compile-dependencies on your distributed code library and on Apache Spark (with a specific version)
Package the driver application into a fat jar to be deployed to driver
Pass the right version of your distributed code as the value of spark.jars parameter when starting the SparkSession
Pass the location of an archive file (e.g. gzip) containing all the jars under lib/ folder of the downloaded Spark binaries as the value of spark.yarn.archive
When building and deploying Spark applications all dependencies require compatible versions.
Scala version. All packages have to use the same major (2.10, 2.11, 2.12) Scala version.
Consider following (incorrect) build.sbt:
name := "Simple Project"
version := "1.0"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.11" % "2.0.1",
"org.apache.spark" % "spark-streaming_2.10" % "2.0.1",
"org.apache.bahir" % "spark-streaming-twitter_2.11" % "2.0.1"
)
We use spark-streaming for Scala 2.10 while remaining packages are for Scala 2.11. A valid file could be
name := "Simple Project"
version := "1.0"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.11" % "2.0.1",
"org.apache.spark" % "spark-streaming_2.11" % "2.0.1",
"org.apache.bahir" % "spark-streaming-twitter_2.11" % "2.0.1"
)
but it is better to specify version globally and use %% (which appends the scala version for you):
name := "Simple Project"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.0.1",
"org.apache.spark" %% "spark-streaming" % "2.0.1",
"org.apache.bahir" %% "spark-streaming-twitter" % "2.0.1"
)
Similarly in Maven:
<project>
<groupId>com.example</groupId>
<artifactId>simple-project</artifactId>
<modelVersion>4.0.0</modelVersion>
<name>Simple Project</name>
<packaging>jar</packaging>
<version>1.0</version>
<properties>
<spark.version>2.0.1</spark.version>
</properties>
<dependencies>
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>spark-streaming-twitter_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
</dependencies>
</project>
Spark version All packages have to use the same major Spark version (1.6, 2.0, 2.1, ...).
Consider following (incorrect) build.sbt:
name := "Simple Project"
version := "1.0"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.11" % "1.6.1",
"org.apache.spark" % "spark-streaming_2.10" % "2.0.1",
"org.apache.bahir" % "spark-streaming-twitter_2.11" % "2.0.1"
)
We use spark-core 1.6 while remaining components are in Spark 2.0. A valid file could be
name := "Simple Project"
version := "1.0"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.11" % "2.0.1",
"org.apache.spark" % "spark-streaming_2.10" % "2.0.1",
"org.apache.bahir" % "spark-streaming-twitter_2.11" % "2.0.1"
)
but it is better to use a variable
(still incorrect):
name := "Simple Project"
version := "1.0"
val sparkVersion = "2.0.1"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.11" % sparkVersion,
"org.apache.spark" % "spark-streaming_2.10" % sparkVersion,
"org.apache.bahir" % "spark-streaming-twitter_2.11" % sparkVersion
)
Similarly in Maven:
<project>
<groupId>com.example</groupId>
<artifactId>simple-project</artifactId>
<modelVersion>4.0.0</modelVersion>
<name>Simple Project</name>
<packaging>jar</packaging>
<version>1.0</version>
<properties>
<spark.version>2.0.1</spark.version>
<scala.version>2.11</scala.version>
</properties>
<dependencies>
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>spark-streaming-twitter_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
</dependencies>
</project>
Spark version used in Spark dependencies has to match Spark version of the Spark installation. For example if you use 1.6.1 on the cluster you have to use 1.6.1 to build jars. Minor versions mismatch are not always accepted.
Scala version used to build jar has to match Scala version used to build deployed Spark. By default (downloadable binaries and default builds):
Spark 1.x -> Scala 2.10
Spark 2.x -> Scala 2.11
Additional packages should be accessible on the worker nodes if included in the fat jar. There are number of options including:
--jars argument for spark-submit - to distribute local jar files.
--packages argument for spark-submit - to fetch dependencies from Maven repository.
When submitting in the cluster node you should include application jar in --jars.
In addition to the very extensive answer already given by user7337271, if the problem results from missing external dependencies you can build a jar with your dependencies with e.g. maven assembly plugin
In that case, make sure to mark all the core spark dependencies as "provided" in your build system and, as already noted, make sure they correlate with your runtime spark version.
Dependency classes of your application shall be specified in the application-jar option of your launching command.
More details can be found at the Spark documentation
Taken from the documentation:
application-jar: Path to a bundled jar including your application and
all dependencies. The URL must be globally visible inside of your
cluster, for instance, an hdfs:// path or a file:// path that is
present on all nodes
I think this problem must solve a assembly plugin.
You need build a fat jar.
For example in sbt :
add file $PROJECT_ROOT/project/assembly.sbt with code addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.0")
to build.sbtadded some librarieslibraryDependencies ++= Seq("com.some.company" %% "some-lib" % "1.0.0")`
in sbt console enter "assembly", and deploy assembly jar
If you need more information, go to https://github.com/sbt/sbt-assembly
Add all the jar files from spark-2.4.0-bin-hadoop2.7\spark-2.4.0-bin-hadoop2.7\jars in the project. The spark-2.4.0-bin-hadoop2.7 can be downloaded from https://spark.apache.org/downloads.html
I have the following build.sbt
lazy val root = (project in file(".")).
settings(
name := "spark-samples",
version := "1.0",
scalaVersion := "2.11.12",
mainClass in Compile := Some("StreamingExample")
)
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.4.0",
"org.apache.spark" %% "spark-streaming" % "2.4.0",
"org.apache.spark" %% "spark-sql" % "2.4.0",
"com.couchbase.client" %% "spark-connector" % "2.2.0"
)
// META-INF discarding
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
I've created a fat jar of my appliction using sbt assembly plugin, but when running using spark-submit it fails with the error :
java.lang.NoClassDefFoundError: rx/Completable$OnSubscribe
at com.couchbase.spark.connection.CouchbaseConnection.streamClient(CouchbaseConnection.scala:154)
I can see that the class exists in my fat jar:
jar tf target/scala-2.11/spark-samples-assembly-1.0.jar | grep 'Completable$OnSubscribe'
rx/Completable$OnSubscribe.class
not sure what am i missing here, any clues?

spark-streaming-kafka doesn't work with scala-library [duplicate]

The common problems when building and deploying Spark applications are:
java.lang.ClassNotFoundException.
object x is not a member of package y compilation errors.
java.lang.NoSuchMethodError
How these can be resolved?
Apache Spark's classpath is built dynamically (to accommodate per-application user code) which makes it vulnerable to such issues. #user7337271's answer is correct, but there are some more concerns, depending on the cluster manager ("master") you're using.
First, a Spark application consists of these components (each one is a separate JVM, therefore potentially contains different classes in its classpath):
Driver: that's your application creating a SparkSession (or SparkContext) and connecting to a cluster manager to perform the actual work
Cluster Manager: serves as an "entry point" to the cluster, in charge of allocating executors for each application. There are several different types supported in Spark: standalone, YARN and Mesos, which we'll describe bellow.
Executors: these are the processes on the cluster nodes, performing the actual work (running Spark tasks)
The relationsip between these is described in this diagram from Apache Spark's cluster mode overview:
Now - which classes should reside in each of these components?
This can be answered by the following diagram:
Let's parse that slowly:
Spark Code are Spark's libraries. They should exist in ALL three components as they include the glue that let's Spark perform the communication between them. By the way - Spark authors made a design decision to include code for ALL components in ALL components (e.g. to include code that should only run in Executor in driver too) to simplify this - so Spark's "fat jar" (in versions up to 1.6) or "archive" (in 2.0, details bellow) contain the necessary code for all components and should be available in all of them.
Driver-Only Code this is user code that does not include anything that should be used on Executors, i.e. code that isn't used in any transformations on the RDD / DataFrame / Dataset. This does not necessarily have to be separated from the distributed user code, but it can be.
Distributed Code this is user code that is compiled with driver code, but also has to be executed on the Executors - everything the actual transformations use must be included in this jar(s).
Now that we got that straight, how do we get the classes to load correctly in each component, and what rules should they follow?
Spark Code: as previous answers state, you must use the same Scala and Spark versions in all components.
1.1 In Standalone mode, there's a "pre-existing" Spark installation to which applications (drivers) can connect. That means that all drivers must use that same Spark version running on the master and executors.
1.2 In YARN / Mesos, each application can use a different Spark version, but all components of the same application must use the same one. That means that if you used version X to compile and package your driver application, you should provide the same version when starting the SparkSession (e.g. via spark.yarn.archive or spark.yarn.jars parameters when using YARN). The jars / archive you provide should include all Spark dependencies (including transitive dependencies), and it will be shipped by the cluster manager to each executor when the application starts.
Driver Code: that's entirely up to - driver code can be shipped as a bunch of jars or a "fat jar", as long as it includes all Spark dependencies + all user code
Distributed Code: in addition to being present on the Driver, this code must be shipped to executors (again, along with all of its transitive dependencies). This is done using the spark.jars parameter.
To summarize, here's a suggested approach to building and deploying a Spark Application (in this case - using YARN):
Create a library with your distributed code, package it both as a "regular" jar (with a .pom file describing its dependencies) and as a "fat jar" (with all of its transitive dependencies included).
Create a driver application, with compile-dependencies on your distributed code library and on Apache Spark (with a specific version)
Package the driver application into a fat jar to be deployed to driver
Pass the right version of your distributed code as the value of spark.jars parameter when starting the SparkSession
Pass the location of an archive file (e.g. gzip) containing all the jars under lib/ folder of the downloaded Spark binaries as the value of spark.yarn.archive
When building and deploying Spark applications all dependencies require compatible versions.
Scala version. All packages have to use the same major (2.10, 2.11, 2.12) Scala version.
Consider following (incorrect) build.sbt:
name := "Simple Project"
version := "1.0"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.11" % "2.0.1",
"org.apache.spark" % "spark-streaming_2.10" % "2.0.1",
"org.apache.bahir" % "spark-streaming-twitter_2.11" % "2.0.1"
)
We use spark-streaming for Scala 2.10 while remaining packages are for Scala 2.11. A valid file could be
name := "Simple Project"
version := "1.0"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.11" % "2.0.1",
"org.apache.spark" % "spark-streaming_2.11" % "2.0.1",
"org.apache.bahir" % "spark-streaming-twitter_2.11" % "2.0.1"
)
but it is better to specify version globally and use %% (which appends the scala version for you):
name := "Simple Project"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.0.1",
"org.apache.spark" %% "spark-streaming" % "2.0.1",
"org.apache.bahir" %% "spark-streaming-twitter" % "2.0.1"
)
Similarly in Maven:
<project>
<groupId>com.example</groupId>
<artifactId>simple-project</artifactId>
<modelVersion>4.0.0</modelVersion>
<name>Simple Project</name>
<packaging>jar</packaging>
<version>1.0</version>
<properties>
<spark.version>2.0.1</spark.version>
</properties>
<dependencies>
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>spark-streaming-twitter_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
</dependencies>
</project>
Spark version All packages have to use the same major Spark version (1.6, 2.0, 2.1, ...).
Consider following (incorrect) build.sbt:
name := "Simple Project"
version := "1.0"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.11" % "1.6.1",
"org.apache.spark" % "spark-streaming_2.10" % "2.0.1",
"org.apache.bahir" % "spark-streaming-twitter_2.11" % "2.0.1"
)
We use spark-core 1.6 while remaining components are in Spark 2.0. A valid file could be
name := "Simple Project"
version := "1.0"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.11" % "2.0.1",
"org.apache.spark" % "spark-streaming_2.10" % "2.0.1",
"org.apache.bahir" % "spark-streaming-twitter_2.11" % "2.0.1"
)
but it is better to use a variable
(still incorrect):
name := "Simple Project"
version := "1.0"
val sparkVersion = "2.0.1"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.11" % sparkVersion,
"org.apache.spark" % "spark-streaming_2.10" % sparkVersion,
"org.apache.bahir" % "spark-streaming-twitter_2.11" % sparkVersion
)
Similarly in Maven:
<project>
<groupId>com.example</groupId>
<artifactId>simple-project</artifactId>
<modelVersion>4.0.0</modelVersion>
<name>Simple Project</name>
<packaging>jar</packaging>
<version>1.0</version>
<properties>
<spark.version>2.0.1</spark.version>
<scala.version>2.11</scala.version>
</properties>
<dependencies>
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>spark-streaming-twitter_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
</dependencies>
</project>
Spark version used in Spark dependencies has to match Spark version of the Spark installation. For example if you use 1.6.1 on the cluster you have to use 1.6.1 to build jars. Minor versions mismatch are not always accepted.
Scala version used to build jar has to match Scala version used to build deployed Spark. By default (downloadable binaries and default builds):
Spark 1.x -> Scala 2.10
Spark 2.x -> Scala 2.11
Additional packages should be accessible on the worker nodes if included in the fat jar. There are number of options including:
--jars argument for spark-submit - to distribute local jar files.
--packages argument for spark-submit - to fetch dependencies from Maven repository.
When submitting in the cluster node you should include application jar in --jars.
In addition to the very extensive answer already given by user7337271, if the problem results from missing external dependencies you can build a jar with your dependencies with e.g. maven assembly plugin
In that case, make sure to mark all the core spark dependencies as "provided" in your build system and, as already noted, make sure they correlate with your runtime spark version.
Dependency classes of your application shall be specified in the application-jar option of your launching command.
More details can be found at the Spark documentation
Taken from the documentation:
application-jar: Path to a bundled jar including your application and
all dependencies. The URL must be globally visible inside of your
cluster, for instance, an hdfs:// path or a file:// path that is
present on all nodes
I think this problem must solve a assembly plugin.
You need build a fat jar.
For example in sbt :
add file $PROJECT_ROOT/project/assembly.sbt with code addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.0")
to build.sbtadded some librarieslibraryDependencies ++= Seq("com.some.company" %% "some-lib" % "1.0.0")`
in sbt console enter "assembly", and deploy assembly jar
If you need more information, go to https://github.com/sbt/sbt-assembly
Add all the jar files from spark-2.4.0-bin-hadoop2.7\spark-2.4.0-bin-hadoop2.7\jars in the project. The spark-2.4.0-bin-hadoop2.7 can be downloaded from https://spark.apache.org/downloads.html
I have the following build.sbt
lazy val root = (project in file(".")).
settings(
name := "spark-samples",
version := "1.0",
scalaVersion := "2.11.12",
mainClass in Compile := Some("StreamingExample")
)
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.4.0",
"org.apache.spark" %% "spark-streaming" % "2.4.0",
"org.apache.spark" %% "spark-sql" % "2.4.0",
"com.couchbase.client" %% "spark-connector" % "2.2.0"
)
// META-INF discarding
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
I've created a fat jar of my appliction using sbt assembly plugin, but when running using spark-submit it fails with the error :
java.lang.NoClassDefFoundError: rx/Completable$OnSubscribe
at com.couchbase.spark.connection.CouchbaseConnection.streamClient(CouchbaseConnection.scala:154)
I can see that the class exists in my fat jar:
jar tf target/scala-2.11/spark-samples-assembly-1.0.jar | grep 'Completable$OnSubscribe'
rx/Completable$OnSubscribe.class
not sure what am i missing here, any clues?

How to create fat jar using sbt

I am newbie to spark programming. I want to create fat jar which include all dependency jars as well. Currently I am running spark application with following command
./spark-submit -class XYZ --jars dependency_1,dependency_2 main.jar
But I don't want each and every time to pass these dependency jar. I googled it but could not find working solution.
One of way I tried is using assembly plugin. But it's giving following error.
[error] Not a valid command: assembly
[error] Not a valid project ID: assembly
[error] Expected ':' (if selecting a configuration)
[error] Not a valid key: assembly
[error] assembly
[error]
So please any one have idea which is best way to create fat jar.
Thanks in advance.
Edit1--
My build.sbt--
import AssemblyKeys._
assemblySettings
name := "Update Sim Count"
version := "1.0"
scalaVersion := "2.10.0"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.6.0"
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0-RC1"
libraryDependencies += "mysql" % "mysql-connector-java" % "5.1.12"
assembly.sbt:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")
Edit2-
Answer given by #Chobeat worked. Followed that blog. No need of build.scala from that blog. You can only add assembly.sbt and few lines to build.sbt. That will work for you. Thanks #Chobeat for your help.
Remember to add the sbt assembly plugin to your project. It's not a default command.
Building a fat jar with Spark is a bit tricky at first but it's not black magic. And also it's the correct way to achieve what you want to do.
Follow a tutorial and you will be good:
http://blog.prabeeshk.com/blog/2014/04/08/creating-uber-jar-for-spark-project-using-sbt-assembly/

Exception in thread "main" java.lang.IllegalStateException: Library directory '/Users/dbl/spark/lib_managed/jars' does not exist

I built Spark 1.6 SNAPSHOT from sources with no issues:
$ mvn3 clean package -DskipTests.
I'm running:
OS X 10.10.5.
Java 1.8
Maven 3.3.3
Spark 1.6 SNAPSHOT
Scala 2.11.7
Zinc 0.3.5.3
Hadoop 3.0 SNAPSHOT
I added the following dependency to my pom.xml (to try to resolve the warning about native libraries):
<dependency>
<groupId>com.googlecode.netlib-java</groupId>
<artifactId>netlib</artifactId>
<version>1.1</version>
</dependency>
Environment variables:
HADOOP_INSTALL=/Users/davidlaxer/hadoop/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT
HADOOP_CONF_DIR=/Users/davidlaxer/hadoop/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/etc/hadoop
HADOOP_OPTS=-Djava.library.path=/Users/davidlaxer/hadoop/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/lib/native
CLASSPATH=/users/davidlaxer/trunk/core/src/test/java/:/Users/davidlaxer/hadoop/hadoop-dist/target/hadoop-dist-3.0.0-SNAPSHOT.jar:/Users/davidlaxer/clojure/target:/Users/davidlaxer/hadoop/lib/native:
SPARK_LIBRARY_PATH=/Users/davidlaxer/hadoop/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/lib/native
When I try to launch spark with: spark-shell I get the following error:
./spark-shell
Exception in thread "main" java.lang.IllegalStateException: Library directory '/Users/davidlaxer/spark/lib_managed/jars' does not exist.
at org.apache.spark.launcher.CommandBuilderUtils.checkState(CommandBuilderUtils.java:249)
at org.apache.spark.launcher.AbstractCommandBuilder.buildClassPath(AbstractCommandBuilder.java:227)
at org.apache.spark.launcher.AbstractCommandBuilder.buildJavaCommand(AbstractCommandBuilder.java:115)
at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitCommand(SparkSubmitCommandBuilder.java:196)
at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:121)
at org.apache.spark.launcher.Main.main(Main.java:86)
I reverted to Spark 1.5 and didn't have the problem:
git clone git://github.com/apache/spark.git -b branch-1.5

Resources