SBT assembly jar exclusion - apache-spark

Im using spark (in java API) and require a single jar that can be pushed to the cluster, however the jar itself should not include spark. The app that deploys the jobs of course should include spark.
I would like:
sbt run - everything should be compiled and excuted
sbt smallAssembly - create a jar without spark
sbt assembly - create an uber jar with everything (including spark) for ease of deployment.
I have 1. and 3. working. Any ideas on how I can 2. ? What code would I need to add to my build.sbt file?
The question is not relevant only to spark, but any other dependency that I may wish to exclude as well.

% "provided" configuration
The first option to exclude a jar from the fat jar is to use "provided" configuration on the library dependency. "provided" comes from Maven's provided scope that's defined as follows:
This is much like compile, but indicates you expect the JDK or a container to provide the dependency at runtime. For example, when building a web application for the Java Enterprise Edition, you would set the dependency on the Servlet API and related Java EE APIs to scope provided because the web container provides those classes. This scope is only available on the compilation and test classpath, and is not transitive.
Since you're deploying your code to a container (in this case Spark), contrary to your comment you'd probably need Scala standard library, and other library jars (e.g. Dispatch if you used it). This won't affect run or test.
packageBin
If you just want your source code, and no Scala standard library or other library dependencies, that would be packageBin built into sbt. This packaged jar can be combined with dependency-only jar you can make using sbt-assembly's assemblyPackageDependency.
excludedJars in assembly
The final option is to use excludedJars in assembly:
excludedJars in assembly := {
val cp = (fullClasspath in assembly).value
cp filter {_.data.getName == "spark-core_2.9.3-0.8.0-incubating.jar"}
}

For beginners like me, simply add the % Provided to Spark dependencies to exclude them from an uber-jar:
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.4.0" % Provided
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.4.0" % Provided
in build.sbt.

Related

spark-submit dependency conflict

I'm trying to submit a jar to spark but my jar contains dependencies that conflict with spark's built-in jars (snakeyml and others).
Is there a way to tell spark to prefer whatever dependencies my project has over the jars inside /jar
UPDATE
When i run spark-submit, i get the following exception:
Caused by: java.lang.NoSuchMethodError: javax.validation.BootstrapConfiguration.getClockProviderClassName()Ljava/lang/String;
at org.hibernate.validator.internal.xml.ValidationBootstrapParameters.<init>(ValidationBootstrapParameters.java:63)
at org.hibernate.validator.internal.engine.ConfigurationImpl.parseValidationXml(ConfigurationImpl.java:540)
at org.hibernate.validator.internal.engine.ConfigurationImpl.buildValidatorFactory(ConfigurationImpl.java:337)
at javax.validation.Validation.buildDefaultValidatorFactory(Validation.java:110)
at org.hibernate.cfg.beanvalidation.TypeSafeActivator.getValidatorFactory(TypeSafeActivator.java:501)
at org.hibernate.cfg.beanvalidation.TypeSafeActivator.activate(TypeSafeActivator.java:84)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.hibernate.cfg.beanvalidation.BeanValidationIntegrator.integrate(BeanValidationIntegrator.java:132)
... 41 more
which is caused by spark having an older version of validation-api (validation-api-1.1.0.Final.jar)
My project has a dependency on the newer version and it does get bundled with my jar (javax.validation:validation-api:jar:2.0.1.Final:compile)
I submit using this command:
/spark/bin/spark-submit --conf spark.executor.userClassPathFirst=true --conf spark.driver.userClassPathFirst=true
but i still get the same exception
If you are building your jar using SBT, you need to exclude those classes which are on the cluster. For example like below:
"org.apache.spark" %% "spark-core" % "2.2.0" % "provided"
You are doing that by adding "provided", that means these classes is provided already in the environment where you run it.
Not sure if using SBT, but I used this in build.sbt via assembly as I had also sorts of dependency conflicts at one stage. See below, maybe this will help.
This is controlled by setting the following confs to true:
spark.driver.userClassPathFirst
spark.executor.userClassPathFirst
I had issues with 2 jars, and this is what I ended up doing, ie copied the required jars to a directory, and used the extraClasspath option
spark-submit --conf spark.driver.extraClassPath="C:\sparkjars\validation-api-2.0.1.Final.jar;C:\sparkjars\gson-2.8.6.jar" myspringbootapp.jar
From the documentaion, spark.driver.extraClassPath Extra classpath entries to prepend to the classpath of the driver.

Compiling Spark program: no 'lib' directory

I am going through the tutorial:
https://www.tutorialspoint.com/apache_spark/apache_spark_deployment.htm
When I got to the Step 2: Compile program section I got stuck, because there is no lib folder in the spark directory which looks the following way:
Where is the lib folder? How could I compile the program?
I looked into the jars folder but there is no file named spark-assembly-1.4.0-hadoop2.6.0.jar
I am sorry I am not answering your question directly, but I want to guide you to the more convenient development process of Spark application.
When you are developing Spark application on your local computer you should use sbt (scala build tool). After you done writing code you should compile it with sbt (running sbt assembly). Sbt will produce 'fat jar' archive, that already has all required dependencies for a job. Then you should upload jar to spark cluster (for example using spark-submit script).
There is no reason to install sbt on your cluster because it is needed only for compilation.
You should check starter project that I created for you.

Spark doesn't find BLAS in MKL dll

I'm working on IntelliJ and specified this parameter to my JVM :
-Dcom.github.fommil.netlib.BLAS=mkl_rt.dll (my mkl folder is in the Path)
However I still have the following warning :
WARN BLAS: Failed to load implementation from: mkl_rt.dll
Any help ?
I finally solved this issue, here's the complete step to do make it work on intelliJ Idea on Windows :
First create an SBT project and make sure to put the following line in build.SBT :
libraryDependencies ++= Seq("com.github.fommil.netlib" % "all" % "1.1.1" pomOnly())
Refresh the project, after that you should have the libraries available. If that doesn't work for some reason, you can go here : http://repo1.maven.org/maven2/com/github/fommil/netlib/ and download the necessary resources for your system directly.
Copy your mkl_rt.dll twice and rename the copies libblas3.dll and liblapack3.dll. Make sure your folders containing all the Dll is in the PATH environment variable.
Finally, go to Run -> Edit configuration and in the VM options put :
-Dcom.github.fommil.netlib.BLAS=mkl_rt.dll

unresolved dependency: com.eed3si9n#sbt-assembly;0.13.0: not found

Did lots of search, saw many people having the similar issue and tried various suggested solution. None worked.
Can someone help me?
resolvers += Resolver.url("bintray-sbt-plugins", url("http://dl.bintray.com/sbt/sbt-plugin-releases"))(Resolver.ivyStylePatterns)
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")
The file is inside the project folder.
Instead of 0.13.0 version, I used 0.14.0 version.
I fixed this by adding POM file which I downloaded from
https://dl.bintray.com/sbt/sbt-plugin-releases/com.eed3si9n/sbt-assembly/scala_2.10/sbt_0.13/0.14.4/ivys/
to my local ivy folder under below location .ivy/local ( if not present, create the local folder).
once it was there I ran the build and it downloaded the jar.
You need to add [root_dir]/project/plugins.sbt file with the following content:
// packager
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.5")
Event better - don't use sbt-assembly at all! Flat-jars cause conflicts during merging which need to be resolved with assemblyMergeStrategy.
Use the binary distribution format plugin that sbt offers which enables you to distribute in binary script, dmg, msi and tar.gz.
Check out sbt-native-packager

sbt, ivy, offline work, and weirdness

I'm trying to work on an sbt project offline (again). Things almost seem to be ok, but there are strange things that I'm baffled by. Here's what I'm noticing:
I've created an empty sbt project and am considering the following dependencies in build.sbt:
name := "sbtSand"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies ++= Seq(
"joda-time" % "joda-time" % "2.9.1",
"org.apache.spark" %% "spark-core" % "1.5.2"
)
I've built the project while online, and can see all the packages in [userhome]/.ivy2/cache. The project builds fine. I then turn off wifi, sbt clean and attempt to build. The build fails. I comment out the spark dependency (keeping the joda-time one). Still offline, I run sbt compile. The project builds fine. I put the spark dependency back in, and sbt clean. It again fails to build. I get back online. I can build again.
The sbt output for the failed builds are like: https://gist.github.com/ashic/9e5ebc39ff4eb8c41ffb
The key part of it is:
[info] Resolving org.apache.hadoop#hadoop-mapreduce-client-app;2.2.0 ...
[warn] Host repo1.maven.org not found. url=https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-client-app/2.2.0/hadoop-mapreduce-client-app-2.2.0.pom
[info] You probably access the destination server through a proxy server that is not well configured.
It's interesting that sbt is managing to use the joda-time from ivy cache, but for the spark-core package (or rather its dependency) it wants to reach out to the internet and fails the build. Could anybody please help me understand this, and what I can do so that I can get this to work while fully offline?
It seems the issue is resolved in 0.13.9. I was using 0.13.8. [The 0.13.9 msi for windows seemed to give me 0.13.8, while the 0.13.9.2 msi installed the right version. Existing projects need updating manually to 0.13.9 in build properties.]

Resources