Spark: Fat-JAR crashing Zeppelin with NullPointerException - apache-spark

Note: This is NOT a duplicate of Getting NullPointerException when running Spark Code in Zeppelin 0.7.1
I've run into this roadblock in Apache Zeppelin on Amazon EMR. I'm trying to load a fat-jar (located on Amazon S3) into Spark interpreter. Once the fat-jar is loaded, Zeppelin's Spark interpreter refuses to work with following stack-trace
java.lang.NullPointerException
at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:38)
at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:33)
at org.apache.zeppelin.spark.SparkInterpreter.createSparkContext_2(SparkInterpreter.java:398)
at org.apache.zeppelin.spark.SparkInterpreter.createSparkContext(SparkInterpreter.java:387)
at org.apache.zeppelin.spark.SparkInterpreter.getSparkContext(SparkInterpreter.java:146)
at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:843)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:70)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:491)
at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Even a simple Scala statement like val str: String = "sample string" that doesn't access anything from the jar produces the above error-log. Removing the jar from interpreter's dependencies fixes the issue; so clearly, it has something to do with the jar only.
The fat-jar in question has been generated by Jenkins using sbt assembly. The project (who's fat-jar I'm loading) contains two submodules inside a parent module.
While sharing the complete build.sbt files and dependency files of all 3 submodules would be impractical, I'm enclosing an exhaustive list of all dependencies and configurations used in the submodules.
AWS dependencies
"com.amazonaws" % "aws-java-sdk-s3" % "1.11.218"
"com.amazonaws" % "aws-java-sdk-emr" % "1.11.218"
"com.amazonaws" % "aws-java-sdk-ec2" % "1.11.218"
Spark dependencies (given as provided allSparkdependencies.map(_ % "provided"))
"org.apache.spark" %% "spark-core" % "2.2.0"
"org.apache.spark" %% "spark-sql" % "2.2.0"
"org.apache.spark" %% "spark-hive" % "2.2.0"
"org.apache.spark" %% "spark-streaming" % "2.2.0"
Testing dependencies
"org.scalatest" %% "scalatest" % "3.0.3" % Test
"com.holdenkarau" %% "spark-testing-base" % "2.2.0_0.7.2" % "test"
Other dependencies
"com.github.scopt" %% "scopt" % "3.7.0"
"com.typesafe" % "config" % "1.3.1"
"com.typesafe.play" %% "play-json" % "2.6.6"
"joda-time" % "joda-time" % "2.9.9"
"mysql" % "mysql-connector-java" % "5.1.41"
"com.github.gilbertw1" %% "slack-scala-client" % "0.2.2"
"org.scalaj" %% "scalaj-http" % "2.3.0"
Framework versions
Scala v2.11.11
SBT v1.0.3
Spark v2.2.0
Zeppelin v0.7.3
SBT Configurations
// cache options
offline := false
updateOptions := updateOptions.value.withCachedResolution(true)
// aggregate options
aggregate in assembly := false
aggregate in update := false
// fork options
fork in Test := true
// merge strategy
assemblyMergeStrategy in assembly := {
case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
case m if m.startsWith("META-INF") => MergeStrategy.discard
case PathList("javax", "servlet", _#_*) => MergeStrategy.first
case PathList("org", "apache", _#_*) => MergeStrategy.first
case PathList("org", "jboss", _#_*) => MergeStrategy.first
case "about.html" => MergeStrategy.rename
case "reference.conf" => MergeStrategy.concat
case "application.conf" => MergeStrategy.concat
case _ => MergeStrategy.first
}

While the problem got fixed, honestly speaking I was unable to drill down to the root cause of it (and hence a real solution for it). After rigorously going through forums in vain, I ended up manually comparing (and re-aligning) my code (git diff) with the last known working build. (!)
It's been a while since then and now when I check my git history, I find it (the commit that fixed this problem) contains either refactoring or build-related stuff. Therefore my best guess is that it was a build-related issue. I'm putting down all changes that I made to build.sbt.
I re-iterate that I cannot establish if it was for these particular modifications that the issue got fixed, so keep looking. I'll keep this question open until a conclusive cause (and solution) is found.
Mark the following dependencies as provided as told here:
"org.apache.spark" %% "spark-core" % sparkVersion
"org.apache.spark" %% "spark-sql" % sparkVersion
"org.apache.spark" %% "spark-hive" % sparkVersion
"org.apache.spark" %% "spark-streaming" % sparkVersion
"com.holdenkarau" %% "spark-testing-base" % "2.2.0_0.7.2" % "test"
Override the following fasterxml.jackson dependencies:
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-core" % "2.6.5"
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-databind" % "2.6.5"
dependencyOverrides += "com.fasterxml.jackson.module" % "jackson-module-scala_2.11" %
"2.6.5"
I'd like to point out one particular thing: the following LogBack dependency that I initially held culprit actually had nothing to do with this (we had faced issues with LogBack in the past, so it suited us to blame it). While we removed it at the time of resolution, we've added it back since then.
"ch.qos.logback" % "logback-classic" % "1.2.3"

Related

Is Kafka 2.4 compatible with Spark 2.4?

When Updating Kafka from 2.3 to 2.4, what(if any) considerations are required to be taken regarding Spark? Will it affect the existing Spark version?
I'm using a Kafka Broker to consume some produced messages from Spark 2.1 using a Kafka 2.4 client
I think Kafka 2.4 would be supported with Spark 2.4 (you can install them both on local and try to produce/consume a helloWorld topic)
In case you are interested in sbt dependencies :
val sparkVersion21 = "2.1.0"
lazy val dependencies = Seq(
"org.apache.spark" %% "spark-core" % sparkVersion21,
"org.apache.spark" %% "spark-sql" % sparkVersion21,
"org.apache.spark" %% "spark-hive" % sparkVersion21,
"org.apache.spark" %% "spark-yarn" % sparkVersion21,
"org.apache.kafka" % "kafka-clients" % "2.4.0",
"org.apache.kafka" %% "kafka" % "2.4.0",
"org.apache.kafka" % "kafka-streams" % "2.4.0")

Why Spark on AWS EMR doesn't load class from application fat jar?

My spark application fails to run on AWS EMR cluster. I noticed that this is because some classes are loaded from the path set by EMR and not from my application jar. For example
java.lang.NoSuchMethodError: org.apache.avro.Schema$Field.<init>(Ljava/lang/String;Lorg/apache/avro/Schema;Ljava/lang/String;Ljava/lang/Object;)V
at com.sksamuel.avro4s.SchemaFor$.fieldBuilder(SchemaFor.scala:424)
at com.sksamuel.avro4s.SchemaFor$.fieldBuilder(SchemaFor.scala:406)
Here org.apache.avro.Schema is loaded from "jar:file:/usr/lib/spark/jars/avro-1.7.7.jar!/org/apache/avro/Schema.class"
Whereas com.sksamuel.avro4s depends on avro 1.8.1. My application is built as a fat jar and has avro 1.8.1. Why isn't that loaded? Instead of picking 1.7.7 from EMR set classpath.
This is just an example. I see the same with other libraries I include in my application. May be Spark depends on 1.7.7 and I'd have to shade when including other dependencies. But why are the classes included in my app jar not loaded first?
After bit of reading I realized that this is how class loading works in Spark. There is a hook to change this behavior spark.executor.userClassPathFirst. It didn't quite work when I tried and its marked as experimental. I guess the best way to proceed is to shade dependencies. Given the number of libraries Spark and its components pull, this might be quite a lot shading with complicated Spark apps.
I had the same exception as you. Based on a recommendation, I was able to resolve this exception by shading the avro dependency as you suggested:
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("org.apache.avro.**" -> "latest_avro.#1").inAll
)
If it helps, here is my full build.sbt (project info aside):
val sparkVersion = "2.1.0"
val confluentVersion = "3.2.1"
resolvers += "Confluent" at "http://packages.confluent.io/maven"
libraryDependencies ++= Seq(
"org.scala-lang" % "scala-library" % scalaVersion.value % "provided",
"org.scala-lang" % "scala-reflect" % scalaVersion.value % "provided",
"org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
"org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided" excludeAll ExclusionRule(organization = "org.scala-lang"),
"org.apache.avro" % "avro" % "1.8.1" % "provided",
"com.databricks" %% "spark-avro" % "3.2.0",
"com.sksamuel.avro4s" %% "avro4s-core" % "1.6.4",
"io.confluent" % "kafka-avro-serializer" % confluentVersion
)
logBuffered in Test := false
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("shapeless.**" -> "new_shapeless.#1").inAll,
ShadeRule.rename("org.apache.avro.**" -> "latest_avro.#1").inAll
)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}

Spark 2.0 with Play! 2.5

I'm trying to use Spark 2.0 with Play! 2.5 but I don't manage to make it work properly (and it seems that there is no example on Github).
I don't have any compilation error, but I have some strange executions errors.
For instance:
Almost all operations on a Dataset or a Dataframe leads to a NullPointerException:
val ds: Dataset[Event] = df.as[Event]
println(ds.count()) //Works well and prints the good results
ds.collect() // --> NullPointerException
ds.show also leads to a NullPointerException.
So there is a big problem somewhere that I'm missing so I think that it comes from incompatible versions. Here is the relevant part of my build.sbt:
object Version {
val scala = "2.11.8"
val spark = "2.0.0"
val postgreSQL = "9.4.1211.jre7"
}
object Library {
val sparkSQL = "org.apache.spark" %% "spark-sql" % Version.spark
val sparkMLLib = "org.apache.spark" %% "spark-mllib" % Version.spark
val sparkCore = "org.apache.spark" %% "spark-core" % Version.spark
val postgreSQL = "org.postgresql" % "postgresql" % Version.postgreSQL
}
object Dependencies {
import Library._
val dependencies = Seq(
sparkSQL,
sparkMLLib,
sparkCore,
postgreSQL)
}
lazy val root = (project in file("."))
.settings(scalaVersion := Version.scala)
.enablePlugins(PlayScala)
libraryDependencies ++= Dependencies.dependencies
dependencyOverrides ++= Set(
"com.fasterxml.jackson.core" % "jackson-databind" % "2.7.4",
"com.fasterxml.jackson.module" %% "jackson-module-scala" % "2.7.4"
)
I ran into the same issue using spark 2.0.0 with play 2.5.12 java.
The activator seems to include the com.fasterxml.jackson-databind 2.7.8 by default, and it doesn't work with jackson-module-scala.
I cleaned my sbt cache
rm -r ~/.ivy2/cache
My new build.sbt, which yields a waring while compiling, since spark 2.0.0 is compiled with jackson-module-scala_2.11:2.6.5, but still spark 2 seams to work with jackson-module-scala 2.8.7
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"com.fasterxml.jackson.core" % "jackson-core" % "2.8.7",
"com.fasterxml.jackson.core" % "jackson-databind" % "2.8.7",
"com.fasterxml.jackson.core" % "jackson-annotations" % "2.8.7",
"com.fasterxml.jackson.module" %% "jackson-module-scala" % "2.8.7",
"org.apache.spark" % "spark-core_2.11" % "2.0.0",
"org.apache.spark" % "spark-mllib_2.11" % "2.0.0"
)
The NullpointerException derived from jackson.databind.JsonMappingException: Incompatible Jackson Version: 2.x.x
please read https://github.com/FasterXML/jackson-module-scala/issues/233

KafkaUtils java.lang.NoClassDefFoundError Spark Streaming

I am attempting to print messages consumed from Kafka via Spark streaming. However, I keep running into the following error:
16/09/04 16:03:33 ERROR ApplicationMaster: User class threw exception: java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka/KafkaUtils$
There have been a few questions asked on StackOverflow regarding this very issue. Ex: https://stackoverflow.com/questions/27710887/kafkautils-class-not-found-in-spark-streaming#=
The answers given have not resolved this issue for me. I have tried creating an "uber jar" using sbt assembly and that did not work either.
Contents of sbt file:
name := "StreamKafka"
version := "1.0"
scalaVersion := "2.10.5"
libraryDependencies ++= Seq(
"org.apache.kafka" % "kafka_2.10" % "0.8.2.1" % "provided",
"org.apache.spark" % "spark-streaming_2.10" % "1.6.1" % "provided",
"org.apache.spark" % "spark-streaming-kafka_2.10" % "1.6.1" % "provided",
"org.apache.spark" % "spark-core_2.10" % "1.6.1" % "provided" exclude("com.esotericsoftware.minlog", "minlog") exclude("com.esotericsoftware.kryo", "kryo")
)
resolvers ++= Seq(
"Maven Central" at "https://repo1.maven.org/maven2/"
)
assemblyMergeStrategy in assembly := {
case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
case m if m.toLowerCase.matches("meta-inf.*\\.sf$") => MergeStrategy.discard
case "log4j.properties" => MergeStrategy.discard
case m if m.toLowerCase.startsWith("meta-inf/services/") => MergeStrategy.filterDistinctLines
case "reference.conf" => MergeStrategy.concat
case _ => MergeStrategy.first
case PathList(ps # _*) if ps.last endsWith "pom.properties" => MergeStrategy.discard
case x => val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}
Posting the answer from the comments so that it will be easy for others to to solve the issue.
You have to remove "provided" from kafka dependencies
"org.apache.kafka" % "kafka_2.10" % "0.8.2.1" % "provided",
"org.apache.spark" % "spark-streaming-kafka_2.10" % "1.6.1" % "provided"
For the dependencies to be bundled in the jar you have to run the command sbt assembly
Also make sure that you are running the right jar file. You can find the right jar file name by checking the sbt assembly command's log.
This may be silly to ask, but does the streamkafka_‌​2.10-1.0.jar contains the org/apache/spark/streaming/kafka/KafkaUtils.class
As long as cluster provides Kafka / Spark classes at runtime, dependencies must be excluded from the assembled JAR. If not, you should expect errors such as this from Java class-loader during application startup.
Additional benefit of assembly without dependencies is faster deployment. If the cluster provides dependancies in runtime it's the best option to omit those decencies using % "provided"

Spark crash while reading json file when linked with aws-java-sdk

Let config.json be a small json file :
{
"toto": 1
}
I made a simple code that read the json file with sc.textFile (because the file can be on S3, local or HDFS, so textFile is convenient)
import org.apache.spark.{SparkContext, SparkConf}
object testAwsSdk {
def main( args:Array[String] ):Unit = {
val sparkConf = new SparkConf().setAppName("test-aws-sdk").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val json = sc.textFile("config.json")
println(json.collect().mkString("\n"))
}
}
The SBT file pull only spark-core library
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.5.1" % "compile"
)
the program works as expected, writing the content of config.json on standard output.
Now I want to link also with aws-java-sdk, amazon's sdk to access S3.
libraryDependencies ++= Seq(
"com.amazonaws" % "aws-java-sdk" % "1.10.30" % "compile",
"org.apache.spark" %% "spark-core" % "1.5.1" % "compile"
)
Executing the same code, spark throws the following Exception.
Exception in thread "main" com.fasterxml.jackson.databind.JsonMappingException: Could not find creator property with name 'id' (in class org.apache.spark.rdd.RDDOperationScope)
at [Source: {"id":"0","name":"textFile"}; line: 1, column: 1]
at com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:148)
at com.fasterxml.jackson.databind.DeserializationContext.mappingException(DeserializationContext.java:843)
at com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.addBeanProps(BeanDeserializerFactory.java:533)
at com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.buildBeanDeserializer(BeanDeserializerFactory.java:220)
at com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.createBeanDeserializer(BeanDeserializerFactory.java:143)
at com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer2(DeserializerCache.java:409)
at com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer(DeserializerCache.java:358)
at com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCache2(DeserializerCache.java:265)
at com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCacheValueDeserializer(DeserializerCache.java:245)
at com.fasterxml.jackson.databind.deser.DeserializerCache.findValueDeserializer(DeserializerCache.java:143)
at com.fasterxml.jackson.databind.DeserializationContext.findRootValueDeserializer(DeserializationContext.java:439)
at com.fasterxml.jackson.databind.ObjectMapper._findRootDeserializer(ObjectMapper.java:3666)
at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3558)
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2578)
at org.apache.spark.rdd.RDDOperationScope$.fromJson(RDDOperationScope.scala:82)
at org.apache.spark.rdd.RDDOperationScope$$anonfun$5.apply(RDDOperationScope.scala:133)
at org.apache.spark.rdd.RDDOperationScope$$anonfun$5.apply(RDDOperationScope.scala:133)
at scala.Option.map(Option.scala:145)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:133)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:709)
at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1012)
at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:827)
at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:825)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:709)
at org.apache.spark.SparkContext.textFile(SparkContext.scala:825)
at testAwsSdk$.main(testAwsSdk.scala:11)
at testAwsSdk.main(testAwsSdk.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
Reading the stack, it seems that when aws-java-sdk is linked, sc.textFile detects that the file is a json file and try to parse it with jackson assuming a certain format, which it cannot find of course. I need to link with aws-java-sdk, so my questions are:
1- Why adding aws-java-sdk modifies the behavior of spark-core?
2- Is there a work-around (the file can be on HDFS, S3 or local)?
Talked to Amazon support. It is a depency issue with Jackson library. In SBT, override jackson:
libraryDependencies ++= Seq(
"com.amazonaws" % "aws-java-sdk" % "1.10.30" % "compile",
"org.apache.spark" %% "spark-core" % "1.5.1" % "compile"
)
dependencyOverrides ++= Set(
"com.fasterxml.jackson.core" % "jackson-databind" % "2.4.4"
)
their answer:
We have done this on a Mac, Ec2 (redhat AMI) instance and on EMR (Amazon Linux). 3 Different environments. Root cause of the issue is that sbt builds a dependency graph and then deals with the issue of version conflicts by evicting the older version and picking the latest version of the dependent library. In this case, the spark depends on the 2.4 version of jackson library while the AWS SDK needs 2.5. So there is a version conflict and the sbt evicts spark's dependency version (which is older) and picks the AWS SDK version (which is the latest).
Adding to Boris' answer, if you don't want to use a fixed version of Jackson (maybe in the future you will upgrade Spark) but still want to discard the one from AWS, you can do the following:
libraryDependencies ++= Seq(
"com.amazonaws" % "aws-java-sdk" % "1.10.30" % "compile" excludeAll (
ExclusionRule("com.fasterxml.jackson.core", "jackson-databind")
),
"org.apache.spark" %% "spark-core" % "1.5.1" % "compile"
)

Resources