Is Kafka 2.4 compatible with Spark 2.4? - apache-spark

When Updating Kafka from 2.3 to 2.4, what(if any) considerations are required to be taken regarding Spark? Will it affect the existing Spark version?

I'm using a Kafka Broker to consume some produced messages from Spark 2.1 using a Kafka 2.4 client
I think Kafka 2.4 would be supported with Spark 2.4 (you can install them both on local and try to produce/consume a helloWorld topic)
In case you are interested in sbt dependencies :
val sparkVersion21 = "2.1.0"
lazy val dependencies = Seq(
"org.apache.spark" %% "spark-core" % sparkVersion21,
"org.apache.spark" %% "spark-sql" % sparkVersion21,
"org.apache.spark" %% "spark-hive" % sparkVersion21,
"org.apache.spark" %% "spark-yarn" % sparkVersion21,
"org.apache.kafka" % "kafka-clients" % "2.4.0",
"org.apache.kafka" %% "kafka" % "2.4.0",
"org.apache.kafka" % "kafka-streams" % "2.4.0")

Related

Spark: Fat-JAR crashing Zeppelin with NullPointerException

Note: This is NOT a duplicate of Getting NullPointerException when running Spark Code in Zeppelin 0.7.1
I've run into this roadblock in Apache Zeppelin on Amazon EMR. I'm trying to load a fat-jar (located on Amazon S3) into Spark interpreter. Once the fat-jar is loaded, Zeppelin's Spark interpreter refuses to work with following stack-trace
java.lang.NullPointerException
at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:38)
at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:33)
at org.apache.zeppelin.spark.SparkInterpreter.createSparkContext_2(SparkInterpreter.java:398)
at org.apache.zeppelin.spark.SparkInterpreter.createSparkContext(SparkInterpreter.java:387)
at org.apache.zeppelin.spark.SparkInterpreter.getSparkContext(SparkInterpreter.java:146)
at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:843)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:70)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:491)
at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Even a simple Scala statement like val str: String = "sample string" that doesn't access anything from the jar produces the above error-log. Removing the jar from interpreter's dependencies fixes the issue; so clearly, it has something to do with the jar only.
The fat-jar in question has been generated by Jenkins using sbt assembly. The project (who's fat-jar I'm loading) contains two submodules inside a parent module.
While sharing the complete build.sbt files and dependency files of all 3 submodules would be impractical, I'm enclosing an exhaustive list of all dependencies and configurations used in the submodules.
AWS dependencies
"com.amazonaws" % "aws-java-sdk-s3" % "1.11.218"
"com.amazonaws" % "aws-java-sdk-emr" % "1.11.218"
"com.amazonaws" % "aws-java-sdk-ec2" % "1.11.218"
Spark dependencies (given as provided allSparkdependencies.map(_ % "provided"))
"org.apache.spark" %% "spark-core" % "2.2.0"
"org.apache.spark" %% "spark-sql" % "2.2.0"
"org.apache.spark" %% "spark-hive" % "2.2.0"
"org.apache.spark" %% "spark-streaming" % "2.2.0"
Testing dependencies
"org.scalatest" %% "scalatest" % "3.0.3" % Test
"com.holdenkarau" %% "spark-testing-base" % "2.2.0_0.7.2" % "test"
Other dependencies
"com.github.scopt" %% "scopt" % "3.7.0"
"com.typesafe" % "config" % "1.3.1"
"com.typesafe.play" %% "play-json" % "2.6.6"
"joda-time" % "joda-time" % "2.9.9"
"mysql" % "mysql-connector-java" % "5.1.41"
"com.github.gilbertw1" %% "slack-scala-client" % "0.2.2"
"org.scalaj" %% "scalaj-http" % "2.3.0"
Framework versions
Scala v2.11.11
SBT v1.0.3
Spark v2.2.0
Zeppelin v0.7.3
SBT Configurations
// cache options
offline := false
updateOptions := updateOptions.value.withCachedResolution(true)
// aggregate options
aggregate in assembly := false
aggregate in update := false
// fork options
fork in Test := true
// merge strategy
assemblyMergeStrategy in assembly := {
case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
case m if m.startsWith("META-INF") => MergeStrategy.discard
case PathList("javax", "servlet", _#_*) => MergeStrategy.first
case PathList("org", "apache", _#_*) => MergeStrategy.first
case PathList("org", "jboss", _#_*) => MergeStrategy.first
case "about.html" => MergeStrategy.rename
case "reference.conf" => MergeStrategy.concat
case "application.conf" => MergeStrategy.concat
case _ => MergeStrategy.first
}
While the problem got fixed, honestly speaking I was unable to drill down to the root cause of it (and hence a real solution for it). After rigorously going through forums in vain, I ended up manually comparing (and re-aligning) my code (git diff) with the last known working build. (!)
It's been a while since then and now when I check my git history, I find it (the commit that fixed this problem) contains either refactoring or build-related stuff. Therefore my best guess is that it was a build-related issue. I'm putting down all changes that I made to build.sbt.
I re-iterate that I cannot establish if it was for these particular modifications that the issue got fixed, so keep looking. I'll keep this question open until a conclusive cause (and solution) is found.
Mark the following dependencies as provided as told here:
"org.apache.spark" %% "spark-core" % sparkVersion
"org.apache.spark" %% "spark-sql" % sparkVersion
"org.apache.spark" %% "spark-hive" % sparkVersion
"org.apache.spark" %% "spark-streaming" % sparkVersion
"com.holdenkarau" %% "spark-testing-base" % "2.2.0_0.7.2" % "test"
Override the following fasterxml.jackson dependencies:
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-core" % "2.6.5"
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-databind" % "2.6.5"
dependencyOverrides += "com.fasterxml.jackson.module" % "jackson-module-scala_2.11" %
"2.6.5"
I'd like to point out one particular thing: the following LogBack dependency that I initially held culprit actually had nothing to do with this (we had faced issues with LogBack in the past, so it suited us to blame it). While we removed it at the time of resolution, we've added it back since then.
"ch.qos.logback" % "logback-classic" % "1.2.3"

Spark Cassandra Connector cannot find java.time.LocalDate

I am trying to query Cassandra using Spark with the Datastax Spark-Cassandra connector. The Spark code is
val conf = new SparkConf(true)
.setMaster("local[4]")
.setAppName("cassandra_query")
.set("spark.cassandra.connection.host", "mycassandrahost")
val sc = new SparkContext(conf)
val rdd = sc.cassandraTable("mykeyspace", "mytable").limit(10)
rdd.foreach(println)
sc.stop()
So its just running locally now. And my build.sbt file looks like
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.0.0",
"org.apache.spark" %% "spark-sql" % "2.0.0",
"cc.mallet" % "mallet" % "2.0.7",
"com.amazonaws" % "aws-java-sdk" % "1.11.229",
"com.datastax.spark" % "spark-cassandra-connector_2.11" % "2.0.0"
)
I create a fat jar using the assembly plugin and when I submit the spark job I get the following error
Lost task 6.0 in stage 0.0 (TID 6) on executor localhost: java.io.IOException (Exception during preparation of SELECT "pcid", "content" FROM "mykeyspace"."mytable" WHERE token("pcid") > ? AND token("pcid") <= ? LIMIT 10 ALLOW FILTERING: class java.time.LocalDate in JavaMirror with org.apache.spark.util.MutableURLClassLoader#923288b of type class org.apache.spark.util.MutableURLClassLoader with classpath [file:/root/GenderPrediction-assembly-0.1.jar] and parent being sun.misc.Launcher$AppClassLoader#1e69dff6 of type class sun.misc.Launcher$AppClassLoader with classpath [file:/root/spark/conf/,file:/root/spark/jars/datanucleus-core-3.2.10.jar,...not found.
(Note: there were too many jars listed in the above classpath so I just replaced them with a "..." )
So it looks like it can't find java.time.LocalDate - how can I fix this?
I found another post that looks similar spark job cassandra error
However it is a different class that cannot be found so I'm not sure if it helps.
java.time.LocalDate is part of Java8 and it seems you are running java version lower than 8.
spark-cassandra-connector 2.0 requires java 8.
Spark Cassandra version compatibility
Can you plz try this
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.0.0",
"org.apache.spark" %% "spark-sql" % "2.0.0",
"cc.mallet" % "mallet" % "2.0.7",
"com.amazonaws" % "aws-java-sdk" % "1.11.229",
"com.datastax.spark" % "spark-cassandra-connector_2.11" % "2.0.0" exclude("joda-time", "joda-time"),
"joda-time" % "joda-time" % "2.3"
)

Why Spark on AWS EMR doesn't load class from application fat jar?

My spark application fails to run on AWS EMR cluster. I noticed that this is because some classes are loaded from the path set by EMR and not from my application jar. For example
java.lang.NoSuchMethodError: org.apache.avro.Schema$Field.<init>(Ljava/lang/String;Lorg/apache/avro/Schema;Ljava/lang/String;Ljava/lang/Object;)V
at com.sksamuel.avro4s.SchemaFor$.fieldBuilder(SchemaFor.scala:424)
at com.sksamuel.avro4s.SchemaFor$.fieldBuilder(SchemaFor.scala:406)
Here org.apache.avro.Schema is loaded from "jar:file:/usr/lib/spark/jars/avro-1.7.7.jar!/org/apache/avro/Schema.class"
Whereas com.sksamuel.avro4s depends on avro 1.8.1. My application is built as a fat jar and has avro 1.8.1. Why isn't that loaded? Instead of picking 1.7.7 from EMR set classpath.
This is just an example. I see the same with other libraries I include in my application. May be Spark depends on 1.7.7 and I'd have to shade when including other dependencies. But why are the classes included in my app jar not loaded first?
After bit of reading I realized that this is how class loading works in Spark. There is a hook to change this behavior spark.executor.userClassPathFirst. It didn't quite work when I tried and its marked as experimental. I guess the best way to proceed is to shade dependencies. Given the number of libraries Spark and its components pull, this might be quite a lot shading with complicated Spark apps.
I had the same exception as you. Based on a recommendation, I was able to resolve this exception by shading the avro dependency as you suggested:
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("org.apache.avro.**" -> "latest_avro.#1").inAll
)
If it helps, here is my full build.sbt (project info aside):
val sparkVersion = "2.1.0"
val confluentVersion = "3.2.1"
resolvers += "Confluent" at "http://packages.confluent.io/maven"
libraryDependencies ++= Seq(
"org.scala-lang" % "scala-library" % scalaVersion.value % "provided",
"org.scala-lang" % "scala-reflect" % scalaVersion.value % "provided",
"org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
"org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided" excludeAll ExclusionRule(organization = "org.scala-lang"),
"org.apache.avro" % "avro" % "1.8.1" % "provided",
"com.databricks" %% "spark-avro" % "3.2.0",
"com.sksamuel.avro4s" %% "avro4s-core" % "1.6.4",
"io.confluent" % "kafka-avro-serializer" % confluentVersion
)
logBuffered in Test := false
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("shapeless.**" -> "new_shapeless.#1").inAll,
ShadeRule.rename("org.apache.avro.**" -> "latest_avro.#1").inAll
)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}

Spark 2.0 with Play! 2.5

I'm trying to use Spark 2.0 with Play! 2.5 but I don't manage to make it work properly (and it seems that there is no example on Github).
I don't have any compilation error, but I have some strange executions errors.
For instance:
Almost all operations on a Dataset or a Dataframe leads to a NullPointerException:
val ds: Dataset[Event] = df.as[Event]
println(ds.count()) //Works well and prints the good results
ds.collect() // --> NullPointerException
ds.show also leads to a NullPointerException.
So there is a big problem somewhere that I'm missing so I think that it comes from incompatible versions. Here is the relevant part of my build.sbt:
object Version {
val scala = "2.11.8"
val spark = "2.0.0"
val postgreSQL = "9.4.1211.jre7"
}
object Library {
val sparkSQL = "org.apache.spark" %% "spark-sql" % Version.spark
val sparkMLLib = "org.apache.spark" %% "spark-mllib" % Version.spark
val sparkCore = "org.apache.spark" %% "spark-core" % Version.spark
val postgreSQL = "org.postgresql" % "postgresql" % Version.postgreSQL
}
object Dependencies {
import Library._
val dependencies = Seq(
sparkSQL,
sparkMLLib,
sparkCore,
postgreSQL)
}
lazy val root = (project in file("."))
.settings(scalaVersion := Version.scala)
.enablePlugins(PlayScala)
libraryDependencies ++= Dependencies.dependencies
dependencyOverrides ++= Set(
"com.fasterxml.jackson.core" % "jackson-databind" % "2.7.4",
"com.fasterxml.jackson.module" %% "jackson-module-scala" % "2.7.4"
)
I ran into the same issue using spark 2.0.0 with play 2.5.12 java.
The activator seems to include the com.fasterxml.jackson-databind 2.7.8 by default, and it doesn't work with jackson-module-scala.
I cleaned my sbt cache
rm -r ~/.ivy2/cache
My new build.sbt, which yields a waring while compiling, since spark 2.0.0 is compiled with jackson-module-scala_2.11:2.6.5, but still spark 2 seams to work with jackson-module-scala 2.8.7
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"com.fasterxml.jackson.core" % "jackson-core" % "2.8.7",
"com.fasterxml.jackson.core" % "jackson-databind" % "2.8.7",
"com.fasterxml.jackson.core" % "jackson-annotations" % "2.8.7",
"com.fasterxml.jackson.module" %% "jackson-module-scala" % "2.8.7",
"org.apache.spark" % "spark-core_2.11" % "2.0.0",
"org.apache.spark" % "spark-mllib_2.11" % "2.0.0"
)
The NullpointerException derived from jackson.databind.JsonMappingException: Incompatible Jackson Version: 2.x.x
please read https://github.com/FasterXML/jackson-module-scala/issues/233

Spark crash while reading json file when linked with aws-java-sdk

Let config.json be a small json file :
{
"toto": 1
}
I made a simple code that read the json file with sc.textFile (because the file can be on S3, local or HDFS, so textFile is convenient)
import org.apache.spark.{SparkContext, SparkConf}
object testAwsSdk {
def main( args:Array[String] ):Unit = {
val sparkConf = new SparkConf().setAppName("test-aws-sdk").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val json = sc.textFile("config.json")
println(json.collect().mkString("\n"))
}
}
The SBT file pull only spark-core library
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.5.1" % "compile"
)
the program works as expected, writing the content of config.json on standard output.
Now I want to link also with aws-java-sdk, amazon's sdk to access S3.
libraryDependencies ++= Seq(
"com.amazonaws" % "aws-java-sdk" % "1.10.30" % "compile",
"org.apache.spark" %% "spark-core" % "1.5.1" % "compile"
)
Executing the same code, spark throws the following Exception.
Exception in thread "main" com.fasterxml.jackson.databind.JsonMappingException: Could not find creator property with name 'id' (in class org.apache.spark.rdd.RDDOperationScope)
at [Source: {"id":"0","name":"textFile"}; line: 1, column: 1]
at com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:148)
at com.fasterxml.jackson.databind.DeserializationContext.mappingException(DeserializationContext.java:843)
at com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.addBeanProps(BeanDeserializerFactory.java:533)
at com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.buildBeanDeserializer(BeanDeserializerFactory.java:220)
at com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.createBeanDeserializer(BeanDeserializerFactory.java:143)
at com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer2(DeserializerCache.java:409)
at com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer(DeserializerCache.java:358)
at com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCache2(DeserializerCache.java:265)
at com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCacheValueDeserializer(DeserializerCache.java:245)
at com.fasterxml.jackson.databind.deser.DeserializerCache.findValueDeserializer(DeserializerCache.java:143)
at com.fasterxml.jackson.databind.DeserializationContext.findRootValueDeserializer(DeserializationContext.java:439)
at com.fasterxml.jackson.databind.ObjectMapper._findRootDeserializer(ObjectMapper.java:3666)
at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3558)
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2578)
at org.apache.spark.rdd.RDDOperationScope$.fromJson(RDDOperationScope.scala:82)
at org.apache.spark.rdd.RDDOperationScope$$anonfun$5.apply(RDDOperationScope.scala:133)
at org.apache.spark.rdd.RDDOperationScope$$anonfun$5.apply(RDDOperationScope.scala:133)
at scala.Option.map(Option.scala:145)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:133)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:709)
at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1012)
at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:827)
at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:825)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:709)
at org.apache.spark.SparkContext.textFile(SparkContext.scala:825)
at testAwsSdk$.main(testAwsSdk.scala:11)
at testAwsSdk.main(testAwsSdk.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
Reading the stack, it seems that when aws-java-sdk is linked, sc.textFile detects that the file is a json file and try to parse it with jackson assuming a certain format, which it cannot find of course. I need to link with aws-java-sdk, so my questions are:
1- Why adding aws-java-sdk modifies the behavior of spark-core?
2- Is there a work-around (the file can be on HDFS, S3 or local)?
Talked to Amazon support. It is a depency issue with Jackson library. In SBT, override jackson:
libraryDependencies ++= Seq(
"com.amazonaws" % "aws-java-sdk" % "1.10.30" % "compile",
"org.apache.spark" %% "spark-core" % "1.5.1" % "compile"
)
dependencyOverrides ++= Set(
"com.fasterxml.jackson.core" % "jackson-databind" % "2.4.4"
)
their answer:
We have done this on a Mac, Ec2 (redhat AMI) instance and on EMR (Amazon Linux). 3 Different environments. Root cause of the issue is that sbt builds a dependency graph and then deals with the issue of version conflicts by evicting the older version and picking the latest version of the dependent library. In this case, the spark depends on the 2.4 version of jackson library while the AWS SDK needs 2.5. So there is a version conflict and the sbt evicts spark's dependency version (which is older) and picks the AWS SDK version (which is the latest).
Adding to Boris' answer, if you don't want to use a fixed version of Jackson (maybe in the future you will upgrade Spark) but still want to discard the one from AWS, you can do the following:
libraryDependencies ++= Seq(
"com.amazonaws" % "aws-java-sdk" % "1.10.30" % "compile" excludeAll (
ExclusionRule("com.fasterxml.jackson.core", "jackson-databind")
),
"org.apache.spark" %% "spark-core" % "1.5.1" % "compile"
)

Resources