Spark 2.0 with Play! 2.5 - apache-spark

I'm trying to use Spark 2.0 with Play! 2.5 but I don't manage to make it work properly (and it seems that there is no example on Github).
I don't have any compilation error, but I have some strange executions errors.
For instance:
Almost all operations on a Dataset or a Dataframe leads to a NullPointerException:
val ds: Dataset[Event] = df.as[Event]
println(ds.count()) //Works well and prints the good results
ds.collect() // --> NullPointerException
ds.show also leads to a NullPointerException.
So there is a big problem somewhere that I'm missing so I think that it comes from incompatible versions. Here is the relevant part of my build.sbt:
object Version {
val scala = "2.11.8"
val spark = "2.0.0"
val postgreSQL = "9.4.1211.jre7"
}
object Library {
val sparkSQL = "org.apache.spark" %% "spark-sql" % Version.spark
val sparkMLLib = "org.apache.spark" %% "spark-mllib" % Version.spark
val sparkCore = "org.apache.spark" %% "spark-core" % Version.spark
val postgreSQL = "org.postgresql" % "postgresql" % Version.postgreSQL
}
object Dependencies {
import Library._
val dependencies = Seq(
sparkSQL,
sparkMLLib,
sparkCore,
postgreSQL)
}
lazy val root = (project in file("."))
.settings(scalaVersion := Version.scala)
.enablePlugins(PlayScala)
libraryDependencies ++= Dependencies.dependencies
dependencyOverrides ++= Set(
"com.fasterxml.jackson.core" % "jackson-databind" % "2.7.4",
"com.fasterxml.jackson.module" %% "jackson-module-scala" % "2.7.4"
)

I ran into the same issue using spark 2.0.0 with play 2.5.12 java.
The activator seems to include the com.fasterxml.jackson-databind 2.7.8 by default, and it doesn't work with jackson-module-scala.
I cleaned my sbt cache
rm -r ~/.ivy2/cache
My new build.sbt, which yields a waring while compiling, since spark 2.0.0 is compiled with jackson-module-scala_2.11:2.6.5, but still spark 2 seams to work with jackson-module-scala 2.8.7
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"com.fasterxml.jackson.core" % "jackson-core" % "2.8.7",
"com.fasterxml.jackson.core" % "jackson-databind" % "2.8.7",
"com.fasterxml.jackson.core" % "jackson-annotations" % "2.8.7",
"com.fasterxml.jackson.module" %% "jackson-module-scala" % "2.8.7",
"org.apache.spark" % "spark-core_2.11" % "2.0.0",
"org.apache.spark" % "spark-mllib_2.11" % "2.0.0"
)
The NullpointerException derived from jackson.databind.JsonMappingException: Incompatible Jackson Version: 2.x.x
please read https://github.com/FasterXML/jackson-module-scala/issues/233

Related

Is Kafka 2.4 compatible with Spark 2.4?

When Updating Kafka from 2.3 to 2.4, what(if any) considerations are required to be taken regarding Spark? Will it affect the existing Spark version?
I'm using a Kafka Broker to consume some produced messages from Spark 2.1 using a Kafka 2.4 client
I think Kafka 2.4 would be supported with Spark 2.4 (you can install them both on local and try to produce/consume a helloWorld topic)
In case you are interested in sbt dependencies :
val sparkVersion21 = "2.1.0"
lazy val dependencies = Seq(
"org.apache.spark" %% "spark-core" % sparkVersion21,
"org.apache.spark" %% "spark-sql" % sparkVersion21,
"org.apache.spark" %% "spark-hive" % sparkVersion21,
"org.apache.spark" %% "spark-yarn" % sparkVersion21,
"org.apache.kafka" % "kafka-clients" % "2.4.0",
"org.apache.kafka" %% "kafka" % "2.4.0",
"org.apache.kafka" % "kafka-streams" % "2.4.0")

KryoException: Unable to find class with spark structured streaming

1-The Problem
I have a Spark program that make use of Kryo but not as part of the Spark Mechanics. More specifically I am using Spark Structured Streaming connected to Kafka.
I read binary values coming from Kafka and decode it on my own.
I am faced with an Exception while attempting to deserialized Data with Kryo. This issue however only happens when I package my program and run it on a Spark Standalone Cluster. That is, it does not happen when I run it, within intellij i.e. as in Spark Local Mode (dev mode).
The exception that I get is as follow:
Caused by: com.esotericsoftware.kryo.KryoException: Unable to find
class: com.elsevier.entellect.commons.package$RawData
Note that RawData is a case Class of my own, situated in one of the sub-project of my multi-project build.
To understand the context please find more details below:
2-build.sbt:
lazy val commonSettings = Seq(
organization := "com.elsevier.entellect",
version := "0.1.0-SNAPSHOT",
scalaVersion := "2.11.12",
resolvers += Resolver.mavenLocal,
updateOptions := updateOptions.value.withLatestSnapshots(false)
)
lazy val entellectextractors = (project in file("."))
.settings(commonSettings).aggregate(entellectextractorscommon, entellectextractorsfetchers, entellectextractorsmappers, entellectextractorsconsumers)
lazy val entellectextractorscommon = project
.settings(
commonSettings,
libraryDependencies ++= Seq(
"com.esotericsoftware" % "kryo" % "5.0.0-RC1",
"com.github.romix.akka" %% "akka-kryo-serialization" % "0.5.0" excludeAll(excludeJpountz),
"org.apache.kafka" % "kafka-clients" % "1.0.1",
"com.typesafe.akka" %% "akka-stream" % "2.5.16",
"com.typesafe.akka" %% "akka-http-spray-json" % "10.1.4",
"com.typesafe.akka" % "akka-slf4j_2.11" % "2.5.16",
"ch.qos.logback" % "logback-classic" % "1.2.3"
)
)
lazy val entellectextractorsfetchers = project
.settings(
commonSettings,
libraryDependencies ++= Seq(
"com.typesafe.akka" %% "akka-stream-kafka" % "0.22",
"com.typesafe.slick" %% "slick" % "3.2.3",
"com.typesafe.slick" %% "slick-hikaricp" % "3.2.3",
"com.lightbend.akka" %% "akka-stream-alpakka-slick" % "0.20")
)
.dependsOn(entellectextractorscommon)
lazy val entellectextractorsconsumers = project
.settings(
commonSettings,
libraryDependencies ++= Seq(
"com.typesafe.akka" %% "akka-stream-kafka" % "0.22")
)
.dependsOn(entellectextractorscommon)
lazy val entellectextractorsmappers = project
.settings(
commonSettings,
mainClass in assembly := Some("entellect.extractors.mappers.NormalizedDataMapper"),
assemblyMergeStrategy in assembly := {
case PathList("META-INF", "services", "org.apache.spark.sql.sources.DataSourceRegister") => MergeStrategy.concat
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first},
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-core" % "2.9.5",
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-databind" % "2.9.5",
dependencyOverrides += "com.fasterxml.jackson.module" % "jackson-module-scala_2.11" % "2.9.5",
dependencyOverrides += "org.apache.jena" % "apache-jena" % "3.8.0",
libraryDependencies ++= Seq(
"org.apache.jena" % "apache-jena" % "3.8.0",
"edu.isi" % "karma-offline" % "0.0.1-SNAPSHOT",
"org.apache.spark" % "spark-core_2.11" % "2.3.1" % "provided",
"org.apache.spark" % "spark-sql_2.11" % "2.3.1" % "provided",
"org.apache.spark" %% "spark-sql-kafka-0-10" % "2.3.1"
//"com.datastax.cassandra" % "cassandra-driver-core" % "3.5.1"
))
.dependsOn(entellectextractorscommon)
lazy val excludeJpountz = ExclusionRule(organization = "net.jpountz.lz4", name = "lz4")
The sub-project that contains spark code is entellectextractorsmappers. The sub-project that contains the case class RawData that can't be found is entellectextractorscommon. entellectextractorsmappers explicitly depends on entellectextractorscommon.
3- The Difference between when I submit on a local standalone cluster and when I run on local development mode:
When I submit to the cluster my spark dependency are as follow:
"org.apache.spark" % "spark-core_2.11" % "2.3.1" % "provided",
"org.apache.spark" % "spark-sql_2.11" % "2.3.1" % "provided",
When I run in local Development mode (no submit script) they turn as such
"org.apache.spark" % "spark-core_2.11" % "2.3.1",
"org.apache.spark" % "spark-sql_2.11" % "2.3.1",
That is, in local dev I need to have the dependencies, while when submitting to the cluster in standalone mode, they are in the cluster already so I put them as provided.
4-How to I submit:
spark-submit --class entellect.extractors.mappers.DeNormalizedDataMapper --name DeNormalizedDataMapper --master spark://MaatPro.local:7077 --deploy-mode cluster --executor-memory 14G --num-executors 1 --conf spark.sql.shuffle.partitions=7 "/Users/maatari/IdeaProjects/EntellectExtractors/entellectextractorsmappers/target/scala-2.11/entellectextractorsmappers-assembly-0.1.0-SNAPSHOT.jar"
5-How I use Kryo:
5.1-Declaration and Registration
In the the entellectextractorscommon project I have a package object with the following:
package object commons {
case class RawData(modelName: String,
modelFile: String,
sourceType: String,
deNormalizedVal: String,
normalVal: Map[String, String])
object KryoContext {
lazy val kryoPool = new Pool[Kryo](true, false, 16) {
protected def create(): Kryo = {
val kryo = new Kryo()
kryo.setRegistrationRequired(false)
kryo.addDefaultSerializer(classOf[scala.collection.Map[_,_]], classOf[ScalaImmutableAbstractMapSerializer])
kryo.addDefaultSerializer(classOf[scala.collection.generic.MapFactory[scala.collection.Map]], classOf[ScalaImmutableAbstractMapSerializer])
kryo.addDefaultSerializer(classOf[RawData], classOf[ScalaProductSerializer])
kryo
}
}
lazy val outputPool = new Pool[Output](true, false, 16) {
protected def create: Output = new Output(4096)
}
lazy val inputPool = new Pool[Input](true, false, 16) {
protected def create: Input = new Input(4096)
}
}
object ExecutionContext {
implicit lazy val system = ActorSystem()
implicit lazy val mat = ActorMaterializer()
implicit lazy val ec = system.dispatcher
}
}
5.2-Usage
In entellectextractorsmappers (where the spark program is), I work with mapMartition. In it, I have a method to decode the Data coming from kafka that makes use of Kryo as such:
def decodeData(rowOfBinaryList: List[Row], kryoPool: Pool[Kryo], inputPool: Pool[Input]): List[RawData] = {
val kryo = kryoPool.obtain()
val input = inputPool.obtain()
val data = rowOfBinaryList.map(r => r.getAs[Array[Byte]]("message")).map{ binaryMsg =>
input.setInputStream(new ByteArrayInputStream(binaryMsg))
val value = kryo.readClassAndObject(input).asInstanceOf[RawData]
input.close()
value
}
kryoPool.free(kryo)
inputPool.free(input)
data
}
Note: The object KryoContext + Lazy val ensure, that kryoPool is instantiated once per JVM. I don't think the issue comes from that however.
I red in some other place a hint about issues of classLoaders used by
spark vs Kryo? But not sure to really understand what is going on.
If someone could give me some pointers, that would help, because I have no idea of where to start. Why would it work in local mode and not in cluster mode, does the provided mess the dependency and create some issue with Kryo ? Is it the SBT Assembly merge Strategy that messes up ?
Many pointers possible, if anyone could help me narrow that, that would be awesome !
So far,
I have solved that problem by picking up the "enclosing" class loader which I suppose is the one from Spark. This is after readying few comments here and there about issue with Class Loader between Kryo and Spark:
lazy val kryoPool = new Pool[Kryo](true, false, 16) {
protected def create(): Kryo = {
val cl = Thread.currentThread().getContextClassLoader()
val kryo = new Kryo()
kryo.setClassLoader(cl)
kryo.setRegistrationRequired(false)
kryo.addDefaultSerializer(classOf[scala.collection.Map[_,_]], classOf[ScalaImmutableAbstractMapSerializer])
kryo.addDefaultSerializer(classOf[scala.collection.generic.MapFactory[scala.collection.Map]], classOf[ScalaImmutableAbstractMapSerializer])
kryo.addDefaultSerializer(classOf[RawData], classOf[ScalaProductSerializer])
kryo
}
}

Spark Cassandra Connector cannot find java.time.LocalDate

I am trying to query Cassandra using Spark with the Datastax Spark-Cassandra connector. The Spark code is
val conf = new SparkConf(true)
.setMaster("local[4]")
.setAppName("cassandra_query")
.set("spark.cassandra.connection.host", "mycassandrahost")
val sc = new SparkContext(conf)
val rdd = sc.cassandraTable("mykeyspace", "mytable").limit(10)
rdd.foreach(println)
sc.stop()
So its just running locally now. And my build.sbt file looks like
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.0.0",
"org.apache.spark" %% "spark-sql" % "2.0.0",
"cc.mallet" % "mallet" % "2.0.7",
"com.amazonaws" % "aws-java-sdk" % "1.11.229",
"com.datastax.spark" % "spark-cassandra-connector_2.11" % "2.0.0"
)
I create a fat jar using the assembly plugin and when I submit the spark job I get the following error
Lost task 6.0 in stage 0.0 (TID 6) on executor localhost: java.io.IOException (Exception during preparation of SELECT "pcid", "content" FROM "mykeyspace"."mytable" WHERE token("pcid") > ? AND token("pcid") <= ? LIMIT 10 ALLOW FILTERING: class java.time.LocalDate in JavaMirror with org.apache.spark.util.MutableURLClassLoader#923288b of type class org.apache.spark.util.MutableURLClassLoader with classpath [file:/root/GenderPrediction-assembly-0.1.jar] and parent being sun.misc.Launcher$AppClassLoader#1e69dff6 of type class sun.misc.Launcher$AppClassLoader with classpath [file:/root/spark/conf/,file:/root/spark/jars/datanucleus-core-3.2.10.jar,...not found.
(Note: there were too many jars listed in the above classpath so I just replaced them with a "..." )
So it looks like it can't find java.time.LocalDate - how can I fix this?
I found another post that looks similar spark job cassandra error
However it is a different class that cannot be found so I'm not sure if it helps.
java.time.LocalDate is part of Java8 and it seems you are running java version lower than 8.
spark-cassandra-connector 2.0 requires java 8.
Spark Cassandra version compatibility
Can you plz try this
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.0.0",
"org.apache.spark" %% "spark-sql" % "2.0.0",
"cc.mallet" % "mallet" % "2.0.7",
"com.amazonaws" % "aws-java-sdk" % "1.11.229",
"com.datastax.spark" % "spark-cassandra-connector_2.11" % "2.0.0" exclude("joda-time", "joda-time"),
"joda-time" % "joda-time" % "2.3"
)

object DataFrame is not a member of package org.apache.spark.sql

I import org.apache.spark.sql.DataFrame in my scala file, than use sbt to compile, the error was object DataFrame is not a member of package org.apache.spark.sql
Searched for some solutions in the Internet, it seems that the problem is spark version is too old. But I am using the newest version(2.1.1), so it is weird.
In REPL, when I import org.apache.spark.sql.DataFrame, there is no error.
My function is like this:
def test(df: DataFrame): Unit={
....
}
When I define this function in REPL, it is fine, but when I compile it using sbt, the error is not found: type DataFrame.
My build.sbt:
name := "Hello"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.1.1"
Anyone can help me to fix this problem? Thanks.
You need both spark-core and spark-sql to work with Dataframe
libraryDependencies ++= Seq(
// https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11
"org.apache.spark" %% "spark-core" % "2.1.1",
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.11
"org.apache.spark" %% "spark-sql" % "2.1.1"
)
Hope this helps!

Why Spark on AWS EMR doesn't load class from application fat jar?

My spark application fails to run on AWS EMR cluster. I noticed that this is because some classes are loaded from the path set by EMR and not from my application jar. For example
java.lang.NoSuchMethodError: org.apache.avro.Schema$Field.<init>(Ljava/lang/String;Lorg/apache/avro/Schema;Ljava/lang/String;Ljava/lang/Object;)V
at com.sksamuel.avro4s.SchemaFor$.fieldBuilder(SchemaFor.scala:424)
at com.sksamuel.avro4s.SchemaFor$.fieldBuilder(SchemaFor.scala:406)
Here org.apache.avro.Schema is loaded from "jar:file:/usr/lib/spark/jars/avro-1.7.7.jar!/org/apache/avro/Schema.class"
Whereas com.sksamuel.avro4s depends on avro 1.8.1. My application is built as a fat jar and has avro 1.8.1. Why isn't that loaded? Instead of picking 1.7.7 from EMR set classpath.
This is just an example. I see the same with other libraries I include in my application. May be Spark depends on 1.7.7 and I'd have to shade when including other dependencies. But why are the classes included in my app jar not loaded first?
After bit of reading I realized that this is how class loading works in Spark. There is a hook to change this behavior spark.executor.userClassPathFirst. It didn't quite work when I tried and its marked as experimental. I guess the best way to proceed is to shade dependencies. Given the number of libraries Spark and its components pull, this might be quite a lot shading with complicated Spark apps.
I had the same exception as you. Based on a recommendation, I was able to resolve this exception by shading the avro dependency as you suggested:
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("org.apache.avro.**" -> "latest_avro.#1").inAll
)
If it helps, here is my full build.sbt (project info aside):
val sparkVersion = "2.1.0"
val confluentVersion = "3.2.1"
resolvers += "Confluent" at "http://packages.confluent.io/maven"
libraryDependencies ++= Seq(
"org.scala-lang" % "scala-library" % scalaVersion.value % "provided",
"org.scala-lang" % "scala-reflect" % scalaVersion.value % "provided",
"org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
"org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided" excludeAll ExclusionRule(organization = "org.scala-lang"),
"org.apache.avro" % "avro" % "1.8.1" % "provided",
"com.databricks" %% "spark-avro" % "3.2.0",
"com.sksamuel.avro4s" %% "avro4s-core" % "1.6.4",
"io.confluent" % "kafka-avro-serializer" % confluentVersion
)
logBuffered in Test := false
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("shapeless.**" -> "new_shapeless.#1").inAll,
ShadeRule.rename("org.apache.avro.**" -> "latest_avro.#1").inAll
)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}

Resources