How can I use directJoin with spark (scala)? - cassandra

I'm trying to use directJoin with the partition keys. But when I run the engine, it doesn't use directJoin. I would like to understand if I am doing something wrong. Here is the code I used:
Configuring the settings:
val sparkConf: SparkConf = new SparkConf()
.set(
s"spark.sql.extensions",
"com.datastax.spark.connector.CassandraSparkExtensions"
)
.set(
s"spark.sql.catalog.CassandraCommercial",
"com.datastax.spark.connector.datasource.CassandraCatalog"
)
.set(
s"spark.sql.catalog.CassandraCommercial.spark.cassandra.connection.host",
Settings.cassandraServerAddress
)
.set(
s"spark.sql.catalog.CassandraCommercial.spark.cassandra.auth.username",
Settings.cassandraUser
)
.set(
s"spark.sql.catalog.CassandraCommercial.spark.cassandra.auth.password",
Settings.cassandraPass
)
.set(
s"spark.sql.catalog.CassandraCommercial.spark.cassandra.connection.port",
Settings.cassandraPort
)
I am using catalog because I intend to use databases on different clusters.
SparkSession:
val sparkSession: SparkSession = SparkSession
.builder()
.config(sparkConf)
.appName(Settings.appName)
.getOrCreate()
I tried it both ways below:
This:
val parameterVOne= spark.read
.table("CassandraCommercial.ky.parameters")
.select(
"id",
"year",
"code"
)
And this:
val parameterVTwo= spark.read
.cassandraFormat("parameters", "CassandraCommercial.ky")
.load
.select(
"id",
"year",
"code"
)
The first one, although spark did not use directjoin, it brings up data normally if I use show():
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [id#19, year#22, code#0]
+- SortMergeJoin [id#19, year#22, code#0], [id#0, year#3, code#2, value#6], Inner, ((id#19 = id#0) AND (year#22 = year#3) AND (code#0 = code#2))
And second return this:
Exception in thread "main" java.io.IOException: Failed to open native connection to Cassandra at {localhost:9042} :: Could not reach any contact point, make sure you've provided valid addresses (showing first 2 nodes, use getAllErrors() for more): Node(endPoint=localhost/127.0.0.1:9042, hostId=null, hashCode=307be82d): [com.datastax.oss.driver.api.core.connection.ConnectionInitException: [s1|control|connecting...] Protocol initialization request, step 1 (OPTIONS): failed to send request (com.datastax.oss.driver.shaded.netty.channel.StacklessClosedChannelException)], Node(endPoint=localhost/0:0:0:0:0:0:0:1:9042, hostId=null, hashCode=3ebc1052): [com.datastax.oss.driver.api.core.connection.ConnectionInitException: [s1|control|connecting...] Protocol initialization request, step 1 (OPTIONS): failed to send request (com.datastax.oss.driver.shaded.netty.channel.StacklessClosedChannelException)]
Apparently this second way did not take the settings defined in the catalog, and is accessing localhost directly unlike the first way.
The dataframe that has the keys has only 7 rows, while the cassandra dataframe has approximately 2 million.
This is my bild.sbt:
ThisBuild / version := "0.1.0-SNAPSHOT"
ThisBuild / scalaVersion := "2.12.15"
lazy val root = (project in file("."))
.settings(
name := "test-job",
idePackagePrefix := Some("com.teste"),
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.2.1",
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.2.1",
libraryDependencies += "org.postgresql" % "postgresql" % "42.3.3",
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "3.1.0",
libraryDependencies += "joda-time" % "joda-time" % "2.10.14",
libraryDependencies += "com.crealytics" %% "spark-excel" % "3.2.1_0.16.5-pre2",
libraryDependencies += "com.datastax.spark" % "spark-cassandra-connector-assembly_2.12" % "3.1.0"
)

I've seen this behavior in some versions of Spark - unfortunately, the changes in the internals of Spark often break this functionality because it relies on the internal details. So please provide more information on what version of Spark & Spark connector is used.
Regarding the second error, I suspect that direct join may not use Spark SQL properties, can you try to use spark.cassandra.connection.host, spark.cassandra.auth.password, and other configuration parameters?
P.S. I have a long blog post on using DirectJoin, but it was tested on Spark 2.4.x (and maybe on 3.0, don't remember

Related

KryoException: Unable to find class with spark structured streaming

1-The Problem
I have a Spark program that make use of Kryo but not as part of the Spark Mechanics. More specifically I am using Spark Structured Streaming connected to Kafka.
I read binary values coming from Kafka and decode it on my own.
I am faced with an Exception while attempting to deserialized Data with Kryo. This issue however only happens when I package my program and run it on a Spark Standalone Cluster. That is, it does not happen when I run it, within intellij i.e. as in Spark Local Mode (dev mode).
The exception that I get is as follow:
Caused by: com.esotericsoftware.kryo.KryoException: Unable to find
class: com.elsevier.entellect.commons.package$RawData
Note that RawData is a case Class of my own, situated in one of the sub-project of my multi-project build.
To understand the context please find more details below:
2-build.sbt:
lazy val commonSettings = Seq(
organization := "com.elsevier.entellect",
version := "0.1.0-SNAPSHOT",
scalaVersion := "2.11.12",
resolvers += Resolver.mavenLocal,
updateOptions := updateOptions.value.withLatestSnapshots(false)
)
lazy val entellectextractors = (project in file("."))
.settings(commonSettings).aggregate(entellectextractorscommon, entellectextractorsfetchers, entellectextractorsmappers, entellectextractorsconsumers)
lazy val entellectextractorscommon = project
.settings(
commonSettings,
libraryDependencies ++= Seq(
"com.esotericsoftware" % "kryo" % "5.0.0-RC1",
"com.github.romix.akka" %% "akka-kryo-serialization" % "0.5.0" excludeAll(excludeJpountz),
"org.apache.kafka" % "kafka-clients" % "1.0.1",
"com.typesafe.akka" %% "akka-stream" % "2.5.16",
"com.typesafe.akka" %% "akka-http-spray-json" % "10.1.4",
"com.typesafe.akka" % "akka-slf4j_2.11" % "2.5.16",
"ch.qos.logback" % "logback-classic" % "1.2.3"
)
)
lazy val entellectextractorsfetchers = project
.settings(
commonSettings,
libraryDependencies ++= Seq(
"com.typesafe.akka" %% "akka-stream-kafka" % "0.22",
"com.typesafe.slick" %% "slick" % "3.2.3",
"com.typesafe.slick" %% "slick-hikaricp" % "3.2.3",
"com.lightbend.akka" %% "akka-stream-alpakka-slick" % "0.20")
)
.dependsOn(entellectextractorscommon)
lazy val entellectextractorsconsumers = project
.settings(
commonSettings,
libraryDependencies ++= Seq(
"com.typesafe.akka" %% "akka-stream-kafka" % "0.22")
)
.dependsOn(entellectextractorscommon)
lazy val entellectextractorsmappers = project
.settings(
commonSettings,
mainClass in assembly := Some("entellect.extractors.mappers.NormalizedDataMapper"),
assemblyMergeStrategy in assembly := {
case PathList("META-INF", "services", "org.apache.spark.sql.sources.DataSourceRegister") => MergeStrategy.concat
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first},
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-core" % "2.9.5",
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-databind" % "2.9.5",
dependencyOverrides += "com.fasterxml.jackson.module" % "jackson-module-scala_2.11" % "2.9.5",
dependencyOverrides += "org.apache.jena" % "apache-jena" % "3.8.0",
libraryDependencies ++= Seq(
"org.apache.jena" % "apache-jena" % "3.8.0",
"edu.isi" % "karma-offline" % "0.0.1-SNAPSHOT",
"org.apache.spark" % "spark-core_2.11" % "2.3.1" % "provided",
"org.apache.spark" % "spark-sql_2.11" % "2.3.1" % "provided",
"org.apache.spark" %% "spark-sql-kafka-0-10" % "2.3.1"
//"com.datastax.cassandra" % "cassandra-driver-core" % "3.5.1"
))
.dependsOn(entellectextractorscommon)
lazy val excludeJpountz = ExclusionRule(organization = "net.jpountz.lz4", name = "lz4")
The sub-project that contains spark code is entellectextractorsmappers. The sub-project that contains the case class RawData that can't be found is entellectextractorscommon. entellectextractorsmappers explicitly depends on entellectextractorscommon.
3- The Difference between when I submit on a local standalone cluster and when I run on local development mode:
When I submit to the cluster my spark dependency are as follow:
"org.apache.spark" % "spark-core_2.11" % "2.3.1" % "provided",
"org.apache.spark" % "spark-sql_2.11" % "2.3.1" % "provided",
When I run in local Development mode (no submit script) they turn as such
"org.apache.spark" % "spark-core_2.11" % "2.3.1",
"org.apache.spark" % "spark-sql_2.11" % "2.3.1",
That is, in local dev I need to have the dependencies, while when submitting to the cluster in standalone mode, they are in the cluster already so I put them as provided.
4-How to I submit:
spark-submit --class entellect.extractors.mappers.DeNormalizedDataMapper --name DeNormalizedDataMapper --master spark://MaatPro.local:7077 --deploy-mode cluster --executor-memory 14G --num-executors 1 --conf spark.sql.shuffle.partitions=7 "/Users/maatari/IdeaProjects/EntellectExtractors/entellectextractorsmappers/target/scala-2.11/entellectextractorsmappers-assembly-0.1.0-SNAPSHOT.jar"
5-How I use Kryo:
5.1-Declaration and Registration
In the the entellectextractorscommon project I have a package object with the following:
package object commons {
case class RawData(modelName: String,
modelFile: String,
sourceType: String,
deNormalizedVal: String,
normalVal: Map[String, String])
object KryoContext {
lazy val kryoPool = new Pool[Kryo](true, false, 16) {
protected def create(): Kryo = {
val kryo = new Kryo()
kryo.setRegistrationRequired(false)
kryo.addDefaultSerializer(classOf[scala.collection.Map[_,_]], classOf[ScalaImmutableAbstractMapSerializer])
kryo.addDefaultSerializer(classOf[scala.collection.generic.MapFactory[scala.collection.Map]], classOf[ScalaImmutableAbstractMapSerializer])
kryo.addDefaultSerializer(classOf[RawData], classOf[ScalaProductSerializer])
kryo
}
}
lazy val outputPool = new Pool[Output](true, false, 16) {
protected def create: Output = new Output(4096)
}
lazy val inputPool = new Pool[Input](true, false, 16) {
protected def create: Input = new Input(4096)
}
}
object ExecutionContext {
implicit lazy val system = ActorSystem()
implicit lazy val mat = ActorMaterializer()
implicit lazy val ec = system.dispatcher
}
}
5.2-Usage
In entellectextractorsmappers (where the spark program is), I work with mapMartition. In it, I have a method to decode the Data coming from kafka that makes use of Kryo as such:
def decodeData(rowOfBinaryList: List[Row], kryoPool: Pool[Kryo], inputPool: Pool[Input]): List[RawData] = {
val kryo = kryoPool.obtain()
val input = inputPool.obtain()
val data = rowOfBinaryList.map(r => r.getAs[Array[Byte]]("message")).map{ binaryMsg =>
input.setInputStream(new ByteArrayInputStream(binaryMsg))
val value = kryo.readClassAndObject(input).asInstanceOf[RawData]
input.close()
value
}
kryoPool.free(kryo)
inputPool.free(input)
data
}
Note: The object KryoContext + Lazy val ensure, that kryoPool is instantiated once per JVM. I don't think the issue comes from that however.
I red in some other place a hint about issues of classLoaders used by
spark vs Kryo? But not sure to really understand what is going on.
If someone could give me some pointers, that would help, because I have no idea of where to start. Why would it work in local mode and not in cluster mode, does the provided mess the dependency and create some issue with Kryo ? Is it the SBT Assembly merge Strategy that messes up ?
Many pointers possible, if anyone could help me narrow that, that would be awesome !
So far,
I have solved that problem by picking up the "enclosing" class loader which I suppose is the one from Spark. This is after readying few comments here and there about issue with Class Loader between Kryo and Spark:
lazy val kryoPool = new Pool[Kryo](true, false, 16) {
protected def create(): Kryo = {
val cl = Thread.currentThread().getContextClassLoader()
val kryo = new Kryo()
kryo.setClassLoader(cl)
kryo.setRegistrationRequired(false)
kryo.addDefaultSerializer(classOf[scala.collection.Map[_,_]], classOf[ScalaImmutableAbstractMapSerializer])
kryo.addDefaultSerializer(classOf[scala.collection.generic.MapFactory[scala.collection.Map]], classOf[ScalaImmutableAbstractMapSerializer])
kryo.addDefaultSerializer(classOf[RawData], classOf[ScalaProductSerializer])
kryo
}
}

Spark Cassandra Connector cannot find java.time.LocalDate

I am trying to query Cassandra using Spark with the Datastax Spark-Cassandra connector. The Spark code is
val conf = new SparkConf(true)
.setMaster("local[4]")
.setAppName("cassandra_query")
.set("spark.cassandra.connection.host", "mycassandrahost")
val sc = new SparkContext(conf)
val rdd = sc.cassandraTable("mykeyspace", "mytable").limit(10)
rdd.foreach(println)
sc.stop()
So its just running locally now. And my build.sbt file looks like
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.0.0",
"org.apache.spark" %% "spark-sql" % "2.0.0",
"cc.mallet" % "mallet" % "2.0.7",
"com.amazonaws" % "aws-java-sdk" % "1.11.229",
"com.datastax.spark" % "spark-cassandra-connector_2.11" % "2.0.0"
)
I create a fat jar using the assembly plugin and when I submit the spark job I get the following error
Lost task 6.0 in stage 0.0 (TID 6) on executor localhost: java.io.IOException (Exception during preparation of SELECT "pcid", "content" FROM "mykeyspace"."mytable" WHERE token("pcid") > ? AND token("pcid") <= ? LIMIT 10 ALLOW FILTERING: class java.time.LocalDate in JavaMirror with org.apache.spark.util.MutableURLClassLoader#923288b of type class org.apache.spark.util.MutableURLClassLoader with classpath [file:/root/GenderPrediction-assembly-0.1.jar] and parent being sun.misc.Launcher$AppClassLoader#1e69dff6 of type class sun.misc.Launcher$AppClassLoader with classpath [file:/root/spark/conf/,file:/root/spark/jars/datanucleus-core-3.2.10.jar,...not found.
(Note: there were too many jars listed in the above classpath so I just replaced them with a "..." )
So it looks like it can't find java.time.LocalDate - how can I fix this?
I found another post that looks similar spark job cassandra error
However it is a different class that cannot be found so I'm not sure if it helps.
java.time.LocalDate is part of Java8 and it seems you are running java version lower than 8.
spark-cassandra-connector 2.0 requires java 8.
Spark Cassandra version compatibility
Can you plz try this
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.0.0",
"org.apache.spark" %% "spark-sql" % "2.0.0",
"cc.mallet" % "mallet" % "2.0.7",
"com.amazonaws" % "aws-java-sdk" % "1.11.229",
"com.datastax.spark" % "spark-cassandra-connector_2.11" % "2.0.0" exclude("joda-time", "joda-time"),
"joda-time" % "joda-time" % "2.3"
)

Exception in thread "main" java.lang.NoClassDefFoundError: org/deeplearning4j/nn/conf/layers/Layer

I am trying to build an application on spark using Deeplearning4j library. I have a cluster where i am going to run my jar(built using intelliJ) using spark-submit command. Here's my code
package Com.Spark.Examples
import scala.collection.mutable.ListBuffer
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.canova.api.records.reader.RecordReader
import org.canova.api.records.reader.impl.CSVRecordReader
import org.deeplearning4j.nn.api.OptimizationAlgorithm
import org.deeplearning4j.nn.conf.MultiLayerConfiguration
import org.deeplearning4j.nn.conf.NeuralNetConfiguration
import org.deeplearning4j.nn.conf.layers.DenseLayer
import org.deeplearning4j.nn.conf.layers.OutputLayer
import org.deeplearning4j.nn.multilayer.MultiLayerNetwork
import org.deeplearning4j.nn.weights.WeightInit
import org.deeplearning4j.spark.impl.multilayer.SparkDl4jMultiLayer
import org.nd4j.linalg.lossfunctions.LossFunctions
object FeedForwardNetworkWithSpark {
def main(args:Array[String]): Unit ={
val recordReader:RecordReader = new CSVRecordReader(0,",")
val conf = new SparkConf()
.setAppName("FeedForwardNetwork-Iris")
val sc = new SparkContext(conf)
val numInputs:Int = 4
val outputNum = 3
val iterations =1
val multiLayerConfig:MultiLayerConfiguration = new NeuralNetConfiguration.Builder()
.seed(12345)
.iterations(iterations)
.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
.learningRate(1e-1)
.l1(0.01).regularization(true).l2(1e-3)
.list(3)
.layer(0, new DenseLayer.Builder().nIn(numInputs).nOut(3).activation("tanh").weightInit(WeightInit.XAVIER).build())
.layer(1, new DenseLayer.Builder().nIn(3).nOut(2).activation("tanh").weightInit(WeightInit.XAVIER).build())
.layer(2, new OutputLayer.Builder(LossFunctions.LossFunction.MCXENT).weightInit(WeightInit.XAVIER)
.activation("softmax")
.nIn(2).nOut(outputNum).build())
.backprop(true).pretrain(false)
.build
val network:MultiLayerNetwork = new MultiLayerNetwork(multiLayerConfig)
network.init
network.setUpdater(null)
val sparkNetwork:SparkDl4jMultiLayer = new
SparkDl4jMultiLayer(sc,network)
val nEpochs:Int = 6
val listBuffer = new ListBuffer[Array[Float]]()
(0 until nEpochs).foreach{i => val net:MultiLayerNetwork = sparkNetwork.fit("/user/iris.txt",4,recordReader)
listBuffer +=(net.params.data.asFloat().clone())
}
println("Parameters vs. iteration Output: ")
(0 until listBuffer.size).foreach{i =>
println(i+"\t"+listBuffer(i).mkString)}
}
}
Here is my build.sbt file
name := "HWApp"
version := "0.1"
scalaVersion := "2.12.3"
libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.6.0" % "provided"
libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % "1.6.0" % "provided"
libraryDependencies += "org.deeplearning4j" % "deeplearning4j-nlp" % "0.4-rc3.8"
libraryDependencies += "org.deeplearning4j" % "dl4j-spark" % "0.4-rc3.8"
libraryDependencies += "org.deeplearning4j" % "deeplearning4j-core" % "0.4-rc3.8"
libraryDependencies += "org.nd4j" % "nd4j-x86" % "0.4-rc3.8" % "test"
libraryDependencies += "org.nd4j" % "nd4j-api" % "0.4-rc3.8"
libraryDependencies += "org.nd4j" % "nd4j-jcublas-7.0" % "0.4-rc3.8"
libraryDependencies += "org.nd4j" % "canova-api" % "0.0.0.14"
when i see my code in intelliJ, it does not show any error but when i execute the application on cluster: i got something like this:
I don't know what it wants from me. Even a little help will be appreciated. Thanks.
I'm not sure how you came up with this list of versions (I'm assuming just randomly compiling? please don't do that.)
You are using a 1.5 year old version of dl4j with dependencies that are a year older than that that don't exist anymore.
Start from scratch and follow our getting started and examples like you would any other open source project.
Those can be found here:
https://deeplearning4j.org/quickstart
with example projects here:
https://github.com/deeplearnin4j/dl4j-examples
A few more things:
Canova doesn't exist anymore and has been renamed datavec for more than a year.
All dl4j, datavec, nd4j,.. versions must be the same.
If you are using any of our scala modules like spark, those must also always have the same scala version.
So you are mixing scala 2.12 with scala 2.10 dependencies which is a scala no no (that's not even dl4j specific).
Dl4j only supports scala 2.11 at most. This is mainly because hadoop distros like cdh and hortonworks don't support scala 2.12 yet.
Edit: Another thing to watch out for that is dl4j specific is how we do spark versions. Spark 1 and 2 are supported. Your artifact id should be:
dl4j-spark_${yourscala version} (usually 2.10, 2.11)
with a dependency like:
0.9.1_spark_${YOUR VERSION OF SPARK}
This is applicable for our NLP modules as well.
Edit for more folks who haven't followed our getting started (Please do that, we keep that up to date): You also always need an nd4j backend. Usually this is nd4j-native-platform but maybe cuda if you are using gpus with:
nd4j-cuda-${YOUR CUDA VERSION}-platform

Why Spark on AWS EMR doesn't load class from application fat jar?

My spark application fails to run on AWS EMR cluster. I noticed that this is because some classes are loaded from the path set by EMR and not from my application jar. For example
java.lang.NoSuchMethodError: org.apache.avro.Schema$Field.<init>(Ljava/lang/String;Lorg/apache/avro/Schema;Ljava/lang/String;Ljava/lang/Object;)V
at com.sksamuel.avro4s.SchemaFor$.fieldBuilder(SchemaFor.scala:424)
at com.sksamuel.avro4s.SchemaFor$.fieldBuilder(SchemaFor.scala:406)
Here org.apache.avro.Schema is loaded from "jar:file:/usr/lib/spark/jars/avro-1.7.7.jar!/org/apache/avro/Schema.class"
Whereas com.sksamuel.avro4s depends on avro 1.8.1. My application is built as a fat jar and has avro 1.8.1. Why isn't that loaded? Instead of picking 1.7.7 from EMR set classpath.
This is just an example. I see the same with other libraries I include in my application. May be Spark depends on 1.7.7 and I'd have to shade when including other dependencies. But why are the classes included in my app jar not loaded first?
After bit of reading I realized that this is how class loading works in Spark. There is a hook to change this behavior spark.executor.userClassPathFirst. It didn't quite work when I tried and its marked as experimental. I guess the best way to proceed is to shade dependencies. Given the number of libraries Spark and its components pull, this might be quite a lot shading with complicated Spark apps.
I had the same exception as you. Based on a recommendation, I was able to resolve this exception by shading the avro dependency as you suggested:
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("org.apache.avro.**" -> "latest_avro.#1").inAll
)
If it helps, here is my full build.sbt (project info aside):
val sparkVersion = "2.1.0"
val confluentVersion = "3.2.1"
resolvers += "Confluent" at "http://packages.confluent.io/maven"
libraryDependencies ++= Seq(
"org.scala-lang" % "scala-library" % scalaVersion.value % "provided",
"org.scala-lang" % "scala-reflect" % scalaVersion.value % "provided",
"org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
"org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided" excludeAll ExclusionRule(organization = "org.scala-lang"),
"org.apache.avro" % "avro" % "1.8.1" % "provided",
"com.databricks" %% "spark-avro" % "3.2.0",
"com.sksamuel.avro4s" %% "avro4s-core" % "1.6.4",
"io.confluent" % "kafka-avro-serializer" % confluentVersion
)
logBuffered in Test := false
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("shapeless.**" -> "new_shapeless.#1").inAll,
ShadeRule.rename("org.apache.avro.**" -> "latest_avro.#1").inAll
)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}

Spark 2.0 with Play! 2.5

I'm trying to use Spark 2.0 with Play! 2.5 but I don't manage to make it work properly (and it seems that there is no example on Github).
I don't have any compilation error, but I have some strange executions errors.
For instance:
Almost all operations on a Dataset or a Dataframe leads to a NullPointerException:
val ds: Dataset[Event] = df.as[Event]
println(ds.count()) //Works well and prints the good results
ds.collect() // --> NullPointerException
ds.show also leads to a NullPointerException.
So there is a big problem somewhere that I'm missing so I think that it comes from incompatible versions. Here is the relevant part of my build.sbt:
object Version {
val scala = "2.11.8"
val spark = "2.0.0"
val postgreSQL = "9.4.1211.jre7"
}
object Library {
val sparkSQL = "org.apache.spark" %% "spark-sql" % Version.spark
val sparkMLLib = "org.apache.spark" %% "spark-mllib" % Version.spark
val sparkCore = "org.apache.spark" %% "spark-core" % Version.spark
val postgreSQL = "org.postgresql" % "postgresql" % Version.postgreSQL
}
object Dependencies {
import Library._
val dependencies = Seq(
sparkSQL,
sparkMLLib,
sparkCore,
postgreSQL)
}
lazy val root = (project in file("."))
.settings(scalaVersion := Version.scala)
.enablePlugins(PlayScala)
libraryDependencies ++= Dependencies.dependencies
dependencyOverrides ++= Set(
"com.fasterxml.jackson.core" % "jackson-databind" % "2.7.4",
"com.fasterxml.jackson.module" %% "jackson-module-scala" % "2.7.4"
)
I ran into the same issue using spark 2.0.0 with play 2.5.12 java.
The activator seems to include the com.fasterxml.jackson-databind 2.7.8 by default, and it doesn't work with jackson-module-scala.
I cleaned my sbt cache
rm -r ~/.ivy2/cache
My new build.sbt, which yields a waring while compiling, since spark 2.0.0 is compiled with jackson-module-scala_2.11:2.6.5, but still spark 2 seams to work with jackson-module-scala 2.8.7
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"com.fasterxml.jackson.core" % "jackson-core" % "2.8.7",
"com.fasterxml.jackson.core" % "jackson-databind" % "2.8.7",
"com.fasterxml.jackson.core" % "jackson-annotations" % "2.8.7",
"com.fasterxml.jackson.module" %% "jackson-module-scala" % "2.8.7",
"org.apache.spark" % "spark-core_2.11" % "2.0.0",
"org.apache.spark" % "spark-mllib_2.11" % "2.0.0"
)
The NullpointerException derived from jackson.databind.JsonMappingException: Incompatible Jackson Version: 2.x.x
please read https://github.com/FasterXML/jackson-module-scala/issues/233

Resources