Not able to create object of SparkBundlecontext in mleap - apache-spark

I have imported required packages. I am even able to import SparkBundleContext
import org.apache.spark.ml.bundle.SparkBundleContext
But then when I do
val sbc = SparkBundleContext()
I get this error
java.lang.NoClassDefFoundError: org/apache/spark/ml/clustering/GaussianMixtureModel

If you are using maven add the apache spark ML dependency as
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.1</version>
</dependency>
If you are using SBT then add dependency as
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.1.1"
Use the right version of dependency so that it matchs your scala version.
Hope this helps!

Related

graphframes for pySpark v3.0.1

I'm trying to use the graphframes library with pySpark v3.0.1. (I'm using vscode on debian but trying to import the package from pyspark shell didn't work either)
According to the documentation, using $ pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11 I should be able to work with it.
This sample code was taken from another post in StackOverflow posing the same question, although its solution didn't do the trick for me.
localVertices = [(1,"A"), (2,"B"), (3, "C")]
localEdges = [(1,2,"love"), (2,1,"hate"), (2,3,"follow")]
v = sqlContext.createDataFrame(localVertices, ["id", "name"])
e = sqlContext.createDataFrame(localEdges, ["src", "dst", "action"])
g = GraphFrame(v, e)
throws error
py4j.protocol.Py4JJavaError: An error occurred while calling o63.createGraph.
java.lang.NoSuchMethodError: 'scala.collection.mutable.ArrayOps scala.Predef$.refArrayOps(java.lang.Object[])'
You need to use the correct graphframes version for Spark 3.0. You have used the graphframes for Spark 2.3 (0.6.0-spark2.3-s_2.11), which caused a Spark version conflict. You can try 0.8.1-spark3.0-s_2.12, which is currently the latest version of graphframes for Spark 3.0.
pyspark --packages graphframes:graphframes:0.8.1-spark3.0-s_2.12

BeforeStep AfterStep aren't called

I created a hook and is using #Before, #After , #BeforeStep, #AfterStep.
1.
The pom I set is as below:
<dependency>
<groupId>io.cucumber</groupId>
<artifactId>cucumber-java</artifactId>
<version>4.2.0</version>
</dependency>
with this setting,Only #Before, #After work.#BeforeStep, #AfterStep don't work. How to fix it.
If I change the version of cucumber-java to latest version 6.9.1 But the following are invalid,
import cucumber.api.java.AfterStep;
import cucumber.api.java.BeforeStep;
which package should I import?
Is anyone able to help me fix it.
Try with package-
io.cucumber.java

Spark / Wiremock: guava version conflict

In a Spark application (v2.3.3), I want to use Wiremock from scala tests. I use following dependencies:
"org.apache.spark" %% "spark-sql" % "2.3.3" % "provided"
"org.apache.spark" %% "spark-mllib" % "2.3.3" % "provided"
"com.github.tomakehurst" % "wiremock" % "2.25.1" % Test
"org.scalatest" %% "scalatest" % "2.2.5" % Test
Doing so, I have the following error from a spark class:
java.lang.IllegalAccessError: tried to access method com.google.common.base.Stopwatch.<init>()V from class org.apache.hadoop.mapred.FileInputFormat
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:312)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.RDD$$anonfun$toLocalIterator$1.apply(RDD.scala:962)
at org.apache.spark.rdd.RDD$$anonfun$toLocalIterator$1.apply(RDD.scala:958)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.toLocalIterator(RDD.scala:958)
Then if I exclude guava from Wiremock:
"com.github.tomakehurst" % "wiremock" % "2.25.1" % Test exclude("com.google", "guava")
I got the following using Wiremock:
java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.createStarted()Lcom/google/common/base/Stopwatch;
Because default guava version is 11.0.2 .
Using guava 18, which introduces the createStarted method, I still have the error tried to access method com.google.common.base.Stopwatch.<init>.
So the issue is that 2 librairies use incompatible versions of Guava. How to fix it? The solutions I found are related to uber-jar, but in my case this is only in the Test scope.
Using the wiremock-standalone library solves the issue:
"com.github.tomakehurst" % "wiremock-standalone" % "2.25.1" % Test
I don't really understand why, if anyone has an explanation?

Why does sbt fail with "object SQLContext is not a member of package org.apache.spark.sql"?

I have been trying to use Spark SQL for which I have used the following import:
import org.apache.spark.sql.SQLContext
but it is creating the error:
object SQLContext is not a member of package org.apache.spark.sql
I am using SBT as the build tool. The content of the sbt file is as follows:
name := "stream-demo"
version := "1.0"
scalaVersion := "2.11.7"
val sparkVersion = "2.1.0"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion)
The problem is that the environment yo use to compile the code did not refresh itself and loaded spark-sql module.
Please leave your sbt session and start over.
You could also do reload while in sbt session (so you won't destroy the JVM sbt runs on and so make the reload faster).

spark-submit and spark-shell results mismatch

I have a simple test spark program as below, the strange thing is that it runs well under a spark-shell, but will get a runtime error of
java.lang.NoSuchMethodError:
in spark-submit, which indicate the line of:
val maps2=maps.collect.toMap
has problem. But why the compilation has no problem and it works well under spark-shell? Thanks!
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark._
import SparkContext._
val docs=sc.parallelize(Array(Array("once" ,"upon", "a", "time"), Array("there", "was", "a", "king")))
val hashingTF = new HashingTF()
val maps=docs.flatMap{term=>term.map(ele=>(hashingTF.indexOf(ele),ele))}
val maps2=maps.collect.toMap
I experienced the same problem. The fix was to match my scalaVersion in the sbt to the same version used by spark-shell. Like you I was specifying a "2.11.x" version in my sbt. At the time of this writing the spark binaries are compiled against scala 2.10.x.
scalaVersion := "2.10.4"
To check your spark-shell scala version :
scala> scala.util.Properties.versionString
res7: String = version 2.10.4
You should see in which scala version is spark compiled, i had the same problem, sbt was with 2.11.7 and spark was with 2.10. Try it !

Resources