How to use Test Jars for spark in SBT

How to use Test Jars for spark in SBT - apache-spark

I am creating Spark 2.0.1 project and want to use Spark test-jars in my SBT Project.
build.sbt:
scalaVersion := "2.11.0"
val sparkVersion = "2.0.1"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "compile",
"org.apache.spark" %% "spark-sql" % sparkVersion % "compile",
"org.scalatest" %% "scalatest" % "2.2.6" % "test",
"org.apache.spark" %% "spark-core" % sparkVersion % "test" classifier "tests",
"org.apache.spark" %% "spark-sql" % sparkVersion % "test" classifier "tests",
"org.apache.spark" %% "spark-catalyst" % sparkVersion % "test" classifier "tests"
)
My Test code:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
import org.apache.spark.sql.test.SharedSQLContext
class LoaderTest extends org.apache.spark.sql.QueryTest with SharedSQLContext {
import testImplicits._
test("function current_date") {
val df1 = Seq((1, 2), (3, 1)).toDF("a", "b")
// Rest of test code and assertion using checkAnswer method
}
}
But when i try to run test using:
sbt clean test
It get following errors:
[info] Compiling 1 Scala source to /tstprg/test/target/scala-2.11/test-classes...
[error] bad symbolic reference to org.apache.spark.sql.catalyst.expressions.PredicateHelper encountered in class file 'PlanTest.class'.
[error] Cannot access type PredicateHelper in package org.apache.spark.sql.catalyst.expressions. The current classpath may be
[error] missing a definition for org.apache.spark.sql.catalyst.expressions.PredicateHelper, or PlanTest.class may have been compiled against a version that's
[error] incompatible with the one found on the current classpath.
[error] /tstprg/test/src/test/scala/facts/LoaderTest.scala:7: illegal inheritance;
[error] self-type facts.LoaderTest does not conform to org.apache.spark.sql.QueryTest's selftype org.apache.spark.sql.QueryTest
[error] class LoaderTest extends org.apache.spark.sql.QueryTest with SharedSQLContext {
[error] ^
[error] /tstprg/test/src/test/scala/facts/LoaderTest.scala:7: illegal inheritance;
[error] self-type facts.LoaderTest does not conform to org.apache.spark.sql.test.SharedSQLContext's selftype org.apache.spark.sql.test.SharedSQLContext
[error] class LoaderTest extends org.apache.spark.sql.QueryTest with SharedSQLContext {
[error] ^
[error] bad symbolic reference to org.apache.spark.sql.Encoder encountered in class file 'SQLImplicits.class'.
[error] Cannot access type Encoder in package org.apache.spark.sql. The current classpath may be
[error] missing a definition for org.apache.spark.sql.Encoder, or SQLImplicits.class may have been compiled against a version that's
[error] incompatible with the one found on the current classpath.
[error] /tstprg/test/src/test/scala/facts/LoaderTest.scala:11: bad symbolic reference to org.apache.spark.sql.catalyst.plans.logical encountered in class file 'SQLTestUtils.class'.
[error] Cannot access term logical in package org.apache.spark.sql.catalyst.plans. The current classpath may be
[error] missing a definition for org.apache.spark.sql.catalyst.plans.logical, or SQLTestUtils.class may have been compiled against a version that's
[error] incompatible with the one found on the current classpath.
[error] val df1 = Seq((1, 2), (3, 1)).toDF("a", "b")
[error] ^
[error] 5 errors found
[error] (test:compileIncremental) Compilation failed
Can anybody who tried using test-jars of spark to unit test using SBT help what i am missing?
Note: This test works fine when I run through IntelliJ IDE.

Try to use the scope as mentioned below
version := "0.1"
scalaVersion := "2.11.11"
val sparkVersion = "2.3.1"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % Provided,
"org.apache.spark" %% "spark-core" % sparkVersion % Test classifier "tests",
"org.apache.spark" %% "spark-core" % sparkVersion % Test classifier "test-sources",
"org.apache.spark" %% "spark-sql" % sparkVersion % Provided,
"org.apache.spark" %% "spark-sql" % sparkVersion % Test classifier "tests",
"org.apache.spark" %% "spark-sql" % sparkVersion % Test classifier "test-sources",
"org.apache.spark" %% "spark-catalyst" % sparkVersion % Test classifier "tests",
"org.apache.spark" %% "spark-catalyst" % sparkVersion % Test classifier "test-sources",
"com.typesafe.scala-logging" %% "scala-logging" % "3.9.0",
"org.scalatest" %% "scalatest" % "3.0.4" % "test",
"org.typelevel" %% "cats-core" % "1.1.0",
"org.typelevel" %% "cats-effect" % "1.0.0-RC2",
"org.apache.spark" %% "spark-streaming" % sparkVersion % Provided,
"org.apache.spark" %% "spark-sql-kafka-0-10" % sparkVersion % Provided exclude ("net.jpountz.lz4", "lz4"),
"com.pusher" % "pusher-java-client" % "1.8.0") ```

Try changing scope of your dependencies marked as test like below
scalaVersion := "2.11.0"
val sparkVersion = "2.0.1"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.scalatest" %% "scalatest" % "2.2.6",
"org.apache.spark" %% "spark-core" % sparkVersion ,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-catalyst" % sparkVersion
)
or adding "compile".

Related

Spark Session Catalog Failure

I'm reading data in batch from a Cassandra database & also in streaming from Azure EventHubs using Scala Spark API.
session.read
.format("org.apache.spark.sql.cassandra")
.option("keyspace", keyspace)
.option("table", table)
.option("pushdown", pushdown)
.load()
&
session.readStream
.format("eventhubs")
.options(eventHubsConf.toMap)
.load()
Everything was running fine, but now I get this exception out frow nowhere...
User class threw exception: java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.catalog.SessionCatalog.<init>(Lscala/Function0;Lscala/Function0;Lorg/apache/spark/sql/catalyst/analysis/FunctionRegistry;Lorg/apache/spark/sql/internal/SQLConf;Lorg/apache/hadoop/conf/Configuration;Lorg/apache/spark/sql/catalyst/parser/ParserInterface;Lorg/apache/spark/sql/catalyst/catalog/FunctionResourceLoader;)V
at org.apache.spark.sql.internal.BaseSessionStateBuilder.catalog$lzycompute(BaseSessionStateBuilder.scala:132)
at org.apache.spark.sql.internal.BaseSessionStateBuilder.catalog(BaseSessionStateBuilder.scala:131)
at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$1.<init>(BaseSessionStateBuilder.scala:157)
at org.apache.spark.sql.internal.BaseSessionStateBuilder.analyzer(BaseSessionStateBuilder.scala:157)
at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293)
at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293)
at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:79)
at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:79)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:428)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:233)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
I don't know what changed exactly but here is my dependencies :
ThisBuild / scalaVersion := "2.11.11"
val sparkVersion = "2.4.0"
libraryDependencies ++= Seq(
"org.apache.logging.log4j" % "log4j-core" % "2.11.1",
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
"org.apache.spark" %% "spark-hive" % sparkVersion % "provided",
"org.apache.spark" %% "spark-catalyst" % sparkVersion % "provided",
"org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
"com.microsoft.azure" % "azure-eventhubs-spark_2.11" % "2.3.10",
"com.microsoft.azure" % "azure-eventhubs" % "2.3.0",
"com.datastax.spark" %% "spark-cassandra-connector" % "2.4.1",
"org.scala-lang.modules" %% "scala-java8-compat" % "0.9.0",
"com.twitter" % "jsr166e" % "1.1.0",
"com.holdenkarau" %% "spark-testing-base" % "2.4.0_0.12.0" % Test,
"MrPowers" % "spark-fast-tests" % "0.19.2-s_2.11" % Test
)
Anyone have a clue ?

java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.catalog.SessionCatalog.<init(
scala/Function0;Lscala/Function0;
Lorg/apache/spark/sql/catalyst/analysis/FunctionRegistry;
Lorg/apache/spark/sql/internal/SQLConf;
Lorg/apache/hadoop/conf/Configuration;
Lorg/apache/spark/sql/catalyst/parser/ParserInterface;
Lorg/apache/spark/sql/catalyst/catalog/FunctionResourceLoader;)
Suggests to me that one of the ilbraries was compiled against a version of Spark that is different than the one that is currently on the runtime path. Since the above method signature does match the Spark 2.4.0 signature see
https://github.com/apache/spark/blob/v2.4.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala#L56-L63
But not the Spark 2.3.0 Signature.
My guess would be there is a runtime Spark 2.3.0 somewhere? Perhaps you are running the application using Spark-Submit from a Spark 2.3.0 install?

MicroBatchExecution Spark Structured Streaming with Kafka 2.4.0

When I try to run Spark Structured streaming application with Kafka integration I keep getting this error:
ERROR MicroBatchExecution: Query [id = ff14fce6-71d3-4616-bd2d-40f07a85a74b, runId = 42670f29-21a9-4f7e-abd0-66ead8807282] terminated with error
java.lang.IllegalStateException: No entry found for connection 2147483647
Why does this happen? Could it be some dependencies problem?
My build.sbt file looks like this:
name := "SparkAirflowK8s"
version := "0.1"
scalaVersion := "2.12.7"
val sparkVersion = "2.4.0"
val circeVersion = "0.11.0"
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-core" % "2.9.8"
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-databind" % "2.9.8"
dependencyOverrides += "com.fasterxml.jackson.module" % "jackson-module-scala_2.12" % "2.9.8"
resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"
resolvers += "confluent" at "http://packages.confluent.io/maven/"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-sql-kafka-0-10" % sparkVersion,
"org.apache.kafka" %% "kafka" % "2.1.0",
"org.scalatest" %% "scalatest" % "3.2.0-SNAP10" % "it, test",
"org.scalacheck" %% "scalacheck" % "1.14.0" % "it, test",
"io.kubernetes" % "client-java" % "3.0.0" % "it",
"org.json" % "json" % "20180813",
"io.circe" %% "circe-core" % circeVersion,
"io.circe" %% "circe-generic" % circeVersion,
"io.circe" %% "circe-parser" % circeVersion,
"org.apache.avro" % "avro" % "1.8.2",
"io.confluent" % "kafka-avro-serializer" % "5.0.1"
)
Here is the part of the code:
val sparkConf = new SparkConf()
.setMaster(args(0))
.setAppName("KafkaSparkJob")
val sparkSession = SparkSession
.builder()
.config(sparkConf)
.getOrCreate()
val avroStream = sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topic1")
.load()
val outputExample = avroStream
.writeStream
.outputMode("append")
.format("console")
.start()
outputExample.awaitTermination()

I have changed localhost to the NodePort service defined for Kafka deployment. Now, this exception does not appear.

SBT file for Spark Kafka

I am new to SBT. I am trying to create a project with a simple producer and consumer using spark and scala. Do I need to add anything else in this SBT file? Using IDEA Intellij. Spark 2.2, CDH 5.10, Kafka 0.10
import sbt.Keys._
import sbt._
name := "consumer"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.0.cloudera1"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.2.0.cloudera1"
libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.2.0.cloudera1"
resolvers ++= Vector(
"Cloudera repo" at "https://repository.cloudera.com/artifactory/cloudera-repos/"
)

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/SQLContext

I am using IntelliJ 2016.3 version.
import sbt.Keys._
import sbt._
object ApplicationBuild extends Build {
object Versions {
val spark = "1.6.3"
}
val projectName = "example-spark"
val common = Seq(
version := "1.0",
scalaVersion := "2.11.7"
)
val customLibraryDependencies = Seq(
"org.apache.spark" %% "spark-core" % Versions.spark % "provided",
"org.apache.spark" %% "spark-sql" % Versions.spark % "provided",
"org.apache.spark" %% "spark-hive" % Versions.spark % "provided",
"org.apache.spark" %% "spark-streaming" % Versions.spark % "provided",
"org.apache.spark" %% "spark-streaming-kafka" % Versions.spark
exclude("log4j", "log4j")
exclude("org.spark-project.spark", "unused"),
"com.typesafe.scala-logging" %% "scala-logging" % "3.1.0",
"org.slf4j" % "slf4j-api" % "1.7.10",
"org.slf4j" % "slf4j-log4j12" % "1.7.10"
exclude("log4j", "log4j"),
"log4j" % "log4j" % "1.2.17" % "provided",
"org.scalatest" %% "scalatest" % "2.2.4" % "test"
)
I have been getting below run time exception., even though i mentioned all the dependencies correctly as shown above.
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/SQLContext
at example.SparkSqlExample.main(SparkSqlExample.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.SQLContext
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 6 more
Investigated more on this web.And found that this is mainly due to in-appropriate entries in buld.sbt or version mismatches.But in my case everything looks good as shown above.
Please suggest where did i do wrong here?

I guess this is because you marked your dependencies as "provided", but apparently you (or IDEA) don't provide them.
Try to remove the "provided" option or (my preferred way): move the class with the main method to src/test/scala

value saveToCassandra is not a member of org.apache.spark.streaming.dstream.DStream[(String, Int)]

I'm new to Spark, I'm getting this error when i try to save data to cassandra.
I have imported: StreamingContext._ and SparkContext._, but still get the error.
These are the dependencies I'm using:
"org.apache.spark" %% "spark-core" % "1.5.2",
"org.apache.spark" %% "spark-streaming" % "1.5.2",
"com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0",
"org.apache.spark" %% "spark-sql" % "1.5.2"
Thank you

To be able to use saveToCassandra on a DStream you have to import DStreamFunctions for example with:
import com.datastax.spark.connector.streaming._

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to use Test Jars for spark in SBT - apache-spark

Related

Spark Session Catalog Failure

MicroBatchExecution Spark Structured Streaming with Kafka 2.4.0

SBT file for Spark Kafka

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/SQLContext

value saveToCassandra is not a member of org.apache.spark.streaming.dstream.DStream[(String, Int)]

Categories

Resources