Spark 2 Connect to HBase - apache-spark

Trying to migrate code from Spark 1.6, Scala 2.10 to Spark 2.4, Scala 2.11.
Cannot get the code to compile. Showing dependency versions, minimal example and compilation error below.
// Dependencies
, "org.apache.spark" %% "spark-core" % "2.4.0"
, "org.apache.spark" %% "spark-sql" % "2.4.0"
, "org.apache.hbase" % "hbase-server" % "1.2.0-cdh5.14.4"
, "org.apache.hbase" % "hbase-common" % "1.2.0-cdh5.14.4"
, "org.apache.hbase" % "hbase-spark" % "1.2.0-cdh5.14.4"
// Minimal example
package spark2.hbase
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.spark.HBaseContext
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
object ConnectToHBase {
def main(args: Array[String]): Unit = {
implicit val spark: SparkSession = SparkSession.builder.appName("Connect to HBase from Spark 2")
.config("spark.master", "local")
.getOrCreate()
implicit val sc: SparkContext = spark.sparkContext
val hbaseConf = HBaseConfiguration.create()
val hbaseContext = new HBaseContext(sc, hbaseConf)
}
}
// Compilation error
[error] missing or invalid dependency detected while loading class file 'HBaseContext.class'.
[error] Could not access type Logging in package org.apache.spark,
[error] because it (or its dependencies) are missing. Check your build definition for
[error] missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
[error] A full rebuild may help if 'HBaseContext.class' was compiled against an incompatible version of org.apache.spark.

This works:
lazy val sparkVer = "2.4.0-cdh6.2.0"
lazy val hbaseVer = "2.1.0"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVer
, "org.apache.spark" %% "spark-sql" % sparkVer
, "org.apache.spark" %% "spark-streaming" % sparkVer
, "org.apache.hbase" % "hbase-common" % hbaseVer
, "org.apache.hbase" % "hbase-client" % hbaseVer
, "org.apache.hbase.connectors.spark" % "hbase-spark" % "1.0.0"
)
The essential piece here is using Cloudera CDH 6 (not 5) and using a different version of "hbase-spark" because CDH 5 cannot work with Spark 2.

Related

MicroBatchExecution Spark Structured Streaming with Kafka 2.4.0

When I try to run Spark Structured streaming application with Kafka integration I keep getting this error:
ERROR MicroBatchExecution: Query [id = ff14fce6-71d3-4616-bd2d-40f07a85a74b, runId = 42670f29-21a9-4f7e-abd0-66ead8807282] terminated with error
java.lang.IllegalStateException: No entry found for connection 2147483647
Why does this happen? Could it be some dependencies problem?
My build.sbt file looks like this:
name := "SparkAirflowK8s"
version := "0.1"
scalaVersion := "2.12.7"
val sparkVersion = "2.4.0"
val circeVersion = "0.11.0"
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-core" % "2.9.8"
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-databind" % "2.9.8"
dependencyOverrides += "com.fasterxml.jackson.module" % "jackson-module-scala_2.12" % "2.9.8"
resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"
resolvers += "confluent" at "http://packages.confluent.io/maven/"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-sql-kafka-0-10" % sparkVersion,
"org.apache.kafka" %% "kafka" % "2.1.0",
"org.scalatest" %% "scalatest" % "3.2.0-SNAP10" % "it, test",
"org.scalacheck" %% "scalacheck" % "1.14.0" % "it, test",
"io.kubernetes" % "client-java" % "3.0.0" % "it",
"org.json" % "json" % "20180813",
"io.circe" %% "circe-core" % circeVersion,
"io.circe" %% "circe-generic" % circeVersion,
"io.circe" %% "circe-parser" % circeVersion,
"org.apache.avro" % "avro" % "1.8.2",
"io.confluent" % "kafka-avro-serializer" % "5.0.1"
)
Here is the part of the code:
val sparkConf = new SparkConf()
.setMaster(args(0))
.setAppName("KafkaSparkJob")
val sparkSession = SparkSession
.builder()
.config(sparkConf)
.getOrCreate()
val avroStream = sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topic1")
.load()
val outputExample = avroStream
.writeStream
.outputMode("append")
.format("console")
.start()
outputExample.awaitTermination()
I have changed localhost to the NodePort service defined for Kafka deployment. Now, this exception does not appear.

SBT file for Spark Kafka

I am new to SBT. I am trying to create a project with a simple producer and consumer using spark and scala. Do I need to add anything else in this SBT file? Using IDEA Intellij. Spark 2.2, CDH 5.10, Kafka 0.10
import sbt.Keys._
import sbt._
name := "consumer"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.0.cloudera1"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.2.0.cloudera1"
libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.2.0.cloudera1"
resolvers ++= Vector(
"Cloudera repo" at "https://repository.cloudera.com/artifactory/cloudera-repos/"
)

How to use Test Jars for spark in SBT

I am creating Spark 2.0.1 project and want to use Spark test-jars in my SBT Project.
build.sbt:
scalaVersion := "2.11.0"
val sparkVersion = "2.0.1"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "compile",
"org.apache.spark" %% "spark-sql" % sparkVersion % "compile",
"org.scalatest" %% "scalatest" % "2.2.6" % "test",
"org.apache.spark" %% "spark-core" % sparkVersion % "test" classifier "tests",
"org.apache.spark" %% "spark-sql" % sparkVersion % "test" classifier "tests",
"org.apache.spark" %% "spark-catalyst" % sparkVersion % "test" classifier "tests"
)
My Test code:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
import org.apache.spark.sql.test.SharedSQLContext
class LoaderTest extends org.apache.spark.sql.QueryTest with SharedSQLContext {
import testImplicits._
test("function current_date") {
val df1 = Seq((1, 2), (3, 1)).toDF("a", "b")
// Rest of test code and assertion using checkAnswer method
}
}
But when i try to run test using:
sbt clean test
It get following errors:
[info] Compiling 1 Scala source to /tstprg/test/target/scala-2.11/test-classes...
[error] bad symbolic reference to org.apache.spark.sql.catalyst.expressions.PredicateHelper encountered in class file 'PlanTest.class'.
[error] Cannot access type PredicateHelper in package org.apache.spark.sql.catalyst.expressions. The current classpath may be
[error] missing a definition for org.apache.spark.sql.catalyst.expressions.PredicateHelper, or PlanTest.class may have been compiled against a version that's
[error] incompatible with the one found on the current classpath.
[error] /tstprg/test/src/test/scala/facts/LoaderTest.scala:7: illegal inheritance;
[error] self-type facts.LoaderTest does not conform to org.apache.spark.sql.QueryTest's selftype org.apache.spark.sql.QueryTest
[error] class LoaderTest extends org.apache.spark.sql.QueryTest with SharedSQLContext {
[error] ^
[error] /tstprg/test/src/test/scala/facts/LoaderTest.scala:7: illegal inheritance;
[error] self-type facts.LoaderTest does not conform to org.apache.spark.sql.test.SharedSQLContext's selftype org.apache.spark.sql.test.SharedSQLContext
[error] class LoaderTest extends org.apache.spark.sql.QueryTest with SharedSQLContext {
[error] ^
[error] bad symbolic reference to org.apache.spark.sql.Encoder encountered in class file 'SQLImplicits.class'.
[error] Cannot access type Encoder in package org.apache.spark.sql. The current classpath may be
[error] missing a definition for org.apache.spark.sql.Encoder, or SQLImplicits.class may have been compiled against a version that's
[error] incompatible with the one found on the current classpath.
[error] /tstprg/test/src/test/scala/facts/LoaderTest.scala:11: bad symbolic reference to org.apache.spark.sql.catalyst.plans.logical encountered in class file 'SQLTestUtils.class'.
[error] Cannot access term logical in package org.apache.spark.sql.catalyst.plans. The current classpath may be
[error] missing a definition for org.apache.spark.sql.catalyst.plans.logical, or SQLTestUtils.class may have been compiled against a version that's
[error] incompatible with the one found on the current classpath.
[error] val df1 = Seq((1, 2), (3, 1)).toDF("a", "b")
[error] ^
[error] 5 errors found
[error] (test:compileIncremental) Compilation failed
Can anybody who tried using test-jars of spark to unit test using SBT help what i am missing?
Note: This test works fine when I run through IntelliJ IDE.
Try to use the scope as mentioned below
version := "0.1"
scalaVersion := "2.11.11"
val sparkVersion = "2.3.1"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % Provided,
"org.apache.spark" %% "spark-core" % sparkVersion % Test classifier "tests",
"org.apache.spark" %% "spark-core" % sparkVersion % Test classifier "test-sources",
"org.apache.spark" %% "spark-sql" % sparkVersion % Provided,
"org.apache.spark" %% "spark-sql" % sparkVersion % Test classifier "tests",
"org.apache.spark" %% "spark-sql" % sparkVersion % Test classifier "test-sources",
"org.apache.spark" %% "spark-catalyst" % sparkVersion % Test classifier "tests",
"org.apache.spark" %% "spark-catalyst" % sparkVersion % Test classifier "test-sources",
"com.typesafe.scala-logging" %% "scala-logging" % "3.9.0",
"org.scalatest" %% "scalatest" % "3.0.4" % "test",
"org.typelevel" %% "cats-core" % "1.1.0",
"org.typelevel" %% "cats-effect" % "1.0.0-RC2",
"org.apache.spark" %% "spark-streaming" % sparkVersion % Provided,
"org.apache.spark" %% "spark-sql-kafka-0-10" % sparkVersion % Provided exclude ("net.jpountz.lz4", "lz4"),
"com.pusher" % "pusher-java-client" % "1.8.0") ```
Try changing scope of your dependencies marked as test like below
scalaVersion := "2.11.0"
val sparkVersion = "2.0.1"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.scalatest" %% "scalatest" % "2.2.6",
"org.apache.spark" %% "spark-core" % sparkVersion ,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-catalyst" % sparkVersion
)
or adding "compile".

Error in Spark cassandra integration with spark-cassandra connector

I am trying to save data in cassandra in a standalone mode from spark. By running following command:
bin/spark-submit --packages datastax:spark-cassandra-connector:1.6.0-s_2.10
--class "pl.japila.spark.SparkMeApp" --master local /home/hduser2/code14/target/scala-2.10/simple-project_2.10-1.0.jar
My build.sbt file is :-
**name := "Simple Project"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.6.0"
resolvers += "Spark Packages Repo" at "https://dl.bintray.com/spark-packages/maven"
libraryDependencies += "datastax" % "spark-cassandra-connector" % "1.6.0-s_2.10"
libraryDependencies ++= Seq(
"org.apache.cassandra" % "cassandra-thrift" % "3.5" ,
"org.apache.cassandra" % "cassandra-clientutil" % "3.5",
"com.datastax.cassandra" % "cassandra-driver-core" % "3.0.0"
)**
My Spark code is :-
package pl.japila.spark
import org.apache.spark.sql._
import com.datastax.spark.connector._
import com.datastax.driver.core._
import com.datastax.spark.connector.cql._
import org.apache.spark.{SparkContext, SparkConf}
import com.datastax.driver.core.QueryOptions._
import org.apache.spark.SparkConf
import com.datastax.driver.core._
import com.datastax.spark.connector.rdd._
object SparkMeApp {
def main(args: Array[String]) {
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext("local", "test", conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val rdd = sc.cassandraTable("test", "kv")
val collection = sc.parallelize(Seq(("cat", 30), ("fox", 40)))
collection.saveToCassandra("test", "kv", SomeColumns("key", "value"))
}
}
And I got this error:-
Exception in thread "main" java.lang.NoSuchMethodError: com.datastax.driver.core.QueryOptions.setRefreshNodeIntervalMillis(I)Lcom/datastax/driver/core/QueryOptions;**
at com.datastax.spark.connector.cql.DefaultConnectionFactory$.clusterBuilder(CassandraConnectionFactory.scala:49)
at com.datastax.spark.connector.cql.DefaultConnectionFactory$.createCluster(CassandraConnectionFactory.scala:92)
at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:153)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$3.apply(CassandraConnector.scala:148)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$3.apply(CassandraConnector.scala:148)
at com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:31)
at com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scala:56)
at com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraConnector.scala:81)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109)
Versions used are :-
Spark - 1.6.0
Scala - 2.10.4
cassandra-driver-core jar - 3.0.0
cassandra version 2.2.7
spark-cassandra connector - 1.6.0-s_2.10
SOMEBODY PLEASE HELP !!
I would start by removing
libraryDependencies ++= Seq(
"org.apache.cassandra" % "cassandra-thrift" % "3.5" ,
"org.apache.cassandra" % "cassandra-clientutil" % "3.5",
"com.datastax.cassandra" % "cassandra-driver-core" % "3.0.0"
)
Since the libraries which are dependencies of the connector will be included automatically with the packages dependency.
Then I would test the packages resolution by launching the spark-shell with
./bin/spark-shell --packages datastax:spark-cassandra-connector:1.6.0-s_2.10
you see the following resolutions happening correctly
datastax#spark-cassandra-connector added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found datastax#spark-cassandra-connector;1.6.0-s_2.10 in spark-packages
found org.apache.cassandra#cassandra-clientutil;3.0.2 in list
found com.datastax.cassandra#cassandra-driver-core;3.0.0 in list
...
[2.10.5] org.scala-lang#scala-reflect;2.10.5
:: resolution report :: resolve 627ms :: artifacts dl 10ms
:: modules in use:
com.datastax.cassandra#cassandra-driver-core;3.0.0 from list in [default]
com.google.guava#guava;16.0.1 from list in [default]
com.twitter#jsr166e;1.1.0 from list in [default]
datastax#spark-cassandra-connector;1.6.0-s_2.10 from spark-packages in [default]
...
If these appear to resolve correctly but everything still doesn't work, I would try clearing out the cache for these artifacts.

error: value succinct is not a member of org.apache.spark.rdd.RDD[String]

I am trying out succinctRDD for searching mechanism.
Below is what I am trying as per the doc:
import edu.berkeley.cs.succinct.kv._
val data = sc.textFile("file:///home/aman/data/jsonDoc1.txt")
val succintdata = data.succinct.persist()
The link is here ...succint RDD
The error I am getting is below
<console>:32: error: value succinct is not a member of org.apache.spark.rdd.RDD[String]
val succintdata = data.succinct.persist()
if anybody can point out the problem here or any step I should follow before this.
This is basically sbt build .
name := "succinttest"
version := "1.0"
scalaVersion := "2.11.7"
resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "1.5.2"
libraryDependencies += "org.apache.kafka" % "kafka_2.11" % "0.8.2.2"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "1.5.2"
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "1.5.2"
libraryDependencies += "amplab" % "succinct" % "0.1.7"
libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "2.6.0" excludeAll ExclusionRule(organization = "javax.servlet")
This is a typical implicit conversion problem in Scala.
When you import the library:
import edu.berkeley.cs.succinct.kv._
Then your are importing all the classes/methods from this package, and then all the implicits. So, if you check the package.object on the source:
https://github.com/amplab/succinct/blob/master/spark/src/main/scala/edu/berkeley/cs/succinct/kv/package.scala
... then you will realize that you have the next implicit conversion:
implicit class SuccinctContext(sc: SparkContext) {
def succinctKV[K: ClassTag](filePath: String, storageLevel: StorageLevel = StorageLevel.MEMORY_ONLY)
(implicit ordering: Ordering[K])
: SuccinctKVRDD[K] = SuccinctKVRDD[K](sc, filePath, storageLevel)
}
Which means that you have a new method on SparkContext to create a new SuccinctKVRDD from a text file. So try the next code:
import edu.berkeley.cs.succinct.kv._
val data = sc.succinctKV("file:///home/aman/data/jsonDoc1.txt")
And then you will have a succint RDD to do all the operations that you need like search, filterByValue, etc:
https://github.com/amplab/succinct/blob/master/spark/src/main/scala/edu/berkeley/cs/succinct/kv/SuccinctKVRDD.scala

Resources