Can spark sql create nosql table? - apache-spark

Learned so far, I think SparkSQL can create hive table,because of spark-sql is also improved based on shark,shark from hive.
hive create table
Does spark-sql create other nosql table ? such as hbase,cassandra,elasticsearch.
I searched the relevant documents, did not find spark-sql create table api,
Will it be supported in the future ?

creating indexes in Elasticsearch, tables in Hbase & Cassandra directly through spark-sql options are not possible at this moment.
you can use the hbase-client option for natively interact with hbase. - https://mvnrepository.com/artifact/org.apache.hbase/hbase-client
1. creating a table in Hbase
import org.apache.hadoop.hbase.*;
import org.apache.hadoop.conf.Configuration;
val admin = new HBaseAdmin(conf)
if (!admin.tableExists(myTable)) {
val htd = new HTableDescriptor(myTable)
htd.addFamily(new HColumnDescriptor("id"))
htd.addFamily(new HColumnDescriptor("name"))
htd.addFamily(new HColumnDescriptor("country))
htd.addFamily(new HColumnDescriptor("pincode"))
admin.createTable(htd)
}
2. Write to Hbase table that is created.
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.hadoop.hbase.{HBaseConfiguration, HColumnDescriptor, HTableDescriptor}
import org.apache.spark.sql.SparkSession
import org.apache.hadoop.hbase.client.ConnectionFactory
import org.apache.hadoop.hbase.client.Connection
import org.apache.hadoop.hbase.client.Put
import org.apache.hadoop.hbase.util.Bytes
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "cmaster.localcloud.com:9092,cworker2.localcloud.com:9092,cworker1.localcloud.com:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "use_a_separate_group_id_for_each_stream",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("test")
val spark = SparkSession.builder().master("local[8]").appName("KafkaSparkHBasePipeline").getOrCreate()
spark.sparkContext.setLogLevel("OFF")
val result = kafkaStream.map(record => (record.key, record.value))
result.foreachRDD(x => {
val cols = x.map(x => x._2.split(","))
val arr =cols.map(x => {
val id = x(0)
val name = x(1)
val country = x(2)
val pincode = x(3)
(id,name,country,pincode)
})
arr.foreachPartition { iter =>
val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", "cmaster.localcloud.com")
conf.set("hbase.rootdir", "hdfs://localhost:8020/hbase")
conf.set("hbase.zookeeper.property.clientPort", "2181")
conf.set("zookeeper.znode.parent", "/hbase")
conf.set("hbase.unsafe.stream.capability.enforce", "false")
conf.set("hbase.cluster.distributed", "true")
val conn = ConnectionFactory.createConnection(conf)
import org.apache.hadoop.hbase.TableName
val tableName = "htd"
val table = TableName.valueOf(tableName)
val HbaseTable = conn.getTable(table)
val cfPersonal = "personal"
iter.foreach(x => {
val keyValue = "Key_" + x._1
val id = new Put(Bytes.toBytes(keyValue))
val name = x._2.toString
val country = x._3.toString
val pincode = x._4.toString
id.addColumn(Bytes.toBytes(cfPersonal), Bytes.toBytes("name"), Bytes.toBytes(name))
id.addColumn(Bytes.toBytes(cfPersonal), Bytes.toBytes("country"), Bytes.toBytes(country))
id.addColumn(Bytes.toBytes(cfPersonal), Bytes.toBytes("pincode"), Bytes.toBytes(pincode))
HbaseTable.put(id)
})
HbaseTable.close()
conn.close()
}
})
3. dependency I have used for this project,
name := "KafkaSparkHBasePipeline"
version := "0.1"
scalaVersion := "2.11.8"
resolvers += "Mavenrepository" at "https://mvnrepository.com"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.2"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.2"
libraryDependencies += "org.apache.hbase" % "hbase-client" % "2.2.0"
libraryDependencies += "org.apache.kafka" %% "kafka" % "2.3.0"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.4.3"
libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.4.3"
4. to verify the data in hbase.
smart#cmaster sathyadev]$ hbase shell
Java HotSpot(TM) 64-Bit Server VM warning: Using incremental CMS is deprecated and will likely be removed in a future release HBase Shell Use "help" to get list of supported commands. Use "exit" to quit this interactive shell.
For Reference, please visit: http://hbase.apache.org/2.0/book.html
#shell Version 2.1.0-cdh6.2.1, rUnknown, Wed Sep 11 01:05:56 PDT 2019 Took 0.0064 seconds
hbase(main):001:0> scan 'htd';
ROW COLUMN+CELL Key_1010 column=personal:country, timestamp=1584125319078, value=USA Key_1010 column=personal:name, timestamp=1584125319078, value=Mark Key_1010 column=personal:pincode, timestamp=1584125319078, value=54321 Key_1011 column=personal:country, timestamp=1584125320073, value=CA

Related

spark unit testing with SharedSparkSession (org.apache.spark.sql.test.SharedSparkSession.eventually)

I tried to write a simple unit for Spark join from the tutorial - Apache Spark Unit Testing Part 2 — Spark SQL.
Added all dependencies from reference and
"org.apache.spark" %% "spark-core" % "2.4.7" % Test,
"org.apache.spark" %% "spark-core" % "2.4.7" % Test classifier "tests",
"org.scalatest" %% "scalatest" % "3.2.3" % Test,
but can't add extended QueryTest because it involves AnyFunSuite in private class. Now the code looks like
class TestScalaCheck extends AnyFunSuite with SharedSparkSession {
import testImplicits._
test("join - join using") {
val df = Seq(1, 2, 3).map(i => (i, i.toString)).toDF("int", "str")
val df2 = Seq(1, 2, 3).map(i => (i, (i + 1).toString)).toDF("int", "str")
checkAnswer(
df.join(df2, "int"),
Row(1, "1", "2") :: Row(2, "2", "3") :: Row(3, "3", "5") :: Nil)
}
And it generates the exception message:
An exception or error caused a run to abort: org.apache.spark.sql.test.SharedSparkSession.eventually(Lorg/scalatest/concurrent/PatienceConfiguration$Timeout;Lorg/scalatest/concurrent/PatienceConfiguration$Interval;Lscala/Function0;Lorg/scalactic/source/Position;)Ljava/lang/Object;
java.lang.NoSuchMethodError: org.apache.spark.sql.test.SharedSparkSession.eventually(Lorg/scalatest/concurrent/PatienceConfiguration$Timeout;Lorg/scalatest/concurrent/PatienceConfiguration$Interval;Lscala/Function0;Lorg/scalactic/source/Position;)Ljava/lang/Object;
at org.apache.spark.sql.test.SharedSparkSession.afterEach(SharedSparkSession.scala:135)
at org.apache.spark.sql.test.SharedSparkSession.afterEach$(SharedSparkSession.scala:129)
at lesson2.TestScalaCheck.afterEach(TestScalaCheck.scala:14)
at org.scalatest.BeforeAndAfterEach.$anonfun$runTest$1(BeforeAndAfterEach.scala:247)
at org.scalatest.Status.$anonfun$withAfterEffect$1(Status.scala:377)
at org.scalatest.Status.$anonfun$withAfterEffect$1$adapted(Status.scala:373)
at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:462)
at org.scalatest.Status.withAfterEffect(Status.scala:373)
at org.scalatest.Status.withAfterEffect$(Status.scala:371)
at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:434)
at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:246)
at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
at lesson2.TestScalaCheck.runTest(TestScalaCheck.scala:14)
at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233)
at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:233)
at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:232)
at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563)
at org.scalatest.Suite.run(Suite.scala:1112)
at org.scalatest.Suite.run$(Suite.scala:1094)
at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1563)
at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:237)
at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:237)
at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:236)
at lesson2.TestScalaCheck.org$scalatest$BeforeAndAfterAll$$super$run(TestScalaCheck.scala:14)
at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
at lesson2.TestScalaCheck.run(TestScalaCheck.scala:14)
at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:45)
at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13(Runner.scala:1320)
at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13$adapted(Runner.scala:1314)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1314)
at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:993)
at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:971)
at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1480)
at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:971)
at org.scalatest.tools.Runner$.run(Runner.scala:798)
at org.scalatest.tools.Runner.run(Runner.scala)
at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2or3(ScalaTestRunner.java:38)
at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:25)
Could you help me to resolve it?
The original code is - DataFrameJoinSuite.scala

Spark 2 Connect to HBase

Trying to migrate code from Spark 1.6, Scala 2.10 to Spark 2.4, Scala 2.11.
Cannot get the code to compile. Showing dependency versions, minimal example and compilation error below.
// Dependencies
, "org.apache.spark" %% "spark-core" % "2.4.0"
, "org.apache.spark" %% "spark-sql" % "2.4.0"
, "org.apache.hbase" % "hbase-server" % "1.2.0-cdh5.14.4"
, "org.apache.hbase" % "hbase-common" % "1.2.0-cdh5.14.4"
, "org.apache.hbase" % "hbase-spark" % "1.2.0-cdh5.14.4"
// Minimal example
package spark2.hbase
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.spark.HBaseContext
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
object ConnectToHBase {
def main(args: Array[String]): Unit = {
implicit val spark: SparkSession = SparkSession.builder.appName("Connect to HBase from Spark 2")
.config("spark.master", "local")
.getOrCreate()
implicit val sc: SparkContext = spark.sparkContext
val hbaseConf = HBaseConfiguration.create()
val hbaseContext = new HBaseContext(sc, hbaseConf)
}
}
// Compilation error
[error] missing or invalid dependency detected while loading class file 'HBaseContext.class'.
[error] Could not access type Logging in package org.apache.spark,
[error] because it (or its dependencies) are missing. Check your build definition for
[error] missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
[error] A full rebuild may help if 'HBaseContext.class' was compiled against an incompatible version of org.apache.spark.
This works:
lazy val sparkVer = "2.4.0-cdh6.2.0"
lazy val hbaseVer = "2.1.0"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVer
, "org.apache.spark" %% "spark-sql" % sparkVer
, "org.apache.spark" %% "spark-streaming" % sparkVer
, "org.apache.hbase" % "hbase-common" % hbaseVer
, "org.apache.hbase" % "hbase-client" % hbaseVer
, "org.apache.hbase.connectors.spark" % "hbase-spark" % "1.0.0"
)
The essential piece here is using Cloudera CDH 6 (not 5) and using a different version of "hbase-spark" because CDH 5 cannot work with Spark 2.

Error in Spark cassandra integration with spark-cassandra connector

I am trying to save data in cassandra in a standalone mode from spark. By running following command:
bin/spark-submit --packages datastax:spark-cassandra-connector:1.6.0-s_2.10
--class "pl.japila.spark.SparkMeApp" --master local /home/hduser2/code14/target/scala-2.10/simple-project_2.10-1.0.jar
My build.sbt file is :-
**name := "Simple Project"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.6.0"
resolvers += "Spark Packages Repo" at "https://dl.bintray.com/spark-packages/maven"
libraryDependencies += "datastax" % "spark-cassandra-connector" % "1.6.0-s_2.10"
libraryDependencies ++= Seq(
"org.apache.cassandra" % "cassandra-thrift" % "3.5" ,
"org.apache.cassandra" % "cassandra-clientutil" % "3.5",
"com.datastax.cassandra" % "cassandra-driver-core" % "3.0.0"
)**
My Spark code is :-
package pl.japila.spark
import org.apache.spark.sql._
import com.datastax.spark.connector._
import com.datastax.driver.core._
import com.datastax.spark.connector.cql._
import org.apache.spark.{SparkContext, SparkConf}
import com.datastax.driver.core.QueryOptions._
import org.apache.spark.SparkConf
import com.datastax.driver.core._
import com.datastax.spark.connector.rdd._
object SparkMeApp {
def main(args: Array[String]) {
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext("local", "test", conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val rdd = sc.cassandraTable("test", "kv")
val collection = sc.parallelize(Seq(("cat", 30), ("fox", 40)))
collection.saveToCassandra("test", "kv", SomeColumns("key", "value"))
}
}
And I got this error:-
Exception in thread "main" java.lang.NoSuchMethodError: com.datastax.driver.core.QueryOptions.setRefreshNodeIntervalMillis(I)Lcom/datastax/driver/core/QueryOptions;**
at com.datastax.spark.connector.cql.DefaultConnectionFactory$.clusterBuilder(CassandraConnectionFactory.scala:49)
at com.datastax.spark.connector.cql.DefaultConnectionFactory$.createCluster(CassandraConnectionFactory.scala:92)
at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:153)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$3.apply(CassandraConnector.scala:148)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$3.apply(CassandraConnector.scala:148)
at com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:31)
at com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scala:56)
at com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraConnector.scala:81)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109)
Versions used are :-
Spark - 1.6.0
Scala - 2.10.4
cassandra-driver-core jar - 3.0.0
cassandra version 2.2.7
spark-cassandra connector - 1.6.0-s_2.10
SOMEBODY PLEASE HELP !!
I would start by removing
libraryDependencies ++= Seq(
"org.apache.cassandra" % "cassandra-thrift" % "3.5" ,
"org.apache.cassandra" % "cassandra-clientutil" % "3.5",
"com.datastax.cassandra" % "cassandra-driver-core" % "3.0.0"
)
Since the libraries which are dependencies of the connector will be included automatically with the packages dependency.
Then I would test the packages resolution by launching the spark-shell with
./bin/spark-shell --packages datastax:spark-cassandra-connector:1.6.0-s_2.10
you see the following resolutions happening correctly
datastax#spark-cassandra-connector added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found datastax#spark-cassandra-connector;1.6.0-s_2.10 in spark-packages
found org.apache.cassandra#cassandra-clientutil;3.0.2 in list
found com.datastax.cassandra#cassandra-driver-core;3.0.0 in list
...
[2.10.5] org.scala-lang#scala-reflect;2.10.5
:: resolution report :: resolve 627ms :: artifacts dl 10ms
:: modules in use:
com.datastax.cassandra#cassandra-driver-core;3.0.0 from list in [default]
com.google.guava#guava;16.0.1 from list in [default]
com.twitter#jsr166e;1.1.0 from list in [default]
datastax#spark-cassandra-connector;1.6.0-s_2.10 from spark-packages in [default]
...
If these appear to resolve correctly but everything still doesn't work, I would try clearing out the cache for these artifacts.

Spark NoSuchMethodError

I'm running spark using the following Docker command:
docker run -it \
-p 8088:8088 -p 8042:8042 -p 50070:50070 \
-v "$(PWD)"/log4j.properties:/usr/local/spark/conf/log4j.properties \
-v "$(PWD)":/app -h sandbox sequenceiq/spark:1.6.0 bash
Running spark-submit --version reports version 1.6.0
My spark-submit command is the following:
spark-submit --class io.jobi.GithubDay \
--master local[*] \
--name "Daily Github Push Counter" \
/app/min-spark_2.11-1.0.jar \
"file:///app/data/github-archive/*.json" \
"/app/data/ghEmployees.txt" \
"file:///app/data/emp-gh-push-output" "json"
build.sbt
name := """min-spark"""
version := "1.0"
scalaVersion := "2.11.7"
lazy val sparkVersion = "1.6.0"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided"
)
// Change this to another test framework if you prefer
libraryDependencies += "org.scalatest" %% "scalatest" % "2.2.4" % "test"
GithubDay.scala
package io.jobi
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import scala.io.Source.fromFile
/**
* Created by hammer on 7/15/16.
*/
object GithubDay {
def main(args: Array[String]): Unit = {
println("Application arguments: ")
args.foreach(println)
val conf = new SparkConf()
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
try {
println("args(0): " + args(0))
val ghLog = sqlContext.read.json(args(0))
val pushes = ghLog.filter("type = 'PushEvent'")
val grouped = pushes.groupBy("actor.login").count()
val ordered = grouped.orderBy(grouped("count").desc)
val employees = Set() ++ (
for {
line <- fromFile(args(1)).getLines()
} yield line.trim
)
val bcEmployees = sc.broadcast(employees)
import sqlContext.implicits._
println("register function")
val isEmployee = sqlContext.udf.register("SetContainsUdf", (u: String) => bcEmployees.value.contains(u))
println("registered udf")
val filtered = ordered.filter(isEmployee($"login"))
println("applied filter")
filtered.write.format(args(3)).save(args(2))
} finally {
sc.stop()
}
}
}
I build using sbt clean package but the output when I run it is:
Application arguments:
file:///app/data/github-archive/*.json
/app/data/ghEmployees.txt
file:///app/data/emp-gh-push-output
json
args(0): file:///app/data/github-archive/*.json
imported implicits
defined isEmp
register function
Exception in thread "main" java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaUniverse$JavaMirror;
at io.jobi.GithubDay$.main(GithubDay.scala:53)
at io.jobi.GithubDay.main(GithubDay.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
From what I've read NoSuchMethodError is a result of version incompatibilities, but I'm building with 1.6.0 and deploying to 1.6.0 so I don't understand what's happening.
Unless you've compiled Spark yourself, out of the box version 1.6.0 is compiled with Scala 2.10.x. This is stated in the docs (which says 1.6.2, but is also relevant to 1.6.0):
Spark runs on Java 7+, Python 2.6+ and R 3.1+. For the Scala API,
Spark 1.6.2 uses Scala 2.10. You will need to use a compatible Scala
version (2.10.x).
You want:
scalaVersion := "2.10.6"
One hint to that is that the error is in a Scala class: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)

error: value succinct is not a member of org.apache.spark.rdd.RDD[String]

I am trying out succinctRDD for searching mechanism.
Below is what I am trying as per the doc:
import edu.berkeley.cs.succinct.kv._
val data = sc.textFile("file:///home/aman/data/jsonDoc1.txt")
val succintdata = data.succinct.persist()
The link is here ...succint RDD
The error I am getting is below
<console>:32: error: value succinct is not a member of org.apache.spark.rdd.RDD[String]
val succintdata = data.succinct.persist()
if anybody can point out the problem here or any step I should follow before this.
This is basically sbt build .
name := "succinttest"
version := "1.0"
scalaVersion := "2.11.7"
resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "1.5.2"
libraryDependencies += "org.apache.kafka" % "kafka_2.11" % "0.8.2.2"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "1.5.2"
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "1.5.2"
libraryDependencies += "amplab" % "succinct" % "0.1.7"
libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "2.6.0" excludeAll ExclusionRule(organization = "javax.servlet")
This is a typical implicit conversion problem in Scala.
When you import the library:
import edu.berkeley.cs.succinct.kv._
Then your are importing all the classes/methods from this package, and then all the implicits. So, if you check the package.object on the source:
https://github.com/amplab/succinct/blob/master/spark/src/main/scala/edu/berkeley/cs/succinct/kv/package.scala
... then you will realize that you have the next implicit conversion:
implicit class SuccinctContext(sc: SparkContext) {
def succinctKV[K: ClassTag](filePath: String, storageLevel: StorageLevel = StorageLevel.MEMORY_ONLY)
(implicit ordering: Ordering[K])
: SuccinctKVRDD[K] = SuccinctKVRDD[K](sc, filePath, storageLevel)
}
Which means that you have a new method on SparkContext to create a new SuccinctKVRDD from a text file. So try the next code:
import edu.berkeley.cs.succinct.kv._
val data = sc.succinctKV("file:///home/aman/data/jsonDoc1.txt")
And then you will have a succint RDD to do all the operations that you need like search, filterByValue, etc:
https://github.com/amplab/succinct/blob/master/spark/src/main/scala/edu/berkeley/cs/succinct/kv/SuccinctKVRDD.scala

Resources