Can I use Hive in concert with the Spark cassandra connector ?
scala> import org.apache.spark.sql.hive.HiveContext
scala> hiveCtx = new HiveContext(sc)
This produces:
ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,
/etc/hive/conf.dist/ivysettings.xml will be used
and then
scala> val rows = hiveCtx.sql("SELECT first_name,last_name,house FROM
test_gce.students WHERE student_id=1")
results in this error:
org.apache.spark.sql.AnalysisException: no such table test_gce.students; line 1 pos 48
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:260)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:268)
...
Is it possible to create a HiveContext from the SparkContext and use it as I am trying to do while using the Spark cassandra connector ?
Here is how I called spark-shell:
spark-shell --jars ~/spark-cassandra-connector/spark-cassandra-connector-assembly-1.4.0-M1-SNAPSHOT.jar --conf spark.cassandra.connection.host=10.240.0.0
Also, I am able to successfully access Cassandra with the pure connector code rather than just using Hive:
scala> val cRDD=sc.cassandraTable("test_gce", "students")
scala>cRDD.select("first_name","last_name","house").where("student_id=?",1).collect()
res0: Array[com.datastax.spark.connector.CassandraRow] =
Array(CassandraRow{first_name: Harry, last_name: Potter, house: Godric Gryffindor})
Related
By refering to this link , I tried to query cassandra table in spark Dataframe
val spark = SparkSession
.builder()
.appName("CassandraSpark")
.config("spark.cassandra.connection.host", "127.0.0.1")
.config("spark.cassandra.connection.port", "9042")
.master("local[2]")
.getOrCreate();
The node which I'm using is SearchAnalytics node
With using this spark session , i tried sql query
val ss = spark.sql("select * from killr_video.videos where solr_query = '{\"q\":\"video_id:1\"}'")
Search indexing is already enabled on that table.
After running the program , here is the error i am getting
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: `killr_video`.`videos`; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation killr_video.videos
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:82)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:66)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:691)
How can i get Cassandra data into Spark?
From this error message it looks like that you're running your code using the standalone Spark, not via DSE Analytics (via dse spark-submit, or dse spark).
In this case you need to register tables - DSE documentation describes how to do it for all tables, using dse client-tool & spark-sql:
dse client-tool --use-server-config spark sql-schema --all > output.sql
spark-sql --jars byos-5.1.jar -f output.sql
For my example, it looks like following:
USE test;
CREATE TABLE t122
USING org.apache.spark.sql.cassandra
OPTIONS (
keyspace "test",
table "t122",
pushdown "true");
Here is an example of solr_query that just works out of box if I run it in the spark-shell started with dse spark:
scala> val ss = spark.sql("select * from test.t122 where solr_query='{\"q\":\"t:t2\"}'").show
+---+----------+---+
| id|solr_query| t|
+---+----------+---+
| 2| null| t2|
+---+----------+---+
To make your life easier, it's better to use DSE Analytics, not the bring your own spark.
I tried to use Sparksession with hive tables.
I had used the following code:
val spark= SparkSession.builder().appName("spark").master("local").enableHiveSupport().getOrCreate()
spark.sql("select * from data").show()
Shows table not found, but the table exists in hive. Please help me with this.
spark.sql("select * from databasename.data").show() - will work
Hello you have to provide the path of warehouse, like:
// warehouseLocation points to the default location for managed databases and tables
val warehouseLocation = new File("spark-warehouse").getAbsolutePath
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
For more information, You can see here: Hive Tables with Spark
I want to read data from an Hbase table using get command while I have also the key of the row..I want to do that in my Spark streaming application, Is there any source code someone can share?
You can use Spark newAPIHadoopRDD to read Hbase table, which returns and RDD.
For example:
val sparkConf = new SparkConf().setAppName("Hbase").setMaster("local")
val sc = new SparkContext(sparkConf)
val conf = HBaseConfiguration.create()
val tableName = "table"
conf.set("hbase.master", "localhost:60000")
conf.set("hbase.zookeeper.quorum", "localhost:2181")
conf.set("zookeeper.znode.parent", "/hbase-unsecure")
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val rdd = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
println("Number of Records found : " + rdd.count())
sc.stop()
Or you can use any Spark Hbase connector like HortonWorks Hbase connector.
https://github.com/hortonworks-spark/shc
You can also use Spark-Phoenix API.
https://phoenix.apache.org/phoenix_spark.html
val sql = "select time from table"
val data = sql(sql).map(_.getTimeStamp(0).toString)
In the hive table,time's type is timestamp.when i run this program,it throws NullPointerException.
val data = sql(sql).map(_.get(0).toString)
When I change to the above code,the same Exception be threw.
Is anyone can tell me how to get TimeStamp data in hive using Spark?
Tks.
If you are trying to read data from Hive table you should use, HiveContext instead of SQLContext.
If you are using Spark 2.0, you can try the following.
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
import spark.sql
val df = sql("select time from table")
df.select($"time").show()
In previous Version of Spark like 1.6.1, i am using creating Cassandra Context using spark Context,
import org.apache.spark.{ Logging, SparkContext, SparkConf }
//config
val conf: org.apache.spark.SparkConf = new SparkConf(true)
.set("spark.cassandra.connection.host", CassandraHost)
.setAppName(getClass.getSimpleName)
lazy val sc = new SparkContext(conf)
val cassandraSqlCtx: org.apache.spark.sql.cassandra.CassandraSQLContext = new CassandraSQLContext(sc)
//Query using Cassandra context
cassandraSqlCtx.sql("select id from table ")
But In Spark 2.0 , Spark Context is replaced with Spark session, how can i use cassandra context?
Short Answer: You don't. It has been deprecated and removed.
Long Answer: You don't want to. The HiveContext provides everything except for the catalogue and supports a much wider range of SQL(HQL~). In Spark 2.0 this just means you will need to manually register Cassandra tables use createOrReplaceTempView until an ExternalCatalogue is implemented.
In Sql this looks like
spark.sql("""CREATE TEMPORARY TABLE words
|USING org.apache.spark.sql.cassandra
|OPTIONS (
| table "words",
| keyspace "test")""".stripMargin)
In the raw DF api it looks like
spark
.read
.format("org.apache.spark.sql.cassandra")
.options(Map("keyspace" -> "test", "table" -> "words"))
.load
.createOrReplaceTempView("words")
Both of these commands will register the table "words" for SQL queries.