Can't access Spark 2.0 Temporary Table from beeline - apache-spark

With Spark 1.5.1, I've already been able to access spark-shell temporary tables from Beeline using Thrift Server. I've been able to do so by reading answers to related questions on Stackoverflow.
However, after upgrading to Spark 2.0, I can't see temporary tables from Beeline anymore, here are the steps I'm following.
I'm launching spark-shell using the following command:
./bin/spark-shell --master=myHost.local:7077 —conf spark.sql.hive.thriftServer.singleSession=true
Once the spark shell is ready I enter the following lines to launch thrift server and create a temporary view from a data frame taking its source in a json file
import org.apache.spark.sql.hive.thriftserver._
spark.sqlContext.setConf("hive.server2.thrift.port","10002")
HiveThriftServer2.startWithContext(spark.sqlContext)
val df = spark.read.json("examples/src/main/resources/people.json")
df.createOrReplaceTempView("people")
spark.sql("select * from people").show()
The last statement displays the table, it runs fine.
However when I start beeline and log to my thrift server instance, I can't see any temporary tables:
show tables;
+------------+--------------+--+
| tableName | isTemporary |
+------------+--------------+--+
+------------+--------------+--+
No rows selected (0,658 seconds)
Did I miss something regarding my spark upgrade from 1.5.1 to 2.0, how can I gain access to my temporary tables ?

This worked for me after upgrading to spark 2.0.1
val sparkConf =
new SparkConf()
.setAppName("Spark Thrift Server Demo")
.setMaster(sparkMaster)
.set("hive.metastore.warehouse.dir", hdfsDataUri + "/hive")
val spark = SparkSession
.builder()
.enableHiveSupport()
.config(sparkConf)
.getOrCreate()
val sqlContext = new org.apache.spark.sql.SQLContext(spark.sparkContext)
HiveThriftServer2.startWithContext(sqlContext)

Related

SQLContext in Spark2 not getting updated hive table records

I have a running application which queries hive table using HiveContext and it works fine if i run the application using spark-submit in spark1.6 . As part of upgrade we switched to spark2.1 and using spark2-submit. Since spark2 doesnt support HiveContext i m uing SQLContext instead. Issue i m facing is once i start the context any incremental changes in hive table is not visible in the hive query results. I am starting SparkContext with the enableHiveSupport() . IF i stop and restart the application i can see the rows. The application writing the data is doing MSCK REPAIR TABLE after writing so i am not sure what i am missing.
This is the code snippet
val spark= SparkSession.builder().enableHiveSupport().getOrCreate()
val sqlc=spark.sqlContext
sqlc.sql("select * from table1").show(false)
+---+----------+----+
| id| hire_dt|user|
+---+----------+----+
|1.0|2018-01-01|John|
|2.0|2018-12-01|Adam|
+---+----------+----+
Now in another session i added new row but if i ran the above code it still returns only 2 rows .
This works fine if i do a refresh table ie
val spark= SparkSession.builder().enableHiveSupport().getOrCreate()
val sqlc=spark.sqlContext
sqlc.sql("refresh table table1")
sqlc.sql("select * from table1").show(false)
My question is why should i do a refeshTable since i never did to do it in spark1.6 when i query using HiveContext and SQLContext is supposed to behave the same way as HiveContext
Try
sqlContext.refreshTable("my_table")
in spark 2.x spark.catalog.refreshTable("my_table")
in SQL Format spark.sql("refresh table my_table")

SPARK_HOME read from IntellIj IDEA

I know it might be trivial question, but I stuck with it for a while :) Basically, I am querying Hive tables from Spark. All connections settings for Hive I have in hive-site.xml file which is in SPARK_HOME/conf directory on my Windows PC. It works fine through spark-shell. However, when I run the same from IntelliJ Idea, it will query Hive locally instead connecting to remote metastore defined in hive-site.xml.
package main.scala
import org.apache.spark.sql.SparkSession
object Main {
def main(args: Array[String]) {
val spark = SparkSession
.builder()
.appName("Hive Tables")
.master("local")
.enableHiveSupport()
.getOrCreate()
import spark.sql
sql("show tables").show()
}
}
I have added environmental variables for HADOOP_HOME and SPARK_HOME. As I mentioned it works fine through spark-shell. But seems like hive-site.xml is not read in IntelliJ Idea.
Do you have any ideas what I may do wrong ?
Thanks

zeppelin not showing hive tables in CDH cluster

i am running zeppelin referring to CDH cluster. sql paragraph doesnot work. sample example showing bank file to be loaded and registered as temp table works.. but not the hive metastore tables.
How to make the default references to hive metastore?
%spark
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("use database_name")
val df = sqlContext.sql("select * from table_name")
df.registerTempTable("table_name")
%sql
show tables
select * from table_name

Using hive database in spark

I am new in spark and trying to run some queries on tpcds benchmark tables, using HortonWorks Sandbox.
http://www.tpc.org/tpcds/
There is no problem while using hive through shell or hive-view on sandbox. The problem is that I don't know how connect to the database if I want to use the spark.
How can I use a hive database in spark for running the queries?
The only solution that I know till now is to rebuild each table manually and load data in them using the following scala codes, which is not the best solution.
scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS employee(id INT, name STRING, age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'")
scala> sqlContext.sql("LOAD DATA LOCAL INPATH 'employee.txt' INTO TABLE employee")
scala> val result = sqlContext.sql("FROM employe SELECT id, name, age")
scala> result.show()
I also read some about hive-site.xml but I don't know where to find it and what changes to make on it to connect to the database.
There is no need to connect to a specific database when using Spark and HiveContext.
You simply need to copy the "hive-site.xml" file to the Spark conf folder (or you could also create a symlink).
cp $HIVE_HOME/conf/hive-site.xml $SPARK_HOME/conf/
Then, in Spark you can do something like that (I'm not a scala user so the syntax might be wrong) :
val hc = new org.apache.spark.sql.hive.HiveContext(sc)
val result = hc.sql("SELECT col1, col2, col3 FROM dbname.tablename")
result.show()

Cannot create Spark Phoenix DataFrames

I am trying to load data from Apache Phoenix into a Spark DataFrame.
I have been able to successfully create an RDD with the following code:
val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val foo: RDD[Map[String, AnyRef]] = sc.phoenixTableAsRDD(
table = "FOO",
columns = Seq("ID", "MESSAGE_EPOCH", "MESSAGE_VALUE"),
zkUrl = Some("<zk-ip-address>:2181:/hbase-unsecure"))
foo.collect().foreach(x => println(x))
However I have not been so lucky trying to create a DataFrame. My current attempt is:
val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new SQLContext(sc)
val df = sqlContext.phoenixTableAsDataFrame(
table = "FOO",
columns = Seq("ID", "MESSAGE_EPOCH", "MESSAGE_VALUE"),
zkUrl = Some("<zk-ip-address>:2181:/hbase-unsecure"))
df.select(df("ID")).show
Unfortunately the above code results in a ClassCastException:
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to org.apache.spark.sql.Row
I am still very new to spark. If anyone can help it would be very much appreciated!
Although you haven't mentioned your spark version and details of the exception...
Please see PHOENIX-2287 which is fixed, which says
Environment: HBase 1.1.1 running in standalone mode on OS X *
Spark 1.5.0 Phoenix 4.5.2
Josh Mahonin added a comment - 23/Sep/15 17:56 Updated patch adds
support for Spark 1.5.0, and is backwards compatible back down to
1.3.0 (manually tested, Spark version profiles may be worth looking at in the future) In 1.5.0, they've gone and explicitly hidden the
GenericMutableRow data structure. Fortunately, we are able to the
external-facing 'Row' data type, which is backwards compatible, and
should remain compatible in future releases as well. As part of the
update, Spark SQL deprecated a constructor on their 'DecimalType'.
In updating this, I exposed a new issue, which is that we don't
carry-forward the precision and scale of the underlying Decimal type
through to Spark. For now I've set it to use the Spark defaults, but
I'll create another issue for that specifically. I've included an
ignored integration test in this patch as well.

Resources