SQLContext in Spark2 not getting updated hive table records - apache-spark

I have a running application which queries hive table using HiveContext and it works fine if i run the application using spark-submit in spark1.6 . As part of upgrade we switched to spark2.1 and using spark2-submit. Since spark2 doesnt support HiveContext i m uing SQLContext instead. Issue i m facing is once i start the context any incremental changes in hive table is not visible in the hive query results. I am starting SparkContext with the enableHiveSupport() . IF i stop and restart the application i can see the rows. The application writing the data is doing MSCK REPAIR TABLE after writing so i am not sure what i am missing.
This is the code snippet
val spark= SparkSession.builder().enableHiveSupport().getOrCreate()
val sqlc=spark.sqlContext
sqlc.sql("select * from table1").show(false)
+---+----------+----+
| id| hire_dt|user|
+---+----------+----+
|1.0|2018-01-01|John|
|2.0|2018-12-01|Adam|
+---+----------+----+
Now in another session i added new row but if i ran the above code it still returns only 2 rows .
This works fine if i do a refresh table ie
val spark= SparkSession.builder().enableHiveSupport().getOrCreate()
val sqlc=spark.sqlContext
sqlc.sql("refresh table table1")
sqlc.sql("select * from table1").show(false)
My question is why should i do a refeshTable since i never did to do it in spark1.6 when i query using HiveContext and SQLContext is supposed to behave the same way as HiveContext

Try
sqlContext.refreshTable("my_table")
in spark 2.x spark.catalog.refreshTable("my_table")
in SQL Format spark.sql("refresh table my_table")

Related

How can we convert an external table to managed table in SPARK 2.2.0?

The below command was successfully converting external tables to managed tables in Spark 2.0.0:
ALTER TABLE {table_name} SET TBLPROPERTIES(EXTERNAL=FLASE);
However the above command is failing in Spark 2.2.0 with the below error:
Error in query: Cannot set or change the preserved property key:
'EXTERNAL';
As #AndyBrown pointed our in a comment you have the option of dropping to the console and invoking the Hive statement there. In Scala this worked for me:
import sys.process._
val exitCode = Seq("hive", "-e", "ALTER TABLE {table_name} SET TBLPROPERTIES(\"EXTERNAL\"=\"FALSE\")").!
I faced this problem using Spark 2.1.1 where #Joha's answer does not work because spark.sessionState is not accessible due to being declared lazy.
In Spark 2.2.0 you can do the following:
import org.apache.spark.sql.catalyst.TableIdentifier
import org.apache.spark.sql.catalyst.catalog.CatalogTable
import org.apache.spark.sql.catalyst.catalog.CatalogTableType
val identifier = TableIdentifier("table", Some("database"))
val oldTable = spark.sessionState.catalog.getTableMetadata(identifier)
val newTableType = CatalogTableType.MANAGED
val alteredTable = oldTable.copy(tableType = newTableType)
spark.sessionState.catalog.alterTable(alteredTable)
The issue is case-sensitivity on spark-2.1 and above.
Please try setting TBLPROPERTIES in lower case -
ALTER TABLE <TABLE NAME> SET TBLPROPERTIES('external'='false')
I had the same issue while using a hive external table. I solved the problem by directly setting the propery external to false in hive metastore using a hive metastore client
Table table = hiveMetaStoreClient.getTable("db", "table");
table.putToParameters("EXTERNAL","FALSE");
hiveMetaStoreClient.alter_table("db", "table", table,true);
I tried the above option from scala databricks notebook, and the
external table was converted to MANAGED table and the good part is
that the desc formatted option from spark on the new table is still
showing the location to be on my ADLS. This was one limitation that
spark was having, that we cannot specify the location for a managed
table.
As of now i am able to do a truncate table for this. hopefully there
was a more direct option for creating a managed table with location
specified from spark sql.

zeppelin not showing hive tables in CDH cluster

i am running zeppelin referring to CDH cluster. sql paragraph doesnot work. sample example showing bank file to be loaded and registered as temp table works.. but not the hive metastore tables.
How to make the default references to hive metastore?
%spark
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("use database_name")
val df = sqlContext.sql("select * from table_name")
df.registerTempTable("table_name")
%sql
show tables
select * from table_name

Can't access Spark 2.0 Temporary Table from beeline

With Spark 1.5.1, I've already been able to access spark-shell temporary tables from Beeline using Thrift Server. I've been able to do so by reading answers to related questions on Stackoverflow.
However, after upgrading to Spark 2.0, I can't see temporary tables from Beeline anymore, here are the steps I'm following.
I'm launching spark-shell using the following command:
./bin/spark-shell --master=myHost.local:7077 —conf spark.sql.hive.thriftServer.singleSession=true
Once the spark shell is ready I enter the following lines to launch thrift server and create a temporary view from a data frame taking its source in a json file
import org.apache.spark.sql.hive.thriftserver._
spark.sqlContext.setConf("hive.server2.thrift.port","10002")
HiveThriftServer2.startWithContext(spark.sqlContext)
val df = spark.read.json("examples/src/main/resources/people.json")
df.createOrReplaceTempView("people")
spark.sql("select * from people").show()
The last statement displays the table, it runs fine.
However when I start beeline and log to my thrift server instance, I can't see any temporary tables:
show tables;
+------------+--------------+--+
| tableName | isTemporary |
+------------+--------------+--+
+------------+--------------+--+
No rows selected (0,658 seconds)
Did I miss something regarding my spark upgrade from 1.5.1 to 2.0, how can I gain access to my temporary tables ?
This worked for me after upgrading to spark 2.0.1
val sparkConf =
new SparkConf()
.setAppName("Spark Thrift Server Demo")
.setMaster(sparkMaster)
.set("hive.metastore.warehouse.dir", hdfsDataUri + "/hive")
val spark = SparkSession
.builder()
.enableHiveSupport()
.config(sparkConf)
.getOrCreate()
val sqlContext = new org.apache.spark.sql.SQLContext(spark.sparkContext)
HiveThriftServer2.startWithContext(sqlContext)

Using hive database in spark

I am new in spark and trying to run some queries on tpcds benchmark tables, using HortonWorks Sandbox.
http://www.tpc.org/tpcds/
There is no problem while using hive through shell or hive-view on sandbox. The problem is that I don't know how connect to the database if I want to use the spark.
How can I use a hive database in spark for running the queries?
The only solution that I know till now is to rebuild each table manually and load data in them using the following scala codes, which is not the best solution.
scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS employee(id INT, name STRING, age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'")
scala> sqlContext.sql("LOAD DATA LOCAL INPATH 'employee.txt' INTO TABLE employee")
scala> val result = sqlContext.sql("FROM employe SELECT id, name, age")
scala> result.show()
I also read some about hive-site.xml but I don't know where to find it and what changes to make on it to connect to the database.
There is no need to connect to a specific database when using Spark and HiveContext.
You simply need to copy the "hive-site.xml" file to the Spark conf folder (or you could also create a symlink).
cp $HIVE_HOME/conf/hive-site.xml $SPARK_HOME/conf/
Then, in Spark you can do something like that (I'm not a scala user so the syntax might be wrong) :
val hc = new org.apache.spark.sql.hive.HiveContext(sc)
val result = hc.sql("SELECT col1, col2, col3 FROM dbname.tablename")
result.show()

save Spark dataframe to Hive: table not readable because "parquet not a SequenceFile"

I'd like to save data in a Spark (v 1.3.0) dataframe to a Hive table using PySpark.
The documentation states:
"spark.sql.hive.convertMetastoreParquet: When set to false, Spark SQL will use the Hive SerDe for parquet tables instead of the built in support."
Looking at the Spark tutorial, is seems that this property can be set:
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false")
# code to create dataframe
my_dataframe.saveAsTable("my_dataframe")
However, when I try to query the saved table in Hive it returns:
hive> select * from my_dataframe;
OK
Failed with exception java.io.IOException:java.io.IOException:
hdfs://hadoop01.woolford.io:8020/user/hive/warehouse/my_dataframe/part-r-00001.parquet
not a SequenceFile
How do I save the table so that it's immediately readable in Hive?
I've been there...
The API is kinda misleading on this one.
DataFrame.saveAsTable does not create a Hive table, but an internal Spark table source.
It also stores something into Hive metastore, but not what you intend.
This remark was made by spark-user mailing list regarding Spark 1.3.
If you wish to create a Hive table from Spark, you can use this approach:
1. Use Create Table ... via SparkSQL for Hive metastore.
2. Use DataFrame.insertInto(tableName, overwriteMode) for the actual data (Spark 1.3)
I hit this issue last week and was able to find a workaround
Here's the story:
I can see the table in Hive if I created the table without partitionBy:
spark-shell>someDF.write.mode(SaveMode.Overwrite)
.format("parquet")
.saveAsTable("TBL_HIVE_IS_HAPPY")
hive> desc TBL_HIVE_IS_HAPPY;
OK
user_id string
email string
ts string
But Hive can't understand the table schema(schema is empty...) if I do this:
spark-shell>someDF.write.mode(SaveMode.Overwrite)
.format("parquet")
.saveAsTable("TBL_HIVE_IS_NOT_HAPPY")
hive> desc TBL_HIVE_IS_NOT_HAPPY;
# col_name data_type from_deserializer
[Solution]:
spark-shell>sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false")
spark-shell>df.write
.partitionBy("ts")
.mode(SaveMode.Overwrite)
.saveAsTable("Happy_HIVE")//Suppose this table is saved at /apps/hive/warehouse/Happy_HIVE
hive> DROP TABLE IF EXISTS Happy_HIVE;
hive> CREATE EXTERNAL TABLE Happy_HIVE (user_id string,email string,ts string)
PARTITIONED BY(day STRING)
STORED AS PARQUET
LOCATION '/apps/hive/warehouse/Happy_HIVE';
hive> MSCK REPAIR TABLE Happy_HIVE;
The problem is that the datasource table created through Dataframe API(partitionBy+saveAsTable) is not compatible with Hive.(see this link). By setting spark.sql.hive.convertMetastoreParquet to false as suggested in the doc, Spark only puts data onto HDFS,but won't create table on Hive. And then you can manually go into hive shell to create an external table with proper schema&partition definition pointing to the data location.
I've tested this in Spark 1.6.1 and it worked for me. I hope this helps!
I have done in pyspark, spark version 2.3.0 :
create empty table where we need to save/overwrite data like:
create table databaseName.NewTableName like databaseName.OldTableName;
then run below command:
df1.write.mode("overwrite").partitionBy("year","month","day").format("parquet").saveAsTable("databaseName.NewTableName");
The issue is you can't read this table with hive but you can read with spark.
metadata doesn't already exist. In other words, it will add any partitions that exist on HDFS but not in metastore, to the hive metastore.

Resources