Spark sql can't find table in hive in HDP - apache-spark

I use HDP3.1 and I added Spark2, Hive and Other services which are needed. I turned of the ACID feature in Hive. The spark job can't find the table in hive. But the table exists in Hive. The exception likes:
org.apache.spark.sql.AnalysisException: Table or view not found
There is hive-site.xml in Spark's conf folder. It is automaticly created by HDP. But it isn't same as the file in hive's conf folder. And from the log, the spark can get the thrift URI of hive correctly.
I use spark sql and created one hive table in spark-shell. I found the table was created in the fold which is specified by spark.sql.warehouse.dir. I changed its value to the value of hive.metastore.warehouse.dir. But the problem is still there.
I also enabled hive support when creating spark session.
val ss = SparkSession.builder().appName("统计").enableHiveSupport().getOrCreate()

There is metastore.catalog.default in hive-site.xml in spark's conf folder. It value is spark. It should be changed to hive. And btw, we should disable ACID feature of hive.

You can use hivewarehouse connector and use llap in hive conf

In HDP 3.0 and later, Spark and Hive use independent catalogs for accessing SparkSQL or Hive tables on the same or different platforms.
Spark, by default, only reads the Spark Catalog. And, this means Spark applications that attempt to read/write to tables created using hive CLI will fail with table not found exception.
Workaround:
Create the table in Hive CLI and Spark SQL
Hive Warehouse Connector

Related

How can i show hive table using pyspark

Hello i created a spark HD insight cluster on azure and i’m trying to read hive tables with pyspark but the proble that its show me only default database
Anyone have an idea ?
If you are using HDInsight 4.0, Spark and Hive not share metadata anymore.
For default you will not see hive tables from pyspark, is a problem that i share on this post: How save/update table in hive, to be readbale on spark.
But, anyway, things you can try:
If you want test only on head node, you can change the hive-site.xml, on property "metastore.catalog.default", change the value to hive, after that open pyspark from command line.
If you want to apply to all cluster nodes, changes need to be made on Ambari.
Login as admin on ambari
Go to spark2 > Configs > hive-site-override
Again, update property "metastore.catalog.default", to hive value
Restart all required on Ambari panel
These changes define hive metastore catalog as default.
You can see hive databases and table now, but depending of table structure, you will not see the table data properly.
If you have created tables in other databases, try show tables from database_name. Replace database_name with the actual name.
You are missing details of hive server in SparkSession. If you haven't added any it will create and use default database to run sparksql.
If you've added configuration details in spark default conf file for spark.sql.warehouse.dir and spark.hadoop.hive.metastore.uris then while creating SparkSession add enableHiveSupport().
Else add configuration details while creating sparksession
.config("spark.sql.warehouse.dir","/user/hive/warehouse")
.config("hive.metastore.uris","thrift://localhost:9083")
.enableHiveSupport()

pyspark dataframe save to hive table can not be found

we have installed our cluster via cdh6.2.
use pyspark create a dataFrame, then save it to hive.
the file is created in warehouse correctly, but it can not be found in hive or impala using show tables.
it can be found by spark sql using sql.('show tables'). But it only show the table be created by the spark code before, which means it can not see tables create via hive or impala console.
so I think it maybe the spark code doesn't collect to the Hive Metastore server.
but I don't know how to setup it to the Hive Metastore server.
In order to connect to hive metastore you need to copy the hive-site.xml file into spark/conf directory. Try the following:
ln -s /usr/lib/hive/conf/hive-site.xml /usr/lib/spark/conf/hive-site.xml

Table loaded through Spark not accessible in Hive

Hive table created through Spark (pyspark) are not accessible from Hive.
df.write.format("orc").mode("overwrite").saveAsTable("db.table")
Error while accessing from Hive:
Error: java.io.IOException: java.lang.IllegalArgumentException: bucketId out of range: -1 (state=,code=0)
Table getting created successfully in Hive and able to read this table back in spark. Table metadata is accessible (in Hive) and data file in table (in hdfs) directory.
TBLPROPERTIES of Hive table are :
'bucketing_version'='2',
'spark.sql.create.version'='2.3.1.3.0.0.0-1634',
'spark.sql.sources.provider'='orc',
'spark.sql.sources.schema.numParts'='1',
I also tried creating table with other workarounds but getting error while creating table:
df.write.mode("overwrite").saveAsTable("db.table")
OR
df.createOrReplaceTempView("dfTable")
spark.sql("CREATE TABLE db.table AS SELECT * FROM dfTable")
Error :
AnalysisException: u'org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Table default.src failed strict managed table checks due to the following reason: Table is marked as a managed table but is not transactional.);'
Stack version details:
Spark2.3
Hive3.1
Hortonworks Data Platform HDP3.0
From HDP 3.0, catalogs for Apache Hive and Apache Spark are separated, and they use their own catalog; namely, they are mutually exclusive - Apache Hive catalog can only be accessed by Apache Hive or this library, and Apache Spark catalog can only be accessed by existing APIs in Apache Spark . In other words, some features such as ACID tables or Apache Ranger with Apache Hive table are only available via this library in Apache Spark. Those tables in Hive should not directly be accessible within Apache Spark APIs themselves.
Below article explain the steps:
Integrating Apache Hive with Apache Spark - Hive Warehouse Connector
I faced the same issue after setting the following properties, it is working fine.
set hive.mapred.mode=nonstrict;
set hive.optimize.ppd=true;
set hive.optimize.index.filter=true;
set hive.tez.bucket.pruning=true;
set hive.explain.user=false;
set hive.fetch.task.conversion=none;
set hive.support.concurrency=true;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;

Snappydata store with hive metastore from existing spark installation

I am using snappydata-1.0.1 on HDP2.6.2, spark 2.1.1 and was able to connect from an external spark application. But when i enable hive support by adding hive-site.xml to spark conf, snappysession is listing the tables from hivemetastore instead of snappystore.
SparkConf sparkConf = new SparkConf().setAppName("TEST APP");
JavaSparkContext javaSparkContxt = new JavaSparkContext(sparkConf);
SparkSession sps = new SparkSession.Builder().enableHiveSupport().getOrCreate();
SnappySession snc = new SnappySession(new SparkSession(javaSparkContxt.sc()).sparkContext());
snc.sqlContext().sql("show tables").show();
The above code gives me list of tables in snappy store when hive-site.xml is not in sparkconf, if hive-site.xml added.. it lists me tables from hive metastore.
Is it not possible to use hive metastore and snappydata metastore in the same application?
Can is read hive table into a dataframe and snappydata table to another DF in same application?
Thanks in advance
So, it isn't the hive metastore that is the problem. You can use Hive tables and Snappy Tables in the same application. e.g. copy hive table into Snappy in-memory.
But, we will need to test the use of external hive metastore configured in hive-site.xml. Perhaps a bug.
You should try using the Snappy smart connector. i.e. Run your Spark using the Spark distribution in HDP and connect to Snappydata cluster using the connector (see docs). Here it looks like you are trying to run your Spark app using the Snappydata distribution.

Spark SQL: how does it tell hive to run query on spark?

As rightly pointed out here:
Spark SQL query execution on Hive
Spark SQL when running through HiveContext will make SQL query use the spark engine.
How does spark SQL setting hive.execution.engine=spark tell hive to do so?
Note this works automatically, we do not have to specify this in hive-site.xml in the conf directory of spark.
There are 2 independent projects here
Hive on Spark - Hive project that integrates Spark as an additional engine.
Spark SQL - Spark module that makes use of the Hive code.
HiveContext belongs to the 2nd and hive.execution.engine is a property of the 1st.

Resources