Table loaded through Spark not accessible in Hive - apache-spark

Hive table created through Spark (pyspark) are not accessible from Hive.
df.write.format("orc").mode("overwrite").saveAsTable("db.table")
Error while accessing from Hive:
Error: java.io.IOException: java.lang.IllegalArgumentException: bucketId out of range: -1 (state=,code=0)
Table getting created successfully in Hive and able to read this table back in spark. Table metadata is accessible (in Hive) and data file in table (in hdfs) directory.
TBLPROPERTIES of Hive table are :
'bucketing_version'='2',
'spark.sql.create.version'='2.3.1.3.0.0.0-1634',
'spark.sql.sources.provider'='orc',
'spark.sql.sources.schema.numParts'='1',
I also tried creating table with other workarounds but getting error while creating table:
df.write.mode("overwrite").saveAsTable("db.table")
OR
df.createOrReplaceTempView("dfTable")
spark.sql("CREATE TABLE db.table AS SELECT * FROM dfTable")
Error :
AnalysisException: u'org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Table default.src failed strict managed table checks due to the following reason: Table is marked as a managed table but is not transactional.);'
Stack version details:
Spark2.3
Hive3.1
Hortonworks Data Platform HDP3.0

From HDP 3.0, catalogs for Apache Hive and Apache Spark are separated, and they use their own catalog; namely, they are mutually exclusive - Apache Hive catalog can only be accessed by Apache Hive or this library, and Apache Spark catalog can only be accessed by existing APIs in Apache Spark . In other words, some features such as ACID tables or Apache Ranger with Apache Hive table are only available via this library in Apache Spark. Those tables in Hive should not directly be accessible within Apache Spark APIs themselves.
Below article explain the steps:
Integrating Apache Hive with Apache Spark - Hive Warehouse Connector

I faced the same issue after setting the following properties, it is working fine.
set hive.mapred.mode=nonstrict;
set hive.optimize.ppd=true;
set hive.optimize.index.filter=true;
set hive.tez.bucket.pruning=true;
set hive.explain.user=false;
set hive.fetch.task.conversion=none;
set hive.support.concurrency=true;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;

Related

On HDinsight 4.0, how can i save/update table on hive and this table be readable for oozie/spark/jupyter

Today we have this scenario:
Cluster Azure HDInsight 4.0
Running Workflow on Oozie
On this version, spark and hive do not share metadata anymore
We came from HDInsight 3.6, to work with this change, now we use Hive Warehouse Connector
Before: spark.write.saveAsTable("tableName", mode="overwrite")
Now: df.write.mode("overwrite").format(HiveWarehouseSession().HIVE_WAREHOUSE_CONNECTOR).option('table', "tableName").save()
The problem is on this point, using HWC make possible to save tables on hive.
But, hive databases/tables are not visible for Spark, Oozie and Jupyter, they see only tables on spark scope.
So, this is a major problem for us, because is not possible get data on managed tables from hive, and use them on oozie workflow.
To be possible save table on hive, and be visible on all cluster i made this configurations on Ambari:
hive > hive.strict.managed.tables = false
spark2 > metastore.catalog.default = hive
And now is possible to save table on hive, on the "old" way spark.write.saveAsTable.
But there is a problem when table is update/overwrite:
pyspark.sql.utils.AnalysisException: u'org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(message:java.security.AccessControlException: Permission denied: user=hive,
path="wasbs://containerName#storageName.blob.core.windows.net/hive/warehouse/managed/table"
:user:supergroup:drwxr-xr-x);'
So, i have two questions:
Is this the correct way to save table on hive, to be visible on all cluster?
How can i avoid this permission error on table overwrite? Keep in mind, this error occours when we execute Oozie Workflow
Thanks!

How can i show hive table using pyspark

Hello i created a spark HD insight cluster on azure and i’m trying to read hive tables with pyspark but the proble that its show me only default database
Anyone have an idea ?
If you are using HDInsight 4.0, Spark and Hive not share metadata anymore.
For default you will not see hive tables from pyspark, is a problem that i share on this post: How save/update table in hive, to be readbale on spark.
But, anyway, things you can try:
If you want test only on head node, you can change the hive-site.xml, on property "metastore.catalog.default", change the value to hive, after that open pyspark from command line.
If you want to apply to all cluster nodes, changes need to be made on Ambari.
Login as admin on ambari
Go to spark2 > Configs > hive-site-override
Again, update property "metastore.catalog.default", to hive value
Restart all required on Ambari panel
These changes define hive metastore catalog as default.
You can see hive databases and table now, but depending of table structure, you will not see the table data properly.
If you have created tables in other databases, try show tables from database_name. Replace database_name with the actual name.
You are missing details of hive server in SparkSession. If you haven't added any it will create and use default database to run sparksql.
If you've added configuration details in spark default conf file for spark.sql.warehouse.dir and spark.hadoop.hive.metastore.uris then while creating SparkSession add enableHiveSupport().
Else add configuration details while creating sparksession
.config("spark.sql.warehouse.dir","/user/hive/warehouse")
.config("hive.metastore.uris","thrift://localhost:9083")
.enableHiveSupport()

Spark sql can't find table in hive in HDP

I use HDP3.1 and I added Spark2, Hive and Other services which are needed. I turned of the ACID feature in Hive. The spark job can't find the table in hive. But the table exists in Hive. The exception likes:
org.apache.spark.sql.AnalysisException: Table or view not found
There is hive-site.xml in Spark's conf folder. It is automaticly created by HDP. But it isn't same as the file in hive's conf folder. And from the log, the spark can get the thrift URI of hive correctly.
I use spark sql and created one hive table in spark-shell. I found the table was created in the fold which is specified by spark.sql.warehouse.dir. I changed its value to the value of hive.metastore.warehouse.dir. But the problem is still there.
I also enabled hive support when creating spark session.
val ss = SparkSession.builder().appName("统计").enableHiveSupport().getOrCreate()
There is metastore.catalog.default in hive-site.xml in spark's conf folder. It value is spark. It should be changed to hive. And btw, we should disable ACID feature of hive.
You can use hivewarehouse connector and use llap in hive conf
In HDP 3.0 and later, Spark and Hive use independent catalogs for accessing SparkSQL or Hive tables on the same or different platforms.
Spark, by default, only reads the Spark Catalog. And, this means Spark applications that attempt to read/write to tables created using hive CLI will fail with table not found exception.
Workaround:
Create the table in Hive CLI and Spark SQL
Hive Warehouse Connector

Can Spark-sql work without a hive installation?

I have installed spark 2.4.0 on a clean ubuntu instance. Spark dataframes work fine but when I try to use spark.sql against a dataframe such as in the example below,i am getting an error "Failed to access metastore. This class should not accessed in runtime."
spark.read.json("/data/flight-data/json/2015-summary.json")
.createOrReplaceTempView("some_sql_view")
spark.sql("""SELECT DEST_COUNTRY_NAME, sum(count)
FROM some_sql_view GROUP BY DEST_COUNTRY_NAME
""").where("DEST_COUNTRY_NAME like 'S%'").where("sum(count) > 10").count()
Most of the fixes that I have see in relation to this error refer to environments where hive is installed. Is hive required if I want to use sql statements against dataframes in spark or am i missing something else?
To follow up with my fix. The problem in my case was that Java 11 was the default on my system. As soon as I set Java 8 as the default metastore_db started working.
Yes, we can run spark sql queries on spark without installing hive, by default hive uses mapred as an execution engine, we can configure hive to use spark or tez as an execution engine to execute our queries much faster. Hive on spark hive uses hive metastore to run hive queries. At the same time, sql queries can be executed through spark. If spark is used to execute simple sql queries or not connected with hive metastore server, its uses embedded derby database and a new folder with name metastore_db will be created under the user home folder who executes the query.

Where is Spark table created if Hive support is not enabled?

Where is table created when I use SQL syntax for create table in Spark 2.*?
CREATE TABLE ... AS
Does spark store it internally somewhere of there has to always be Impala or Hive support for Spark tables to work?
By default (when no connection specified e.g. in hive-site.xml). Spark will save it in the local metastore that is on Derby db.
And it should be stored in spark-warehouse directory in your Spark directory or whatever is defined in spark.sql.warehouse.dir prop.

Resources