how to register an existing delta table to hive - apache-spark

We are using spark for reading/writing data in delta format stored in HDFS (Databricks Delta table version 0.5.0).
We would like to utilize the power of Hive to interact with the delta tables.
How can we register an existing data in delta format from a path on HDFS to Hive?
Please note that currently we are running spark (2.4.0) on cloudera platform (CDH 6.3.3)

The only way I can do this so far is by registering it as an unmanaged table. The most significant difference, as far as I can tell, is that if you drop an unmanaged table, it does not drop the underlying data.

Related

Write a spark DataFrame to a table

I am trying to understand the spark DataFrame API method called saveAsTable.
I have following question
If I simply write a dataframe using saveAsTable API
df7.write.saveAsTable("t1"), (assuming t1 did not exist earlier), will the newly created table be a hive table which can be read outside spark using Hive QL ?
Does spark also create some non-hive table (which are created using saveAsTable API but can not be read outside spark using HiveQL)?
How can check if a table is Hive Table or Non-Hive table ?
(I am new to big data processing, so pardon me if question is not phrased properly)
Yes. Newly created table will be hive table and can be queried from Hive CLI(Only if the DataFrame is created from single input HDFS path i.e. from non-partitioned single input HDFS path).
Below is the documentation comment in DataFrameWriter.scala class. Documentation link
When the DataFrame is created from a non-partitioned
HadoopFsRelation with a single input path, and the data source
provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and
Parquet), the table is persisted in a Hive compatible format, which
means other systems like Hive will be able to read this table.
Otherwise, the table is persisted in a Spark SQL specific format.
Yes, you can do. You table can be partitioned by a column, but can not use bucketing (its a problem between spark and hive).

Can Hive Read data from Delta lake file format?

I started going through DELTA LAKE file format, is hive capable of reading data from this newly introduced delta file format? If so could you please let me know the serde you were using.
Hive support is available with Delta Lake file format. First, step is to add the jars from https://github.com/delta-io/connectors, in our hive path. And then create a table using following format.
CREATE EXTERNAL TABLE test.dl_attempts_stream
(
...
)
STORED BY 'io.delta.hive.DeltaStorageHandler'
LOCATION
Delta Format picks up partition by default, so no need to mention partition while creating a table.
NOTE: If data is being inserted via a Spark job, please provide hive-site.xml, and enableHiveSupport in Spark Job, to create Delta Lake table in Hive.

Apache Spark not using partition information from Hive partitioned external table

I have a simple Hive-External table which is created on top of S3 (Files are in CSV format). When I run the hive query it shows all records and partitions.
However when I use the same table in Spark ( where the Spark SQL has a where condition on the partition column) it does not show that a partition filter is applied. However for a Hive Managed table , Spark is able to use the information of partitions and apply the partition filter.
Is there any flag or setting that can help me make use of partitions of Hive external tables in Spark ? Thanks.
Update :
For some reason, only the spark plan is not showing the Partition Filters. However, when you look at the data loaded its only loading the data needed from the partitions.
Ex: Where rating=0 , loads only one file of 1 MB, when I don't have filter its reads all 3 partition for 3 MB
tl; dr set the following before the running sql for external table
spark.sql("set spark.sql.hive.convertMetastoreOrc=true")
The difference in behaviour is not because of extenal/managed table.
The behaviour depends on two factors
1. Where the table was created(Hive or Spark)
2. File format (I believe it is ORC in this case, from the screen capture)
Where the table was created(Hive or Spark)
If the table was create using Spark APIs, it is considered as Datasource table.
If the table was created usng HiveQL, it is considered as Hive native table.
The metadata of both these tables are store in Hive metastore, the only difference is in the provider field of TBLPROPERTIES of the tables(describe extended <tblName>). The value of the property is orcor empty in Spark table and hive for a Hive.
How spark uses this information
When provider is not hive(datasource table), Spark uses its native way of processing the data.
If provider is hive, Spark uses Hive code to process the data.
Fileformat
Spark gives config flag to instruct the engine to use Datasource way of processing the data for the floowing file formats = Orc and Parquet
Flags:
Orc
val CONVERT_METASTORE_ORC = buildConf("spark.sql.hive.convertMetastoreOrc")
.doc("When set to true, the built-in ORC reader and writer are used to process " +
"ORC tables created by using the HiveQL syntax, instead of Hive serde.")
.booleanConf
.createWithDefault(true)
Parquet
val CONVERT_METASTORE_PARQUET = buildConf("spark.sql.hive.convertMetastoreParquet")
.doc("When set to true, the built-in Parquet reader and writer are used to process " +
"parquet tables created by using the HiveQL syntax, instead of Hive serde.")
.booleanConf
.createWithDefault(true)
I also ran into this kind of problem having multiple joins of internal and external tables.
None of the tricks work including:
spark.sql("set spark.sql.hive.convertMetastoreParquet=false")
spark.sql("set spark.sql.hive.metastorePartitionPruning=true")
spark.sql("set spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER")
anyone who knows how to solve this problem.

PySpark is not able to read Hive ORC transaction table through sparkContext/hiveContext ? Can we update/delete hive table data using Pyspark?

I have tried to access the Hive ORC Transactional table (which has underlying delta files on HDFS) using PySpark but I'm not able to read the transactional table through sparkContext/hiveContext.
/mydim/delta_0117202_0117202
/mydim/delta_0117203_0117203
Officially Spark not yet supported for Hive-ACID table, get a
full dump/incremental dump of acid table to regular hive orc/parquet partitioned table then read the data using spark.
There is a Open Jira saprk-15348 to add support for reading Hive ACID table.
If you run major compaction on Acid table(from hive) then spark able to read base_XXX directories only but not delta directories Spark-16996 addressed in this jira.
There are some workaround to read acid tables using SPARK-LLAP as mentioned in this link.
I think starting from HDP-3.X HiveWareHouseConnector is able to support to read HiveAcid tables.

Databricks Delta and Hive Transactional Table

I've seen from two sources that right now you cannot interact in any meaningful way with HIVE Transactional Tables from Spark.
Hive ACID
Hive Transactional Tables are not readable by spark
I see Databricks has released a Transactional feature called Databricks Delta. Is it possible to now read HIVE Transactional Tables using this feature?
Nope. Not the Hive Transactional tables. You create a new type of table called Databricks Delta Table(Spark table of parquets) and leverage the Hive metastore to read/write to these tables.
Its a kind of External table but its more like data to schema. More of Spark and Parquet.
The solution for your problem might be to read the hive files and Impose the schema accordingly in a Databricks notebook and then save it as a databricks delta table.
like this : df.write.mode('overwrite').format('delta').save(/mnt/out/put/path)
You would still need to write a DDL pointing to that location.Just FYI DELTA table is Transactional.
I don't see the point on stressing on just Spark for accessing Hive ACID.
Actually Spark relies on a host language, Python and Scala being the most popular choices.
You could use Hive ACID from Python with no issues, this is a very well proven integration.
Your data can reside on Spark dataframes or RDDs, but as long as you can transfer it to standard Python data structures, you can interoperate with Hive ACID directly from these.

Resources