Databricks Delta and Hive Transactional Table - apache-spark

I've seen from two sources that right now you cannot interact in any meaningful way with HIVE Transactional Tables from Spark.
Hive ACID
Hive Transactional Tables are not readable by spark
I see Databricks has released a Transactional feature called Databricks Delta. Is it possible to now read HIVE Transactional Tables using this feature?

Nope. Not the Hive Transactional tables. You create a new type of table called Databricks Delta Table(Spark table of parquets) and leverage the Hive metastore to read/write to these tables.
Its a kind of External table but its more like data to schema. More of Spark and Parquet.
The solution for your problem might be to read the hive files and Impose the schema accordingly in a Databricks notebook and then save it as a databricks delta table.
like this : df.write.mode('overwrite').format('delta').save(/mnt/out/put/path)
You would still need to write a DDL pointing to that location.Just FYI DELTA table is Transactional.

I don't see the point on stressing on just Spark for accessing Hive ACID.
Actually Spark relies on a host language, Python and Scala being the most popular choices.
You could use Hive ACID from Python with no issues, this is a very well proven integration.
Your data can reside on Spark dataframes or RDDs, but as long as you can transfer it to standard Python data structures, you can interoperate with Hive ACID directly from these.

Related

how to register an existing delta table to hive

We are using spark for reading/writing data in delta format stored in HDFS (Databricks Delta table version 0.5.0).
We would like to utilize the power of Hive to interact with the delta tables.
How can we register an existing data in delta format from a path on HDFS to Hive?
Please note that currently we are running spark (2.4.0) on cloudera platform (CDH 6.3.3)
The only way I can do this so far is by registering it as an unmanaged table. The most significant difference, as far as I can tell, is that if you drop an unmanaged table, it does not drop the underlying data.

Writing Data into Hive Transactional table

I am trying to write data into Hive transactional table using spark. Following is the sample code that I have used to insert data
dataSet.write().format("orc")
.partitionBy("column1")
.bucketBy(2,"column2")
.insertInto("table");
but unfortunately getting the following error while running the application.
org.apache.spark.sql.AnalysisException: 'insertInto' does not support
bucketBy right now;
The spark and hive versions that I have used is 2.4 and 3.1. Have googled a lot but didn't find any solution. I am pretty much new to hive Any help would be appreciated.
https://issues.apache.org/jira/browse/SPARK-15348 states clearly that Spark does not allow HIVE ORC ACID processing, currently. A pity, but not possible.
You need to write Hive scripts with TEZ or MR as underlying engine for Hive.

PySpark is not able to read Hive ORC transaction table through sparkContext/hiveContext ? Can we update/delete hive table data using Pyspark?

I have tried to access the Hive ORC Transactional table (which has underlying delta files on HDFS) using PySpark but I'm not able to read the transactional table through sparkContext/hiveContext.
/mydim/delta_0117202_0117202
/mydim/delta_0117203_0117203
Officially Spark not yet supported for Hive-ACID table, get a
full dump/incremental dump of acid table to regular hive orc/parquet partitioned table then read the data using spark.
There is a Open Jira saprk-15348 to add support for reading Hive ACID table.
If you run major compaction on Acid table(from hive) then spark able to read base_XXX directories only but not delta directories Spark-16996 addressed in this jira.
There are some workaround to read acid tables using SPARK-LLAP as mentioned in this link.
I think starting from HDP-3.X HiveWareHouseConnector is able to support to read HiveAcid tables.

Spark Connect Hive to HDFS vs Spark connect HDFS directly and Hive on the top of it?

Summary of the problem:
I have a perticular usecase to write >10gb data per day to HDFS via spark streaming. We are currently in the design phase. We want to write the data to HDFS (constraint) using spark streaming. The data is columnar.
We have 2 options(so far):
Naturally, I would like to use hive context to feed data to HDFS. The schema is defined and the data is feeded in batches or row wise.
There is another option. We can directly write data to HDFS thanks to spark streaming API. We are also considering this because we can query data from HDFS through hive then in this usecase. This will leave options open to use other technologies in future for the new usecases that may come.
What is best?
Spark Streaming -> Hive -> HDFS -> Consumed by Hive.
VS
Spark Streaming -> HDFS -> Consumed by Hive , or other technologies.
Thanks.
So far I have not found a discussion on the topic, my research may be short. If there is any article that you can suggest, I would be most happy to read it.
I have a particular use case to write >10gb data per day and data is columnar
that means you are storing day-wise data. if thats the case hive has partition column as date, so that you can query the data for each day easily. you can query the raw data from BI tools like looker or presto or any other BI tool. if you are querying from spark then you can use hive features/properties. Moreover if you store the data in columnar format in parquet impala can query the data using hive metastore.
If your data is columnar consider parquet or orc.
Regarding option2:
if you have hive an option NO need to feed data in to HDFS and create an external table from hive and access it.
Conclusion :
I feel both are same. but hive is preferred considering direct query on raw data using BI tools or spark. From HDFS also we can query data using spark. if its there in the formats like json or parquet or xml there wont be added advantage for option 2.
It depends on your final use cases. Please consider below two scenarios while taking decision:
If you have RT/NRT case and all your data is full refresh then I would suggest to go with second approach Spark Streaming -> HDFS -> Consumed by Hive. It will be faster than your first approach Spark Streaming -> Hive -> HDFS -> Consumed by Hive. Since there is one less layer in it.
If your data is incremental and also have multiple update, delete operations then It will be difficult to use HDFS or Hive over HDFS with spark. Since Spark does not allow to update or delete data from HDFS. In that case, both your approaches will be difficult to implement. Either you can go with Hive managed table and do update/delete using HQL (only supported in Hortonwork Hive version) or you can go with NOSQL database like HBase or Cassandra so that spark can do upsert & delete easily. From program perspective, it will be also easy in compare to both your approaches.
If you dump data in NoSQL then you can use hive over it for normal SQL or reporting purpose.
There are so many tools & approaches are available but go with that which fit in your all cases. :)

How is spark HiveContext/SQLContext retrieving schema/data?

I can't seem to find much documentation on it but when I pull data from Hive in Spark SQL how is it retrieving the schema, is it automatically looking in the Hive Metastore? Also is it Hive telling spark to look at the file location to pull the data into a DataFrame? And how does it handle a view or can it not handle a view yet?
Yes, it looks up hive metastore.
Spark delegates hive queries to hive. It captures output and turn it to a dataframe of rows.
From docs:
When working with Hive one must construct a HiveContext, which
inherits from SQLContext, and adds support for finding tables in the
MetaStore and writing queries using HiveQL

Resources