How to access HIVE ACID tables using HANA SDA Virtual table? - apache-spark

We are currently using HANA 1 sps 12 and SPARK controller to create virtual tables and access the HIVE data in HANA. The problem is we have some SC2 table we want to archive in HANA for which we need full CRUD operation. We have converted a few of the Hive tables to ACID (transactional = true). Now we are not able to fetch records, it returns 0 records.
We tried to use Drill, which has native support for Hive acid tables, but when we are querying the Hive tables using Drill ODBC driver and DSN, it fails. After checking the queries that hit Drill we found out, HANA is wrapping the schema name in double-quotes. eg. Select * from "hive.schemaname".tablename.
We tried to change the default quoting to " from the default backtick, but ended up losing remote schema refresh as that query sends wrap the schema name with backtick `.

There is an incompatibility of Spark 2 and Hive ACID transactional tables. External tables can be used as a workaround. If you have an S-user SAP have documented this in Note 2901291 (SAP HANA Spark Controller - Incompatibility of Spark 2 and Hive ACID transactional tables).

ACID table access is not possible using SDA, so we created a non-ACID table from the ACID table in Hive and accessed it back in the HANA DB. We are using ACID to perform all our necessary CRUD operations and then insert overwrite into a partitioned non-ACID table, this way only the partitions where an update/change happened will be recreated, rather than the whole table.

Related

PySpark is not able to read Hive ORC transaction table through sparkContext/hiveContext ? Can we update/delete hive table data using Pyspark?

I have tried to access the Hive ORC Transactional table (which has underlying delta files on HDFS) using PySpark but I'm not able to read the transactional table through sparkContext/hiveContext.
/mydim/delta_0117202_0117202
/mydim/delta_0117203_0117203
Officially Spark not yet supported for Hive-ACID table, get a
full dump/incremental dump of acid table to regular hive orc/parquet partitioned table then read the data using spark.
There is a Open Jira saprk-15348 to add support for reading Hive ACID table.
If you run major compaction on Acid table(from hive) then spark able to read base_XXX directories only but not delta directories Spark-16996 addressed in this jira.
There are some workaround to read acid tables using SPARK-LLAP as mentioned in this link.
I think starting from HDP-3.X HiveWareHouseConnector is able to support to read HiveAcid tables.

Hive - Copy database schema with partitions and recreate in another hive instance

I have copied the data and folder structure for a database with partitioned hive tables from one HDFS instance to another.
How can I do the same with the hive metadata? I need the new HDFS instance's hive to have this database and its tables defined using their existing partitioning just like it is in the original location. And, of course, they need to maintain their original schemas in general with the hdfs external table locations being updated.
Happy to use direct hive commands, spark, or any general CLI utilities that are open source and readily available. I don't have an actual hadoop cluster (this is cloud storage), so please avoid answers that depend on map reduce/etc (like Sqoop).
Use Hive command:
SHOW CREATE TABLE tablename;
This will print create table sentence. Copy and change table type to external, location, schema, column names if necessary, etc and execute.
After you created the table, use this command to create partitions metadata
MSCK [REPAIR] TABLE tablename;
The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is:
ALTER TABLE tablename RECOVER PARTITIONS;
This will add Hive partitions metadata. See manual here: RECOVER PARTITIONS

Databricks Delta and Hive Transactional Table

I've seen from two sources that right now you cannot interact in any meaningful way with HIVE Transactional Tables from Spark.
Hive ACID
Hive Transactional Tables are not readable by spark
I see Databricks has released a Transactional feature called Databricks Delta. Is it possible to now read HIVE Transactional Tables using this feature?
Nope. Not the Hive Transactional tables. You create a new type of table called Databricks Delta Table(Spark table of parquets) and leverage the Hive metastore to read/write to these tables.
Its a kind of External table but its more like data to schema. More of Spark and Parquet.
The solution for your problem might be to read the hive files and Impose the schema accordingly in a Databricks notebook and then save it as a databricks delta table.
like this : df.write.mode('overwrite').format('delta').save(/mnt/out/put/path)
You would still need to write a DDL pointing to that location.Just FYI DELTA table is Transactional.
I don't see the point on stressing on just Spark for accessing Hive ACID.
Actually Spark relies on a host language, Python and Scala being the most popular choices.
You could use Hive ACID from Python with no issues, this is a very well proven integration.
Your data can reside on Spark dataframes or RDDs, but as long as you can transfer it to standard Python data structures, you can interoperate with Hive ACID directly from these.

update query in Spark SQL

I wonder can I use the update query in sparksql just like:
sqlContext.sql("update users set name = '*' where name is null")
I got the error:
org.apache.spark.sql.AnalysisException:
Unsupported language features in query:update users set name = '*' where name is null
If the sparksql does not support the update query or am i writing the code incorrectly?
Spark SQL doesn't support UPDATE statements yet.
Hive has started supporting UPDATE since hive version 0.14. But even with Hive, it supports updates/deletes only on those tables that support transactions, it is mentioned in the hive documentation.
See the answers in databricks forums confirming that UPDATES/DELETES are not supported in Spark SQL as it doesn't support transactions. If we think, supporting random updates is very complex with most of the storage formats in big data. It requires scanning huge files, updating specific records and rewriting potentially TBs of data. It is not normal SQL.
Now it's possible, with Databricks Delta Lake
Spark SQL now supports update, delete and such data modification operations if the underlying table is in delta format.
Check this out:
https://docs.delta.io/0.4.0/delta-update.html#update-a-table

Hive Bucketed Tables enabled for Transactions

So we are trying to create a Hive table with ORC format bucketed and enabled for transactions using the below statement
create table orctablecheck ( id int,name string) clustered by (sno) into 3 buckets stored as orc TBLPROPERTIES ( 'transactional'='true')
The table is getting created in Hive and also Reflects in Beeline both in the Metastore as well as Spark SQL(which we have configured to run on top of Hive JDBC)
We are now inserting data into this table via Hive. However we see after insertion the data doesnt reflect in Spark SQL. It only reflects correctly in Hive.
The table only shows the data in the table if we restart the Thrift Server.
Is the transaction attribute set on your table? I observed that hive transaction storage structure do not work with spark yet. You can confirm this by looking at the transactional attribute in the output of below command in hive console.
desc extended <tablename> ;
If you'd need to access transactional table, consider doing a major compaction and then try accessing the tables
ALTER TABLE <tablename> COMPACT 'major';
I created a transactional table in Hive, and stored data in it using Spark (records 1,2,3) and Hive (record 4).
After major compaction,
I can see all 4 records in Hive (using beeline)
only records 1,2,3 in spark (using spark-shell)
unable to update records 1,2,3 in Hive
update to record 4 in Hive is ok

Resources