Deleting data from Hive managed table (Partitioned and Bucketed) - apache-spark

We have a hive managed table (its both partitioned and bucketed, and transaction = 'true').
We are using Spark (version 2.4) to interact with this hive table.
We are able to successfully ingest data into this table using following;
sparkSession.sql("insert into table values(''))
But we are not able to delete a row from this table. We are attempting to delete using below command;
sparkSession.sql("delete from table where col1 = '' and col2 = '')
We are getting operationNotAccepted exception.
Do we need to do anything specific to be able to perform this action?
Thanks
Anuj

Unless DELTA table, this is not possible.
ORC does not support delete for Hive bucketed tables. See https://github.com/qubole/spark-acid
HUDI on AWS could also be an option.

Related

spark and impala in CDH Hadoop do not agree on the table schema

Need some help as we are baffled. Using Impala SQL, we did ALTER TABLE to add 3 columns to a parquet table. The table is used by both Spark (v2) and Impala jobs.
After the columns were added, Impala correctly reports the new columns using describe, however, Spark does not report the freshly added columns when spark.sql("describe tablename") is executed.
We double checked Hive and it correctly reports the added columns.
We ran refresh table tablename in spark but it still doesn't see the new columns. We believe we must be overlooking something simple. What step did we miss?
Update: Impala sees the table with the columns but Spark does not acknowledge the new columns. Reading more about spark, apparently the spark engine reads the schema from the parquet file rather than from the hive meta store. The suggested work around did not work and the only recourse that could be found was to drop the table and rebuild it.

Spark SQL Merge query

I am trying to MERGE two tables using spark sql and getting error with the statement.
The tables are created as external tables pointing to the Azure ADLS storage. The sql is executing using Databricks.
Table 1:
Name,Age.Sex
abc,24,M
bca,25,F
Table 2:
Name,Age,Sex
abc,25,M
acb,25,F
The Table 1 is the target table and Table 2 is the source table.
In the table 2 I have one Insert and one update record which needs to be merged with source table 1.
Query:
MERGE INTO table1 using table2 ON (table1.name=table2.name)
WHEN MATCHED AND table1.age <> table2.age AND table1.sex<>table2.sex
THEN UPDATE SET table1.age=table2.age AND table1.sex=table2.sex
WHEN NOT MATCHED
THEN INSERT (name,age,sex) VALUES (table2.name,table2.age,table2.sex)
Is the spark SQL support merge or is there another way of achieving it ?
Thanks
Sat
To use MERGE you need the Delta Lake option (and associated jars). Then you can use MERGE.
Otherwise, SQL Merge is not supported by Spark. The Dataframe Writer APIs with own logic are then needed. There are a few different ways to do this. Even with ORC ACID, Spark will not work in this way.

drop table command is not deleting path of hive table which was created by spark-sql

I am trying to drop a table(Internal) table that was created Spark-Sql, some how table is getting dropped but location of the table is still exists. Can some one let me know how to do this?
I tried both Beeline and Spark-Sql
create table something(hello string)
PARTITIONED BY(date_d string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "^"
LOCATION "hdfs://path"
)
Drop table something;
No rows affected (0.945 seconds)
Thanks
Spark internally uses Hive metastore to create Table. If the table is created as an external hive table from spark i.e. the data present in HDFS and Hive provides a table view on that, drop table command will only delete the Metastore information and will not delete the data from HDFS.
So there are some alternate strategy which you could take
Manually delete the data from HDFS using hadoop fs -rm -rf command
Do alter table on the table you want to delete, change the external table to internal table then drop the table.
ALTER TABLE <table-name> SET TBLPROPERTIES('external'='false');
drop table <table-name>;
The first statement will convert the external table to internal table and 2nd statement will delete the table with the data.

Hive PartitionFilter are not Applied

I have facing this issue with hive.
When i Query a table ,which is partitioned on date column,
SELECT count(*) from table_name where date='2018-06-01' the query reads the entire table data and keeps for running hours,
Using EXPLAIN I found that HIVE is not applying the PartitionFilter on the query
I have double checked that the table is partitioned on date column by desc table_name.
Execution engine is Spark And Data is stored in Azure Data lake in Parquet Format
However I have another table in the Database for which the PartitionFilter is applied and it executes as expected.
Can there be some issue with the hive metadata or it is something else
Found the cause for this issue,
Hive wasn't applying the Partition Filters on some table ,because those tables were cached.
Thus when i restared the thrift server the cache was removed and Partition Filters were applied

save dataframe as external hive table

I have used one way to save dataframe as external table using parquet file format but is there some other way to save dataframes directly as external table in hive like we have saveAsTable for managed table
you can do this in this way
df.write.format("ORC").options(Map("path"-> "yourpath")) saveAsTable "anubhav"
In PySpark, External Table can be created as below:
df.write.option('path','<External Table Path>').saveAsTable('<Table Name>')
For external table, don't use saveAsTable. Instead, save the data at location of the external table specified by path. Then add partition so that it is registered with hive metadata. This will allow you to hive query by partition later.
// hc is HiveContext, df is DataFrame.
df.write.mode(SaveMode.Overwrite).parquet(path)
val sql =
s"""
|alter table $targetTable
|add if not exists partition
|(year=$year,month=$month)
|location "$path"
""".stripMargin
hc.sql(sql)
You can also save dataframe with manual create table
dataframe.registerTempTable("temp_table");
hiveSqlContext.sql("create external table
table_name if not exist as select * from temp_table");
Below mentioned link has a good explanation for create table https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-table.html

Resources