How to drop partitions from hive views? - apache-spark

I have a partitioned view and I am trying to drop an existing partition from the view definition using hive CLI. However, when I try to drop a partition, it throws me the following error:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. null
Here is my create statement for view:
CREATE or replace VIEW test_view (logrecordtype, datacenter, ts_date, gen_date)
PARTITIONED ON (ts_date, gen_date)
AS SELECT logrecordtype, datacenter, ts_date, gen_date from test_table1 where ts_date <= '20200720'
union all
select logrecordtype, datacenter, ts_date, gen_date from test_table2 where ts_date != '20200720';
The underlying tables test_table1, test_table2 are also partitioned by (ts_date, gen_date).
Drop partition command:
ALTER VIEW test_view DROP IF EXISTS PARTITION (ts_date = '20200720', gen_date = '2020072201')
I am able to add partitions and issue show partition on my view but drop partition fails.
My show partition command shows:
show partitions test_view;
ts_date=20200720/gen_date=2020072201

Related

Spark - nondeterministic expressions not allowed using Delete FROM and exists

The goal of this query is to delete the records that have the same keys of the new dataframe.
This is the query that I'm executing:
val op=spark.sql(s"""
DELETE
FROM TABLE1 AS t
WHERE EXISTS (
SELECT 1
FROM TABLE2 AS s
WHERE t.DAY=s.DAY
AND t.DATA_STREAM=s.DATA_STREAM
)""")
The error that I'm getting is:
AnalysisException: nondeterministic expressions are only allowed in
Project, Filter, Aggregate, Window, or Generate, but found:
exists(t.DAY, t.DATA_STREAM)
in operator DeleteCommandEdge com.databricks.sql.transaction.tahoe.DeltaLog#49e1c8fd, exists#107963 [DAY#108138 && DATA_STREAM#108193]
;
DeleteCommandEdge com.databricks.sql.transaction.tahoe.DeltaLog#49e1c8fd, exists#107963 [DAY#108138 && DATA_STREAM#108193]```
As far as I know exists is deterministic. Is there another way to achieve the same result with another query?
The query is running on a Databricks cluster with Spark 3.2.1

Executing DDL statements inside Spark is slow and not in order

I'm trying to execute multiples commands with cqlSession inside spark, but the commands is very slow and didn't respect the order and fail to create the table. i need to wait delete to create a new table.
delete:
connector.withSessionDo(cqlSession => cqlSession.execute("DROP TABLE IF EXISTS person"))
create:
connector.withSessionDo(cqlSession => cqlSession.execute("CREATE TABLE IF EXISTS person(id text)"))

Hive permissions on view not inherited in Spark

I've parquet table test_table created in Hive. The location of one of the partitions is '/user/hive/warehouse/prod.db/test_table/date_id=20210701'
Created the view based on this table:
create view prod.test_table_vw as
select date_id, src, telephone_number, action_date, duration from prod.test_table
Then granted select privilege to some role:
GRANT SELECT ON TABLE prod.test_table_vw TO ROLE data_analytics;
Then user with this role trying to query this data using spark.sql() from pyspark:
sql = spark.sql(f"""SELECT DISTINCT phone_number
FROM prod.test_table_vw
WHERE date_id=20210701""")
sql.show(5)
This code returns permission denied error:
Py4JJavaError: An error occurred while calling o138.collectToPython.
: org.apache.hadoop.security.AccessControlException: Permission denied: user=keytabuser, access=READ_EXECUTE, inode="/user/hive/warehouse/prod.db/test_table/date_id=20210701":hive:hive:drwxrwx--x
I can't grant select rights for this table due to some sensitive fields in it, that's why I created view with limited list of columns.
Questions:
Why Spark ignores Hive's select permissions on view? And how to resolve this issue?
It's Spark's limitation:
When a Spark job accesses a Hive view, Spark must have privileges to
read the data files in the underlying Hive tables. Currently, Spark
cannot use fine-grained privileges based on the columns or the WHERE
clause in the view definition. If Spark does not have the required
privileges on the underlying data files, a SparkSQL query against the
view returns an empty result set, rather than an error.
Reference link.

How to automatically sync a Hive external table with a MySQL table without using Sqoop?

I'm already having a MySQL table in my local machine (Linux) itself, and I have a Hive external table with the same schema as the MySQL table.
I want to sync my hive external table whenever a new record is inserted or updated. Batch update is ok with me to say hourly.
What is the best possible approach to achieve the same without using sqoop?
Thanks,
Sumit
Without scoop, you can create table STORED BY JdbcStorageHandler. Project repository: https://github.com/qubole/Hive-JDBC-Storage-Handler It will work as usual hive table, but query will run on MySQL. Predicate pushdown will work.
DROP TABLE HiveTable;
CREATE EXTERNAL TABLE HiveTable(
id INT,
id_double DOUBLE,
names STRING,
test INT
)
STORED BY 'org.apache.hadoop.hive.jdbc.storagehandler.JdbcStorageHandler'
TBLPROPERTIES (
"mapred.jdbc.driver.class"="com.mysql.jdbc.Driver",
"mapred.jdbc.url"="jdbc:mysql://localhost:3306/rstore",
"mapred.jdbc.username"="root",
"mapred.jdbc.input.table.name"="JDBCTable",
"mapred.jdbc.output.table.name"="JDBCTable",
"mapred.jdbc.password"="",
"mapred.jdbc.hive.lazy.split"= "false"
);

Rebuild index failed on Hive on Azure HDInsight with Tez

I try to create indexes on Hive on Azure HDInsight with Tez enabled.
I can successfully create indexes but I can't rebuild them : the job failed with this output :
Map 1: -/- Reducer 2: 0/1
Status: Failed
Vertex failed, vertexName=Map 1, vertexId=vertex_1421234198072_0091_1_01, diagnostics=[Vertex Input: measures initializer failed.]
Vertex killed, vertexName=Reducer 2, vertexId=vertex_1421234198072_0091_1_00, diagnostics=[Vertex > received Kill in INITED state.]
DAG failed due to vertex failure. failedVertices:1 killedVertices:1
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask
I have created my table and indexes with the following job :
DROP TABLE IF EXISTS Measures;
CREATE TABLE Measures(
topology string,
val double,
date timestamp,
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE LOCATION 'wasb://<mycontainer>#<mystorage>.blob.core.windows.net/';
CREATE INDEX measures_index_topology ON TABLE Measures (topology) AS 'COMPACT' WITH DEFERRED REBUILD;
CREATE INDEX measures_index_date ON TABLE Measures (date) AS 'COMPACT' WITH DEFERRED REBUILD;
ALTER INDEX measures_index_topology ON Measures REBUILD;
ALTER INDEX measures_index_date ON Measures REBUILD;
Where am I wrong ? And why my rebuilding index fail ?
Best regards
It looks like Tez might have a problem with generating an index on an empty table. I was able to get the same error as you (without using the JSON SerDe), and if you look at the application logs for the DAG that fails, you might see something like:
java.lang.NullPointerException
at org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:254)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:299)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getSplits(TezGroupedSplitsInputFormat.java:68)
at org.apache.tez.mapreduce.hadoop.MRHelpers.generateOldSplits(MRHelpers.java:263)
at org.apache.tez.mapreduce.common.MRInputAMSplitGenerator.initialize(MRInputAMSplitGenerator.java:139)
at org.apache.tez.dag.app.dag.RootInputInitializerRunner$InputInitializerCallable$1.run(RootInputInitializerRunner.java:154)
at org.apache.tez.dag.app.dag.RootInputInitializerRunner$InputInitializerCallable$1.run(RootInputInitializerRunner.java:146)
...
If you populate the table with a single dummy record, it seems to work fine. I used:
INSERT INTO TABLE Measures SELECT market,0,0 FROM hivesampletable limit 1;
After that, the index rebuild was able to run without error.

Resources