Hive permissions on view not inherited in Spark - apache-spark

I've parquet table test_table created in Hive. The location of one of the partitions is '/user/hive/warehouse/prod.db/test_table/date_id=20210701'
Created the view based on this table:
create view prod.test_table_vw as
select date_id, src, telephone_number, action_date, duration from prod.test_table
Then granted select privilege to some role:
GRANT SELECT ON TABLE prod.test_table_vw TO ROLE data_analytics;
Then user with this role trying to query this data using spark.sql() from pyspark:
sql = spark.sql(f"""SELECT DISTINCT phone_number
FROM prod.test_table_vw
WHERE date_id=20210701""")
sql.show(5)
This code returns permission denied error:
Py4JJavaError: An error occurred while calling o138.collectToPython.
: org.apache.hadoop.security.AccessControlException: Permission denied: user=keytabuser, access=READ_EXECUTE, inode="/user/hive/warehouse/prod.db/test_table/date_id=20210701":hive:hive:drwxrwx--x
I can't grant select rights for this table due to some sensitive fields in it, that's why I created view with limited list of columns.
Questions:
Why Spark ignores Hive's select permissions on view? And how to resolve this issue?

It's Spark's limitation:
When a Spark job accesses a Hive view, Spark must have privileges to
read the data files in the underlying Hive tables. Currently, Spark
cannot use fine-grained privileges based on the columns or the WHERE
clause in the view definition. If Spark does not have the required
privileges on the underlying data files, a SparkSQL query against the
view returns an empty result set, rather than an error.
Reference link.

Related

Spark SQL persistent view over jdbc data source

I want to create a persistent (global) view in spark sql that gets data from an underlying jdbc database connection. It works fine when I use a temporary (session-scoped) view as shown below but fails when trying to create a regular (persistent and global) view.
I don't understand why the latter should not work but couldn't find any docs/hints as all examples are always done with temporary views. Technically, I cannot see why it shouldn't work as the data is properly retrieved from jdbc source in the temporary view and thus it should not matter if I wanted to "store" the query in a persistent view so that whenever calling the view it would retrieve data directly from jdbc source.
Config.
tbl_in = myjdbctable
tbl_out = myview
db_user = 'myuser'
db_pw = 'mypw'
jdbc_url = 'jdbc:sqlserver://myserver.domain:1433;database=mydb'
This works.
query = f"""
create or replace temporary view {tbl_out}
using jdbc
options(
dbtable '{tbl_in}',
user '{db_user}',
password '{db_pw}',
url '{jdbc_url}'
)
"""
spark.sql(query)
> DataFrame[]
This does not work.
query = f"""
create or replace view {tbl_out}
using jdbc
options(
dbtable '{tbl_in}',
user '{db_user}',
password '{db_pw}',
url '{jdbc_url}'
)
"""
spark.sql(query)
> ParseException:
Error.
ParseException:
mismatched input 'using' expecting {'(', 'UP_TO_DATE', 'AS', 'COMMENT', 'PARTITIONED', 'TBLPROPERTIES'}(line 3, pos 0)
== SQL ==
create or replace view myview
using jdbc
^^^
options(
dbtable 'myjdbctable',
user 'myuser',
password '[REDACTED]',
url 'jdbc:sqlserver://myserver.domain:1433;database=mydb'
)
TL;DR: A spark sql table over jdbc source behaves like a view and so can be used like one.
It seems my assumptions about jdbc tables in spark sql were flawed. It turns out that a sql table with a jdbc source (i.e. created via using jdbc) is actually a live query against the jdbc source (and not a one-off jdbc query during table creation as I assumed). In my mind it actually behaves like a view then. That means if the underlying jdbc source changes (e.g. new entries in a column) this is reflected in the spark sql table on read (e.g. select from) without having to re-create the table.
It follows that the spark sql table over jdbc source satisfies my requirements of having an always up2date reflection of the underlying table/sql object in the jdbc source. Usually, I would use a view for that. Maybe this is the reason why there is no persistent view over a jdbc source but only temporary views (which of course still make sense as they are session-scoped). It should be noted that the spark sql jdbc table behaves like a view which may be surprising, in particular:
if you add a column in underlying jdbc table, it will not show up in spark sql table
if you remove a column from underlying jdbc table, an error will occur when spark sql table is accessed (assuming the removed column was present during spark sql table creation)
if you remove the underlying jdbc table, an error will occur when spark sql table is accessed
The input of spark.sql should be DML (Data Manipulation Language). Its output is a dataframe.
In terms of best practices, you should avoid using DDL (Data Definition Language) with spark.sql. Even if some statements may work, that's not meant to be used this way.
If you want to use DDL, simply connect to your DB using python packages.
If you want to create a temp view in spark, do it using spark syntaxe createTempView

How to drop partitions from hive views?

I have a partitioned view and I am trying to drop an existing partition from the view definition using hive CLI. However, when I try to drop a partition, it throws me the following error:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. null
Here is my create statement for view:
CREATE or replace VIEW test_view (logrecordtype, datacenter, ts_date, gen_date)
PARTITIONED ON (ts_date, gen_date)
AS SELECT logrecordtype, datacenter, ts_date, gen_date from test_table1 where ts_date <= '20200720'
union all
select logrecordtype, datacenter, ts_date, gen_date from test_table2 where ts_date != '20200720';
The underlying tables test_table1, test_table2 are also partitioned by (ts_date, gen_date).
Drop partition command:
ALTER VIEW test_view DROP IF EXISTS PARTITION (ts_date = '20200720', gen_date = '2020072201')
I am able to add partitions and issue show partition on my view but drop partition fails.
My show partition command shows:
show partitions test_view;
ts_date=20200720/gen_date=2020072201

Spark not able to write into a new hive table in partitioned and append mode

Created a new table in hive in partitioned and ORC format.
Writing into this table using spark by using append ,orc and partitioned mode.
It fails with the exception:
org.apache.spark.sql.AnalysisException: The format of the existing table test.table1 is `HiveFileFormat`. It doesn't match the specified format `OrcFileFormat`.;
I change the format to "hive" from "orc" while writing . It still fails with the exception :
Spark not able to understand the underlying structure of table .
So this issue is happening because spark is not able to write into hive table in append mode , because it cant create a new table . I am able to do overwrite successfully because spark creates a table again.
But my use case is to write into append mode from starting. InsertInto also does not work specifically for partitioned tables. I am pretty much blocked with my use case. Any help would be great.
Edit1:
Working on HDP 3.1.0 environment.
Spark Version is 2.3.2
Hive Version is 3.1.0
Edit 2:
// Reading the table
val inputdf=spark.sql("select id,code,amount from t1")
//writing into table
inputdf.write.mode(SaveMode.Append).partitionBy("code").format("orc").saveAsTable("test.t2")
Edit 3: Using insertInto()
val df2 =spark.sql("select id,code,amount from t1")
df2.write.format("orc").mode("append").insertInto("test.t2");
I get the error as:
20/05/17 19:15:12 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
20/05/17 19:15:12 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
20/05/17 19:15:13 WARN AcidUtils: Cannot get ACID state for test.t1 from null
20/05/17 19:15:13 WARN AcidUtils: Cannot get ACID state for test.t1 from null
20/05/17 19:15:13 WARN HiveMetastoreCatalog: Unable to infer schema for table test.t1 from file format ORC (inference mode: INFER_AND_SAVE). Using metastore schema.
If I rerun the insertInto command I get the following exception :
20/05/17 19:16:37 ERROR Hive: MetaException(message:The transaction for alter partition did not commit successfully.)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$alter_partitions_req_result$alter_partitions_req_resultStandardScheme.read(ThriftHiveMetastore.java)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$alter_partitions_req_result$alter_partitions_req_resultStandardScheme.read(ThriftHiveMetastore.java)
Error in hive metastore logs :
2020-05-17T21:17:43,891 INFO [pool-8-thread-198]: metastore.HiveMetaStore (HiveMetaStore.java:logInfo(907)) - 163: alter_partitions : tbl=hive.test.t1
2020-05-17T21:17:43,891 INFO [pool-8-thread-198]: HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(349)) - ugi=X#A.ORG ip=10.10.1.36 cmd=alter_partitions : tbl=hive.test.t1
2020-05-17T21:17:43,891 INFO [pool-8-thread-198]: metastore.HiveMetaStore (HiveMetaStore.java:alter_partitions_with_environment_context(5119)) - New partition values:[BR]
2020-05-17T21:17:43,913 ERROR [pool-8-thread-198]: metastore.ObjectStore (ObjectStore.java:alterPartitions(4397)) - Alter failed
org.apache.hadoop.hive.metastore.api.MetaException: Cannot change stats state for a transactional table without providing the transactional write state for verification (new write ID -1, valid write IDs null; current state null; new state {}
I was able to resolve the issue by using external tables in my use case. We currently have an open issue in spark , which is related to acid properties of hive . Once I create hive table in external mode , I am able to do append operations in partitioned/non partitioned table.
https://issues.apache.org/jira/browse/SPARK-15348

How to get the Source Data When Ingestion Failure in KUSTO ADX

I have a base table in ADX Kusto DB.
.create table base (info:dynamic)
I have written a function which parses(dynamic column) the base table and greps a few columns and stores it in another table whenever the base table gets data(from EventHub). Below function and its update policy
.create function extractBase()
{
base
| evaluate bag_unpack(info)
| project tostring(column1), toreal(column2), toint(column3), todynamic(column4)
}
.alter table target_table policy update
#'[{"IsEnabled": true, "Source": "base", "Query": "extractBase()", "IsTransactional": false, "PropagateIngestionProperties": true}]'
suppose if the base table does not contain the expected column, ingestion error happens. how do I get the source(row) for the failure?
When using .show ingestion failures, it displays the failure message. there is a column called IngestionSourcePath. when I browse the URL, getting an exception as Resource Not Found.
If ingestion failure happens, I need to store the particular row of base table into IngestionFailure Table. for further investigation
In this case, your source data cannot "not have" a column defined by its schema.
If no value was ingested for some column in some row, a null value will be present there and the update policy will not fail.
Here the update policy will break if the original table row does not contain enough columns. Currently the source data for such errors is not emitted as part of the failure message.
In general, the source URI is only useful when you are ingesting data from blobs. In other cases the URI shown in the failed ingestion info is a URI on an internal blob that was created on the fly and no one has access to.
However, there is a command that is missing from documentation (we will make sure to update it) that allows you to duplicate (dump to storage container you provide) the source data for the next failed ingestion into a specific table.
The syntax is:
.dup-next-failed-ingest into TableName to h#'Path to Azure blob container'
Here the path to Azure Blob container must include a writeable SAS.
The required permission to run this command is DB admin.

Unable to select from SQL Database tables using node-ibm_db

I created a new table in the Bluemix SQL Database service by uploading a csv (baseball.csv) and took the default table name of "baseball".
I created a simple app in Node.js which is just trying to select data from the table with select * from baseball, but I keep getting the following error:
[IBM][CLI Driver][DB2/NT] SQL0204N "USERxxxx.BASEBALL" in an undefined name
Why can't it find my database table?
This issue seems independent of bluemix, rather it is usage error.
This error is possibly caused by following:
The object identified by name is not defined in the database.
User response
Ensure that the object name (including any required qualifiers) is correctly specified in the SQL statement and it exists.
try running "list tables" from command prompt to check if your table spelling is correct or not.
http://www-01.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.messages.sql.doc/doc/msql00204n.html?cp=SSEPGG_9.7.0%2F2-6-27-0-130
I created the table from SQL Database web UI in bluemix and took the default name of baseball. It looks like this creates a case-sensitive table name.
Unfortunately for me, the sql_db libary (and all db2 clients I believe) auto-capitalizes the SQL query into "SELECT * FROM BASEBALL"
The solution was to either
A. Explicitly name my table BASEBALL in the web UI; or
B. Modify my sql query by quoting the table name:
select * from "baseball"
More info at http://www.ibm.com/developerworks/data/library/techarticle/0203adamache/0203adamache.html#N10121

Resources