spark.table vs sql() AccessControlException - apache-spark

Trying to run
spark.table("db.table")
.groupBy($"date")
.agg(sum($"total"))
returns
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter table. java.security.AccessControlException: Permission denied: user=user, access=WRITE, inode="/sources/db/table":tech_user:bgd_group:drwxr-x---
the same script but written as
sql("SELECT sum(total) FROM db.table group by date").show()
returns actual result.
I don't understand why this is happening. What is the first script trying to write exactly? Some staging result?
I have read permission for this table and I'm only trying to perform some aggregations.
Using Spark 2.2 for this.

In Spark 2.2, the default for spark.sql.hive.caseSensitiveInferenceMode was changed from NEVER_INFER to INFER_AND_SAVE. This mode causes Spark to infer (from underlying files) and try to save case-sensitive schema into Hive metastore. This will fail if the user executing the command wasn't granted permissions to update HMS.
Obvious workaround is to set inference mode back to NEVER_INFER, or INFER_ONLY if application relies on column names as they present in files (CaseSensitivE).

Related

Databricks Error: AnalysisException: Incompatible format detected. with Delta

I'm getting the following error when I attempt to write to my data lake with Delta on Databricks
fulldf = spark.read.format("csv").option("header", True).option("inferSchema",True).load("/databricks-datasets/flights/")
fulldf.write.format("delta").mode("overwrite").save('/mnt/lake/BASE/flights/Full/')
The above produces the following error:
AnalysisException: Incompatible format detected.
You are trying to write to `/mnt/lake/BASE/flights/Full/` using Databricks Delta, but there is no
transaction log present. Check the upstream job to make sure that it is writing
using format("delta") and that you are trying to write to the table base path.
To disable this check, SET spark.databricks.delta.formatCheck.enabled=false
To learn more about Delta, see https://docs.databricks.com/delta/index.html
Any reason for the error?
Such error usually occurs when you have data in another format inside the folder. For example, if you wrote Parquet or CSV files into it before. Remove the folder completely and try again
This worked in my similar situation:
%sql CONVERT TO DELTA parquet.`/mnt/lake/BASE/flights/Full/`

User does not have privileges for ALTERTABLE_ADDCOLS while using spark.sql to read the data

Select query in spark.sql is resulting in the following error:
User *username* does not have privileges for ALTERTABLE_ADDCOLS
Spark version - 2.1.0
Trying to execute the following query:
dig = spark.sql("""select col1, col2 from dbname.tablename""")
It's caused by the spark.sql.hive.caseSensitiveInferenceMode propertie.
By default, spark tries to infer the table's schema and then change its properties.
To avoid these messages you can alter the default configuration to INFER_ONLY.
Considering a spark session named spark, the code below should work:
spark.conf.set("spark.sql.hive.caseSensitiveInferenceMode", "INFER_ONLY")

Unable to query HIVE Parquet based EXTERNAL table from spark-sql

We have an External Hive Table which is stored as Parquet. I am not the owner of the schema in which this hive-parquet table is so don't have much info.
The Problem here is when in try to Query that table from spark-sql>(Shell prompt) Not by using scala like spark.read.parquet("path"), I am getting 0 records stating "Unable to infer schema". But when i created a Managed Table by using CTAS in my personal schema just for testing i was able to query it from the spark-sql>(Shell prompt)
When i try it from spark-shell> via spark.read.parquet("../../00000_0").show(10) , I was able to see the data.
So this clears that something is wrong between
External Hive table - Parquet - Spark-SQL(shell)
If locating Schema would be the issue then it should behave same while accessing through spark session (spark.read.parquet(""))
I am using MapR 5.2, Spark version 2.1.0
Please suggest what can be the issue

Create view from parquet unsupported?

I'm trying to create a permanent (as non-temporary, not persistent) view from a parquet source:
create view foo as select * from parquet.`file:///bar`;
and get the following exception:
java.lang.UnsupportedOperationException: unsupported plan Relation[id#67,col1#68,col2#69,col3#70,col4#71] parquet
Creating a temporary view with the same query works just fine. Am I missing something or is this seemingly obvious feature just not implemented yet? I'm running Spark SQL 2.0.0-rc1 (https://github.com/apache/spark/tree/v2.0.0-rc1).

DataStax Cassandra File System - Fixed Width Text File - Hive Integration Issue

I'm trying to read a fixed width text file stored in Cassandra File System (CFS) using Hive. I'm able to query the file when I run from hive client. However, when I try to run from Hadoop Hive JDBC, It says table is not available or bad connection. Below are the steps I followed.
Input file (employees.dat):
21736Ambalavanar Thirugnanam BOY-EAG 2005-05-091992-11-18
21737Anand Jeyamani BOY-AST 2005-05-091985-02-12
31123Muthukumar Rajendran BOY-EES 2009-08-121983-02-23
Starting Hive Client
bash-3.2# dse hive;
Logging initialized using configuration in file:/etc/dse/hive/hive-log4j.properties
Hive history file=/tmp/root/hive_job_log_root_201209250900_157600446.txt
hive> use HiveDB;
OK
Time taken: 1.149 seconds
Creating Hive External Table pointing to fixed width format text file
hive> CREATE EXTERNAL TABLE employees (empid STRING, firstname STRING, lastname STRING, dept STRING, dateofjoining STRING, dateofbirth STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
> WITH SERDEPROPERTIES ("input.regex" = "(.{5})(.{25})(.{25})(.{15})(.{10})(.{10}).*" )
> LOCATION 'cfs://hostname:9160/folder/';
OK
Time taken: 0.524 seconds
Do a select * from table.
hive> select * from employees;
OK
21736 Ambalavanar Thirugnanam BOY-EAG 2005-05-09 1992-11-18
21737 Anand Jeyamani BOY-AST 2005-05-09 1985-02-12
31123 Muthukumar Rajendran BOY-EES 2009-08-12 1983-02-23
Time taken: 0.698 seconds
Do a select with specific fields from hive table throws permission error (first issue)
hive> select empid, firstname from employees;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
java.io.IOException: The ownership/permissions on the staging directory cfs:/tmp/hadoop-root/mapred/staging/root/.staging is not as expected. It is owned by root and permissions are rwxrwxrwx. The directory must be owned by the submitter root or by root and permissions must be rwx------
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:108)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:856)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:452)
at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:136)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:133)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1332)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1123)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:931)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:671)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Job Submission failed with exception 'java.io.IOException(The ownership/permissions on the staging directory cfs:/tmp/hadoop-root/mapred/staging/root/.staging is not as expected. It is owned by root and permissions are rwxrwxrwx. The directory must be owned by the submitter root or by root and permissions must be rwx------)'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask
The second issue is, when I try to run the select * query from JDBC Hive driver (outside of dse/cassandra nodes), It says the table employees is not available. The external table created acts like a temporary table and it does not get persisted. When I use 'hive> show tables;', the employees table is not listed. Can anyone please help me figure out the problem?
I don't have an immediate answer for the first issue, but the second looks like its due to a known issue.
There is a bug in DSE 2.1 which drops external tables created from CFS files from the metastore when show tables is run. Only the table metadata is removed, the data remains in CFS so if you recreate the table definition you shouldn't have to reload it. Tables backed by Cassandra ColumnFamilies are notaffected by this bug. This has been fixed in the 2.2 release of DSE, which is due for release imminently.
I'm not familiar with the Hive JDBC driver, but if it issues a Show Tables command at any point, it could be triggering this bug.

Resources