Create view from parquet unsupported? - apache-spark

I'm trying to create a permanent (as non-temporary, not persistent) view from a parquet source:
create view foo as select * from parquet.`file:///bar`;
and get the following exception:
java.lang.UnsupportedOperationException: unsupported plan Relation[id#67,col1#68,col2#69,col3#70,col4#71] parquet
Creating a temporary view with the same query works just fine. Am I missing something or is this seemingly obvious feature just not implemented yet? I'm running Spark SQL 2.0.0-rc1 (https://github.com/apache/spark/tree/v2.0.0-rc1).

Related

Databricks Error: AnalysisException: Incompatible format detected. with Delta

I'm getting the following error when I attempt to write to my data lake with Delta on Databricks
fulldf = spark.read.format("csv").option("header", True).option("inferSchema",True).load("/databricks-datasets/flights/")
fulldf.write.format("delta").mode("overwrite").save('/mnt/lake/BASE/flights/Full/')
The above produces the following error:
AnalysisException: Incompatible format detected.
You are trying to write to `/mnt/lake/BASE/flights/Full/` using Databricks Delta, but there is no
transaction log present. Check the upstream job to make sure that it is writing
using format("delta") and that you are trying to write to the table base path.
To disable this check, SET spark.databricks.delta.formatCheck.enabled=false
To learn more about Delta, see https://docs.databricks.com/delta/index.html
Any reason for the error?
Such error usually occurs when you have data in another format inside the folder. For example, if you wrote Parquet or CSV files into it before. Remove the folder completely and try again
This worked in my similar situation:
%sql CONVERT TO DELTA parquet.`/mnt/lake/BASE/flights/Full/`

AssertionError: assertion failed: No plan for DeleteFromTable In Databricks

Is there any reason this command works well:
%sql SELECT * FROM Azure.Reservations WHERE timestamp > '2021-04-02'
returning 2 rows, while the below:
%sql DELETE FROM Azure.Reservations WHERE timestamp > '2021-04-02'
fails with:
Error in SQL statement: AssertionError: assertion failed: No plan for
DeleteFromTable (timestamp#394 > 1617321600000000)
?
I'm new to Databricks but I'm sure I ran similar command on another table (without WHERE clause). The table is created basing on a Parquet file.
DELETE FROM (and similarly UPDATE, or MERGE) aren't supported on the Parquet files - right now on Databricks it's supported for Delta format. You can convert your parquet files into delta using CONVERT TO DELTA, and then this command will work for you.
Another alternative is to implement it is to read parquet files, filter out the rows that you want to keep, and overwrite your parquet files.
It could be that you are trying to DELETE from a VIEW (in case it is not a parquet file)
Unfortunately, there is no easy way to differentiate between a VIEW and a TABLE in databricks; the only way you can test if it's indeed a view is by:
SHOW VIEWS FROM Azure like 'reser*'
or, if it's a table:
SHOW TABLES FROM Azure like 'reser*'
Show tables syntax
Show views syntax
just delete from the delta
%sql
delete from delta.`/mnt/path`
where x

spark.table vs sql() AccessControlException

Trying to run
spark.table("db.table")
.groupBy($"date")
.agg(sum($"total"))
returns
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter table. java.security.AccessControlException: Permission denied: user=user, access=WRITE, inode="/sources/db/table":tech_user:bgd_group:drwxr-x---
the same script but written as
sql("SELECT sum(total) FROM db.table group by date").show()
returns actual result.
I don't understand why this is happening. What is the first script trying to write exactly? Some staging result?
I have read permission for this table and I'm only trying to perform some aggregations.
Using Spark 2.2 for this.
In Spark 2.2, the default for spark.sql.hive.caseSensitiveInferenceMode was changed from NEVER_INFER to INFER_AND_SAVE. This mode causes Spark to infer (from underlying files) and try to save case-sensitive schema into Hive metastore. This will fail if the user executing the command wasn't granted permissions to update HMS.
Obvious workaround is to set inference mode back to NEVER_INFER, or INFER_ONLY if application relies on column names as they present in files (CaseSensitivE).

Spark returns Empty DataFrame but Populated in Hive

I have a table in hive
db.table_name
When I run the following in hive I get results back
SELECT * FROM db.table_name;
When I run the following in a spark-shell
spark.read.table("db.table_name").show
It shows nothing. Similarly
sql("SELECT * FROM db.table_name").show
Also shows nothing. Selecting arbitrary columns out before the show also displays nothing. Performing a count states the table has 0 rows.
Running the same queries works against other tables in the same database.
Spark Version: 2.2.0.cloudera1
The table is created using
table.write.mode(SaveMode.Overwrite).saveAsTable("db.table_name")
And if I read the file using the parquet files directly it works.
spark.read.parquet(<path-to-files>).show
EDIT:
I'm currently using a workaround by describing the table and getting the location and using spark.read.parquet.
Have you refresh metadata table? Maybe you need to refresh table to access to new data.
spark.catalog.refreshTable("my_table")
I solved the problem by using
query_result.write.mode(SaveMode.Overwrite).format("hive").saveAsTable("table")
which stores the results in textfile.
There is probably some incompatibility with the Hive parquet.
I also found a Cloudera report about it (CDH Release Notes): they recommend creating the Hive table manually and then load data from a temporary table or by query.

Unable to query HIVE Parquet based EXTERNAL table from spark-sql

We have an External Hive Table which is stored as Parquet. I am not the owner of the schema in which this hive-parquet table is so don't have much info.
The Problem here is when in try to Query that table from spark-sql>(Shell prompt) Not by using scala like spark.read.parquet("path"), I am getting 0 records stating "Unable to infer schema". But when i created a Managed Table by using CTAS in my personal schema just for testing i was able to query it from the spark-sql>(Shell prompt)
When i try it from spark-shell> via spark.read.parquet("../../00000_0").show(10) , I was able to see the data.
So this clears that something is wrong between
External Hive table - Parquet - Spark-SQL(shell)
If locating Schema would be the issue then it should behave same while accessing through spark session (spark.read.parquet(""))
I am using MapR 5.2, Spark version 2.1.0
Please suggest what can be the issue

Resources