I am saving my spark dataframe on azure databricks and create delta lake table.
It works fine, however I am getting this warning message while execution.
Question- Why I am still getting this message, even with my table is delta table. What is wrong with my approach, any inputs is greatly appreciated.
Warning Message
This query contains a highly selective filter. To improve the performance of queries, convert the table to Delta and run the OPTIMIZE ZORDER BY command on the table
Code
dfMerged.write\
.partitionBy("Date")\
.mode("append")\
.format("delta")\
.option("overwriteSchema", "true")\
.save("/mnt/path..")
spark.sql("CREATE TABLE DeltaUDTable USING DELTA LOCATION '/mnt/path..'")
Some more details
I've mounted azure storage gen 2 to above mount location.
databricks runtime - 6.4 (includes Apache Spark 2.4.5, Scala 2.11)
The warning message is clearly misleading as you already have a Delta option. Ignore it.
df.write.mode("overwirte").saveAsTable("table_loc")
Related
I create table on Hadoop cluster using PySpark SQL:spark.sql("CREATE TABLE my_table (...) PARTITIONED BY (...) STORED AS Parquet") and load some data with: spark.sql("INSERT INTO my_table SELECT * FROM my_other_table"), however the resulting files do not seem to be Parquet files, they're missing ".snappy.parquet" extension.
The same problem occurs when repeating those steps in Hive.
But surprisingly when I create table using PySpark DataFrame: df.write.partitionBy("my_column").saveAsTable(name="my_table", format="Parquet")
everything works just fine.
So, my question is: what's wrong with the SQL way of creating and populating Parquet table?
Spark version 2.4.5, Hive version 3.1.2.
Update (27 Dec 2022 after #mazaneicha answer)
Unfortunately, there is no parquet-tools on the cluster I'm working with, so the best I could do is to check the content of the files with hdfs dfs -tail (and -head). And in all cases there is "PAR1" both at the beginning and at the end of the file. And even more - the meta-data of parquet version (implementation):
Method # of files Total size Parquet version File name
Hive Insert 8 34.7 G Jparquet-mr version 1.10.0 xxxxxx_x
PySpark SQL Insert 8 10.4 G Iparquet-mr version 1.6.0 part-xxxxx-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.c000
PySpark DF insertInto 8 10.9 G Iparquet-mr version 1.6.0 part-xxxxx-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.c000
PySpark DF saveAsTable 8 11.5 G Jparquet-mr version 1.10.1 part-xxxxx-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-c000.snappy.parquet
(To create the same number of files I used "repartition" with df, and "distribute by" with SQL).
So, considering the above mentioned, it's still not clear:
Why there is no file extension in 3 out of 4 cases?
Why files created with Hive are so big? (no compression, I suppose).
Why PySpark SQL and PySpark Dataframe versions/implementations of parquet differ and how set them explicitly?
File format is not defined by the extension, but rather by the contents. You can quickly check if format is parquet by looking for magic bytes PAR1 at the very beginning and the very end of a file.
For in-depth format, metadata and consistency checking, try opening a file with parquet-tools.
Update:
As mentioned in online docs, parquet is supported by Spark as one of the many data sources via its common DataSource framework, so that it doesn't have to rely on Hive:
"When reading from Hive metastore Parquet tables and writing to non-partitioned Hive metastore Parquet tables, Spark SQL will try to use its own Parquet support instead of Hive SerDe for better performance..."
You can find and review this implementation in Spark git repo (its open-source! :))
I want to add some columns in a Delta table using spark sql, but it showing me error like :
ALTER ADD COLUMNS does not support datasource table with type org.apache.spark.sql.delta.sources.DeltaDataSource.
You must drop and re-create the table for adding the new columns.
Is there any way to alter my table in delta lake?
Thanks a lot for this question! I learnt quite a lot while hunting down a solution 👍
This is Apache Spark 3.2.1 and Delta Lake 1.1.0 (all open source).
The reason for the error is that Spark SQL (3.2.1) supports ALTER ADD COLUMNS statement for csv, json, parquet, orc data sources only. Otherwise, it throws the exception.
I assume you ran ALTER ADD COLUMNS using SQL (as the root cause would've been caught earlier if you'd used Scala API or PySpark).
That leads us to org.apache.spark.sql.delta.catalog.DeltaCatalog that has to be "installed" to Spark SQL for it to recognize Delta Lake as a supported datasource. This is described in the official Quickstart.
For PySpark (on command line) it'd be as follows:
./bin/pyspark \
--packages io.delta:delta-core_2.12:1.1.0 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
In order to extend Spark SQL with Delta Lake's features (incl. ALTER ADD COLUMNS support) you have to add the following configuration properties for DeltaSparkSessionExtension and DeltaCatalog:
spark.sql.extensions
spark.sql.catalog.spark_catalog
They are mandatory (and optional in managed environments like Azure Databricks that were mentioned as working fine for obvious reasons).
ALTER ADD COLUMNS does not support datasource table with type
org.apache.spark.sql.delta.sources.DeltaDataSource. You must drop and
re-create the table for adding the new columns.
You are facing this issue because of different versions in configuration. Make sure you are using compatible version of spark and scala.
As mentioned in comment you can alter table in delta lake, follow link mentioned by #David דודו Markovitz
Refer this similar issue
We are using spark for reading/writing data in delta format stored in HDFS (Databricks Delta table version 0.5.0).
We would like to utilize the power of Hive to interact with the delta tables.
How can we register an existing data in delta format from a path on HDFS to Hive?
Please note that currently we are running spark (2.4.0) on cloudera platform (CDH 6.3.3)
The only way I can do this so far is by registering it as an unmanaged table. The most significant difference, as far as I can tell, is that if you drop an unmanaged table, it does not drop the underlying data.
I have a simple Hive-External table which is created on top of S3 (Files are in CSV format). When I run the hive query it shows all records and partitions.
However when I use the same table in Spark ( where the Spark SQL has a where condition on the partition column) it does not show that a partition filter is applied. However for a Hive Managed table , Spark is able to use the information of partitions and apply the partition filter.
Is there any flag or setting that can help me make use of partitions of Hive external tables in Spark ? Thanks.
Update :
For some reason, only the spark plan is not showing the Partition Filters. However, when you look at the data loaded its only loading the data needed from the partitions.
Ex: Where rating=0 , loads only one file of 1 MB, when I don't have filter its reads all 3 partition for 3 MB
tl; dr set the following before the running sql for external table
spark.sql("set spark.sql.hive.convertMetastoreOrc=true")
The difference in behaviour is not because of extenal/managed table.
The behaviour depends on two factors
1. Where the table was created(Hive or Spark)
2. File format (I believe it is ORC in this case, from the screen capture)
Where the table was created(Hive or Spark)
If the table was create using Spark APIs, it is considered as Datasource table.
If the table was created usng HiveQL, it is considered as Hive native table.
The metadata of both these tables are store in Hive metastore, the only difference is in the provider field of TBLPROPERTIES of the tables(describe extended <tblName>). The value of the property is orcor empty in Spark table and hive for a Hive.
How spark uses this information
When provider is not hive(datasource table), Spark uses its native way of processing the data.
If provider is hive, Spark uses Hive code to process the data.
Fileformat
Spark gives config flag to instruct the engine to use Datasource way of processing the data for the floowing file formats = Orc and Parquet
Flags:
Orc
val CONVERT_METASTORE_ORC = buildConf("spark.sql.hive.convertMetastoreOrc")
.doc("When set to true, the built-in ORC reader and writer are used to process " +
"ORC tables created by using the HiveQL syntax, instead of Hive serde.")
.booleanConf
.createWithDefault(true)
Parquet
val CONVERT_METASTORE_PARQUET = buildConf("spark.sql.hive.convertMetastoreParquet")
.doc("When set to true, the built-in Parquet reader and writer are used to process " +
"parquet tables created by using the HiveQL syntax, instead of Hive serde.")
.booleanConf
.createWithDefault(true)
I also ran into this kind of problem having multiple joins of internal and external tables.
None of the tricks work including:
spark.sql("set spark.sql.hive.convertMetastoreParquet=false")
spark.sql("set spark.sql.hive.metastorePartitionPruning=true")
spark.sql("set spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER")
anyone who knows how to solve this problem.
I am having an issue with pyspark sql module. I created a partitioned table and saved it as parquet file into hive table by running spark job after multiple transformations.
Data load is successful into hive and also able to query the data. But when I try to query the same data from spark it says file path doesn't exist.
java.io.FileNotFoundException: File hdfs://localhost:8020/data/path/of/partition partition=15f244ee8f48a2f98539d9d319d49d9c does not exist
The partition which is mentioned in above error was the old partitioned column data which doesn't even exist now.
I have run the spark job which populates a new partition value.
I searched for solutions but all I can see is people say there was no issue in spark version 1.4 and there is an issue in 1.6
Can someone please suggest me the solution for this problem.