streamWriter with format(delta) is not producing a delta table - apache-spark

I am using AutoLoader in databricks. However when I save the stream as a delta table, the generated table is NOT delta.
.writeStream
.format("delta") # <-----------
.option("checkpointLocation", checkpoint_path)
.option("path", output_path)
.trigger(availableNow=True)
.toTable(table_name))
delta.DeltaTable.isDeltaTable(spark, table_name)
> false
Why is the generated table not delta format?
If I try to read the table using spark.read(table_name) it works but if I am trying to use Redash or the builtin databricks' Data tab it produces an error and the schema is not well parsed.
An error occurred while fetching table: table_name
com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: org.apache.spark.sql.AnalysisException: Incompatible format detected
A transaction log for Databricks Delta was found at s3://delta/_delta_log,
but you are trying to read from s3://delta using format("parquet"). You must use
'format("delta")' when reading and writing to a delta table.

Could you try this:
(
spark
.writeStream
.option("checkpointLocation", <checkpointLocation_path>)
.trigger(availableNow=True)
.table("<table_name>")
)
Instead of toTable can you try table

Related

Querying snowflake metadata using spark connector

I want to run 'SHOW TABLES' statement through the spark-snowflake connector, I am running the spark on a Databricks platform and getting "Object 'SHOW' does not exist or not authorized" error.
df = spark.read \
.format("snowflake") \
.options(**options) \
.option("query", "show tables") \
.load()
df.show()
Sample query like "SELECT 1" is working as expected.
I know that I am able to install the native python-snowflake driver but I want to avoid this solution if possible because I already opened the session using spark.
There is also a way using "Utils.runQuery" function but I understood that is relevant only for DDL statement (It doesn't return the actual results).
Thanks!
When using DataFrames, the Snowflake connector supports SELECT queries only.
This is documented on our docs.

JDBC not truncating Postgres table on pyspark

I'm using the following code to truncate a table before inserting data on it.
df.write \
.option("driver", "org.postgresql:postgresql:42.2.16") \
.option("truncate", True) \
.jdbc(url=pgsql_connection, table="service", mode='append', properties=properties_postgres)
Although, it is not working. The table still with old data. I'm using append since I don't want to the DB drop and create a new table everytime.
I've tried .option("truncate", "true") but not worked too.
I got no error messages. How can i solve this problem using .option to truncate my table.
You need to use overwrite mode
df.write \
.option("driver", "org.postgresql:postgresql:42.2.16") \
.option("truncate", True) \
.jdbc(url=pgsql_connection, table="service", mode='overwrite', properties=properties_postgres)
As given in documentation
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
truncate: true -> When SaveMode.Overwrite is enabled, this option causes Spark to
truncate an existing table instead of dropping and recreating it.

How to specify delta table properties when writing a steaming spark dataframe

Let's assume I have a streaming dataframe, and I'm writing it to Databricks Delta Lake:
someStreamingDf.writeStream
.format("delta")
.outputMode("append")
.start("targetPath")
and then creating a delta table out of it:
spark.sql("CREATE TABLE <TBL_NAME> USING DELTA LOCATION '<targetPath>'
TBLPROPERTIES ('delta.autoOptimize.optimizeWrite'=true)")
which fails with AnalysisException: The specified properties do not match the existing properties at <targetPath>.
I know I can create a table beforehand:
CREATE TABLE <TBL_NAME> (
//columns
)
USING DELTA LOCATION "< targetPath >"
TBLPROPERTIES (
"delta.autoOptimize.optimizeWrite" = true,
....
)
and then just write to it, but writting this SQL with all the columns and their types looks like a bit of extra/unnecessary work. So is there a way to specify these TBLPROPERTIES while writing to a delta table (for the first time) and not beforehand?
If you look into documentation, you can see that you can set following property:
spark.conf.set(
"spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite", "true")
and then all newly created tables will have delta.autoOptimize.optimizeWrite set to true.
another approach - create table without option, and then try to do alter table set tblprperties (not tested although)

How to saveAsTable to s3?

It looks like this will error out
df.write()
.option("mode", "DROPMALFORMED")
.option("compression", "snappy")
.mode("overwrite")
.bucketBy(32,"column")
.sortBy("column")
.parquet("s3://....");
With error
Exception in thread "main" org.apache.spark.sql.AnalysisException: 'save' does not support bucketing right now; at org.apache.spark.sql.DataFrameWriter.assertNotBucketed(DataFrameWriter.scala:314)
I see saveAsTable("myfile") is still supported but it only writes locally. How would I take that saveAsTable(...) output and put it on s3 after the job is done?
You Can use like below:
df
.write()
.option("mode", "DROPMALFORMED")
.option("compression", "snappy")
.option("path","s3://....")
.mode("overwrite")
.format("parquet")
.bucketBy(32,"column").sortBy("column")
.saveAsTable("tableName");
This will create a external table pointing to the S3 location
.option("path","s3://....") is the catch here

Spark partitions: creating RDD partitions but not Hive partitions

This is a followup to Save Spark dataframe as dynamic partitioned table in Hive . I tried to use suggestions in the answers but couldn't make it to work in Spark 1.6.1
I am trying to create partitions programmatically from `DataFrame. Here is the relevant code (adapted from a Spark test):
hc.setConf("hive.metastore.warehouse.dir", "tmp/tests")
// hc.setConf("hive.exec.dynamic.partition", "true")
// hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
hc.sql("create database if not exists tmp")
hc.sql("drop table if exists tmp.partitiontest1")
Seq(2012 -> "a").toDF("year", "val")
.write
.partitionBy("year")
.mode(SaveMode.Append)
.saveAsTable("tmp.partitiontest1")
hc.sql("show partitions tmp.partitiontest1").show
Full file is here: https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a
Partitioned files are created fine on the file system but Hive complains that the table is not partitioned:
======================
HIVE FAILURE OUTPUT
======================
SET hive.support.sql11.reserved.keywords=false
SET hive.metastore.warehouse.dir=tmp/tests
OK
OK
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Table tmp.partitiontest1 is not a partitioned table
======================
It looks like the root cause is that org.apache.spark.sql.hive.HiveMetastoreCatalog.newSparkSQLSpecificMetastoreTable always creates table with empty partitions.
Any help to move this forward is appreciated.
EDIT: also created SPARK-14927
I found a workaround: if you pre-create the table then saveAsTable() won't mess with it. So the following works:
hc.setConf("hive.metastore.warehouse.dir", "tmp/tests")
// hc.setConf("hive.exec.dynamic.partition", "true")
// hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
hc.sql("create database if not exists tmp")
hc.sql("drop table if exists tmp.partitiontest1")
// Added line:
hc.sql("create table tmp.partitiontest1(val string) partitioned by (year int)")
Seq(2012 -> "a").toDF("year", "val")
.write
.partitionBy("year")
.mode(SaveMode.Append)
.saveAsTable("tmp.partitiontest1")
hc.sql("show partitions tmp.partitiontest1").show
This workaround works in 1.6.1 but not in 1.5.1

Resources