JDBC not truncating Postgres table on pyspark - apache-spark

I'm using the following code to truncate a table before inserting data on it.
df.write \
.option("driver", "org.postgresql:postgresql:42.2.16") \
.option("truncate", True) \
.jdbc(url=pgsql_connection, table="service", mode='append', properties=properties_postgres)
Although, it is not working. The table still with old data. I'm using append since I don't want to the DB drop and create a new table everytime.
I've tried .option("truncate", "true") but not worked too.
I got no error messages. How can i solve this problem using .option to truncate my table.

You need to use overwrite mode
df.write \
.option("driver", "org.postgresql:postgresql:42.2.16") \
.option("truncate", True) \
.jdbc(url=pgsql_connection, table="service", mode='overwrite', properties=properties_postgres)
As given in documentation
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
truncate: true -> When SaveMode.Overwrite is enabled, this option causes Spark to
truncate an existing table instead of dropping and recreating it.

Related

streamWriter with format(delta) is not producing a delta table

I am using AutoLoader in databricks. However when I save the stream as a delta table, the generated table is NOT delta.
.writeStream
.format("delta") # <-----------
.option("checkpointLocation", checkpoint_path)
.option("path", output_path)
.trigger(availableNow=True)
.toTable(table_name))
delta.DeltaTable.isDeltaTable(spark, table_name)
> false
Why is the generated table not delta format?
If I try to read the table using spark.read(table_name) it works but if I am trying to use Redash or the builtin databricks' Data tab it produces an error and the schema is not well parsed.
An error occurred while fetching table: table_name
com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: org.apache.spark.sql.AnalysisException: Incompatible format detected
A transaction log for Databricks Delta was found at s3://delta/_delta_log,
but you are trying to read from s3://delta using format("parquet"). You must use
'format("delta")' when reading and writing to a delta table.
Could you try this:
(
spark
.writeStream
.option("checkpointLocation", <checkpointLocation_path>)
.trigger(availableNow=True)
.table("<table_name>")
)
Instead of toTable can you try table

Querying snowflake metadata using spark connector

I want to run 'SHOW TABLES' statement through the spark-snowflake connector, I am running the spark on a Databricks platform and getting "Object 'SHOW' does not exist or not authorized" error.
df = spark.read \
.format("snowflake") \
.options(**options) \
.option("query", "show tables") \
.load()
df.show()
Sample query like "SELECT 1" is working as expected.
I know that I am able to install the native python-snowflake driver but I want to avoid this solution if possible because I already opened the session using spark.
There is also a way using "Utils.runQuery" function but I understood that is relevant only for DDL statement (It doesn't return the actual results).
Thanks!
When using DataFrames, the Snowflake connector supports SELECT queries only.
This is documented on our docs.

Issue with Apache Hudi Update and Delete Operation on Parquet S3 File

Here I am trying to simulate updates and deletes over a Hudi dataset and wish to see the state reflected in Athena table. We use EMR, S3 and Athena services of AWS.
Attempting Record Update with a withdrawal object
withdrawalID_mutate = 10382495
updateDF = final_df.filter(col("withdrawalID") == withdrawalID_mutate) \
.withColumn("accountHolderName", lit("Hudi_Updated"))
updateDF.write.format("hudi") \
.options(**hudi_options) \
.mode("append") \
.save(tablePath)
hudiDF = spark.read \
.format("hudi") \
.load(tablePath).filter(col("withdrawalID") == withdrawalID_mutate).show()
Shows the updated record but it is actually appended in the Athena table. Probably something to do with Glue Catalogue?
Attempting Record Delete
deleteDF = updateDF #deleting the updated record above
deleteDF.write.format("hudi") \
.option('hoodie.datasource.write.operation', 'upsert') \
.option('hoodie.datasource.write.payload.class', 'org.apache.hudi.common.model.EmptyHoodieRecordPayload') \
.options(**hudi_options) \
.mode("append") \
.save(tablePath)
still reflects the deleted record in the Athena table
Also tried using mode("overwrite") but as expected it deletes the older partitions and keeps only the latest.
Did anyone faced same issue and can guide in the right direction

Databricks Azure database warehouse saving tables

I am using the following code to write a Azure warehouse table
df_execution_config_remain.write
.format("com.databricks.spark.sqldw")
.option("user", user)
.option("password", pswd)
.option("url","jdbc:sqlserver://"+sqlserver +":"+port+";database="+database)
.option("forwardSparkAzureStorageCredentials", "true")
.option("dbTable", execution_config)
.option("tempDir", dwtmp)
.mode("Overwrite")
.save()
But Overwrite will drop the table and recreate .
Question
1. I found the new create table is having round robin distribution. which I don't want
the column is having different length with the original table, varchar(256)
I don't want to use append, because I would like to clear the rows in the current table
Q1: Refer to the tableOptions parameter under the following link:
https://docs.databricks.com/spark/latest/data-sources/azure/sql-data-warehouse.html#parameters
Q2:Are you being affected by the maxStrLength parameter under that same link?
Q3: I think your approach is sound, but an alternative might be to use the preActions parameter under that same link, and TRUNCATE the table before loading.

Data Frame write append approach

This works fine for me when saving data to mySQL:
jdbcDF.write
.format("jdbc")
.option("url", "jdbc:mysql://db4free.net:3306/gedmysql")
.option("dbtable", "newtable2")
.option("user", "user")
.option("password", "pswd")
.save()
However, I cannot seem to find the append / overwrite equivalent in this format. I see various things, but they do not seem to work.
E.g. .mode(SaveMode.Append) instead of .save() runs, but no change to the mySQL database. There looks to me to be 2 styles to use, what I quote I thought was new in 2.1.
Need to do it the "other way" from what I can see, i.e. :
jdbcDF.write.mode("append").jdbc(url, table, prop)
.save only an option on that format I showed in question.
It is what it is.

Resources