Data Frame write append approach - apache-spark

This works fine for me when saving data to mySQL:
jdbcDF.write
.format("jdbc")
.option("url", "jdbc:mysql://db4free.net:3306/gedmysql")
.option("dbtable", "newtable2")
.option("user", "user")
.option("password", "pswd")
.save()
However, I cannot seem to find the append / overwrite equivalent in this format. I see various things, but they do not seem to work.
E.g. .mode(SaveMode.Append) instead of .save() runs, but no change to the mySQL database. There looks to me to be 2 styles to use, what I quote I thought was new in 2.1.

Need to do it the "other way" from what I can see, i.e. :
jdbcDF.write.mode("append").jdbc(url, table, prop)
.save only an option on that format I showed in question.
It is what it is.

Related

Querying snowflake metadata using spark connector

I want to run 'SHOW TABLES' statement through the spark-snowflake connector, I am running the spark on a Databricks platform and getting "Object 'SHOW' does not exist or not authorized" error.
df = spark.read \
.format("snowflake") \
.options(**options) \
.option("query", "show tables") \
.load()
df.show()
Sample query like "SELECT 1" is working as expected.
I know that I am able to install the native python-snowflake driver but I want to avoid this solution if possible because I already opened the session using spark.
There is also a way using "Utils.runQuery" function but I understood that is relevant only for DDL statement (It doesn't return the actual results).
Thanks!
When using DataFrames, the Snowflake connector supports SELECT queries only.
This is documented on our docs.

JDBC not truncating Postgres table on pyspark

I'm using the following code to truncate a table before inserting data on it.
df.write \
.option("driver", "org.postgresql:postgresql:42.2.16") \
.option("truncate", True) \
.jdbc(url=pgsql_connection, table="service", mode='append', properties=properties_postgres)
Although, it is not working. The table still with old data. I'm using append since I don't want to the DB drop and create a new table everytime.
I've tried .option("truncate", "true") but not worked too.
I got no error messages. How can i solve this problem using .option to truncate my table.
You need to use overwrite mode
df.write \
.option("driver", "org.postgresql:postgresql:42.2.16") \
.option("truncate", True) \
.jdbc(url=pgsql_connection, table="service", mode='overwrite', properties=properties_postgres)
As given in documentation
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
truncate: true -> When SaveMode.Overwrite is enabled, this option causes Spark to
truncate an existing table instead of dropping and recreating it.

Issue with Apache Hudi Update and Delete Operation on Parquet S3 File

Here I am trying to simulate updates and deletes over a Hudi dataset and wish to see the state reflected in Athena table. We use EMR, S3 and Athena services of AWS.
Attempting Record Update with a withdrawal object
withdrawalID_mutate = 10382495
updateDF = final_df.filter(col("withdrawalID") == withdrawalID_mutate) \
.withColumn("accountHolderName", lit("Hudi_Updated"))
updateDF.write.format("hudi") \
.options(**hudi_options) \
.mode("append") \
.save(tablePath)
hudiDF = spark.read \
.format("hudi") \
.load(tablePath).filter(col("withdrawalID") == withdrawalID_mutate).show()
Shows the updated record but it is actually appended in the Athena table. Probably something to do with Glue Catalogue?
Attempting Record Delete
deleteDF = updateDF #deleting the updated record above
deleteDF.write.format("hudi") \
.option('hoodie.datasource.write.operation', 'upsert') \
.option('hoodie.datasource.write.payload.class', 'org.apache.hudi.common.model.EmptyHoodieRecordPayload') \
.options(**hudi_options) \
.mode("append") \
.save(tablePath)
still reflects the deleted record in the Athena table
Also tried using mode("overwrite") but as expected it deletes the older partitions and keeps only the latest.
Did anyone faced same issue and can guide in the right direction

Databricks Azure database warehouse saving tables

I am using the following code to write a Azure warehouse table
df_execution_config_remain.write
.format("com.databricks.spark.sqldw")
.option("user", user)
.option("password", pswd)
.option("url","jdbc:sqlserver://"+sqlserver +":"+port+";database="+database)
.option("forwardSparkAzureStorageCredentials", "true")
.option("dbTable", execution_config)
.option("tempDir", dwtmp)
.mode("Overwrite")
.save()
But Overwrite will drop the table and recreate .
Question
1. I found the new create table is having round robin distribution. which I don't want
the column is having different length with the original table, varchar(256)
I don't want to use append, because I would like to clear the rows in the current table
Q1: Refer to the tableOptions parameter under the following link:
https://docs.databricks.com/spark/latest/data-sources/azure/sql-data-warehouse.html#parameters
Q2:Are you being affected by the maxStrLength parameter under that same link?
Q3: I think your approach is sound, but an alternative might be to use the preActions parameter under that same link, and TRUNCATE the table before loading.

AnalysisException: It is not allowed to add database prefix

I am attempting to read in data from a table that is in a schema using JDBC. However, I'm getting an error:
org.apache.spark.sql.AnalysisException: It is not allowed to add database prefix `myschema` for the TEMPORARY view name.;
The code is pretty straight forward, error occurs on the third line (others included just to show what I am doing). myOptions includes url, dbtable, driver, user, password.
SQLContext sqlCtx = new SQLContext(ctx);
Dataset<Row> df = sqlCtx.read().format("jdbc").options(myOptions).load();
df.createOrReplaceTempView("myschema.test_table");
df = sqlCtx.sql("select field1, field2, field3 from myschema.test_table");
So if database/schema qualifiers are not allowed, then how do you reference the correct one for your table? Leaving it off gives an 'invalid object name' from the database which is expected.
The only option I have at the database side is to use default schema, however this is user-based and not session-based so I would have to create one user and connection per schema I want to access.
What am I missing here? This seems like a common use case.
Edit: Also for those attempting to close this... "a problem that can no longer be reproduced or a simple typographical error" how about a comment as to why this is the reason to close? If I have made a typo or made a simple mistake, leave a comment and show me what. I can't be the only person who has run into this.
registerTempTable in Spark 1.2 used to work this way, and we were told that createOrReplaceTempView was supposed to replace it in 2.x. Yet the functionality is not there.
I figured it out.
The short answer is... dbtable name and the temp view/table name are two different things and don't have to have the same value. dbtable defines were in the database to go for the data, temp view/table is used to define what you call this in your Spark SQL.
This was confusing at first because in Spark 1.6 it allowed the view name to match the full table name (and so the software I am using plugged it in for both for 1.6). If you were coding this by hand, you would just use a nonqualified table name for the temp table or view on either 1.6 or 2.2.
In order to reference a table in a schema in Spark 1.6, I had to do the following because the dbtable and view name were the same:
1. dbtable to "schema.table"
2. registerTempTable("schema.table")
3. Reference table as `schema.table` (include the ticks to treat the entire thing as an identifier to match the view name) in the SQL
However, in Spark 2.2, you need to, since schema/database is not allowed in the view name:
1. dbtable to "schema.table"
2. createOrReplaceTempView("table")
3. Reference table (not schema.table) in the SQL (matching the view)
I guess you are trying to fetch a specific table from an RDBMS. If you are using Spark 2.x or later , you can use below code to get ur table in dataframe.
DF = spark.read \
.format("jdbc") \
.option("url", "jdbc:oracle:thin:username/password#//hostname:portnumber/SID") \
.option("dbtable", "hr.emp") \
.option("user", "db_user_name") \
.option("password", "password") \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.load()

Resources