How to saveAsTable to s3? - apache-spark

It looks like this will error out
df.write()
.option("mode", "DROPMALFORMED")
.option("compression", "snappy")
.mode("overwrite")
.bucketBy(32,"column")
.sortBy("column")
.parquet("s3://....");
With error
Exception in thread "main" org.apache.spark.sql.AnalysisException: 'save' does not support bucketing right now; at org.apache.spark.sql.DataFrameWriter.assertNotBucketed(DataFrameWriter.scala:314)
I see saveAsTable("myfile") is still supported but it only writes locally. How would I take that saveAsTable(...) output and put it on s3 after the job is done?

You Can use like below:
df
.write()
.option("mode", "DROPMALFORMED")
.option("compression", "snappy")
.option("path","s3://....")
.mode("overwrite")
.format("parquet")
.bucketBy(32,"column").sortBy("column")
.saveAsTable("tableName");
This will create a external table pointing to the S3 location
.option("path","s3://....") is the catch here

Related

streamWriter with format(delta) is not producing a delta table

I am using AutoLoader in databricks. However when I save the stream as a delta table, the generated table is NOT delta.
.writeStream
.format("delta") # <-----------
.option("checkpointLocation", checkpoint_path)
.option("path", output_path)
.trigger(availableNow=True)
.toTable(table_name))
delta.DeltaTable.isDeltaTable(spark, table_name)
> false
Why is the generated table not delta format?
If I try to read the table using spark.read(table_name) it works but if I am trying to use Redash or the builtin databricks' Data tab it produces an error and the schema is not well parsed.
An error occurred while fetching table: table_name
com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: org.apache.spark.sql.AnalysisException: Incompatible format detected
A transaction log for Databricks Delta was found at s3://delta/_delta_log,
but you are trying to read from s3://delta using format("parquet"). You must use
'format("delta")' when reading and writing to a delta table.
Could you try this:
(
spark
.writeStream
.option("checkpointLocation", <checkpointLocation_path>)
.trigger(availableNow=True)
.table("<table_name>")
)
Instead of toTable can you try table

Querying snowflake metadata using spark connector

I want to run 'SHOW TABLES' statement through the spark-snowflake connector, I am running the spark on a Databricks platform and getting "Object 'SHOW' does not exist or not authorized" error.
df = spark.read \
.format("snowflake") \
.options(**options) \
.option("query", "show tables") \
.load()
df.show()
Sample query like "SELECT 1" is working as expected.
I know that I am able to install the native python-snowflake driver but I want to avoid this solution if possible because I already opened the session using spark.
There is also a way using "Utils.runQuery" function but I understood that is relevant only for DDL statement (It doesn't return the actual results).
Thanks!
When using DataFrames, the Snowflake connector supports SELECT queries only.
This is documented on our docs.

JDBC not truncating Postgres table on pyspark

I'm using the following code to truncate a table before inserting data on it.
df.write \
.option("driver", "org.postgresql:postgresql:42.2.16") \
.option("truncate", True) \
.jdbc(url=pgsql_connection, table="service", mode='append', properties=properties_postgres)
Although, it is not working. The table still with old data. I'm using append since I don't want to the DB drop and create a new table everytime.
I've tried .option("truncate", "true") but not worked too.
I got no error messages. How can i solve this problem using .option to truncate my table.
You need to use overwrite mode
df.write \
.option("driver", "org.postgresql:postgresql:42.2.16") \
.option("truncate", True) \
.jdbc(url=pgsql_connection, table="service", mode='overwrite', properties=properties_postgres)
As given in documentation
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
truncate: true -> When SaveMode.Overwrite is enabled, this option causes Spark to
truncate an existing table instead of dropping and recreating it.

How to read excel as a pyspark dataframe

I am able to read all the files and formats like csv, parquet, delta from adls2 account with oauth2 cred.
However when I am trying to read excel file like below,
df = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema", "true") \
.option("dataAddress", "'excel sheet name'!A1") \
.load(filepath)
I am getting below error
Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key
Note: I have installed external library "com.crealytics:spark-excel_2.11:0.12.2" to read excel as a dataframe.
Can anyone help me with error here?
Try to use in configs as: "fs.azure.account.oauth2.client.secret": "<key-name>",
And different versions have different set of parameters, so try use the latest release: https://mvnrepository.com/artifact/com.crealytics/spark-excel_2.12/0.13.7

Spark partitions: creating RDD partitions but not Hive partitions

This is a followup to Save Spark dataframe as dynamic partitioned table in Hive . I tried to use suggestions in the answers but couldn't make it to work in Spark 1.6.1
I am trying to create partitions programmatically from `DataFrame. Here is the relevant code (adapted from a Spark test):
hc.setConf("hive.metastore.warehouse.dir", "tmp/tests")
// hc.setConf("hive.exec.dynamic.partition", "true")
// hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
hc.sql("create database if not exists tmp")
hc.sql("drop table if exists tmp.partitiontest1")
Seq(2012 -> "a").toDF("year", "val")
.write
.partitionBy("year")
.mode(SaveMode.Append)
.saveAsTable("tmp.partitiontest1")
hc.sql("show partitions tmp.partitiontest1").show
Full file is here: https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a
Partitioned files are created fine on the file system but Hive complains that the table is not partitioned:
======================
HIVE FAILURE OUTPUT
======================
SET hive.support.sql11.reserved.keywords=false
SET hive.metastore.warehouse.dir=tmp/tests
OK
OK
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Table tmp.partitiontest1 is not a partitioned table
======================
It looks like the root cause is that org.apache.spark.sql.hive.HiveMetastoreCatalog.newSparkSQLSpecificMetastoreTable always creates table with empty partitions.
Any help to move this forward is appreciated.
EDIT: also created SPARK-14927
I found a workaround: if you pre-create the table then saveAsTable() won't mess with it. So the following works:
hc.setConf("hive.metastore.warehouse.dir", "tmp/tests")
// hc.setConf("hive.exec.dynamic.partition", "true")
// hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
hc.sql("create database if not exists tmp")
hc.sql("drop table if exists tmp.partitiontest1")
// Added line:
hc.sql("create table tmp.partitiontest1(val string) partitioned by (year int)")
Seq(2012 -> "a").toDF("year", "val")
.write
.partitionBy("year")
.mode(SaveMode.Append)
.saveAsTable("tmp.partitiontest1")
hc.sql("show partitions tmp.partitiontest1").show
This workaround works in 1.6.1 but not in 1.5.1

Resources