How to generate a spark sql truncate query without Only - apache-spark

I am using Spark(3.0.0_preview) and reading/writing from/to GreenPlum(version 5.24). The Greenplum version 5.24 doesn't support the "truncate table only $table_name" command.
With Spark 3.0.0_preview and jdbcdriver(org.postgresql" % "postgresql" % "42.2.5), the command genearted by Spark is "truncate table only $table_name".
df.write.format("jdbc").option("url", "jdbc:postgresql://test:5432/sample")
.option("user", "sample")
.option("password", "sample")
.option("dbtable", "test.employer")
.option("truncate", true) // **Genearte truncate table only**
.mode(SaveMode.Overwrite)
.save();
I want to generate truncate command without ONLY option. As the only option is not supported by Greenplum V5.24

As mentioned by #mazaneicha the Spark's PostgreSQL dialect can only generate TRUNCATE ONLY
To get this working for me, I am using Scala to truncate my table. Not a good fix but would work till we upgrade to GreenPlum 6 which supports TRUNCATE TABLE ONLY
truncate(""test.employer", "jdbc:postgresql://test:5432/sample","sample","sample" )
df.write.format("jdbc").option("url", "jdbc:postgresql://test:5432/sample")
.option("user", "sample")
.option("password", "sample")
.option("dbtable", "test.employer")
.mode(SaveMode.Append)
.save();
def truncate(tableName: String, jdbcUrl: String, username: String, password:
String) = {
val connection = DriverManager.getConnection(jdbcUrl, username, password)
connection.setAutoCommit(true)
val statement = connection.createStatement()
statement.execute(s"TRUNCATE TABLE $tableName")
}

Related

How to filter data using spark.read in place?

I try read data in Delta format from ADLS. I want read some portion of that data using filter in place. Same approach worked for me during reading JDBC format
query = f"""
select * from {table_name}
where
createdate < to_date('{createdate}','YYYY-MM-DD HH24:MI:SS') or
modifieddate < to_date('{modifieddate}','YYYY-MM-DD HH24:MI:SS')
"""
return spark.read \
.format("jdbc") \
.option("url", url) \
.option("query", query) \
.option("user", username) \
.option("password", password) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.load()
So I tried to create in similar way reading delta using query but it reads whole table.
return spark.read \
.format("delta") \
.option("query", query) \
.load(path)
How could I solve this issue without reading full df and then filter it?
Thanks in advance!
Spark uses a functionality called predicate pushdown to optimize queries.
In the first case, the filters can be passed on to the oracle database.
Delta does not work that way. There can be optimisations through data skipping and Z-ordering, but since you are essentially querying parquet files, you have to read the all of them in memory and filter afterwards.

ClickHouse housepower driver with spark

I'm new to Stack and Spark so please forgive me my simplicity and mistakes!
I have a problem with Clickhouse and spark (2.4.7), I work on a jupyter notebook.
Basically, I want to insert dataframe with Array column to Clickhouse with Array(String) column. Using yandex driver this is impossible, because jdbc doesn't support Arrays, right? ;)
So I wanted to run Spark Session with housepower jar: clickhouse-native-jdbc-shaded-2.6.4.jar, because I read that they added handling Arrays - correct me if I'm wrong.
And I want to get a query from Clickhouse via jdbc.
spark = SparkSession\
.builder\
.enableHiveSupport()\
.appName(f'custom-events-test)\
.config("spark.jars", "drivers/clickhouse-native-jdbc-shaded-2.6.4.jar")\
.getOrCreate()
My query:
query = """
select date as date,
partnerID as partnerID,
sessionID as sessionID,
toString(mapKeys(cast((JSONExtractKeysAndValues(ec.custom, 'String')), 'Map(String, String)'))) as keys,
toString(mapValues(cast((JSONExtractKeysAndValues(ec.custom, 'String')), 'Map(String, String)'))) as values
from audience.uber_all
array join eventContents as ec
PREWHERE date = ('2022-07-22')
WHERE partnerID = 'XXX'
and ec.custom != '{}'
order by date, partnerID
"""
and my code:
df_tab = spark.read \
.format("jdbc") \
.option("driver", "com.github.housepower.jdbc.ClickHouseDriver") \
.option("url", f"jdbc:clickhouse://{ch_host}:9000/{ch_db}") \
.option("query", query) \
.option("user", ch_user) \
.option("password", ch_pass) \
.load()
But there I get an error:
housepower_error
BUT when I run above query with yandex driver: ru.yandex.clickhouse.ClickHouseDriver
everything works fine. (even with housepower jar)
This error also appears when I want to import column like this:
JSONExtractKeysAndValues(ec.custom, 'String')
or
toString(JSONExtractKeysAndValues(ec.custom, 'String'))
What am I doing wrong ?
And tell me how to insert a DF with Array column using spark jdbc to Clickhouse table also with Array(String) column? I was looking everywhere but coudn't find a solution...
Thank you in advance !

Spark over JDBC uses an alias not allowed

Using Spark 2.4.0 and Exasol 6.2.0, I want to create a DataFrame over jdbc from a simple query SELECT * FROM table_name over JDBC.
This is the code in scala:
df = sparkSession.read
.format("jdbc")
.option("driver", sourceJDBCConn("jdbcDriver"))
.option("url", sourceJDBCConn("jdbcUrl"))
.option("user", sourceJDBCConn("jdbcUsername"))
.option("password", sourceJDBCConn("jdbcPassword"))
.option("query", "SELECT * FROM table_name")
.load()
This works with PostgreSQL but not with Exasol. If I look into the session-audit I find the following SQL:
SELECT * FROM (SELECT * FROM table_name) __SPARK_GEN_JDBC_SUBQUERY_NAME_0 WHERE 1=0
the error message from Exasol is the folowing:
syntax error, unexpected $undefined, expecting UNION_ or EXCEPT_ or MINUS_ or INTERSECT_ [line 1, column 98]
While PostgreSQL looks to accept the alias with the 2 leading underscores, such an alias is not allowed in Exasol. Is there any workaround to change the name of the alias __SPARK_GEN_JDBC_SUBQUERY_NAME_0 with an identifier accepted by Exasol?

Query table names

I have been given access to a database. I am querying the data from a spark cluster. How do I check all the databases/tables I have access to?
jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2};user='{3}';password='{4}'".format(jdbcHostname, jdbcPort, jdbcDatabase, jdbcUsername, jdbcPassword)
connectionProperties = {
"user" : jdbcUsername,
"password" : jdbcPassword,
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
df = spark.read.jdbc(url=jdbcUrl, properties=connectionProperties)
Access to the database has been authenticated.
In SQL Server itself:
select *
from sys.tables
Not sure if you use synonym or not as the way into the sys schema.
val tables = spark.read.jdbc(jdbc_url, "sys.tables", connectionProperties)
tables.select(...
If you have synonym, replace the sys.tables with that. There are different ways of writing, you go down the tables or own SQL Query approach approach. This is the tables approach. Here under the SQL Query approach, an example:
val dataframe_mysql = spark.read.jdbc(jdbcUrl, "(select k, v from sample order by k DESC) e", connectionProperties)
SCALA version I just realized.
pyspark specifically
See: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
Just the same approach, but specifically for this case postgres:
jdbcDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.load()
JDBC loading and saving can be achieved via either the load/save or jdbc methods, see the guide.

spark sql load all data into memory without using where clause

so I have a really huge table contains billions of rows, I tried the Spark DataFrame API to load data, here is my code:
sql = "select * from mytable where day = 2016-11-25 and hour = 10"
df = sqlContext.read \
.format("jdbc") \
.option("driver", driver) \
.option("url", url) \
.option("user", user) \
.option("password", password) \
.option("dbtable", table) \
.load(sql)
df.show()
I typed the sql in mysql, it returns about 100 rows,but the above sql did not work in spark sql, it occurs OOM error, it seems like that spark sql load all data into memory without using where clause. So how can spark sql using where clause?
I have solved the problem. the spark doc gives the answer:
spark doc
So the key is to change the "dbtalble" option, make your sql a subquery. The correct answer is :
// 1. write your query sql as a subquery
sql = "(select * from mytable where day = 2016-11-25 and hour = 10) t1"
df = sqlContext.read \
.format("jdbc") \
.option("driver", driver) \
.option("url", url) \
.option("user", user) \
.option("password", password) \
.option("dbtable", sql) \ // 2. change "dbtable" option to your subquery sql
.load(sql)
df.show()
sql = "(select * from mytable where day = 2016-11-25 and hour = 10) as t1"
df = sqlContext.read
.format("jdbc")
.option("driver", driver)
.option("url", url)
.option("user", user)
.option("password", password)
.option("dbtable", sql)
.load(sql)
df.show()

Resources