Spark over JDBC uses an alias not allowed - apache-spark

Using Spark 2.4.0 and Exasol 6.2.0, I want to create a DataFrame over jdbc from a simple query SELECT * FROM table_name over JDBC.
This is the code in scala:
df = sparkSession.read
.format("jdbc")
.option("driver", sourceJDBCConn("jdbcDriver"))
.option("url", sourceJDBCConn("jdbcUrl"))
.option("user", sourceJDBCConn("jdbcUsername"))
.option("password", sourceJDBCConn("jdbcPassword"))
.option("query", "SELECT * FROM table_name")
.load()
This works with PostgreSQL but not with Exasol. If I look into the session-audit I find the following SQL:
SELECT * FROM (SELECT * FROM table_name) __SPARK_GEN_JDBC_SUBQUERY_NAME_0 WHERE 1=0
the error message from Exasol is the folowing:
syntax error, unexpected $undefined, expecting UNION_ or EXCEPT_ or MINUS_ or INTERSECT_ [line 1, column 98]
While PostgreSQL looks to accept the alias with the 2 leading underscores, such an alias is not allowed in Exasol. Is there any workaround to change the name of the alias __SPARK_GEN_JDBC_SUBQUERY_NAME_0 with an identifier accepted by Exasol?

Related

ClickHouse housepower driver with spark

I'm new to Stack and Spark so please forgive me my simplicity and mistakes!
I have a problem with Clickhouse and spark (2.4.7), I work on a jupyter notebook.
Basically, I want to insert dataframe with Array column to Clickhouse with Array(String) column. Using yandex driver this is impossible, because jdbc doesn't support Arrays, right? ;)
So I wanted to run Spark Session with housepower jar: clickhouse-native-jdbc-shaded-2.6.4.jar, because I read that they added handling Arrays - correct me if I'm wrong.
And I want to get a query from Clickhouse via jdbc.
spark = SparkSession\
.builder\
.enableHiveSupport()\
.appName(f'custom-events-test)\
.config("spark.jars", "drivers/clickhouse-native-jdbc-shaded-2.6.4.jar")\
.getOrCreate()
My query:
query = """
select date as date,
partnerID as partnerID,
sessionID as sessionID,
toString(mapKeys(cast((JSONExtractKeysAndValues(ec.custom, 'String')), 'Map(String, String)'))) as keys,
toString(mapValues(cast((JSONExtractKeysAndValues(ec.custom, 'String')), 'Map(String, String)'))) as values
from audience.uber_all
array join eventContents as ec
PREWHERE date = ('2022-07-22')
WHERE partnerID = 'XXX'
and ec.custom != '{}'
order by date, partnerID
"""
and my code:
df_tab = spark.read \
.format("jdbc") \
.option("driver", "com.github.housepower.jdbc.ClickHouseDriver") \
.option("url", f"jdbc:clickhouse://{ch_host}:9000/{ch_db}") \
.option("query", query) \
.option("user", ch_user) \
.option("password", ch_pass) \
.load()
But there I get an error:
housepower_error
BUT when I run above query with yandex driver: ru.yandex.clickhouse.ClickHouseDriver
everything works fine. (even with housepower jar)
This error also appears when I want to import column like this:
JSONExtractKeysAndValues(ec.custom, 'String')
or
toString(JSONExtractKeysAndValues(ec.custom, 'String'))
What am I doing wrong ?
And tell me how to insert a DF with Array column using spark jdbc to Clickhouse table also with Array(String) column? I was looking everywhere but coudn't find a solution...
Thank you in advance !

Error reading data from DB2 table in Spark

I'm trying to read data from DB2 table using spark jdbc, but getting the below error.
Exception in thread "main" com.ibm.db2.jcc.am.SqlSyntaxErrorException: DB2 SQL Error: SQLCODE=-104, SQLSTATE=42601, SQLERRMC=*;SELECT * FROM select;UNION, DRIVER=4.19.26
Code:
val df = spark.sqlContext.read.format("jdbc")
.option("driver", "com.ibm.db2.jcc.DB2Driver")
.option("url", "jdbc:db2://server/database")
.option("user", "user")
.option("password", "password")
.option("dbtable","select * from schema.table limit 10").load()
Any help is appreciated. Thank you.

Query table names

I have been given access to a database. I am querying the data from a spark cluster. How do I check all the databases/tables I have access to?
jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2};user='{3}';password='{4}'".format(jdbcHostname, jdbcPort, jdbcDatabase, jdbcUsername, jdbcPassword)
connectionProperties = {
"user" : jdbcUsername,
"password" : jdbcPassword,
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
df = spark.read.jdbc(url=jdbcUrl, properties=connectionProperties)
Access to the database has been authenticated.
In SQL Server itself:
select *
from sys.tables
Not sure if you use synonym or not as the way into the sys schema.
val tables = spark.read.jdbc(jdbc_url, "sys.tables", connectionProperties)
tables.select(...
If you have synonym, replace the sys.tables with that. There are different ways of writing, you go down the tables or own SQL Query approach approach. This is the tables approach. Here under the SQL Query approach, an example:
val dataframe_mysql = spark.read.jdbc(jdbcUrl, "(select k, v from sample order by k DESC) e", connectionProperties)
SCALA version I just realized.
pyspark specifically
See: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
Just the same approach, but specifically for this case postgres:
jdbcDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.load()
JDBC loading and saving can be achieved via either the load/save or jdbc methods, see the guide.

How to generate a spark sql truncate query without Only

I am using Spark(3.0.0_preview) and reading/writing from/to GreenPlum(version 5.24). The Greenplum version 5.24 doesn't support the "truncate table only $table_name" command.
With Spark 3.0.0_preview and jdbcdriver(org.postgresql" % "postgresql" % "42.2.5), the command genearted by Spark is "truncate table only $table_name".
df.write.format("jdbc").option("url", "jdbc:postgresql://test:5432/sample")
.option("user", "sample")
.option("password", "sample")
.option("dbtable", "test.employer")
.option("truncate", true) // **Genearte truncate table only**
.mode(SaveMode.Overwrite)
.save();
I want to generate truncate command without ONLY option. As the only option is not supported by Greenplum V5.24
As mentioned by #mazaneicha the Spark's PostgreSQL dialect can only generate TRUNCATE ONLY
To get this working for me, I am using Scala to truncate my table. Not a good fix but would work till we upgrade to GreenPlum 6 which supports TRUNCATE TABLE ONLY
truncate(""test.employer", "jdbc:postgresql://test:5432/sample","sample","sample" )
df.write.format("jdbc").option("url", "jdbc:postgresql://test:5432/sample")
.option("user", "sample")
.option("password", "sample")
.option("dbtable", "test.employer")
.mode(SaveMode.Append)
.save();
def truncate(tableName: String, jdbcUrl: String, username: String, password:
String) = {
val connection = DriverManager.getConnection(jdbcUrl, username, password)
connection.setAutoCommit(true)
val statement = connection.createStatement()
statement.execute(s"TRUNCATE TABLE $tableName")
}

Running custom Apache Phoenix SQL query in PySpark

Could someone provide an example using pyspark on how to run a custom Apache Phoenix SQL query and store the result of that query in a RDD or DF. Note: I am looking for a custom query and not an entire table to be read into a RDD.
From Phoenix Documentation, to load an entire table I can use this:
table = sqlContext.read \
.format("org.apache.phoenix.spark") \
.option("table", "<TABLENAME>") \
.option("zkUrl", "<hostname>:<port>") \
.load()
I want to know what is the corresponding equivalent for using a custom SQL
sqlResult = sqlContext.read \
.format("org.apache.phoenix.spark") \
.option("sql", "select * from <TABLENAME> where <CONDITION>") \
.option("zkUrl", "<HOSTNAME>:<PORT>") \
.load()
Thanks.
This can be done using Phoenix as a JDBC data source as given below:
sql = '(select COL1, COL2 from TABLE where COL3 = 5) as TEMP_TABLE'
df = sqlContext.read.format('jdbc')\
.options(driver="org.apache.phoenix.jdbc.PhoenixDriver", url='jdbc:phoenix:<HOSTNAME>:<PORT>', dbtable=sql).load()
df.show()
However it should be noted that if there are column aliases in the SQL statement then the .show() statement would throw up an exception (It will work if you use .select() to select the columns that are not aliased), this is a possible bug in Phoenix.
Here you need to use .sql to work with custom queries. Here is syntax
dataframe = sqlContext.sql("select * from <table> where <condition>")
dataframe.show()
To Spark2, I didn't have problem with .show() function, and I did not use .select() function to print all values of DataFrame coming from Phoenix.
So, make sure that your sql query has been inside parentheses, look my example:
val sql = " (SELECT P.PERSON_ID as PERSON_ID, P.LAST_NAME as LAST_NAME, C.STATUS as STATUS FROM PERSON P INNER JOIN CLIENT C ON C.CLIENT_ID = P.PERSON_ID) "
val dft = dfPerson.sparkSession.read.format("jdbc")
.option("driver", "org.apache.phoenix.jdbc.PhoenixDriver")
.option("url", "jdbc:phoenix:<HOSTNAME>:<PORT>")
.option("useUnicode", "true")
.option("continueBatchOnError", "true")
.option("dbtable", sql)
.load()
dft.show();
It shows me:
+---------+--------------------+------+
|PERSON_ID| LAST_NAME|STATUS|
+---------+--------------------+------+
| 1005| PerDiem|Active|
| 1008|NAMEEEEEEEEEEEEEE...|Active|
| 1009| Admission|Active|
| 1010| Facility|Active|
| 1011| MeUP|Active|
+---------+--------------------+------+

Resources