Error reading data from DB2 table in Spark - apache-spark

I'm trying to read data from DB2 table using spark jdbc, but getting the below error.
Exception in thread "main" com.ibm.db2.jcc.am.SqlSyntaxErrorException: DB2 SQL Error: SQLCODE=-104, SQLSTATE=42601, SQLERRMC=*;SELECT * FROM select;UNION, DRIVER=4.19.26
Code:
val df = spark.sqlContext.read.format("jdbc")
.option("driver", "com.ibm.db2.jcc.DB2Driver")
.option("url", "jdbc:db2://server/database")
.option("user", "user")
.option("password", "password")
.option("dbtable","select * from schema.table limit 10").load()
Any help is appreciated. Thank you.

Related

Getting error: Py4JJavaError while writing data from Databricks to sql data-warehouse

I followed this link writing data from Databricks to sql data-warehouse .
datafram.write
.format("com.databricks.spark.sqldw")
.option("url", "jdbc:sqlserver.......)
.option("forwardSparkAzureStorageCredentials", "true")
.option("dbTable", "table")
.option("tempDir", "Blob_url")
.save()
but still I am getting this error:
Py4JJavaError: An error occurred while calling 0174.save.
: java. lang. ClassNotFoundException
Please follow below steps:
Configure Azure Storage account access key with Databricks
spark.conf.set(
"fs.azure.account.key.<storage_account>.blob.core.windows.net","Azure_access_key")
Syntax of JDBC URL
jdbc_url = "jdbc:sqlserver://<Server_name>.sql.azuresynapse.net:1433;database<Database_name>;user=<user_name>;password=Vam#9182;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
Sample Data frame:
Writing data from Azure Databricks to Synapse:
df.write \
.format("com.databricks.spark.sqldw") \
.option("url", jdbc_url) \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "<table_name>") \
.option("tempDir", "wasbs://sdd#vambl.blob.core.windows.net/") \
.mode("overwrite") \
.save()
Output:

How do I connect to Teradata from pyspark?

I am trying to connect to teradata from pyspark. Please can you tell me how can I do that? I tried looking online but can't find anything.
I have the following jars :
tdgssconfig-15.10.00.14.jar
teradata-connector-1.4.1.jar
You need the jdbc connector JAR(you can get it for the teradata version that you are connecting to. And then use something like this :
sc.addJar("yourDriver.jar")
val jdbcDF = sqlContext.load("jdbc", Map(
"url" -> "jdbc:teradata://<server_name>, TMODE=TERA, user=my_user, password=*****",
"dbtable" -> "schema.table_name",
"driver" -> "com.teradata.jdbc.TeraDriver"))
Step 1: Find the appropriate jdbc driver for the version of teradata you are using:
https://downloads.teradata.com/download/connectivity/jdbc-driver
Step 2: Go through the tutorial here
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
Sample code:
val jdbcDF = spark.read
.format("jdbc")
.option("url", "<jdbc_connection_string>")
.option("dbtable", "schema.tablename")
.option("user", "username")
.option("password", "password")
.load()

Spark over JDBC uses an alias not allowed

Using Spark 2.4.0 and Exasol 6.2.0, I want to create a DataFrame over jdbc from a simple query SELECT * FROM table_name over JDBC.
This is the code in scala:
df = sparkSession.read
.format("jdbc")
.option("driver", sourceJDBCConn("jdbcDriver"))
.option("url", sourceJDBCConn("jdbcUrl"))
.option("user", sourceJDBCConn("jdbcUsername"))
.option("password", sourceJDBCConn("jdbcPassword"))
.option("query", "SELECT * FROM table_name")
.load()
This works with PostgreSQL but not with Exasol. If I look into the session-audit I find the following SQL:
SELECT * FROM (SELECT * FROM table_name) __SPARK_GEN_JDBC_SUBQUERY_NAME_0 WHERE 1=0
the error message from Exasol is the folowing:
syntax error, unexpected $undefined, expecting UNION_ or EXCEPT_ or MINUS_ or INTERSECT_ [line 1, column 98]
While PostgreSQL looks to accept the alias with the 2 leading underscores, such an alias is not allowed in Exasol. Is there any workaround to change the name of the alias __SPARK_GEN_JDBC_SUBQUERY_NAME_0 with an identifier accepted by Exasol?

Query table names

I have been given access to a database. I am querying the data from a spark cluster. How do I check all the databases/tables I have access to?
jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2};user='{3}';password='{4}'".format(jdbcHostname, jdbcPort, jdbcDatabase, jdbcUsername, jdbcPassword)
connectionProperties = {
"user" : jdbcUsername,
"password" : jdbcPassword,
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
df = spark.read.jdbc(url=jdbcUrl, properties=connectionProperties)
Access to the database has been authenticated.
In SQL Server itself:
select *
from sys.tables
Not sure if you use synonym or not as the way into the sys schema.
val tables = spark.read.jdbc(jdbc_url, "sys.tables", connectionProperties)
tables.select(...
If you have synonym, replace the sys.tables with that. There are different ways of writing, you go down the tables or own SQL Query approach approach. This is the tables approach. Here under the SQL Query approach, an example:
val dataframe_mysql = spark.read.jdbc(jdbcUrl, "(select k, v from sample order by k DESC) e", connectionProperties)
SCALA version I just realized.
pyspark specifically
See: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
Just the same approach, but specifically for this case postgres:
jdbcDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.load()
JDBC loading and saving can be achieved via either the load/save or jdbc methods, see the guide.

spark sql load all data into memory without using where clause

so I have a really huge table contains billions of rows, I tried the Spark DataFrame API to load data, here is my code:
sql = "select * from mytable where day = 2016-11-25 and hour = 10"
df = sqlContext.read \
.format("jdbc") \
.option("driver", driver) \
.option("url", url) \
.option("user", user) \
.option("password", password) \
.option("dbtable", table) \
.load(sql)
df.show()
I typed the sql in mysql, it returns about 100 rows,but the above sql did not work in spark sql, it occurs OOM error, it seems like that spark sql load all data into memory without using where clause. So how can spark sql using where clause?
I have solved the problem. the spark doc gives the answer:
spark doc
So the key is to change the "dbtalble" option, make your sql a subquery. The correct answer is :
// 1. write your query sql as a subquery
sql = "(select * from mytable where day = 2016-11-25 and hour = 10) t1"
df = sqlContext.read \
.format("jdbc") \
.option("driver", driver) \
.option("url", url) \
.option("user", user) \
.option("password", password) \
.option("dbtable", sql) \ // 2. change "dbtable" option to your subquery sql
.load(sql)
df.show()
sql = "(select * from mytable where day = 2016-11-25 and hour = 10) as t1"
df = sqlContext.read
.format("jdbc")
.option("driver", driver)
.option("url", url)
.option("user", user)
.option("password", password)
.option("dbtable", sql)
.load(sql)
df.show()

Resources