PySpark is raising error (ORA-00933) when fetching data from Oracle Database - apache-spark

Context
I am using Databricks to connect to an Oracle Database and fetch data daily. We are using the following example code in PySpark to authenticate and access the database:
testDF = (spark.read
.format("jdbc")
.option("url", "jdbc:oracle:thin:#10.80.40.5:1526:D301")
.option("dbtable", "SCHEMA_TEST.TABLE_TEST")
.option("user", jdbcUsername)
.option("password", jdbcPassword)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.option("oracle.jdbc.timezoneAsRegion", "false")
.option("fetchSize", "256")
.load()
.createOrReplaceTempView("test")
)
Then we used to access the information via SQL code using the command below:
SELECT * FROM test
The problem
Today we realized this approach is raising an error like the one below:
SQLSyntaxErrorException: ORA-00933: SQL command not properly ended
Caused by: OracleDatabaseException: ORA-00933: SQL command not
properly ended
Upon further investigation into the error log, I saw the following message:
Caused by: Error : 933, Position : 2735, Sql = SELECT *
SCHEMA_TEST.TABLE_TEST LIMIT 10 , Error Msg = ORA-00933: SQL command
not properly ended
So I believe that PySpark/DBSQL translates the instruction SELECT * FROM test into the instruction SELECT * SCHEMA_TEST.TABLE_TEST LIMIT 10 to fetch information from the Oracle database.
The problem is that LIMIT does not exist in the ORACLE SQL Clause, therefore raising the error.
So I tried to implement an instruction with a WHERE clause. Because I thought that the translation would prevent to use of the LIMIT clause in the backend. The following code then was executed:
SELECT * FROM test WHERE column_example is not null
Well, IT WORKED. So for some reason, the PySpark/DBSQL is translating a simple select star instruction (SELECT * FROM test) WRONGLY.
Does someone know what might be the reason behind it?
Additional info:
Databricks Runtime: 12.0
Spark: 3.3.1
Oracle Driver: oracle.jdbc.driver.OracleDriver
Maven Oracle Driver: com.oracle.ojdbc:ojdbc8:19.3.0.0

Try an alternate approach by closing query in parentheses and passing query instead of calling entire table.
queryString = "(SELECT * SCHEMA_TEST.TABLE_TEST LIMIT 10)"
print("Query to fetch data from Oracle:: ", query)
# Fetch data from query
oracleDataDf = spark.read
.format("jdbc")
.option("url", oracleUrl)
.option("dbtable", queryString)
.option("user", oracleDbUser)
.option("password", oracleDbPass)
.option("driver", oracleDriver)
.load()
oracleDataDf.show()

Related

How to read data from db2 using jdbc connection?

I'm trying to fetch the data from db2 using
df= spark.read.format(“jdbc”).option(“user”,”user”).option(“password”,”password”)\
.option(“driver”, “com.ibm.db2.jcc.DB2Driver”)\
.option(“url”,”jdbc:db2://url:<port>/<DB>”)\
.option(“query”, query)\
.load()
In my local in options query function is working but in server it is asking me to use dbtable
when i use dbtable i'm getting sqlsyntax error: sql code =-104 sqlstate =42601 and taking wrong columns
can some one help me with this
You can use the AS400 driver to fetch DB2 data using Spark.
Your DB2 URL will look something like this: jdbc:as400://<DBIPAddress>
val query = "(select * from db.temptable) temp"
val df = spark.read.format("jdbc").option("url", <YourURL>).option("driver", "com.ibm.as400.access.AS400JDBCDriver").option("dbtable", query).option("user", <Username>).option("password", <Password>).load()
Please note that you will need to keep the query format as shown above (i.e. give an alias to the query). Hope this resolves your issue.

Using spark MS SQL connector PySpark causes NoSuchMethodError for BulkCopy

I'm trying to use the MS SQL connector for Spark to insert high volumes of data from pyspark.
After creating a session:
SparkSession.builder
.config('spark.jars.packages', 'org.apache.hadoop:hadoop-azure:3.2.0,org.apache.spark:spark-avro_2.12:3.1.2,com.microsoft.sqlserver:mssql-jdbc:8.4.1.jre8,com.microsoft.azure:spark-mssql-connector_2.12:1.2.0')
I get the following error:
ERROR executor.Executor: Exception in task 6.0 in stage 12.0 (TID 233)
java.lang.NoSuchMethodError: 'void com.microsoft.sqlserver.jdbc.SQLServerBulkCopy.writeToServer(com.microsoft.sqlserver.jdbc.ISQLServerBulkData)'
at com.microsoft.sqlserver.jdbc.spark.BulkCopyUtils$.bulkWrite(BulkCopyUtils.scala:110)
at com.microsoft.sqlserver.jdbc.spark.BulkCopyUtils$.savePartition(BulkCopyUtils.scala:58)
at com.microsoft.sqlserver.jdbc.spark.SingleInstanceWriteStrategies$.$anonfun$write$2(BestEffortSingleInstanceStrategy.scala:43)
at com.microsoft.sqlserver.jdbc.spark.SingleInstanceWriteStrategies$.$anonfun$write$2$adapted(BestEffortSingleInstanceStrategy.scala:42)
at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1020)
at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:1020)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
When trying to write data like this:
try:
(
df.write.format("com.microsoft.sqlserver.jdbc.spark")
.mode("append")
.option("url", url)
.option("dbtable", table_name)
.option("user", username)
.option("password", password)
.option("schemaCheckEnabled", "false")
.save()
)
except ValueError as error:
print("Connector write failed", error)
I tried different versions of spark and the sql connector but no luck so far.
I also tried using a jar for the mssql-jdbc dependency directly:
SparkSession.builder
.config('spark.jars', '/mssql-jdbc-8.4.1.jre8.jar')
.config(...)
It still complains that it can't find the method, however if you inspect the JAR file, the method is defined in the source code.
Any tips on where to look are welcome!
We reproduced the same scenario in our environment and it's correctly working now.
There is an issue in JDBC driver 8.2.2 you can use the older version for the library.
Below is the code sample,
Output:
Data got inserted into table from pyspark.
Reference: NoSuchMethodError for BulkCopy.
Using com.microsoft.sqlserver:mssql-jdbc:8.4.1.jre8 is one thing but also you need proper version of MS' Spark SQL Connector, compatible with your Spark's version.
com.microsoft.azure:spark-mssql-connector_2.12_3.0:1.0.0-alpha and com.microsoft.sqlserver:mssql-jdbc:8.4.1.jre8 did not work for my case as I'm using AWS Glue 3.0 (which is Spark 3.1)
I had to switch to com.microsoft.azure:spark-mssql-connector_2.12:1.2.0 as it's Spark 3.1 compatible.
def write_df_to_target(self, df, schema_table):
spark = self.gc.spark_session
spark.builder.config('spark.jars.packages', 'com.microsoft.sqlserver:mssql-jdbc:8.4.1.jre8,com.microsoft.azure:spark-mssql-connector_2.12:1.2.0').getOrCreate()
credentials = self.get_credentials(self.replica_connection_name)
df.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.option("url", credentials["url"] + ";databaseName=" + self.database_name) \
.option("dbtable", schema_table) \
.option("user", credentials["user"]) \
.option("password", credentials["password"]) \
.option("batchsize","100000") \
.option("numPartitions","15") \
.save()
Last thing. AWS Glue job must have --user-jars-first: "true" param.
This instruction indicates that provided jars are to be used in first order (aka - you override default ones).
Try to check if equivalent parameter is on your end.

Unrecognized connection property 'url' when using Presto JDBC in Spark SQL

Here is my spark sql code, where I am trying to read a presto table based on this guide;  https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
val df = spark.read
.format("jdbc")
.option("driver", "com.facebook.presto.jdbc.PrestoDriver")
.option("url", "jdbc:presto://localhost:8889/mycatalog")
.option("query", "select * from mydb.mytable limit 1")
.option("user", "myuserid")
.load()
 
I am getting the following exception, unrecognized connection property 'url'
Exception in thread "main" java.sql.SQLException: Unrecognized connection property 'url'
at com.facebook.presto.jdbc.PrestoDriverUri.validateConnectionProperties(PrestoDriverUri.java:345)
at com.facebook.presto.jdbc.PrestoDriverUri.<init>(PrestoDriverUri.java:102)
at com.facebook.presto.jdbc.PrestoDriverUri.<init>(PrestoDriverUri.java:92)
at com.facebook.presto.jdbc.PrestoDriver.connect(PrestoDriver.java:87)
at org.apache.spark.sql.execution.datasources.jdbc.connection.BasicConnectionProvider.getConnection(BasicConnectionProvider.scala:49)
at org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider$.create(ConnectionProvider.scala:68)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$createConnectionFactory$1(JdbcUtils.scala:62)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:56)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:326)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:226)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:341)
Seems like this issue is related to https://github.com/prestodb/presto/issues/9254  where the property url is not a recognized property in Presto and looks like the fix needs to be done on the Spark side? Are there any other workaround for this issue?
PS:
Spark Version: 3.1.1
presto-jdbc version: 0.245
looks like a spark bug fixed 3.3
https://issues.apache.org/jira/browse/SPARK-36163
There is no issue with spark or presto JDBC driver. I don't think URL which you specified will work.
You should change that to below format.
jdbc:presto://localhost:8889/mycatalog
UPDATE
Not sure how it's working with spark version < 3. As an workaround you can use another jar where strict config check has been removed as specified here.
#odonnry is correct that the issue was fixed in spark 3.3.x, but if anyone cannot upgrade to Spark 3.3.x and is trying to use Trino, I created a workaround below according to the Jira issue linked by #Mohana
https://github.com/amitferman/trino

How to create a table with primary key using jdbc spark connector (to ignite)

I'm trying to save a spark dataframe to the ignite cache using spark connector (pyspark) like this:
df.write.format("jdbc") \
.option("url", "jdbc:ignite:thin://<ignite ip>") \
.option("driver", "org.apache.ignite.IgniteJdbcThinDriver") \
.option("primaryKeyFields", 'id') \
.option("dbtable", "ignite") \
.mode("overwrite") \
.save()
# .option("createTableOptions", "primary key (id)") \
# .option("customSchema", 'id BIGINT PRIMARY KEY, txt TEXT') \
I have an error:
java.sql.SQLException: No PRIMARY KEY defined for CREATE TABLE
The library org.apache.ignite:ignite-spark-2.4:2.9.0 is installed. I can't use the ignite format because azure databricks uses spring framework version that conflicts with the spring framework version in the org.apache.ignite:ignite-spark-2.4:2.9.0. So I'm trying to use jdbc thin client. But I can only read/append data to an existing cache.
I can't use the overwrite mode because I can't choose primary key. There is an option primaryKeyFields for the ignite format, but it doesn't work on jdbc. The jdbc customSchema option is ignored. The createTableOptions adds primary key statement after the schema parenthesis and a sql syntax error occurs.
Is there a way to determine a primary key for the jdbc spark connector?
Here's an example with correct syntax that should work fine:
DataFrameWriter < Row > df = resultDF
.write()
.format(IgniteDataFrameSettings.FORMAT_IGNITE())
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), configPath)
.option(IgniteDataFrameSettings.OPTION_TABLE(), "Person")
.option(IgniteDataFrameSettings.OPTION_CREATE_TABLE_PRIMARY_KEY_FIELDS(), "id, city_id")
.option(IgniteDataFrameSettings.OPTION_CREATE_TABLE_PARAMETERS(), "template=partitioned,backups=1")
.mode(Append);
Please let me know if something is wrong here.

Read mariaDB4J table into spark DataSet using JDBC

I am trying to read table from MariaDB4J via jdbc using the following command:
Dataset<Row> jdbcDF = spark.read()
.format("jdbc")
.option("url", url)
.option("dbtable", String.format("SELECT userID FROM %s;", TableNAME))
.load();
I get the following error:
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'SELECT userID FROM MYTABLE; WHERE 1=0' at line 1
I am not sure where the WHERE comes from and why I get this error...
Thanks
The second parameter supplied to the dbtable option is not correct. Instead of specifing a sql query, you should either use a table name qualified with the schema or a valid SQL query with an alias.
Dataset<Row> jdbcDF = spark.read()
.format("jdbc")
.option("url", url)
.option("dbtable", String.format("(SELECT userID FROM %s) as table_alias", tableNAME))
.load();
Your question is similar to this one, so forgive me if I quote myself:
The reason why you see the strange looking query WHERE 1=0 is that Spark tries to infer the schema of your data frame without loading any actual data. This query is guaranteed never to deliver any results, but the query result's metadata can be used by Spark to get the schema information.

Resources