Read mariaDB4J table into spark DataSet using JDBC - apache-spark

I am trying to read table from MariaDB4J via jdbc using the following command:
Dataset<Row> jdbcDF = spark.read()
.format("jdbc")
.option("url", url)
.option("dbtable", String.format("SELECT userID FROM %s;", TableNAME))
.load();
I get the following error:
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'SELECT userID FROM MYTABLE; WHERE 1=0' at line 1
I am not sure where the WHERE comes from and why I get this error...
Thanks

The second parameter supplied to the dbtable option is not correct. Instead of specifing a sql query, you should either use a table name qualified with the schema or a valid SQL query with an alias.
Dataset<Row> jdbcDF = spark.read()
.format("jdbc")
.option("url", url)
.option("dbtable", String.format("(SELECT userID FROM %s) as table_alias", tableNAME))
.load();
Your question is similar to this one, so forgive me if I quote myself:
The reason why you see the strange looking query WHERE 1=0 is that Spark tries to infer the schema of your data frame without loading any actual data. This query is guaranteed never to deliver any results, but the query result's metadata can be used by Spark to get the schema information.

Related

PySpark is raising error (ORA-00933) when fetching data from Oracle Database

Context
I am using Databricks to connect to an Oracle Database and fetch data daily. We are using the following example code in PySpark to authenticate and access the database:
testDF = (spark.read
.format("jdbc")
.option("url", "jdbc:oracle:thin:#10.80.40.5:1526:D301")
.option("dbtable", "SCHEMA_TEST.TABLE_TEST")
.option("user", jdbcUsername)
.option("password", jdbcPassword)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.option("oracle.jdbc.timezoneAsRegion", "false")
.option("fetchSize", "256")
.load()
.createOrReplaceTempView("test")
)
Then we used to access the information via SQL code using the command below:
SELECT * FROM test
The problem
Today we realized this approach is raising an error like the one below:
SQLSyntaxErrorException: ORA-00933: SQL command not properly ended
Caused by: OracleDatabaseException: ORA-00933: SQL command not
properly ended
Upon further investigation into the error log, I saw the following message:
Caused by: Error : 933, Position : 2735, Sql = SELECT *
SCHEMA_TEST.TABLE_TEST LIMIT 10 , Error Msg = ORA-00933: SQL command
not properly ended
So I believe that PySpark/DBSQL translates the instruction SELECT * FROM test into the instruction SELECT * SCHEMA_TEST.TABLE_TEST LIMIT 10 to fetch information from the Oracle database.
The problem is that LIMIT does not exist in the ORACLE SQL Clause, therefore raising the error.
So I tried to implement an instruction with a WHERE clause. Because I thought that the translation would prevent to use of the LIMIT clause in the backend. The following code then was executed:
SELECT * FROM test WHERE column_example is not null
Well, IT WORKED. So for some reason, the PySpark/DBSQL is translating a simple select star instruction (SELECT * FROM test) WRONGLY.
Does someone know what might be the reason behind it?
Additional info:
Databricks Runtime: 12.0
Spark: 3.3.1
Oracle Driver: oracle.jdbc.driver.OracleDriver
Maven Oracle Driver: com.oracle.ojdbc:ojdbc8:19.3.0.0
Try an alternate approach by closing query in parentheses and passing query instead of calling entire table.
queryString = "(SELECT * SCHEMA_TEST.TABLE_TEST LIMIT 10)"
print("Query to fetch data from Oracle:: ", query)
# Fetch data from query
oracleDataDf = spark.read
.format("jdbc")
.option("url", oracleUrl)
.option("dbtable", queryString)
.option("user", oracleDbUser)
.option("password", oracleDbPass)
.option("driver", oracleDriver)
.load()
oracleDataDf.show()

How to read data from db2 using jdbc connection?

I'm trying to fetch the data from db2 using
df= spark.read.format(“jdbc”).option(“user”,”user”).option(“password”,”password”)\
.option(“driver”, “com.ibm.db2.jcc.DB2Driver”)\
.option(“url”,”jdbc:db2://url:<port>/<DB>”)\
.option(“query”, query)\
.load()
In my local in options query function is working but in server it is asking me to use dbtable
when i use dbtable i'm getting sqlsyntax error: sql code =-104 sqlstate =42601 and taking wrong columns
can some one help me with this
You can use the AS400 driver to fetch DB2 data using Spark.
Your DB2 URL will look something like this: jdbc:as400://<DBIPAddress>
val query = "(select * from db.temptable) temp"
val df = spark.read.format("jdbc").option("url", <YourURL>).option("driver", "com.ibm.as400.access.AS400JDBCDriver").option("dbtable", query).option("user", <Username>).option("password", <Password>).load()
Please note that you will need to keep the query format as shown above (i.e. give an alias to the query). Hope this resolves your issue.

Connect to Hive with jdbc driver in Spark

I need to move data from remote Hive to local Hive with Spark. I try to connect to remote hive with JDBC driver: 'org.apache.hive.jdbc.HiveDriver'. I'm now trying to read from Hive and the result is the column headers in the column values in stead of the actual data:
df = self.spark_session.read.format('JDBC') \
.option('url', "jdbc:hive2://{self.host}:{self.port}/{self.database}") \
.option('driver', 'org.apache.hive.jdbc.HiveDriver') \
.option("user", self.username) \
.option("password", self.password)
.option('dbtable', 'test_table') \
.load()
df.show()
Result:
+----------+
|str_column|
+----------+
|str_column|
|str_column|
|str_column|
|str_column|
|str_column|
+----------+
I know that Hive JDBC isn't an official support in Apache Spark. But I have already found solutions to download from other unsupported sources, such as IMB Informix. Maybe someone has already solved this problem.
After debug&trace the code we will find the problem in JdbcDialect。There is no HiveDialect so spark will use default JdbcDialect.quoteIdentifier。
So you should implement a HiveDialect to fix this problem:
import org.apache.spark.sql.jdbc.JdbcDialect
class HiveDialect extends JdbcDialect{
override def canHandle(url: String): Boolean =
url.startsWith("jdbc:hive2")
override def quoteIdentifier(colName: String): String = {
if(colName.contains(".")){
var colName1 = colName.substring(colName.indexOf(".") + 1)
return s"`$colName1`"
}
s"`$colName`"
}
}
And then register the Dialect by:
JdbcDialects.registerDialect(new HiveDialect)
At last, add option hive.resultset.use.unique.column.names=false to the url like this
option("url", "jdbc:hive2://bigdata01:10000?hive.resultset.use.unique.column.names=false")
refer to csdn blog
Apache Kyuubi has provided a Hive dialect plugin here.
https://kyuubi.readthedocs.io/en/latest/extensions/engines/spark/jdbc-dialect.html
Hive Dialect plugin aims to provide Hive Dialect support to Spark’s JDBC source. It will auto registered to Spark and applied to JDBC sources with url prefix of jdbc:hive2:// or jdbc:kyuubi://. It will quote identifier in Hive SQL style, eg. Quote table.column in table.column.
compile and get the dialect plugin from Kyuubi. (It's a standalone Spark plugin, which is independent from Kyuubi)
put jar into $SPARK_HOME/jars
add plugin to config spark.sql.extensions=org.apache.spark.sql.dialect.KyuubiSparkJdbcDialectExtension, it will be auto registered to spark

How to create a table with primary key using jdbc spark connector (to ignite)

I'm trying to save a spark dataframe to the ignite cache using spark connector (pyspark) like this:
df.write.format("jdbc") \
.option("url", "jdbc:ignite:thin://<ignite ip>") \
.option("driver", "org.apache.ignite.IgniteJdbcThinDriver") \
.option("primaryKeyFields", 'id') \
.option("dbtable", "ignite") \
.mode("overwrite") \
.save()
# .option("createTableOptions", "primary key (id)") \
# .option("customSchema", 'id BIGINT PRIMARY KEY, txt TEXT') \
I have an error:
java.sql.SQLException: No PRIMARY KEY defined for CREATE TABLE
The library org.apache.ignite:ignite-spark-2.4:2.9.0 is installed. I can't use the ignite format because azure databricks uses spring framework version that conflicts with the spring framework version in the org.apache.ignite:ignite-spark-2.4:2.9.0. So I'm trying to use jdbc thin client. But I can only read/append data to an existing cache.
I can't use the overwrite mode because I can't choose primary key. There is an option primaryKeyFields for the ignite format, but it doesn't work on jdbc. The jdbc customSchema option is ignored. The createTableOptions adds primary key statement after the schema parenthesis and a sql syntax error occurs.
Is there a way to determine a primary key for the jdbc spark connector?
Here's an example with correct syntax that should work fine:
DataFrameWriter < Row > df = resultDF
.write()
.format(IgniteDataFrameSettings.FORMAT_IGNITE())
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), configPath)
.option(IgniteDataFrameSettings.OPTION_TABLE(), "Person")
.option(IgniteDataFrameSettings.OPTION_CREATE_TABLE_PRIMARY_KEY_FIELDS(), "id, city_id")
.option(IgniteDataFrameSettings.OPTION_CREATE_TABLE_PARAMETERS(), "template=partitioned,backups=1")
.mode(Append);
Please let me know if something is wrong here.

spark-sql Table or view not found error

I'm trying to run a basic java program using spark-sql & JDBC. I'm running into the following error. Not sure what's wrong here. Most of the material I have read does not talk on what needs to be done to fix this problem.
It will also be great if someone can point me to some good material to read on Spark-sql (Spark-2.1.1). I'm planning to use spark to implement ETL's, connecting to MySQL and other datasources.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: myschema.mytable; line 1 pos 21;
String MYSQL_CONNECTION_URL = "jdbc:mysql://localhost:3306/myschema";
String MYSQL_USERNAME = "root";
String MYSQL_PWD = "root";
Properties connectionProperties = new Properties();
connectionProperties.put("user", MYSQL_USERNAME);
connectionProperties.put("password", MYSQL_PWD);
Dataset<Row> jdbcDF2 = spark.read()
.jdbc(MYSQL_CONNECTION_URL, "myschema.mytable", connectionProperties);
spark.sql("SELECT COUNT(*) FROM myschema.mytable").show();
It's because Spark is not registering any tables from any schemas from connection by default in Spark SQL Context. You must register it by yourself:
jdbcDF2.createOrReplaceTempView("mytable");
spark.sql("select count(*) from mytable");
Your jdbcDF2 has a source in myschema.mytable from MySQL and will load data from this table on some action.
Remember that MySQL table is not the same as Spark table or view. You are telling Spark to read data from MySQL, but you must register this DataFrame or Dataset as table or view in current Spark SQL Context or Spark Session

Resources