Here is my spark sql code, where I am trying to read a presto table based on this guide; https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
val df = spark.read
.format("jdbc")
.option("driver", "com.facebook.presto.jdbc.PrestoDriver")
.option("url", "jdbc:presto://localhost:8889/mycatalog")
.option("query", "select * from mydb.mytable limit 1")
.option("user", "myuserid")
.load()
I am getting the following exception, unrecognized connection property 'url'
Exception in thread "main" java.sql.SQLException: Unrecognized connection property 'url'
at com.facebook.presto.jdbc.PrestoDriverUri.validateConnectionProperties(PrestoDriverUri.java:345)
at com.facebook.presto.jdbc.PrestoDriverUri.<init>(PrestoDriverUri.java:102)
at com.facebook.presto.jdbc.PrestoDriverUri.<init>(PrestoDriverUri.java:92)
at com.facebook.presto.jdbc.PrestoDriver.connect(PrestoDriver.java:87)
at org.apache.spark.sql.execution.datasources.jdbc.connection.BasicConnectionProvider.getConnection(BasicConnectionProvider.scala:49)
at org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider$.create(ConnectionProvider.scala:68)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$createConnectionFactory$1(JdbcUtils.scala:62)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:56)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:326)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:226)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:341)
Seems like this issue is related to https://github.com/prestodb/presto/issues/9254 where the property url is not a recognized property in Presto and looks like the fix needs to be done on the Spark side? Are there any other workaround for this issue?
PS:
Spark Version: 3.1.1
presto-jdbc version: 0.245
looks like a spark bug fixed 3.3
https://issues.apache.org/jira/browse/SPARK-36163
There is no issue with spark or presto JDBC driver. I don't think URL which you specified will work.
You should change that to below format.
jdbc:presto://localhost:8889/mycatalog
UPDATE
Not sure how it's working with spark version < 3. As an workaround you can use another jar where strict config check has been removed as specified here.
#odonnry is correct that the issue was fixed in spark 3.3.x, but if anyone cannot upgrade to Spark 3.3.x and is trying to use Trino, I created a workaround below according to the Jira issue linked by #Mohana
https://github.com/amitferman/trino
Related
I'm trying to use the MS SQL connector for Spark to insert high volumes of data from pyspark.
After creating a session:
SparkSession.builder
.config('spark.jars.packages', 'org.apache.hadoop:hadoop-azure:3.2.0,org.apache.spark:spark-avro_2.12:3.1.2,com.microsoft.sqlserver:mssql-jdbc:8.4.1.jre8,com.microsoft.azure:spark-mssql-connector_2.12:1.2.0')
I get the following error:
ERROR executor.Executor: Exception in task 6.0 in stage 12.0 (TID 233)
java.lang.NoSuchMethodError: 'void com.microsoft.sqlserver.jdbc.SQLServerBulkCopy.writeToServer(com.microsoft.sqlserver.jdbc.ISQLServerBulkData)'
at com.microsoft.sqlserver.jdbc.spark.BulkCopyUtils$.bulkWrite(BulkCopyUtils.scala:110)
at com.microsoft.sqlserver.jdbc.spark.BulkCopyUtils$.savePartition(BulkCopyUtils.scala:58)
at com.microsoft.sqlserver.jdbc.spark.SingleInstanceWriteStrategies$.$anonfun$write$2(BestEffortSingleInstanceStrategy.scala:43)
at com.microsoft.sqlserver.jdbc.spark.SingleInstanceWriteStrategies$.$anonfun$write$2$adapted(BestEffortSingleInstanceStrategy.scala:42)
at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1020)
at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:1020)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
When trying to write data like this:
try:
(
df.write.format("com.microsoft.sqlserver.jdbc.spark")
.mode("append")
.option("url", url)
.option("dbtable", table_name)
.option("user", username)
.option("password", password)
.option("schemaCheckEnabled", "false")
.save()
)
except ValueError as error:
print("Connector write failed", error)
I tried different versions of spark and the sql connector but no luck so far.
I also tried using a jar for the mssql-jdbc dependency directly:
SparkSession.builder
.config('spark.jars', '/mssql-jdbc-8.4.1.jre8.jar')
.config(...)
It still complains that it can't find the method, however if you inspect the JAR file, the method is defined in the source code.
Any tips on where to look are welcome!
We reproduced the same scenario in our environment and it's correctly working now.
There is an issue in JDBC driver 8.2.2 you can use the older version for the library.
Below is the code sample,
Output:
Data got inserted into table from pyspark.
Reference: NoSuchMethodError for BulkCopy.
Using com.microsoft.sqlserver:mssql-jdbc:8.4.1.jre8 is one thing but also you need proper version of MS' Spark SQL Connector, compatible with your Spark's version.
com.microsoft.azure:spark-mssql-connector_2.12_3.0:1.0.0-alpha and com.microsoft.sqlserver:mssql-jdbc:8.4.1.jre8 did not work for my case as I'm using AWS Glue 3.0 (which is Spark 3.1)
I had to switch to com.microsoft.azure:spark-mssql-connector_2.12:1.2.0 as it's Spark 3.1 compatible.
def write_df_to_target(self, df, schema_table):
spark = self.gc.spark_session
spark.builder.config('spark.jars.packages', 'com.microsoft.sqlserver:mssql-jdbc:8.4.1.jre8,com.microsoft.azure:spark-mssql-connector_2.12:1.2.0').getOrCreate()
credentials = self.get_credentials(self.replica_connection_name)
df.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.option("url", credentials["url"] + ";databaseName=" + self.database_name) \
.option("dbtable", schema_table) \
.option("user", credentials["user"]) \
.option("password", credentials["password"]) \
.option("batchsize","100000") \
.option("numPartitions","15") \
.save()
Last thing. AWS Glue job must have --user-jars-first: "true" param.
This instruction indicates that provided jars are to be used in first order (aka - you override default ones).
Try to check if equivalent parameter is on your end.
I'm trying running this simple function using Spark 3.0.0 using Zeppelin 0.9.0:
def getSession(uri: String) : SparkSession = {
return SparkSession
.builder()
.master("local[*]")
.appName("TEST")
.config( "spark.mongodb.input.database", uri)
.config("spark.driver.extraJavaOptions", "-Duser.timezone=Europe/Rome")
.config("spark.executor.extraJavaOptions", "-Duser.timezone=Europe/Rome")
.getOrCreate()}
Despite the configuration it's correct, i receive this error:
java.lang.IllegalArgumentException: Missing database name. Set via the 'spark.mongodb.input.uri' or 'spark.mongodb.input.database' property
I really don't have idea why this is not working. Before a Zeppelin update, this function worked correctly.
We have updated Zeppelin to 0.9.0 from 0.9.0preview but i don't know hoe this can influence this. Someone have any idea?
I need to move data from remote Hive to local Hive with Spark. I try to connect to remote hive with JDBC driver: 'org.apache.hive.jdbc.HiveDriver'. I'm now trying to read from Hive and the result is the column headers in the column values in stead of the actual data:
df = self.spark_session.read.format('JDBC') \
.option('url', "jdbc:hive2://{self.host}:{self.port}/{self.database}") \
.option('driver', 'org.apache.hive.jdbc.HiveDriver') \
.option("user", self.username) \
.option("password", self.password)
.option('dbtable', 'test_table') \
.load()
df.show()
Result:
+----------+
|str_column|
+----------+
|str_column|
|str_column|
|str_column|
|str_column|
|str_column|
+----------+
I know that Hive JDBC isn't an official support in Apache Spark. But I have already found solutions to download from other unsupported sources, such as IMB Informix. Maybe someone has already solved this problem.
After debug&trace the code we will find the problem in JdbcDialect。There is no HiveDialect so spark will use default JdbcDialect.quoteIdentifier。
So you should implement a HiveDialect to fix this problem:
import org.apache.spark.sql.jdbc.JdbcDialect
class HiveDialect extends JdbcDialect{
override def canHandle(url: String): Boolean =
url.startsWith("jdbc:hive2")
override def quoteIdentifier(colName: String): String = {
if(colName.contains(".")){
var colName1 = colName.substring(colName.indexOf(".") + 1)
return s"`$colName1`"
}
s"`$colName`"
}
}
And then register the Dialect by:
JdbcDialects.registerDialect(new HiveDialect)
At last, add option hive.resultset.use.unique.column.names=false to the url like this
option("url", "jdbc:hive2://bigdata01:10000?hive.resultset.use.unique.column.names=false")
refer to csdn blog
Apache Kyuubi has provided a Hive dialect plugin here.
https://kyuubi.readthedocs.io/en/latest/extensions/engines/spark/jdbc-dialect.html
Hive Dialect plugin aims to provide Hive Dialect support to Spark’s JDBC source. It will auto registered to Spark and applied to JDBC sources with url prefix of jdbc:hive2:// or jdbc:kyuubi://. It will quote identifier in Hive SQL style, eg. Quote table.column in table.column.
compile and get the dialect plugin from Kyuubi. (It's a standalone Spark plugin, which is independent from Kyuubi)
put jar into $SPARK_HOME/jars
add plugin to config spark.sql.extensions=org.apache.spark.sql.dialect.KyuubiSparkJdbcDialectExtension, it will be auto registered to spark
Currently I am using Spark 3.2.0 with Trino 363. I am trying to connect to Trino but I am getting an error. Error message is as below.
Exception in thread "main" java.sql.SQLException: Unrecognized connection property 'url'
Please find below code which I am using.
val sparkSession = SparkSession.builder().appName("Trino-Spark")
.master("local[*]")
.getOrCreate()
val properties = new Properties()
properties.setProperty("SSL", "true")
properties.setProperty("SSLVerification", "NONE")
properties.setProperty("user", "USERNAME")
properties.setProperty("password", "PWD")
val df = sparkSession.read.jdbc("jdbc:trino://HOST:PORT/hive", "hive.TABLE_NAME", properties)
println(s"Count: ${df.count()}")
Please could anyone help me to point out what is wrong here. Thanks in advance.
I was able to run spark 3.2.0 with Trino 363. I have commented out below mentioned line and re build JDBC driver.
Trino 363 JDBC Driver
I am trying to integrate Cassandra with Spark and facing the below issue.
Issue:
com.datastax.spark.connector.util.ConfigCheck$ConnectorConfigurationException: Invalid Config Variables
Only known spark.cassandra.* variables are allowed when using the Spark Cassandra Connector.
spark.cassandra.keyspace is not a valid Spark Cassandra Connector variable.
Possible matches:
spark.cassandra.sql.keyspace
spark.cassandra.output.batch.grouping.key
at com.datastax.spark.connector.util.ConfigCheck$.checkConfig(ConfigCheck.scala:50)
at com.datastax.spark.connector.cql.CassandraConnectorConf$.apply(CassandraConnectorConf.scala:253)
at org.apache.spark.sql.cassandra.CassandraSourceRelation$.apply(CassandraSourceRelation.scala:263)
at org.apache.spark.sql.cassandra.CassandraCatalog.org$apache$spark$sql$cassandra$CassandraCatalog$$buildRelation(CasandraCatalog.scala:41)
at org.apache.spark.sql.cassandra.CassandraCatalog$$anon$1.load(CassandraCatalog.scala:26)
at org.apache.spark.sql.cassandra.CassandraCatalog$$anon$1.load(CassandraCatalog.scala:23)
Please find the below versions of spark Cassandra and connector I am using.
Spark : 1.6.0
Cassandra : 2.1.17
Connector Used : spark-cassandra-connector_2.10-1.6.0-M1.jar
Below is the code snippet I am using to connect Cassandra from spark.
val conf: org.apache.spark.SparkConf = new SparkConf(true) \
.setAppName("Spark Cassandra") \
.set"spark.cassandra.connection.host", "abc.efg.lkh") \
.set("spark.cassandra.auth.username", "xyz") \
.set("spark.cassandra.auth.password", "1234") \
.set("spark.cassandra.keyspace","abcded")
val sc = new SparkContext("local[*]", "Spark Cassandra",conf)
val csc = new CassandraSQLContext(sc)
csc.setKeyspace("abcded")
val my_df = csc.sql("select * from table")
Here when I try to create DF, I am getting above posted error. I tried without passing schema in conf but it is trying to access in default schema where mentioned user doesn't have access.
Already a JIRA was opened and closed.
https://datastax-oss.atlassian.net/browse/SPARKC-102
yet I am getting this issue. Please let me know whether I need to use lastest connector to resolve this issue.
Thanks in advance.
The important information is in the error message you posted [formatted for readability]:
Invalid Config Variables
Only known spark.cassandra.* variables are allowed when using the Spark Cassandra Connector.
spark.cassandra.keyspace is not a valid Spark Cassandra Connector variable.
Possible matches: spark.cassandra.sql.keyspace
spark.cassandra.keyspace is not an available property for the connector. A full list of the available properties can be found here: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md
You may have some luck using the suggested spark.cassandra.sql.keyspace; otherwise you may just need to explicitly specify the keyspace for every Cassandra interaction you perform using the connector.