Spark Trino Connection - apache-spark

Currently I am using Spark 3.2.0 with Trino 363. I am trying to connect to Trino but I am getting an error. Error message is as below.
Exception in thread "main" java.sql.SQLException: Unrecognized connection property 'url'
Please find below code which I am using.
val sparkSession = SparkSession.builder().appName("Trino-Spark")
.master("local[*]")
.getOrCreate()
val properties = new Properties()
properties.setProperty("SSL", "true")
properties.setProperty("SSLVerification", "NONE")
properties.setProperty("user", "USERNAME")
properties.setProperty("password", "PWD")
val df = sparkSession.read.jdbc("jdbc:trino://HOST:PORT/hive", "hive.TABLE_NAME", properties)
println(s"Count: ${df.count()}")
Please could anyone help me to point out what is wrong here. Thanks in advance.

I was able to run spark 3.2.0 with Trino 363. I have commented out below mentioned line and re build JDBC driver.
Trino 363 JDBC Driver

Related

SparkSession doesen't read database despite correct configuration

I'm trying running this simple function using Spark 3.0.0 using Zeppelin 0.9.0:
def getSession(uri: String) : SparkSession = {
return SparkSession
.builder()
.master("local[*]")
.appName("TEST")
.config( "spark.mongodb.input.database", uri)
.config("spark.driver.extraJavaOptions", "-Duser.timezone=Europe/Rome")
.config("spark.executor.extraJavaOptions", "-Duser.timezone=Europe/Rome")
.getOrCreate()}
Despite the configuration it's correct, i receive this error:
java.lang.IllegalArgumentException: Missing database name. Set via the 'spark.mongodb.input.uri' or 'spark.mongodb.input.database' property
I really don't have idea why this is not working. Before a Zeppelin update, this function worked correctly.
We have updated Zeppelin to 0.9.0 from 0.9.0preview but i don't know hoe this can influence this. Someone have any idea?

Pyspark Dataframe to AWS MySql: requirement failed: The driver could not open a JDBC connection

I want to write a pyspark dataframe into a MySQL table in AWS RDS, but I keep getting the error
pyspark.sql.utils.IllegalArgumentException: requirement failed: The driver could not open a JDBC connection. Check the URL: jdbc:mysql:mtestdb.ch4i3d3jc0yc.eu-central-1.rds.amazonaws.com
My code looks like this:
import os
import sys
spark = SparkSession.builder\
.appName('test-app')\
.config('spark.jars.packages', 'mysql:mysql-connector-java:8.0.28')\
.getOrCreate()
properties = {'user':'admin', 'password':'password', 'driver':'com.mysql.cj.jdbc.Driver'}
resultDF.write.jdbc(url='jdbc:mysql:mtestdb.ch4i3d3jc0yc.eu-central-1.rds.amazonaws.com', table='mcm_objects', properties=properties)\
.mode('append')\
.save()
I also tried the url 'jdbc:mysql://mtestdb.ch4i3d3jc0yc.eu-central-1.rds.amazonaws.com',
but then I get the error:
java.sql.SQLException: No database selected
No idea what I am doing wrong. Any help would be greatly appreciated
table should be {dbName}.{dbtable}:
resultDF.write.jdbc(url='jdbc:mysql:mtestdb.ch4i3d3jc0yc.eu-central-1.rds.amazonaws.com', table='{dbname}.mcm_objects', properties=properties)\
.mode('append')\
.save()

Unrecognized connection property 'url' when using Presto JDBC in Spark SQL

Here is my spark sql code, where I am trying to read a presto table based on this guide;  https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
val df = spark.read
.format("jdbc")
.option("driver", "com.facebook.presto.jdbc.PrestoDriver")
.option("url", "jdbc:presto://localhost:8889/mycatalog")
.option("query", "select * from mydb.mytable limit 1")
.option("user", "myuserid")
.load()
 
I am getting the following exception, unrecognized connection property 'url'
Exception in thread "main" java.sql.SQLException: Unrecognized connection property 'url'
at com.facebook.presto.jdbc.PrestoDriverUri.validateConnectionProperties(PrestoDriverUri.java:345)
at com.facebook.presto.jdbc.PrestoDriverUri.<init>(PrestoDriverUri.java:102)
at com.facebook.presto.jdbc.PrestoDriverUri.<init>(PrestoDriverUri.java:92)
at com.facebook.presto.jdbc.PrestoDriver.connect(PrestoDriver.java:87)
at org.apache.spark.sql.execution.datasources.jdbc.connection.BasicConnectionProvider.getConnection(BasicConnectionProvider.scala:49)
at org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider$.create(ConnectionProvider.scala:68)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$createConnectionFactory$1(JdbcUtils.scala:62)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:56)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:326)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:226)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:341)
Seems like this issue is related to https://github.com/prestodb/presto/issues/9254  where the property url is not a recognized property in Presto and looks like the fix needs to be done on the Spark side? Are there any other workaround for this issue?
PS:
Spark Version: 3.1.1
presto-jdbc version: 0.245
looks like a spark bug fixed 3.3
https://issues.apache.org/jira/browse/SPARK-36163
There is no issue with spark or presto JDBC driver. I don't think URL which you specified will work.
You should change that to below format.
jdbc:presto://localhost:8889/mycatalog
UPDATE
Not sure how it's working with spark version < 3. As an workaround you can use another jar where strict config check has been removed as specified here.
#odonnry is correct that the issue was fixed in spark 3.3.x, but if anyone cannot upgrade to Spark 3.3.x and is trying to use Trino, I created a workaround below according to the Jira issue linked by #Mohana
https://github.com/amitferman/trino

spark setCassandraConf is not working as expected

I am using .setCassandraConf(c_options_conf) to set sparkSession to connect cassandra cluster as show below.
Working fine:
val spark = SparkSession
.builder()
.appName("DatabaseMigrationUtility")
.config("spark.master",devProps.getString("deploymentMaster"))
.getOrCreate()
.setCassandraConf(c_options_conf)
If I save table using dataframe writer object as below it is pointing to the configured cluster and saving in Cassandra perfectly fine as below
writeDfToCassandra(o_vals_df, key_space , "model_vals"); //working fine using o_vals_df.
But if say as below it is pointing to localhost instead of cassandra cluster and failing to save.
Not working:
import spark.implicits._
val sc = spark.sparkContext
val audit_df = sc.parallelize(Seq(LogCaseClass(columnFamilyName, status,
error_msg,currentDate,currentTimeStamp, updated_user))).saveToCassandra(keyspace, columnFamilyName);
It is throwing error as it is trying connect localhost.
Error:
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All
host(s) tried for query failed (tried: localhost/127.0.0.1:9042
(com.datastax.driver.core.exceptions.TransportException:
[localhost/127.0.0.1:9042] Cannot connect))
at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:233)
What is wrong here? Why it is pointing to default localhost even though sparkSession set to cassandra cluster and earlier method is working fine.
We need to set the config using two set methods of SparkSession, i.e. .config(conf) and .setCassandraConf(c_options_conf) with same values like below
val spark = SparkSession
.builder()
.appName("DatabaseMigrationUtility")
.config("spark.master",devProps.getString("deploymentMaster"))
.config("spark.dynamicAllocation.enabled",devProps.getString("spark.dynamicAllocation.enabled"))
.config("spark.executor.memory",devProps.getString("spark.executor.memory"))
.config("spark.executor.cores",devProps.getString("spark.executor.cores"))
.config("spark.executor.instances",devProps.getString("spark.executor.instances"))
.config(conf)
.getOrCreate()
.setCassandraConf(c_options_conf)
Then i would work for cassandra latest api as well as RDD/DF Api.
Setting IP via spark.cassandra.connection.host Spark property (not via setCassandraConf!) works for both RDD & DataFrames. This property could be set from command-line when submitting the job, or explicitly (example from documentation):
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "192.168.123.10")
.set("spark.cassandra.auth.username", "cassandra")
.set("spark.cassandra.auth.password", "cassandra")
val sc = new SparkContext("spark://192.168.123.10:7077", "test", conf)
Take look onto documentation for connector, including reference about existing configuration properties.

Spark Cassandra Connector Issue

I am trying to integrate Cassandra with Spark and facing the below issue.
Issue:
com.datastax.spark.connector.util.ConfigCheck$ConnectorConfigurationException: Invalid Config Variables
Only known spark.cassandra.* variables are allowed when using the Spark Cassandra Connector.
spark.cassandra.keyspace is not a valid Spark Cassandra Connector variable.
Possible matches:
spark.cassandra.sql.keyspace
spark.cassandra.output.batch.grouping.key
at com.datastax.spark.connector.util.ConfigCheck$.checkConfig(ConfigCheck.scala:50)
at com.datastax.spark.connector.cql.CassandraConnectorConf$.apply(CassandraConnectorConf.scala:253)
at org.apache.spark.sql.cassandra.CassandraSourceRelation$.apply(CassandraSourceRelation.scala:263)
at org.apache.spark.sql.cassandra.CassandraCatalog.org$apache$spark$sql$cassandra$CassandraCatalog$$buildRelation(CasandraCatalog.scala:41)
at org.apache.spark.sql.cassandra.CassandraCatalog$$anon$1.load(CassandraCatalog.scala:26)
at org.apache.spark.sql.cassandra.CassandraCatalog$$anon$1.load(CassandraCatalog.scala:23)
Please find the below versions of spark Cassandra and connector I am using.
Spark : 1.6.0
Cassandra : 2.1.17
Connector Used : spark-cassandra-connector_2.10-1.6.0-M1.jar
Below is the code snippet I am using to connect Cassandra from spark.
val conf: org.apache.spark.SparkConf = new SparkConf(true) \
.setAppName("Spark Cassandra") \
.set"spark.cassandra.connection.host", "abc.efg.lkh") \
.set("spark.cassandra.auth.username", "xyz") \
.set("spark.cassandra.auth.password", "1234") \
.set("spark.cassandra.keyspace","abcded")
val sc = new SparkContext("local[*]", "Spark Cassandra",conf)
val csc = new CassandraSQLContext(sc)
csc.setKeyspace("abcded")
val my_df = csc.sql("select * from table")
Here when I try to create DF, I am getting above posted error. I tried without passing schema in conf but it is trying to access in default schema where mentioned user doesn't have access.
Already a JIRA was opened and closed.
https://datastax-oss.atlassian.net/browse/SPARKC-102
yet I am getting this issue. Please let me know whether I need to use lastest connector to resolve this issue.
Thanks in advance.
The important information is in the error message you posted [formatted for readability]:
Invalid Config Variables
Only known spark.cassandra.* variables are allowed when using the Spark Cassandra Connector.
spark.cassandra.keyspace is not a valid Spark Cassandra Connector variable.
Possible matches: spark.cassandra.sql.keyspace
spark.cassandra.keyspace is not an available property for the connector. A full list of the available properties can be found here: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md
You may have some luck using the suggested spark.cassandra.sql.keyspace; otherwise you may just need to explicitly specify the keyspace for every Cassandra interaction you perform using the connector.

Resources