How to create a table with primary key using jdbc spark connector (to ignite) - apache-spark

I'm trying to save a spark dataframe to the ignite cache using spark connector (pyspark) like this:
df.write.format("jdbc") \
.option("url", "jdbc:ignite:thin://<ignite ip>") \
.option("driver", "org.apache.ignite.IgniteJdbcThinDriver") \
.option("primaryKeyFields", 'id') \
.option("dbtable", "ignite") \
.mode("overwrite") \
.save()
# .option("createTableOptions", "primary key (id)") \
# .option("customSchema", 'id BIGINT PRIMARY KEY, txt TEXT') \
I have an error:
java.sql.SQLException: No PRIMARY KEY defined for CREATE TABLE
The library org.apache.ignite:ignite-spark-2.4:2.9.0 is installed. I can't use the ignite format because azure databricks uses spring framework version that conflicts with the spring framework version in the org.apache.ignite:ignite-spark-2.4:2.9.0. So I'm trying to use jdbc thin client. But I can only read/append data to an existing cache.
I can't use the overwrite mode because I can't choose primary key. There is an option primaryKeyFields for the ignite format, but it doesn't work on jdbc. The jdbc customSchema option is ignored. The createTableOptions adds primary key statement after the schema parenthesis and a sql syntax error occurs.
Is there a way to determine a primary key for the jdbc spark connector?

Here's an example with correct syntax that should work fine:
DataFrameWriter < Row > df = resultDF
.write()
.format(IgniteDataFrameSettings.FORMAT_IGNITE())
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), configPath)
.option(IgniteDataFrameSettings.OPTION_TABLE(), "Person")
.option(IgniteDataFrameSettings.OPTION_CREATE_TABLE_PRIMARY_KEY_FIELDS(), "id, city_id")
.option(IgniteDataFrameSettings.OPTION_CREATE_TABLE_PARAMETERS(), "template=partitioned,backups=1")
.mode(Append);
Please let me know if something is wrong here.

Related

df.show() in PySpark returns "UnauthorizedException: User my_user has no SELECT permission on <table system.size_estimates> or any of its parents"

I'm trying to read records from Cassandra table
this code works fine:
df = spark.read \
.format("org.apache.spark.sql.cassandra") \
.option("spark.cassandra.connection.host", "my_host") \
.option("spark.cassandra.connection.port", "9042") \
.option("spark.cassandra.auth.username", "my_user") \
.option("spark.cassandra.auth.password", "my_pass") \
.option("keyspace", "my_keyspace") \
.option("table", "my_table") \
.load()
but when i try to show records
df.show(3)
i get this exception
com.datastax.oss.driver.api.core.servererrors.UnauthorizedException: User my_user has no SELECT permission on <table system.size_estimates> or any of its parents
The point is i have all permissions to my_keyspace only.
But i successfully connect with cqlsh to same cassandra host:port with same user/pass and do whatever in my_keyspace.
Please advice what's wrong with spark code and how to act in such situation?
You need to grant read access to system.size_estimation for that user
The Spark Cassandra connector estimates the size of the Cassandra tables using the values stored in system.size_estimates. The connector needs an estimate of the table size in order to calculate the number of Spark partitions. See my answer in this post for details.
If you've enabled the authorizer in Cassandra, authenticated users/roles are automatically given read access to some system tables:
system_schema.keyspaces
system_schema.columns
system_schema.tables
system.local
system.peers
But you will need to explicitly authorize your Spark user so it can access the size_estimates table with:
GRANT SELECT ON system.size_estimates TO spark_role
Note that the role only needs read access (SELECT permission) to the table. Cheers!

Connect to Hive with jdbc driver in Spark

I need to move data from remote Hive to local Hive with Spark. I try to connect to remote hive with JDBC driver: 'org.apache.hive.jdbc.HiveDriver'. I'm now trying to read from Hive and the result is the column headers in the column values in stead of the actual data:
df = self.spark_session.read.format('JDBC') \
.option('url', "jdbc:hive2://{self.host}:{self.port}/{self.database}") \
.option('driver', 'org.apache.hive.jdbc.HiveDriver') \
.option("user", self.username) \
.option("password", self.password)
.option('dbtable', 'test_table') \
.load()
df.show()
Result:
+----------+
|str_column|
+----------+
|str_column|
|str_column|
|str_column|
|str_column|
|str_column|
+----------+
I know that Hive JDBC isn't an official support in Apache Spark. But I have already found solutions to download from other unsupported sources, such as IMB Informix. Maybe someone has already solved this problem.
After debug&trace the code we will find the problem in JdbcDialect。There is no HiveDialect so spark will use default JdbcDialect.quoteIdentifier。
So you should implement a HiveDialect to fix this problem:
import org.apache.spark.sql.jdbc.JdbcDialect
class HiveDialect extends JdbcDialect{
override def canHandle(url: String): Boolean =
url.startsWith("jdbc:hive2")
override def quoteIdentifier(colName: String): String = {
if(colName.contains(".")){
var colName1 = colName.substring(colName.indexOf(".") + 1)
return s"`$colName1`"
}
s"`$colName`"
}
}
And then register the Dialect by:
JdbcDialects.registerDialect(new HiveDialect)
At last, add option hive.resultset.use.unique.column.names=false to the url like this
option("url", "jdbc:hive2://bigdata01:10000?hive.resultset.use.unique.column.names=false")
refer to csdn blog
Apache Kyuubi has provided a Hive dialect plugin here.
https://kyuubi.readthedocs.io/en/latest/extensions/engines/spark/jdbc-dialect.html
Hive Dialect plugin aims to provide Hive Dialect support to Spark’s JDBC source. It will auto registered to Spark and applied to JDBC sources with url prefix of jdbc:hive2:// or jdbc:kyuubi://. It will quote identifier in Hive SQL style, eg. Quote table.column in table.column.
compile and get the dialect plugin from Kyuubi. (It's a standalone Spark plugin, which is independent from Kyuubi)
put jar into $SPARK_HOME/jars
add plugin to config spark.sql.extensions=org.apache.spark.sql.dialect.KyuubiSparkJdbcDialectExtension, it will be auto registered to spark

No FileSystem for scheme: oss

I am using Alibaba cloud to store processed data from the spark scripts but I am unable to upload the data to storage. I know it with s3 by including some jars but not sure how to do it in Alibaba OSS service
from pyspark.sql import SparkSession
conf = SparkConf()
conf.set("spark.hadoop.fs.oss.impl", "com.aliyun.fs.oss.nat.NativeOssFileSystem")
spark = SparkSession.builder.config("spark.jars", "/home/username/mysql-connector-java-5.1.38.jar") \
.master("local").appName("PySpark_MySQL_test").getOrCreate()
wine_df = spark.read.format("jdbc").option("url", "jdbc:mysql://db.com:3306/service_db") \
.option("driver", "com.mysql.jdbc.Driver").option("query", "select * from transactions limit 1000") \
.option("user", "***").option("password", "***").load()
outputPath = "oss://Bucket_name"
rdd = wine_df.rdd.map(list)
rdd.saveAsTextFile(outputPath)
I think maybe it is due to you do not open the authority of OSS.
In your OSS ,click your bucket---authorize. changed to the related rules, like add conditions IPs.
it could work for you.

Read mariaDB4J table into spark DataSet using JDBC

I am trying to read table from MariaDB4J via jdbc using the following command:
Dataset<Row> jdbcDF = spark.read()
.format("jdbc")
.option("url", url)
.option("dbtable", String.format("SELECT userID FROM %s;", TableNAME))
.load();
I get the following error:
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'SELECT userID FROM MYTABLE; WHERE 1=0' at line 1
I am not sure where the WHERE comes from and why I get this error...
Thanks
The second parameter supplied to the dbtable option is not correct. Instead of specifing a sql query, you should either use a table name qualified with the schema or a valid SQL query with an alias.
Dataset<Row> jdbcDF = spark.read()
.format("jdbc")
.option("url", url)
.option("dbtable", String.format("(SELECT userID FROM %s) as table_alias", tableNAME))
.load();
Your question is similar to this one, so forgive me if I quote myself:
The reason why you see the strange looking query WHERE 1=0 is that Spark tries to infer the schema of your data frame without loading any actual data. This query is guaranteed never to deliver any results, but the query result's metadata can be used by Spark to get the schema information.

Spark Cassandra Connector Issue

I am trying to integrate Cassandra with Spark and facing the below issue.
Issue:
com.datastax.spark.connector.util.ConfigCheck$ConnectorConfigurationException: Invalid Config Variables
Only known spark.cassandra.* variables are allowed when using the Spark Cassandra Connector.
spark.cassandra.keyspace is not a valid Spark Cassandra Connector variable.
Possible matches:
spark.cassandra.sql.keyspace
spark.cassandra.output.batch.grouping.key
at com.datastax.spark.connector.util.ConfigCheck$.checkConfig(ConfigCheck.scala:50)
at com.datastax.spark.connector.cql.CassandraConnectorConf$.apply(CassandraConnectorConf.scala:253)
at org.apache.spark.sql.cassandra.CassandraSourceRelation$.apply(CassandraSourceRelation.scala:263)
at org.apache.spark.sql.cassandra.CassandraCatalog.org$apache$spark$sql$cassandra$CassandraCatalog$$buildRelation(CasandraCatalog.scala:41)
at org.apache.spark.sql.cassandra.CassandraCatalog$$anon$1.load(CassandraCatalog.scala:26)
at org.apache.spark.sql.cassandra.CassandraCatalog$$anon$1.load(CassandraCatalog.scala:23)
Please find the below versions of spark Cassandra and connector I am using.
Spark : 1.6.0
Cassandra : 2.1.17
Connector Used : spark-cassandra-connector_2.10-1.6.0-M1.jar
Below is the code snippet I am using to connect Cassandra from spark.
val conf: org.apache.spark.SparkConf = new SparkConf(true) \
.setAppName("Spark Cassandra") \
.set"spark.cassandra.connection.host", "abc.efg.lkh") \
.set("spark.cassandra.auth.username", "xyz") \
.set("spark.cassandra.auth.password", "1234") \
.set("spark.cassandra.keyspace","abcded")
val sc = new SparkContext("local[*]", "Spark Cassandra",conf)
val csc = new CassandraSQLContext(sc)
csc.setKeyspace("abcded")
val my_df = csc.sql("select * from table")
Here when I try to create DF, I am getting above posted error. I tried without passing schema in conf but it is trying to access in default schema where mentioned user doesn't have access.
Already a JIRA was opened and closed.
https://datastax-oss.atlassian.net/browse/SPARKC-102
yet I am getting this issue. Please let me know whether I need to use lastest connector to resolve this issue.
Thanks in advance.
The important information is in the error message you posted [formatted for readability]:
Invalid Config Variables
Only known spark.cassandra.* variables are allowed when using the Spark Cassandra Connector.
spark.cassandra.keyspace is not a valid Spark Cassandra Connector variable.
Possible matches: spark.cassandra.sql.keyspace
spark.cassandra.keyspace is not an available property for the connector. A full list of the available properties can be found here: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md
You may have some luck using the suggested spark.cassandra.sql.keyspace; otherwise you may just need to explicitly specify the keyspace for every Cassandra interaction you perform using the connector.

Resources