df.show() in PySpark returns "UnauthorizedException: User my_user has no SELECT permission on <table system.size_estimates> or any of its parents" - apache-spark

I'm trying to read records from Cassandra table
this code works fine:
df = spark.read \
.format("org.apache.spark.sql.cassandra") \
.option("spark.cassandra.connection.host", "my_host") \
.option("spark.cassandra.connection.port", "9042") \
.option("spark.cassandra.auth.username", "my_user") \
.option("spark.cassandra.auth.password", "my_pass") \
.option("keyspace", "my_keyspace") \
.option("table", "my_table") \
.load()
but when i try to show records
df.show(3)
i get this exception
com.datastax.oss.driver.api.core.servererrors.UnauthorizedException: User my_user has no SELECT permission on <table system.size_estimates> or any of its parents
The point is i have all permissions to my_keyspace only.
But i successfully connect with cqlsh to same cassandra host:port with same user/pass and do whatever in my_keyspace.
Please advice what's wrong with spark code and how to act in such situation?

You need to grant read access to system.size_estimation for that user

The Spark Cassandra connector estimates the size of the Cassandra tables using the values stored in system.size_estimates. The connector needs an estimate of the table size in order to calculate the number of Spark partitions. See my answer in this post for details.
If you've enabled the authorizer in Cassandra, authenticated users/roles are automatically given read access to some system tables:
system_schema.keyspaces
system_schema.columns
system_schema.tables
system.local
system.peers
But you will need to explicitly authorize your Spark user so it can access the size_estimates table with:
GRANT SELECT ON system.size_estimates TO spark_role
Note that the role only needs read access (SELECT permission) to the table. Cheers!

Related

PySpark is raising error (ORA-00933) when fetching data from Oracle Database

Context
I am using Databricks to connect to an Oracle Database and fetch data daily. We are using the following example code in PySpark to authenticate and access the database:
testDF = (spark.read
.format("jdbc")
.option("url", "jdbc:oracle:thin:#10.80.40.5:1526:D301")
.option("dbtable", "SCHEMA_TEST.TABLE_TEST")
.option("user", jdbcUsername)
.option("password", jdbcPassword)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.option("oracle.jdbc.timezoneAsRegion", "false")
.option("fetchSize", "256")
.load()
.createOrReplaceTempView("test")
)
Then we used to access the information via SQL code using the command below:
SELECT * FROM test
The problem
Today we realized this approach is raising an error like the one below:
SQLSyntaxErrorException: ORA-00933: SQL command not properly ended
Caused by: OracleDatabaseException: ORA-00933: SQL command not
properly ended
Upon further investigation into the error log, I saw the following message:
Caused by: Error : 933, Position : 2735, Sql = SELECT *
SCHEMA_TEST.TABLE_TEST LIMIT 10 , Error Msg = ORA-00933: SQL command
not properly ended
So I believe that PySpark/DBSQL translates the instruction SELECT * FROM test into the instruction SELECT * SCHEMA_TEST.TABLE_TEST LIMIT 10 to fetch information from the Oracle database.
The problem is that LIMIT does not exist in the ORACLE SQL Clause, therefore raising the error.
So I tried to implement an instruction with a WHERE clause. Because I thought that the translation would prevent to use of the LIMIT clause in the backend. The following code then was executed:
SELECT * FROM test WHERE column_example is not null
Well, IT WORKED. So for some reason, the PySpark/DBSQL is translating a simple select star instruction (SELECT * FROM test) WRONGLY.
Does someone know what might be the reason behind it?
Additional info:
Databricks Runtime: 12.0
Spark: 3.3.1
Oracle Driver: oracle.jdbc.driver.OracleDriver
Maven Oracle Driver: com.oracle.ojdbc:ojdbc8:19.3.0.0
Try an alternate approach by closing query in parentheses and passing query instead of calling entire table.
queryString = "(SELECT * SCHEMA_TEST.TABLE_TEST LIMIT 10)"
print("Query to fetch data from Oracle:: ", query)
# Fetch data from query
oracleDataDf = spark.read
.format("jdbc")
.option("url", oracleUrl)
.option("dbtable", queryString)
.option("user", oracleDbUser)
.option("password", oracleDbPass)
.option("driver", oracleDriver)
.load()
oracleDataDf.show()

Hive permissions on view not inherited in Spark

I've parquet table test_table created in Hive. The location of one of the partitions is '/user/hive/warehouse/prod.db/test_table/date_id=20210701'
Created the view based on this table:
create view prod.test_table_vw as
select date_id, src, telephone_number, action_date, duration from prod.test_table
Then granted select privilege to some role:
GRANT SELECT ON TABLE prod.test_table_vw TO ROLE data_analytics;
Then user with this role trying to query this data using spark.sql() from pyspark:
sql = spark.sql(f"""SELECT DISTINCT phone_number
FROM prod.test_table_vw
WHERE date_id=20210701""")
sql.show(5)
This code returns permission denied error:
Py4JJavaError: An error occurred while calling o138.collectToPython.
: org.apache.hadoop.security.AccessControlException: Permission denied: user=keytabuser, access=READ_EXECUTE, inode="/user/hive/warehouse/prod.db/test_table/date_id=20210701":hive:hive:drwxrwx--x
I can't grant select rights for this table due to some sensitive fields in it, that's why I created view with limited list of columns.
Questions:
Why Spark ignores Hive's select permissions on view? And how to resolve this issue?
It's Spark's limitation:
When a Spark job accesses a Hive view, Spark must have privileges to
read the data files in the underlying Hive tables. Currently, Spark
cannot use fine-grained privileges based on the columns or the WHERE
clause in the view definition. If Spark does not have the required
privileges on the underlying data files, a SparkSQL query against the
view returns an empty result set, rather than an error.
Reference link.

No FileSystem for scheme: oss

I am using Alibaba cloud to store processed data from the spark scripts but I am unable to upload the data to storage. I know it with s3 by including some jars but not sure how to do it in Alibaba OSS service
from pyspark.sql import SparkSession
conf = SparkConf()
conf.set("spark.hadoop.fs.oss.impl", "com.aliyun.fs.oss.nat.NativeOssFileSystem")
spark = SparkSession.builder.config("spark.jars", "/home/username/mysql-connector-java-5.1.38.jar") \
.master("local").appName("PySpark_MySQL_test").getOrCreate()
wine_df = spark.read.format("jdbc").option("url", "jdbc:mysql://db.com:3306/service_db") \
.option("driver", "com.mysql.jdbc.Driver").option("query", "select * from transactions limit 1000") \
.option("user", "***").option("password", "***").load()
outputPath = "oss://Bucket_name"
rdd = wine_df.rdd.map(list)
rdd.saveAsTextFile(outputPath)
I think maybe it is due to you do not open the authority of OSS.
In your OSS ,click your bucket---authorize. changed to the related rules, like add conditions IPs.
it could work for you.

How to create a table with primary key using jdbc spark connector (to ignite)

I'm trying to save a spark dataframe to the ignite cache using spark connector (pyspark) like this:
df.write.format("jdbc") \
.option("url", "jdbc:ignite:thin://<ignite ip>") \
.option("driver", "org.apache.ignite.IgniteJdbcThinDriver") \
.option("primaryKeyFields", 'id') \
.option("dbtable", "ignite") \
.mode("overwrite") \
.save()
# .option("createTableOptions", "primary key (id)") \
# .option("customSchema", 'id BIGINT PRIMARY KEY, txt TEXT') \
I have an error:
java.sql.SQLException: No PRIMARY KEY defined for CREATE TABLE
The library org.apache.ignite:ignite-spark-2.4:2.9.0 is installed. I can't use the ignite format because azure databricks uses spring framework version that conflicts with the spring framework version in the org.apache.ignite:ignite-spark-2.4:2.9.0. So I'm trying to use jdbc thin client. But I can only read/append data to an existing cache.
I can't use the overwrite mode because I can't choose primary key. There is an option primaryKeyFields for the ignite format, but it doesn't work on jdbc. The jdbc customSchema option is ignored. The createTableOptions adds primary key statement after the schema parenthesis and a sql syntax error occurs.
Is there a way to determine a primary key for the jdbc spark connector?
Here's an example with correct syntax that should work fine:
DataFrameWriter < Row > df = resultDF
.write()
.format(IgniteDataFrameSettings.FORMAT_IGNITE())
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), configPath)
.option(IgniteDataFrameSettings.OPTION_TABLE(), "Person")
.option(IgniteDataFrameSettings.OPTION_CREATE_TABLE_PRIMARY_KEY_FIELDS(), "id, city_id")
.option(IgniteDataFrameSettings.OPTION_CREATE_TABLE_PARAMETERS(), "template=partitioned,backups=1")
.mode(Append);
Please let me know if something is wrong here.

Read mariaDB4J table into spark DataSet using JDBC

I am trying to read table from MariaDB4J via jdbc using the following command:
Dataset<Row> jdbcDF = spark.read()
.format("jdbc")
.option("url", url)
.option("dbtable", String.format("SELECT userID FROM %s;", TableNAME))
.load();
I get the following error:
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'SELECT userID FROM MYTABLE; WHERE 1=0' at line 1
I am not sure where the WHERE comes from and why I get this error...
Thanks
The second parameter supplied to the dbtable option is not correct. Instead of specifing a sql query, you should either use a table name qualified with the schema or a valid SQL query with an alias.
Dataset<Row> jdbcDF = spark.read()
.format("jdbc")
.option("url", url)
.option("dbtable", String.format("(SELECT userID FROM %s) as table_alias", tableNAME))
.load();
Your question is similar to this one, so forgive me if I quote myself:
The reason why you see the strange looking query WHERE 1=0 is that Spark tries to infer the schema of your data frame without loading any actual data. This query is guaranteed never to deliver any results, but the query result's metadata can be used by Spark to get the schema information.

Resources