Querying snowflake metadata using spark connector - apache-spark

I want to run 'SHOW TABLES' statement through the spark-snowflake connector, I am running the spark on a Databricks platform and getting "Object 'SHOW' does not exist or not authorized" error.
df = spark.read \
.format("snowflake") \
.options(**options) \
.option("query", "show tables") \
.load()
df.show()
Sample query like "SELECT 1" is working as expected.
I know that I am able to install the native python-snowflake driver but I want to avoid this solution if possible because I already opened the session using spark.
There is also a way using "Utils.runQuery" function but I understood that is relevant only for DDL statement (It doesn't return the actual results).
Thanks!

When using DataFrames, the Snowflake connector supports SELECT queries only.
This is documented on our docs.

Related

Delta lake in databricks - creating a table for existing storage

I currently have an append table in databricks (spark 3, databricks 7.5)
parsedDf \
.select("somefield", "anotherField",'partition', 'offset') \
.write \
.format("delta") \
.mode("append") \
.option("mergeSchema", "true") \
.save(f"/mnt/defaultDatalake/{append_table_name}")
It was created with a create table command before and I don't use INSERT commands to write to it (as seen above)
Now I want to be able to use SQL logic to query it without everytime going through createOrReplaceTempView every time. Is is possible to add a table to the current data without removing it? what changes do I need to support this?
UPDATE:
I've tried:
res= spark.sql(f"CREATE TABLE exploration.oplog USING DELTA LOCATION '/mnt/defaultDataLake/{append_table_name}'")
But get an AnalysisException
You are trying to create an external table exploration.dataitems_oplog
from /mnt/defaultDataLake/specificpathhere using Databricks Delta, but the schema is not specified when the
input path is empty.
While the path isn't empty.
Starting with Databricks Runtime 7.0, you can create table in Hive metastore from the existing data, automatically discovering schema, partitioning, etc. (see documentation for all details). The base syntax is following (replace values in <> with actual values):
CREATE TABLE <database>.<table>
USING DELTA
LOCATION '/mnt/defaultDatalake/<append_table_name>'
P.S. there is more documentation on different aspects of the managed vs unmanaged tables that could be useful to read.
P.P.S. Works just fine for me on DBR 7.5ML:

Not able to write data in Hive using sparksql

I am loading Data from one Hive table to another using spark Sql. I've created sparksession with enableHiveSupport and I'm able to create table in hive using sparksql, but when I'm loading data from one hive table to another hive table using sparksql I'm getting permission issue:
Permission denied: user=anonymous,access=WRITE, path="hivepath".
I am running this using spark user but not able to understand why its taking anonymous as user instead of spark. Can anyone suggest how should I resolve this issue?
I'm using below code.
sparksession.sql("insert overwrite into table dbname.tablename" select * from dbname.tablename").
If you're using spark, you need to set username in your spark context.
System.setProperty("HADOOP_USER_NAME","newUserName")
val spark = SparkSession
.builder()
.appName("SparkSessionApp")
.master("local[*]")
.getOrCreate()
println(spark.sparkContext.sparkUser)
First thing is you may try this for ananymous user
root#host:~# su - hdfs
hdfs#host:~$ hadoop fs -mkdir /user/anonymous
hdfs#host:~$ hadoop fs -chown anonymous /user/anonymous
In general
export HADOOP_USER_NAME=youruser before spark-submit will work.
along with spark-submit configuration like below.
--conf "spark.yarn.appMasterEnv.HADOOP_USER_NAME=${HADDOP_USER_NAME}" \
alternatively you can try using
sudo -su username spark-submit --class your class
see this
Note : This user name setting should be part of your initial
cluster setup ideally if its done then no need to do all these above
and its seemless.
I personally dont prefer user name hard coding in the code it should be from outside the spark job.
To validate with which user you are running,
run below command: -
sc.sparkUser
It will show you the current user and then
you can try setting new user as per the below code
And in scala, you can set the username by
System.setProperty("HADOOP_USER_NAME","newUserName")

How can I use the saveAsTable function when I have two Spark streams running in parallel in the same notebook?

I have two Spark streams set up in a notebook to run in parallel like so.
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")
df1 = spark \
.readStream.format("delta") \
.table("test_db.table1") \
.select('foo', 'bar')
writer_df1 = df1.writeStream.option("checkpoint_location", checkpoint_location_1) \
.foreachBatch(
lambda batch_df, batch_epoch:
process_batch(batch_df, batch_epoch)
) \
.start()
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool2")
df2 = spark \
.readStream.format("delta") \
.table("test_db.table2") \
.select('foo', 'bar')
writer_df2 = merchant_df.writeStream.option("checkpoint_location", checkpoint_location_2) \
.foreachBatch(
lambda batch_df, batch_epoch:
process_batch(batch_df, batch_epoch)
) \
.start()
These dataframes then get processed row by row, with each row being sent to an API. If the API call reports an error, I then convert the row into JSON and append this row to a common failures table in databricks.
columns = ['table_name', 'record', 'time_of_failure', 'error_or_status_code']
vals = [(table_name, json.dumps(row.asDict()), datetime.now(), str(error_or_http_code))]
error_df = spark.createDataFrame(vals, columns)
error_df.select('table_name','record','time_of_failure', 'error_or_status_code').write.format('delta').mode('Append').saveAsTable("failures_db.failures_db)"
When attempting to add the row to this table, the saveAsTable() call here throws the following exception.
py4j.protocol.Py4JJavaError: An error occurred while calling o3578.saveAsTable.
: java.lang.IllegalStateException: Cannot find the REPL id in Spark local properties. Spark-submit and R doesn't support transactional writes from different clusters. If you are using R, please switch to Scala or Python. If you are using spark-submit , please convert it to Databricks JAR job. Or you can disable multi-cluster writes by setting 'spark.databricks.delta.multiClusterWrites.enabled' to 'false'. If this is disabled, writes to a single table must originate from a single cluster. Please check https://docs.databricks.com/delta/delta-intro.html#frequently-asked-questions-faq for more details.
If I comment out one of the streams and re-run the notebook, any errors from the API calls get inserted into the table with no issues. I feel like there's some configuration I need to add but am not sure of where to go from here.
Not sure if this is the best solution, but I believe the problem comes from each stream writing to the table at the same time. I split this table into separate tables for each stream and it worked after that.

Spark SQL error cannot evaluate expression, worked in 2.3.2 and fails in 2.4

I just posted an issue, I hope I did not over-step the protocol!
https://issues.apache.org/jira/browse/SPARK-26777
I wonder if anyone has hit a problem with SQL Spark 2.4.0 (from Pyspark 3.6)
spark.sql("select partition_year_utc,partition_month_utc,partition_day_utc \
from datalake_reporting.copy_of_leads_notification \
where partition_year_utc = (select max(partition_year_utc) from datalake_reporting.copy_of_leads_notification) \
and partition_month_utc = \
(select max(partition_month_utc) from datalake_reporting.copy_of_leads_notification as m \
where \
m.partition_year_utc = (select max(partition_year_utc) from datalake_reporting.copy_of_leads_notification)) \
and partition_day_utc = (select max(d.partition_day_utc) from datalake_reporting.copy_of_leads_notification as d \
where d.partition_month_utc = \
(select max(m1.partition_month_utc) from datalake_reporting.copy_of_leads_notification as m1 \
where m1.partition_year_utc = \
(select max(y.partition_year_utc) from datalake_reporting.copy_of_leads_notification as y) \
) \
) \
order by 1 desc, 2 desc, 3 desc limit 1 ").show(1,False)
Above PySpark/SQL code works in Presto/Athena and it used to work in Spark 2.3.2 as well.
Now in the latest Spark 2.4.0 AWS EMR 5.20.0 it fails with error (query syntax):
py4j.protocol.Py4JJavaError: An error occurred while calling
o1326.showString. : java.lang.UnsupportedOperationException: Cannot
evaluate expression: scalar-subquery#4495 []
I submitted an issue in Spark but I also wondering if someone knows about it already?
I could re-write this SQL code to break it up into multiple (3-4) simple SQL statements but thought to post it here for opinions as it is rather trivial code.
Thank you!
i'm running into the same issue and am reverting to EMR 5.17 for the time being but doing some reading and curious is subquery aliasing might be the cause.
"Un-aliased subquery’s semantic has not been well defined with confusing behaviors. Since Spark 2.3, we invalidate such confusing cases, for example: SELECT v.i from (SELECT i FROM v), Spark will throw an analysis exception in this case because users should not be able to use the qualifier inside a subquery. See SPARK-20690 and SPARK-21335 for more details."
https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html
You're using datalake_reporting.copy_of_leads_notification in your query and subquery, maybe you need to use an alias?

AnalysisException: It is not allowed to add database prefix

I am attempting to read in data from a table that is in a schema using JDBC. However, I'm getting an error:
org.apache.spark.sql.AnalysisException: It is not allowed to add database prefix `myschema` for the TEMPORARY view name.;
The code is pretty straight forward, error occurs on the third line (others included just to show what I am doing). myOptions includes url, dbtable, driver, user, password.
SQLContext sqlCtx = new SQLContext(ctx);
Dataset<Row> df = sqlCtx.read().format("jdbc").options(myOptions).load();
df.createOrReplaceTempView("myschema.test_table");
df = sqlCtx.sql("select field1, field2, field3 from myschema.test_table");
So if database/schema qualifiers are not allowed, then how do you reference the correct one for your table? Leaving it off gives an 'invalid object name' from the database which is expected.
The only option I have at the database side is to use default schema, however this is user-based and not session-based so I would have to create one user and connection per schema I want to access.
What am I missing here? This seems like a common use case.
Edit: Also for those attempting to close this... "a problem that can no longer be reproduced or a simple typographical error" how about a comment as to why this is the reason to close? If I have made a typo or made a simple mistake, leave a comment and show me what. I can't be the only person who has run into this.
registerTempTable in Spark 1.2 used to work this way, and we were told that createOrReplaceTempView was supposed to replace it in 2.x. Yet the functionality is not there.
I figured it out.
The short answer is... dbtable name and the temp view/table name are two different things and don't have to have the same value. dbtable defines were in the database to go for the data, temp view/table is used to define what you call this in your Spark SQL.
This was confusing at first because in Spark 1.6 it allowed the view name to match the full table name (and so the software I am using plugged it in for both for 1.6). If you were coding this by hand, you would just use a nonqualified table name for the temp table or view on either 1.6 or 2.2.
In order to reference a table in a schema in Spark 1.6, I had to do the following because the dbtable and view name were the same:
1. dbtable to "schema.table"
2. registerTempTable("schema.table")
3. Reference table as `schema.table` (include the ticks to treat the entire thing as an identifier to match the view name) in the SQL
However, in Spark 2.2, you need to, since schema/database is not allowed in the view name:
1. dbtable to "schema.table"
2. createOrReplaceTempView("table")
3. Reference table (not schema.table) in the SQL (matching the view)
I guess you are trying to fetch a specific table from an RDBMS. If you are using Spark 2.x or later , you can use below code to get ur table in dataframe.
DF = spark.read \
.format("jdbc") \
.option("url", "jdbc:oracle:thin:username/password#//hostname:portnumber/SID") \
.option("dbtable", "hr.emp") \
.option("user", "db_user_name") \
.option("password", "password") \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.load()

Resources