AnalysisException: It is not allowed to add database prefix - apache-spark

I am attempting to read in data from a table that is in a schema using JDBC. However, I'm getting an error:
org.apache.spark.sql.AnalysisException: It is not allowed to add database prefix `myschema` for the TEMPORARY view name.;
The code is pretty straight forward, error occurs on the third line (others included just to show what I am doing). myOptions includes url, dbtable, driver, user, password.
SQLContext sqlCtx = new SQLContext(ctx);
Dataset<Row> df = sqlCtx.read().format("jdbc").options(myOptions).load();
df.createOrReplaceTempView("myschema.test_table");
df = sqlCtx.sql("select field1, field2, field3 from myschema.test_table");
So if database/schema qualifiers are not allowed, then how do you reference the correct one for your table? Leaving it off gives an 'invalid object name' from the database which is expected.
The only option I have at the database side is to use default schema, however this is user-based and not session-based so I would have to create one user and connection per schema I want to access.
What am I missing here? This seems like a common use case.
Edit: Also for those attempting to close this... "a problem that can no longer be reproduced or a simple typographical error" how about a comment as to why this is the reason to close? If I have made a typo or made a simple mistake, leave a comment and show me what. I can't be the only person who has run into this.
registerTempTable in Spark 1.2 used to work this way, and we were told that createOrReplaceTempView was supposed to replace it in 2.x. Yet the functionality is not there.

I figured it out.
The short answer is... dbtable name and the temp view/table name are two different things and don't have to have the same value. dbtable defines were in the database to go for the data, temp view/table is used to define what you call this in your Spark SQL.
This was confusing at first because in Spark 1.6 it allowed the view name to match the full table name (and so the software I am using plugged it in for both for 1.6). If you were coding this by hand, you would just use a nonqualified table name for the temp table or view on either 1.6 or 2.2.
In order to reference a table in a schema in Spark 1.6, I had to do the following because the dbtable and view name were the same:
1. dbtable to "schema.table"
2. registerTempTable("schema.table")
3. Reference table as `schema.table` (include the ticks to treat the entire thing as an identifier to match the view name) in the SQL
However, in Spark 2.2, you need to, since schema/database is not allowed in the view name:
1. dbtable to "schema.table"
2. createOrReplaceTempView("table")
3. Reference table (not schema.table) in the SQL (matching the view)

I guess you are trying to fetch a specific table from an RDBMS. If you are using Spark 2.x or later , you can use below code to get ur table in dataframe.
DF = spark.read \
.format("jdbc") \
.option("url", "jdbc:oracle:thin:username/password#//hostname:portnumber/SID") \
.option("dbtable", "hr.emp") \
.option("user", "db_user_name") \
.option("password", "password") \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.load()

Related

Querying snowflake metadata using spark connector

I want to run 'SHOW TABLES' statement through the spark-snowflake connector, I am running the spark on a Databricks platform and getting "Object 'SHOW' does not exist or not authorized" error.
df = spark.read \
.format("snowflake") \
.options(**options) \
.option("query", "show tables") \
.load()
df.show()
Sample query like "SELECT 1" is working as expected.
I know that I am able to install the native python-snowflake driver but I want to avoid this solution if possible because I already opened the session using spark.
There is also a way using "Utils.runQuery" function but I understood that is relevant only for DDL statement (It doesn't return the actual results).
Thanks!
When using DataFrames, the Snowflake connector supports SELECT queries only.
This is documented on our docs.

Delta lake in databricks - creating a table for existing storage

I currently have an append table in databricks (spark 3, databricks 7.5)
parsedDf \
.select("somefield", "anotherField",'partition', 'offset') \
.write \
.format("delta") \
.mode("append") \
.option("mergeSchema", "true") \
.save(f"/mnt/defaultDatalake/{append_table_name}")
It was created with a create table command before and I don't use INSERT commands to write to it (as seen above)
Now I want to be able to use SQL logic to query it without everytime going through createOrReplaceTempView every time. Is is possible to add a table to the current data without removing it? what changes do I need to support this?
UPDATE:
I've tried:
res= spark.sql(f"CREATE TABLE exploration.oplog USING DELTA LOCATION '/mnt/defaultDataLake/{append_table_name}'")
But get an AnalysisException
You are trying to create an external table exploration.dataitems_oplog
from /mnt/defaultDataLake/specificpathhere using Databricks Delta, but the schema is not specified when the
input path is empty.
While the path isn't empty.
Starting with Databricks Runtime 7.0, you can create table in Hive metastore from the existing data, automatically discovering schema, partitioning, etc. (see documentation for all details). The base syntax is following (replace values in <> with actual values):
CREATE TABLE <database>.<table>
USING DELTA
LOCATION '/mnt/defaultDatalake/<append_table_name>'
P.S. there is more documentation on different aspects of the managed vs unmanaged tables that could be useful to read.
P.P.S. Works just fine for me on DBR 7.5ML:

Spark jdbc overwrite mode not working as expected

I would like to perform update and insert operation using spark
please find the image reference of existing table
Here i am updating id :101 location and inserttime and inserting 2 more records:
and writing to the target with mode overwrite
df.write.format("jdbc")
.option("url", "jdbc:mysql://localhost/test")
.option("driver","com.mysql.jdbc.Driver")
.option("dbtable","temptgtUpdate")
.option("user", "root")
.option("password", "root")
.option("truncate","true")
.mode("overwrite")
.save()
After executing the above command my data is corrupted which is inserted into db table
Data in the dataframe
Could you please let me know your observations and solutions
Spark JDBC writer supports following modes:
append: Append contents of this :class:DataFrame to existing data.
overwrite: Overwrite existing data.
ignore: Silently ignore this operation if data already exists.
error (default case): Throw an exception if data already exists
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
Since you are using "overwrite" mode it recreate your table as per then column length, if you want your own table definition create table first and use "append" mode
i would like to perform update and insert operation using spark
There is no equivalent in to SQL UPDATE statement with Spark SQL. Nor is there an equivalent of the SQL DELETE WHERE statement with Spark SQL. Instead, you will have to delete the rows requiring update outside of Spark, then write the Spark dataframe containing the new and updated records to the table using append mode (in order to preserve the remaining existing rows in the table).
In case where you need to perform UPSERT / DELETE operations in your pyspark code, i suggest you to use pymysql libary, and execute your upsert/delete operations. Please check this post for more info, and code sample for reference : Error while using INSERT INTO table ON DUPLICATE KEY, using a for loop array
Please modify the code sample as per your needs.
I wouldn't recommend TRUNCATE, since it would actually drop the table, and create new table. While doing this, the table may lose column level attributes that were set earlier...so be careful while using TRUNCATE, and be sure, if it's ok for dropping the table/recreate the table.
Upsert logic is working fine when following below steps
df = (spark.read.format("csv").
load("file:///C:/Users/test/Desktop/temp1/temp1.csv", header=True,
delimiter=','))
and doing this
(df.write.format("jdbc").
option("url", "jdbc:mysql://localhost/test").
option("driver", "com.mysql.jdbc.Driver").
option("dbtable", "temptgtUpdate").
option("user", "root").
option("password", "root").
option("truncate", "true").
mode("overwrite").save())
Still, I am unable to understand the logic why its failing when i am writing using the data frame directly

Loading Snowflake from Databricks changes table structure

I'm doing some POC to load a Snowflake table from a dataframe in Databricks. I've successfully loaded the table, however it changes the structure of it.
For example, in Snowflake I created this table:
CREATE OR REPLACE TABLE FNT_DAYS
(
FNT_DT_PK TIMESTAMP_NTZ NOT NULL,
OPEN_DT_FLG VARCHAR(1),
HOLIDAY_DT_FLG VARCHAR(1),
LOAD_USR VARCHAR(10)
);
ALTER TABLE FNT_DAYS ADD CONSTRAINT FNT_DAYS_PK PRIMARY KEY (FNT_DT_PK);
When running my code in Databricks unsing Python, the table gets successfully loaded, however the structure of the table changes to this:
CREATE OR REPLACE TABLE FNT_DAYS
(
FNT_DT_PK TIMESTAMP_NTZ,
OPEN_DT_FLG VARCHAR(16777216),
HOLIDAY_DT_FLG VARCHAR(16777216),
LOAD_USR VARCHAR(10)
);
Note that the Primary Key Constraint is gone, FNT_DT_PK field is not longer NOT NULL and finally, every single VARCHAR field data type length is changed to 16777216.
My python code in Databricks is very straight forward:
%python
options = dict(sfUrl="mysnflk.snowflakecomputing.com",
sfUser="me",
sfPassword="******",
sfDatabase="SNF_DB",
sfSchema="PUBLIC",
sfWarehouse="SNF_WH")
df = spark.sql("select * from exc.test")
df.write \
.format("snowflake") \
.mode("overwrite") \
.options(**options) \
.option("dbtable", "FNT_DAYS") \
.save()
Do you have an idea of why the table structure is changed in Snowflake?
If you look at the query_history in Snowflake, do you see that the table is being recreated by the df.write command? It seems that it is recreating the table and using the datatypes of the dataframe to define your new table. I don't know exactly what is causing that, but I do see that the Snowflake example (https://docs.snowflake.net/manuals/user-guide/spark-connector-use.html#id1) has a slightly different syntax on your mode.
I should also note that the length of those varchar field will not hurt you in any way in Snowflake. Length does not affect storage or performance and those lengths mean that the connector is literally just using VARCHAR as the data type without a length specified. Also, constraints on PK are not enforced, so not sure how important that is to you. The only thing I'd be concerned about is your NOT NULL, which Snowflake does enforce.

Databricks Azure database warehouse saving tables

I am using the following code to write a Azure warehouse table
df_execution_config_remain.write
.format("com.databricks.spark.sqldw")
.option("user", user)
.option("password", pswd)
.option("url","jdbc:sqlserver://"+sqlserver +":"+port+";database="+database)
.option("forwardSparkAzureStorageCredentials", "true")
.option("dbTable", execution_config)
.option("tempDir", dwtmp)
.mode("Overwrite")
.save()
But Overwrite will drop the table and recreate .
Question
1. I found the new create table is having round robin distribution. which I don't want
the column is having different length with the original table, varchar(256)
I don't want to use append, because I would like to clear the rows in the current table
Q1: Refer to the tableOptions parameter under the following link:
https://docs.databricks.com/spark/latest/data-sources/azure/sql-data-warehouse.html#parameters
Q2:Are you being affected by the maxStrLength parameter under that same link?
Q3: I think your approach is sound, but an alternative might be to use the preActions parameter under that same link, and TRUNCATE the table before loading.

Resources