Table or view not found exception when writing to Hive - apache-spark

I'm saving a dataframe to Hive using saveAsTable("schema.table") but it throws a org.apache.spark.sql.AnalysisException: Table or view not found exception. This happens with both "overwrite" and "append" mode flags.
The target table does indeed not exist, checked using
scala> spark.catalog.tableExists("table_name")
Per my understanding the mode flag controls the behaviour depending on whether the target table exists or not. So normally this is irrelevant for the issue at hand. Thinking the issue is with the table creation itself but I don't know how to investigate this.
Thank you!

#incase you want pyspark/pysparksql
Spark saveAsTable is not much compatible with hive table.
I would suggest to create a temptable in spark and after that load data into hive table using CTAS.
Df.createOrReplaceTempView("mytempTable")
sqlContext.sql("create table mytable as select * from mytempTable")

Looks like a standard case of being stupid: the error had nothing to do with the table creation, I was attempting some table statistics on the actual table I was trying to create before actually creating it.
Best,
SD_

Related

Meaning of spark.sql.sources.provider in TBLPROPERTIES

When I create a table in spark over parquet using saveAsTable and then view the TBLPROPERTIES of the table, I see one of the properties is spark.sql.sources.provider=parquet. I couldn't find this property anywhere in either the documentation or the spark source itself, and I don't understand how it affects the table. Is there documentation anywhere on TBLPROPERTIES that spark appends to the table in general?
Spark SQL stores some Spark-specific table properties using spark.sql prefix (e.g. table statistics for Cost-Based Optimization).
Among them is spark.sql.sources.provider that is the way to tell Spark SQL what data source (table) format to use to load a (catalog) table.

How to write to Hive table with static partition using PySpark?

I've created a Hive table with a partition like this:
CREATE TABLE IF NOT EXISTS my_table
(uid INT, num INT) PARTITIONED BY (dt DATE)
Then with PySpark, I'm having a dataframe and I've tried to write it to the Hive table like this:
df.write.format('hive').mode('append').partitionBy('dt').saveAsTable('my_table')
Running this I'm getting an exception:
Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict
I then added this config:
hive.exec.dynamic.partition=true
hive.exec.dynamic.partition.mode=nonstrict
This time no exception but the table wasn't populated either!
Then I removed the above config and added this:
hive.exec.dynamic.partition=false
Also altered the code to be like:
df.write.format('hive').mode('append').partitionBy(dt='2022-04-29').saveAsTable('my_table')
This time I am getting:
Dynamic partition is disabled. Either enable it by setting hive.exec.dynamic.partition=true or specify partition column values
The Spark job I want to run is going to have daily data, so I guess what I want is the static partition, but how does it work?
If you haven't predefined all the partitions you will need to use:
hive.exec.dynamic.partition=true
hive.exec.dynamic.partition.mode=nonstrict
Remember that hive is schema on read, and it won't automagically fix your data into partitions. You need to inform the meta-store of the paritions.
You will need to do that manually with one of the two commands:
alter table <db_name>.<table_name> add partition(`date`='<date_value>') location '<hdfs_location_of the specific partition>';
or
MSCK REPAIR TABLE [tablename]
if the table is already created, and you are using append mode anyway, you can use insertInto instead of saveAsTable, and you don't even need .partitionBy('dt')
df.write.format('hive').mode('append').insertInto('my_table')

How to perform insert overwrite dynamically on partitions of Delta file using PySpark?

I'm new to pyspark and looking for overwriting a delta partition dynamically. From the other resources available online I could see that spark supports dynamic partition by setting the below conf as "dynamic"
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
However, when I try overwriting the partitioned_table with a dataframe, the below line of code in pyspark (databricks) overwrites the entire table instead of a single partition on delta file.
data.write.insertInto("partitioned_table", overwrite = True)
I did come across the option of using Hive external table, but it is not straight forward in my case since the partitioned_table is based out of Delta file.
Please let me know what am I missing here. Thanks in advance!
Look at this issue and details regarding dynamic overwrite on delta tables : https://github.com/delta-io/delta/issues/348
You can use replaceWhere

Hive Table or view not found although the Table exists

I am trying to run a spark job written in Java, on the Spark cluster to load records as dataframe into a Hive Table i created.
df.write().mode("overwrite").insertInto(dbname.tablename);
Although the table and database exists in Hive, it throws below error:
org.apache.spark.sql.AnalysisException: Table or view not found: dbname.tablename, the database dbname doesn't exist.;
I also tried reading from an existing hive table different than the above table thinking there might be an issue while my table creation.
I also checked if my user has permission to the hdfs folder where the hive is storing the data.
It all looks fine, not sure what could be the issue.
Please suggest.
Thanks
I think it is searching for that table in spark instead of hive.

Using spark sql DataFrameWriter to create external Hive table

As part of a data integration process I am working on, I have a need to persist a Spark SQL DataFrame as an external Hive table.
My constraints at the moment:
Currently limited to Spark 1.6 (v1.6.0)
Need to persist the data in a specific location, retaining the data even if the table definition is dropped (hence external table)
I have found what appears to be a satisfactory solution to write the dataframe, df, as follows:
df.write.saveAsTable('schema.table_name',
format='parquet',
mode='overwrite',
path='/path/to/external/table/files/')
Doing a describe extended schema.table_name against the resulting table confirms that it is indeed external. I can also confirm that the data is retained (as desired) even if the table itself is dropped.
My main concern is that I can't really find a documented example of this anywhere, nor can I find much mention of it in the official docs -
particularly the use of a path to enforce the creation of an external table.
(https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter).
Is there a better/safer/more standard way to persist the dataframe?
I rather creating the Hive tables myself (e.g. CREATE EXTERNAL TABLE IF NOT EXISTS) exactly as I need and then in Spark just do: df.write.saveAsTable('schema.table_name', mode='overwrite').
This way you have control about the table creation and don't depend on the HiveContext doing what you need. In the past there where issues with the Hive tables created this way and the behavior can change in the future since that API is generic and cannot guarantee the underlying implementation by HiveContext.

Resources