Spark 2.2.1 HQL to ALTER Hive Table with partitions fails with InvalidOperationExeception - apache-spark

I have an application where I'm sending HQL using SparkSession.sql() method.
First I create a table with parititions
CREATE TABLE table_name (Id BigInt) PARTITIONED BY(Age BigInt)
After this I have following ALTER table statement as follows :
ALTER TABLE table_name ADD COLUMNS(Name String)
The ALTER command is failing with following exception :
InvalidOpearationException(message: parititions keys can not be changed)
When I try to execute ALTER statement on a table which doesn't have any partitions it runs fine.
Also the above code runs fine with Spark 1.6.
The above HQL statements also works fine if I directly run them in Hive.
I read the change log of Spark but couldn't find any explanation for following behavior.
Can anyone help me in figuring out what is happening ?

Related

Pyspark - Creating a delta table while enableHiveSupport()

I'm creating a delta table on a EMR (6.2) using the following code:
try:
self.spark.sql(f'''
CREATE TABLE default.features_scd
(`{entity_key}` {entity_value_type}, `{CURRENT}` BOOLEAN,
`{EFFECTIVE_TIMESTAMP}` TIMESTAMP, `{END_TIMESTAMP}` TIMESTAMP, `date` DATE)
USING DELTA
PARTITIONED BY (DATE)
LOCATION 's3://mybucket/some/path'
''')
except IllegalArgumentException as e:
self.logger.error('got an illegal argument exception')
pass
I have enableHiveSupport() on the spark session.
I'm getting the warning:
WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table default.features_scd into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
And the exception:
IllegalArgumentException: Can not create a Path from an empty string
Basically the table is being created well and I'm able to get my goal, but I wish to not have to try except and pass on that error.
If I run the same code without the enableHiveSupport() it runs smoothly. But I need the hive support in the same session for creating/updating a hive table.
Is there a way to prevent this exception?

Spark : Can't set table properties start with "spark.sql" to hive external table while creating

Env : linux (spark-submit xxx.py)
Target database : Hive
We used to use beeline to execute hql, but now we try to run the hql through pyspark and faced some issue when tried to set table properties while creating the table.
SQL
CREATE EXTERNAL TABLE example.a(
column_a string)
TBLPROPERTIES (
'discover.partitions'='true',
'spark.sql.sources.schema.numPartCols'='1',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"column_a","type":"string","nullable":true,"metadata":{}}]}',
'spark.sql.sources.schema.partCol.0'='received_utc_date_partition');
Error message
Hive - ERROR - Cannot persist
example.a into Hive metastore as table property
keys may not start with 'spark.sql.': [spark.sql.sources.schema.partCol.0, spark.sql.sources.schema.numParts,
spark.sql.sources.schema.numPartCols, spark.sql.sources.schema.part.0];
In line 130-147 in spark source code it seems that it prevent all table properties that start with "spark.sql"
Not sure if I did it wrong or there's another way to set up the table properties for hive table.
Any kinds of suggestion is appreciated.

Error While Writing into a Hive table from Spark Sql

I am trying to insert data into a Hive External table from Spark Sql.
I am created the hive external table through the following command
CREATE EXTERNAL TABLE tab1 ( col1 type,col2 type ,col3 type) CLUSTERED BY (col1,col2) SORTED BY (col1) INTO 8 BUCKETS STORED AS PARQUET
In my spark job , I have written the following code
Dataset df = session.read().option("header","true").csv(csvInput);
df.repartition(numBuckets, somecol)
.write()
.format("parquet")
.bucketBy(numBuckets,col1,col2)
.sortBy(col1)
.saveAsTable(hiveTableName);
Each time I am running this code I am getting the following exception
org.apache.spark.sql.AnalysisException: Table `tab1` already exists.;
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:408)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:393)
at somepackage.Parquet_Read_WriteNew.writeToParquetHiveMetastore(Parquet_Read_WriteNew.java:100)
You should be specifying a save mode while saving the data in hive.
df.write.mode(SaveMode.Append)
.format("parquet")
.bucketBy(numBuckets,col1,col2)
.sortBy(col1)
.insertInto(hiveTableName);
Spark provides the following save modes:
Save Mode
ErrorIfExists: Throws an exception if the target already exists. If target doesn’t exist write the data out.
Append: If target already exists, append the data to it. If the data doesn’t exist write the data out.
Overwrite: If the target already exists, delete the target. Write the data out.
Ignore: If the target already exists, silently skip writing out. Otherwise write out the data.
You are using the saveAsTable API, which create the table into Hive. Since you have already created the hive table through command, the table tab1 already exists. so when Spark API trying to create it, it throws error saying table already exists, org.apache.spark.sql.AnalysisException: Tabletab1already exists.
Either drop the table and let spark API saveAsTable create the table itself.
Or use the API insertInto to insert into an existing hive table.
df.repartition(numBuckets, somecol)
.write()
.format("parquet")
.bucketBy(numBuckets,col1,col2)
.sortBy(col1)
.insertInto(hiveTableName);

Spark returns Empty DataFrame but Populated in Hive

I have a table in hive
db.table_name
When I run the following in hive I get results back
SELECT * FROM db.table_name;
When I run the following in a spark-shell
spark.read.table("db.table_name").show
It shows nothing. Similarly
sql("SELECT * FROM db.table_name").show
Also shows nothing. Selecting arbitrary columns out before the show also displays nothing. Performing a count states the table has 0 rows.
Running the same queries works against other tables in the same database.
Spark Version: 2.2.0.cloudera1
The table is created using
table.write.mode(SaveMode.Overwrite).saveAsTable("db.table_name")
And if I read the file using the parquet files directly it works.
spark.read.parquet(<path-to-files>).show
EDIT:
I'm currently using a workaround by describing the table and getting the location and using spark.read.parquet.
Have you refresh metadata table? Maybe you need to refresh table to access to new data.
spark.catalog.refreshTable("my_table")
I solved the problem by using
query_result.write.mode(SaveMode.Overwrite).format("hive").saveAsTable("table")
which stores the results in textfile.
There is probably some incompatibility with the Hive parquet.
I also found a Cloudera report about it (CDH Release Notes): they recommend creating the Hive table manually and then load data from a temporary table or by query.

save Spark dataframe to Hive: table not readable because "parquet not a SequenceFile"

I'd like to save data in a Spark (v 1.3.0) dataframe to a Hive table using PySpark.
The documentation states:
"spark.sql.hive.convertMetastoreParquet: When set to false, Spark SQL will use the Hive SerDe for parquet tables instead of the built in support."
Looking at the Spark tutorial, is seems that this property can be set:
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false")
# code to create dataframe
my_dataframe.saveAsTable("my_dataframe")
However, when I try to query the saved table in Hive it returns:
hive> select * from my_dataframe;
OK
Failed with exception java.io.IOException:java.io.IOException:
hdfs://hadoop01.woolford.io:8020/user/hive/warehouse/my_dataframe/part-r-00001.parquet
not a SequenceFile
How do I save the table so that it's immediately readable in Hive?
I've been there...
The API is kinda misleading on this one.
DataFrame.saveAsTable does not create a Hive table, but an internal Spark table source.
It also stores something into Hive metastore, but not what you intend.
This remark was made by spark-user mailing list regarding Spark 1.3.
If you wish to create a Hive table from Spark, you can use this approach:
1. Use Create Table ... via SparkSQL for Hive metastore.
2. Use DataFrame.insertInto(tableName, overwriteMode) for the actual data (Spark 1.3)
I hit this issue last week and was able to find a workaround
Here's the story:
I can see the table in Hive if I created the table without partitionBy:
spark-shell>someDF.write.mode(SaveMode.Overwrite)
.format("parquet")
.saveAsTable("TBL_HIVE_IS_HAPPY")
hive> desc TBL_HIVE_IS_HAPPY;
OK
user_id string
email string
ts string
But Hive can't understand the table schema(schema is empty...) if I do this:
spark-shell>someDF.write.mode(SaveMode.Overwrite)
.format("parquet")
.saveAsTable("TBL_HIVE_IS_NOT_HAPPY")
hive> desc TBL_HIVE_IS_NOT_HAPPY;
# col_name data_type from_deserializer
[Solution]:
spark-shell>sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false")
spark-shell>df.write
.partitionBy("ts")
.mode(SaveMode.Overwrite)
.saveAsTable("Happy_HIVE")//Suppose this table is saved at /apps/hive/warehouse/Happy_HIVE
hive> DROP TABLE IF EXISTS Happy_HIVE;
hive> CREATE EXTERNAL TABLE Happy_HIVE (user_id string,email string,ts string)
PARTITIONED BY(day STRING)
STORED AS PARQUET
LOCATION '/apps/hive/warehouse/Happy_HIVE';
hive> MSCK REPAIR TABLE Happy_HIVE;
The problem is that the datasource table created through Dataframe API(partitionBy+saveAsTable) is not compatible with Hive.(see this link). By setting spark.sql.hive.convertMetastoreParquet to false as suggested in the doc, Spark only puts data onto HDFS,but won't create table on Hive. And then you can manually go into hive shell to create an external table with proper schema&partition definition pointing to the data location.
I've tested this in Spark 1.6.1 and it worked for me. I hope this helps!
I have done in pyspark, spark version 2.3.0 :
create empty table where we need to save/overwrite data like:
create table databaseName.NewTableName like databaseName.OldTableName;
then run below command:
df1.write.mode("overwrite").partitionBy("year","month","day").format("parquet").saveAsTable("databaseName.NewTableName");
The issue is you can't read this table with hive but you can read with spark.
metadata doesn't already exist. In other words, it will add any partitions that exist on HDFS but not in metastore, to the hive metastore.

Resources