How can we convert an external table to managed table in SPARK 2.2.0? - apache-spark

The below command was successfully converting external tables to managed tables in Spark 2.0.0:
ALTER TABLE {table_name} SET TBLPROPERTIES(EXTERNAL=FLASE);
However the above command is failing in Spark 2.2.0 with the below error:
Error in query: Cannot set or change the preserved property key:
'EXTERNAL';

As #AndyBrown pointed our in a comment you have the option of dropping to the console and invoking the Hive statement there. In Scala this worked for me:
import sys.process._
val exitCode = Seq("hive", "-e", "ALTER TABLE {table_name} SET TBLPROPERTIES(\"EXTERNAL\"=\"FALSE\")").!
I faced this problem using Spark 2.1.1 where #Joha's answer does not work because spark.sessionState is not accessible due to being declared lazy.

In Spark 2.2.0 you can do the following:
import org.apache.spark.sql.catalyst.TableIdentifier
import org.apache.spark.sql.catalyst.catalog.CatalogTable
import org.apache.spark.sql.catalyst.catalog.CatalogTableType
val identifier = TableIdentifier("table", Some("database"))
val oldTable = spark.sessionState.catalog.getTableMetadata(identifier)
val newTableType = CatalogTableType.MANAGED
val alteredTable = oldTable.copy(tableType = newTableType)
spark.sessionState.catalog.alterTable(alteredTable)

The issue is case-sensitivity on spark-2.1 and above.
Please try setting TBLPROPERTIES in lower case -
ALTER TABLE <TABLE NAME> SET TBLPROPERTIES('external'='false')

I had the same issue while using a hive external table. I solved the problem by directly setting the propery external to false in hive metastore using a hive metastore client
Table table = hiveMetaStoreClient.getTable("db", "table");
table.putToParameters("EXTERNAL","FALSE");
hiveMetaStoreClient.alter_table("db", "table", table,true);

I tried the above option from scala databricks notebook, and the
external table was converted to MANAGED table and the good part is
that the desc formatted option from spark on the new table is still
showing the location to be on my ADLS. This was one limitation that
spark was having, that we cannot specify the location for a managed
table.
As of now i am able to do a truncate table for this. hopefully there
was a more direct option for creating a managed table with location
specified from spark sql.

Related

Spark : Can't set table properties start with "spark.sql" to hive external table while creating

Env : linux (spark-submit xxx.py)
Target database : Hive
We used to use beeline to execute hql, but now we try to run the hql through pyspark and faced some issue when tried to set table properties while creating the table.
SQL
CREATE EXTERNAL TABLE example.a(
column_a string)
TBLPROPERTIES (
'discover.partitions'='true',
'spark.sql.sources.schema.numPartCols'='1',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"column_a","type":"string","nullable":true,"metadata":{}}]}',
'spark.sql.sources.schema.partCol.0'='received_utc_date_partition');
Error message
Hive - ERROR - Cannot persist
example.a into Hive metastore as table property
keys may not start with 'spark.sql.': [spark.sql.sources.schema.partCol.0, spark.sql.sources.schema.numParts,
spark.sql.sources.schema.numPartCols, spark.sql.sources.schema.part.0];
In line 130-147 in spark source code it seems that it prevent all table properties that start with "spark.sql"
Not sure if I did it wrong or there's another way to set up the table properties for hive table.
Any kinds of suggestion is appreciated.

How to Read Hive Table in Spark without header

I am trying to read one hive table in pyspark but I am getting header as well that I do not want.
File.csv
Id,Name
1,A
2,B
3,C
4,D
Hive Table
I build hive table with tblproperties("skip.header.line.count"="1") and in Hive I am getting data correctly so there is no issue with Hive.
I am facing issue while I am reading this table in pyspark.
There is Spark-11374 jira reported for this issue and closed as won't fix.
Possible ways to do this are:
1.You can directly read the HDFS file:
spark.read.option("header","true").option("delimiter",",").csv("<hdfs_path>").show()
2.using hive query:
spark.sql("select * from <table_name> where <col_name1> != 'id'").show()

PySpark cannot insertInto Hive table because "Can only write data to relations with a single path"

I have a Hive Orc table with a definition similar to the following definition
CREATE EXTERNAL TABLE `example.example_table`(
...
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'path'='s3a://path/to/table')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3a://path/to/table'
TBLPROPERTIES (
...
)
I am attempting to use PySpark to append a dataframe to this table using "df.write.insertInto("example.example_table")". When running this, I get the following error:
org.apache.spark.sql.AnalysisException: Can only write data to relations with a single path.;
at org.apache.spark.sql.execution.datasources.DataSourceAnalysis$$anonfun$apply$1.applyOrElse(DataSourceStrategy.scala:188)
at org.apache.spark.sql.execution.datasources.DataSourceAnalysis$$anonfun$apply$1.applyOrElse(DataSourceStrategy.scala:134)
...
When looking at the underlying Scala code, the condition that throws this error is checking to see if the table location has multiple "rootPaths". Obviously, my table is defined with a single location. What else could cause this?
It is that path that you are defining that causes the error. I just ran into this same problem myself. Hive generates a location path based on the hive.metastore.warehouse.dir property, so you have that default location plus the path you specified, which is causing that linked code to fail.
If you want to pick a specific path other than the default, then try using LOCATION.
Try running a describe extended example.example_table query to see more detailed information on the table. One of the output rows will be a Detailed Table Information which contains a bunch of useful information:
Table(
tableName:
dbName:
owner:
createTime:1548335003
lastAccessTime:0
retention:0
sd:StorageDescriptor(cols:
location:[*path_to_table*]
inputFormat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
outputFormat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
compressed:false
numBuckets:-1
serdeInfo:SerDeInfo(
name:null
serializationLib:org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
parameters:{
serialization.format=1
path=[*path_to_table*]
}
)
bucketCols:[]
sortCols:[]
parameters:{}
skewedInfo:SkewedInfo(skewedColNames:[]
skewedColValues:[]
skewedColValueLocationMaps:{})
storedAsSubDirectories:false
)
partitionKeys:[]
parameters:{transient_lastDdlTime=1548335003}
viewOriginalText:null
viewExpandedText:null
tableType:MANAGED_TABLE
rewriteEnabled:false
)
We had the same problem in a project when migrating from Spark 1.x and HDFS to Spark 3.x and S3. We solve this issue setting the next Spark property to false:
spark.sql.hive.convertMetastoreParquet
You can just run
spark.sql("SET spark.sql.hive.convertMetastoreParquet=false")
Or maybe
spark.conf("spark.sql.hive.convertMetastoreParquet", False)
Being spark the SparkSession object. The explanaition of this is currently in Spark documentation.

Get data from subfolders of an unpartitioned hive table into a dataframe in spark

There is an external table in hive pointing to s3 location that is not partitioned. The table points to a folder in s3 but the data is in multiple subfolders inside that folder.
This table can be queried even though the table is not partitioned by setting few properties in hive like below,
set hive.input.dir.recursive=true;
set hive.mapred.supports.subdirectories=true;
set hive.supports.subdirectories=true;
set mapred.input.dir.recursive=true;
However, when the same table is used in spark to load the data into a dataframe using a sql statement like df = sqlContext.sql("select * from table_name"), the action fails saying 'The subfolders in the external s3 location is not a file'.
I tried setting above hive properties in spark using sc.hadoopConfiguration.set("mapred.input.dir.recursive","true") method, but it did not help. Looks like this would help only for sc.textFile kind of loading.
This can be achieved by setting the following property in spark,
sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true")
Note here that the property is set usign sqlContext instead of sparkContext.
And I tested this in spark 1.6.2

save Spark dataframe to Hive: table not readable because "parquet not a SequenceFile"

I'd like to save data in a Spark (v 1.3.0) dataframe to a Hive table using PySpark.
The documentation states:
"spark.sql.hive.convertMetastoreParquet: When set to false, Spark SQL will use the Hive SerDe for parquet tables instead of the built in support."
Looking at the Spark tutorial, is seems that this property can be set:
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false")
# code to create dataframe
my_dataframe.saveAsTable("my_dataframe")
However, when I try to query the saved table in Hive it returns:
hive> select * from my_dataframe;
OK
Failed with exception java.io.IOException:java.io.IOException:
hdfs://hadoop01.woolford.io:8020/user/hive/warehouse/my_dataframe/part-r-00001.parquet
not a SequenceFile
How do I save the table so that it's immediately readable in Hive?
I've been there...
The API is kinda misleading on this one.
DataFrame.saveAsTable does not create a Hive table, but an internal Spark table source.
It also stores something into Hive metastore, but not what you intend.
This remark was made by spark-user mailing list regarding Spark 1.3.
If you wish to create a Hive table from Spark, you can use this approach:
1. Use Create Table ... via SparkSQL for Hive metastore.
2. Use DataFrame.insertInto(tableName, overwriteMode) for the actual data (Spark 1.3)
I hit this issue last week and was able to find a workaround
Here's the story:
I can see the table in Hive if I created the table without partitionBy:
spark-shell>someDF.write.mode(SaveMode.Overwrite)
.format("parquet")
.saveAsTable("TBL_HIVE_IS_HAPPY")
hive> desc TBL_HIVE_IS_HAPPY;
OK
user_id string
email string
ts string
But Hive can't understand the table schema(schema is empty...) if I do this:
spark-shell>someDF.write.mode(SaveMode.Overwrite)
.format("parquet")
.saveAsTable("TBL_HIVE_IS_NOT_HAPPY")
hive> desc TBL_HIVE_IS_NOT_HAPPY;
# col_name data_type from_deserializer
[Solution]:
spark-shell>sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false")
spark-shell>df.write
.partitionBy("ts")
.mode(SaveMode.Overwrite)
.saveAsTable("Happy_HIVE")//Suppose this table is saved at /apps/hive/warehouse/Happy_HIVE
hive> DROP TABLE IF EXISTS Happy_HIVE;
hive> CREATE EXTERNAL TABLE Happy_HIVE (user_id string,email string,ts string)
PARTITIONED BY(day STRING)
STORED AS PARQUET
LOCATION '/apps/hive/warehouse/Happy_HIVE';
hive> MSCK REPAIR TABLE Happy_HIVE;
The problem is that the datasource table created through Dataframe API(partitionBy+saveAsTable) is not compatible with Hive.(see this link). By setting spark.sql.hive.convertMetastoreParquet to false as suggested in the doc, Spark only puts data onto HDFS,but won't create table on Hive. And then you can manually go into hive shell to create an external table with proper schema&partition definition pointing to the data location.
I've tested this in Spark 1.6.1 and it worked for me. I hope this helps!
I have done in pyspark, spark version 2.3.0 :
create empty table where we need to save/overwrite data like:
create table databaseName.NewTableName like databaseName.OldTableName;
then run below command:
df1.write.mode("overwrite").partitionBy("year","month","day").format("parquet").saveAsTable("databaseName.NewTableName");
The issue is you can't read this table with hive but you can read with spark.
metadata doesn't already exist. In other words, it will add any partitions that exist on HDFS but not in metastore, to the hive metastore.

Resources