Read Hive table and transform it to Parquet Table - apache-spark

The data is from a Hive table, to be more precise
The first table has the properties
Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
This Table should be transformed to have parquet and have the properties
Serde Library org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
The following Scala Spark code is executed:
val df = spark.sql("SELECT * FROM table")
df.write.format("parquet").mode("append").saveAsTable("table")
This results still in the unwanted the properties:
Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Hopefully somebody can help me

You can not mix different file formats in the same table, nor can you change the file format of a table with data in it. (To be more precise, you can do these things, but neither Hive nor Spark will be able to read the data that is in a format that does not match the metadata.)
You should write the data to a new table, make sure that it matches your expectations, then rename or remove the old table and finally rename the new table to the old name. For example:
CREATE TABLE new_table STORED AS PARQUET AS SELECT * FROM orig_table;
ALTER TABLE orig_table RENAME TO orig_table_backup;
ALTER TABLE new_table RENAME TO orig_table;
You can execute these SQL statements in a Hive session directly or from Spark using spark.sql(...) statements (one by one).

Related

How can I create external hive table with PARQUET FORMATE

I am facing a issue on creating hive table on top of parquet file .can someone help on this. I have read many articles but not able to load the parquet file.
I followed this article
As per MS Document , I reproduce same in my environment and got the below results:
If you are creating parquet format, make sure to add ROW FORMAT SERDE'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
Please follow this code for creating external hive table with Parquet format.
CREATE EXTERNAL TABLE `demo_table`(
`first_name` string,
`Last_name` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'dbfs:/FileStore/'
TBLPROPERTIES (
'totalSize'='2335',
'numRows'='240',
'rawDataSize'='2095',
'COLUMN_STATS_ACCURATE'='true',
'numFiles'='1',
'transient_lastDdlTime'='1418173653')
Output:

Unable to load Hive table with ROW FORMAT DELIMITED from spark

I am trying to load csv file with '~' delimited to Hive external table.
Below is the code snippet in spark scala
scala > val df = spark.read
.options(Map("delimiter"->"~"))
.csv(s"file://${textFile}")
scala> df.write.mode("overwrite").insertInto(s"schema.parquet_table")
Now I a getting the below error.
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.serde2.io.ParquetHiveRecord
The table is created as below.
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0007'
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
Please suggest what is going wrong here.
The issue seems to be with cast from text to parquet. Try creating a sample table like mytable without serde properties but just 'stored as parquet' and check if insert overwrite works.

Create hive table by using spark sql

I'm trying to create a hive table with parquet file format after reading the data frame by using spark-sql .Table has been created in hive with Sequence file Format instead of parquet file format.But in the table path I could see parquet file was created.I'm not able to query this one from hive.This is the
code I have used.
df.write.option("path","/user/hive/warehouse/test/normal").format("parquet").mode("Overwrite").saveAsTable("test.people")
I'm using spark 2.3 and hive 2.3.3 along with MapR Distribution
show create table people:
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'path'='maprfs:///user/hive/warehouse/test.db/people')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
LazySimpleSerDe is for CSV, TSV, and Custom-Delimited Files
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'path'='maprfs:///user/hive/warehouse/test.db/people')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
for parquet you have to use different serde or have to specify stored as parquet.
STORED AS PARQUET
LOCATION ''
tblproperties ("parquet.compress"="SNAPPY");
Since you're using spark, if hive table is already exist then it will not touch meta data info only updated data. Technically it is not going to drop and recreate the table. It will create table only if table doesn't exist.

Spark Sql - Insert Into External Hive Table Error

I am trying to insert data into a external hive table through spark sql.
My hive table is bucketed via a column.
The query to create the external hive table is this
create external table tab1 ( col1 type,col2 type,col3 type) clustered by (col1,col2) sorted by (col1) into 8 buckets stored as parquet
Now I tried to store data from a parquet file (stored in hdfs) into the table.
This is my code
SparkSession session = SparkSession.builder().appName("ParquetReadWrite").
config("hive.exec.dynamic.partition", "true").
config("hive.exec.dynamic.partition.mode", "nonstrict").
config("hive.execution.engine","tez").
config("hive.exec.max.dynamic.partitions","400").
config("hive.exec.max.dynamic.partitions.pernode","400").
config("hive.enforce.bucketing","true").
config("optimize.sort.dynamic.partitionining","true").
config("hive.vectorized.execution.enabled","true").
config("hive.enforce.sorting","true").
enableHiveSupport()
.master(args[0]).getOrCreate();
String insertSql="insert into tab1 select * from"+"'"+parquetInput+"'";
session.sql(insertSql);
When I run the code , its throwing the below error
mismatched input ''hdfs://url:port/user/clsadmin/somedata.parquet'' expecting (line 1, pos 50)
== SQL ==
insert into UK_DISTRICT_MONTH_DATA select * from 'hdfs://url:port/user/clsadmin/somedata.parquet'
--------------------------------------------------^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:239)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:115)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
What is the difference between using the hive execution engine as Tez and Spark ?
Have you tried
LOAD DATA LOCAL INPATH '/path/to/data'
OVERWRITE INTO TABLE tablename;
Creating external table in Hive, HDFS location to be specified.
create external table tab1 ( col1 type,col2 type,col3 type)
clustered by (col1,col2) sorted by (col1) into 8 buckets
stored as parquet
LOCATION hdfs://url:port/user/clsadmin/tab1
There won't be necessity that hive will populate the data, either same application or other application can ingest the data into the location and hive will access the data by defining the schema top of the location.
*== SQL ==
insert into UK_DISTRICT_MONTH_DATA select * from 'hdfs://url:port/user/clsadmin/somedata.parquet'
--------------------------------------------------^^^*
parquetInput is parquet HDFS file path and not Hive table name. Hence the error.
There are two ways you can solve this issue:
Define the external table for "parquetInput" and give the table
name
Use LOAD DATA INPATH 'hdfs://url:port/user/clsadmin/somedata.parquet' INTO TABLE tab1

save dataframe as external hive table

I have used one way to save dataframe as external table using parquet file format but is there some other way to save dataframes directly as external table in hive like we have saveAsTable for managed table
you can do this in this way
df.write.format("ORC").options(Map("path"-> "yourpath")) saveAsTable "anubhav"
In PySpark, External Table can be created as below:
df.write.option('path','<External Table Path>').saveAsTable('<Table Name>')
For external table, don't use saveAsTable. Instead, save the data at location of the external table specified by path. Then add partition so that it is registered with hive metadata. This will allow you to hive query by partition later.
// hc is HiveContext, df is DataFrame.
df.write.mode(SaveMode.Overwrite).parquet(path)
val sql =
s"""
|alter table $targetTable
|add if not exists partition
|(year=$year,month=$month)
|location "$path"
""".stripMargin
hc.sql(sql)
You can also save dataframe with manual create table
dataframe.registerTempTable("temp_table");
hiveSqlContext.sql("create external table
table_name if not exist as select * from temp_table");
Below mentioned link has a good explanation for create table https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-table.html

Resources