How can I create external hive table with PARQUET FORMATE - azure

I am facing a issue on creating hive table on top of parquet file .can someone help on this. I have read many articles but not able to load the parquet file.
I followed this article

As per MS Document , I reproduce same in my environment and got the below results:
If you are creating parquet format, make sure to add ROW FORMAT SERDE'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
Please follow this code for creating external hive table with Parquet format.
CREATE EXTERNAL TABLE `demo_table`(
`first_name` string,
`Last_name` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'dbfs:/FileStore/'
TBLPROPERTIES (
'totalSize'='2335',
'numRows'='240',
'rawDataSize'='2095',
'COLUMN_STATS_ACCURATE'='true',
'numFiles'='1',
'transient_lastDdlTime'='1418173653')
Output:

Related

Big Query is not able to convert String to Timestamp

I have a BigQuery table where one of the column (publishTs) is timeStamp. I am trying to upload a parquet file into same table using GCP UI BQ upload option having same column name (publishTs) with String datatype (e.g. “2021-08-24T16:06:21.122Z “), But BQ is complaining with following error :-
I am generating parquet file using Apache Spark. I tried searching on internet but could not get the answer.
Try to generate this column as INT64 - link

Unable to load Hive table with ROW FORMAT DELIMITED from spark

I am trying to load csv file with '~' delimited to Hive external table.
Below is the code snippet in spark scala
scala > val df = spark.read
.options(Map("delimiter"->"~"))
.csv(s"file://${textFile}")
scala> df.write.mode("overwrite").insertInto(s"schema.parquet_table")
Now I a getting the below error.
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.serde2.io.ParquetHiveRecord
The table is created as below.
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0007'
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
Please suggest what is going wrong here.
The issue seems to be with cast from text to parquet. Try creating a sample table like mytable without serde properties but just 'stored as parquet' and check if insert overwrite works.

How to find out whether Spark table is parquet or delta?

I have a database with some tables in parquet format and others in delta format. If I want to append data to the table, I need to specify it if a table is in delta format (default is parquet).
How can I determine a table's format?
I tried show tblproperties <tbl_name> but this gives an empty result.
According to Delta lake Api Doc you can check
DeltaTable.isDeltaTable(Spark, "path")
Please see the note in the documentation
This uses the active SparkSession in the current thread to read the table data. Hence, this throws error if active SparkSession has not been set, that is, SparkSession.getActiveSession() is empty.

Create hive table by using spark sql

I'm trying to create a hive table with parquet file format after reading the data frame by using spark-sql .Table has been created in hive with Sequence file Format instead of parquet file format.But in the table path I could see parquet file was created.I'm not able to query this one from hive.This is the
code I have used.
df.write.option("path","/user/hive/warehouse/test/normal").format("parquet").mode("Overwrite").saveAsTable("test.people")
I'm using spark 2.3 and hive 2.3.3 along with MapR Distribution
show create table people:
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'path'='maprfs:///user/hive/warehouse/test.db/people')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
LazySimpleSerDe is for CSV, TSV, and Custom-Delimited Files
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'path'='maprfs:///user/hive/warehouse/test.db/people')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
for parquet you have to use different serde or have to specify stored as parquet.
STORED AS PARQUET
LOCATION ''
tblproperties ("parquet.compress"="SNAPPY");
Since you're using spark, if hive table is already exist then it will not touch meta data info only updated data. Technically it is not going to drop and recreate the table. It will create table only if table doesn't exist.

Read Hive table and transform it to Parquet Table

The data is from a Hive table, to be more precise
The first table has the properties
Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
This Table should be transformed to have parquet and have the properties
Serde Library org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
The following Scala Spark code is executed:
val df = spark.sql("SELECT * FROM table")
df.write.format("parquet").mode("append").saveAsTable("table")
This results still in the unwanted the properties:
Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Hopefully somebody can help me
You can not mix different file formats in the same table, nor can you change the file format of a table with data in it. (To be more precise, you can do these things, but neither Hive nor Spark will be able to read the data that is in a format that does not match the metadata.)
You should write the data to a new table, make sure that it matches your expectations, then rename or remove the old table and finally rename the new table to the old name. For example:
CREATE TABLE new_table STORED AS PARQUET AS SELECT * FROM orig_table;
ALTER TABLE orig_table RENAME TO orig_table_backup;
ALTER TABLE new_table RENAME TO orig_table;
You can execute these SQL statements in a Hive session directly or from Spark using spark.sql(...) statements (one by one).

Resources