I am trying to load csv file with '~' delimited to Hive external table.
Below is the code snippet in spark scala
scala > val df = spark.read
.options(Map("delimiter"->"~"))
.csv(s"file://${textFile}")
scala> df.write.mode("overwrite").insertInto(s"schema.parquet_table")
Now I a getting the below error.
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.serde2.io.ParquetHiveRecord
The table is created as below.
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0007'
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
Please suggest what is going wrong here.
The issue seems to be with cast from text to parquet. Try creating a sample table like mytable without serde properties but just 'stored as parquet' and check if insert overwrite works.
Related
I am facing a issue on creating hive table on top of parquet file .can someone help on this. I have read many articles but not able to load the parquet file.
I followed this article
As per MS Document , I reproduce same in my environment and got the below results:
If you are creating parquet format, make sure to add ROW FORMAT SERDE'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
Please follow this code for creating external hive table with Parquet format.
CREATE EXTERNAL TABLE `demo_table`(
`first_name` string,
`Last_name` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'dbfs:/FileStore/'
TBLPROPERTIES (
'totalSize'='2335',
'numRows'='240',
'rawDataSize'='2095',
'COLUMN_STATS_ACCURATE'='true',
'numFiles'='1',
'transient_lastDdlTime'='1418173653')
Output:
I'm trying to create a hive table with parquet file format after reading the data frame by using spark-sql .Table has been created in hive with Sequence file Format instead of parquet file format.But in the table path I could see parquet file was created.I'm not able to query this one from hive.This is the
code I have used.
df.write.option("path","/user/hive/warehouse/test/normal").format("parquet").mode("Overwrite").saveAsTable("test.people")
I'm using spark 2.3 and hive 2.3.3 along with MapR Distribution
show create table people:
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'path'='maprfs:///user/hive/warehouse/test.db/people')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
LazySimpleSerDe is for CSV, TSV, and Custom-Delimited Files
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'path'='maprfs:///user/hive/warehouse/test.db/people')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
for parquet you have to use different serde or have to specify stored as parquet.
STORED AS PARQUET
LOCATION ''
tblproperties ("parquet.compress"="SNAPPY");
Since you're using spark, if hive table is already exist then it will not touch meta data info only updated data. Technically it is not going to drop and recreate the table. It will create table only if table doesn't exist.
The data is from a Hive table, to be more precise
The first table has the properties
Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
This Table should be transformed to have parquet and have the properties
Serde Library org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
The following Scala Spark code is executed:
val df = spark.sql("SELECT * FROM table")
df.write.format("parquet").mode("append").saveAsTable("table")
This results still in the unwanted the properties:
Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Hopefully somebody can help me
You can not mix different file formats in the same table, nor can you change the file format of a table with data in it. (To be more precise, you can do these things, but neither Hive nor Spark will be able to read the data that is in a format that does not match the metadata.)
You should write the data to a new table, make sure that it matches your expectations, then rename or remove the old table and finally rename the new table to the old name. For example:
CREATE TABLE new_table STORED AS PARQUET AS SELECT * FROM orig_table;
ALTER TABLE orig_table RENAME TO orig_table_backup;
ALTER TABLE new_table RENAME TO orig_table;
You can execute these SQL statements in a Hive session directly or from Spark using spark.sql(...) statements (one by one).
I have a parquet table and the table contains a column with new line data. So when a hive query is fired on this table new line data will be considered as new records, i can over come this in hive by setting parameter "set hive.query.result.fileformat=SequenceFile;". Now i am migrating this parameter and MR query to be run in spark sql. Also i want to run some other queries like drop table statement before the actual query.
My code is like below
spark.sql(set hive.query.result.fileformat=SequenceFile;drop table output_table; create table output_table stored as orc as select * from source_table;)
With this query i am getting parser error at the semicolon (;) position. How can i properly execute above code in spark sql?
You shouldn't have a semicolon at the end of the code. Remove your semicolon ,add parentheses, and include your set parameter variable in the spark config command. Then it should work.
ex:
spark = (SparkSession
.builder
.appName('hive_validation_test')
.enableHiveSupport()
.config("hive.query.result.fileformat", "SequenceFile")
spark.sql('drop table output_table').
spark.sql('create table output_table stored as orc as select * from source_table').
I am building a pipeline through Azure data factory. Input dataset is a csv file with column delimiter and the output dataset is also a csv file column delimiter. The pipeline is designed with a HDinsight activity through hive query in the file with extension .hql. The hive query is as follows
set hive.exec.dynamic.partition.mode=nonstrict;
DROP TABLE IF EXISTS Table1;
CREATE EXTERNAL TABLE Table1 (
Number string,
Name string,
Address string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/your/folder/location'
SELECT * FROM Table1;
Below is the file format
Number,Name,Address
1,xyz,No 152,Chennai
2,abc,7th street,Chennai
3,wer,Chennai,Tamil Nadu
How do I data parse the column header with the data in the output dataset?
As per my understanding, Your question is related to to csv file. You are putting csv file at table location and it consist of header. If my understanding is correct, Please try below property in your table ddl. I hope this will help you.
tblproperties ("skip.header.line.count"="1");
Thanks,
Manu