What is valid syntax for spark hive create table with USING and PARTITIONED BY clauses? - apache-spark

I'm trying to create hive table in orc format with following command passed to SparkSesssion.sql(...):
CREATE TABLE `db`.`table`(
_id string,
...
)
PARTITIONED BY (load_date string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
USING ORC
And getting exception like mismatched input 'USING' expecting <EOF>
Changing places for above clauses don't help.
Official documentation omits this part or at least I'm unable to find it.
What is the correct way to do it ?

There is no USING tag in hive DDL statements:
You need to use stored as ORC or just the input and output format
CREATE TABLE `db`.`table`(
_id string,
...
)
PARTITIONED BY (load_date string)
STORED AS ORC

Related

Error while querying parquet table in presto

A hive parquet table is created over a spark dataframe saved in parquet format.
I am able to query the parquet data in my parquet table.
But while querying in presto, it shows an error: "Query 20200817_061959_00150_nztin failed: Can not read value at 0 in block 0 in file"
I am not using any decimal fields. Most of my fields are of string type and some of them are date and timestamp type.
Can someone help?

Correct syntax for creating a parquet table with CTAS at a specified location

I'm trying to create a table stored as parquet with spark.sql with a pre-specified external location, but I appear to be missing something, or something is omitted from the documentation.
My reading of the documentation suggests the following should work:
create table if not exists schema.test
using PARQUET
partitioned by (year, month)
location 's3a://path/to/location'
as select '1' test1, true test2, '2017' year, '01' month
But this returns the error:
mismatched input 'location' expecting (line 4, pos 0)
The documentation suggests external is automatically implied by setting location, but anyway adding create external table gives the same error.
I was able to successfully create an empty table with similar syntax:
create external table if not exists schema.test
(test1 string, test2 boolean)
partitioned by (year string, month string)
stored as PARQUET
location 's3a://path/to/location'
My alternative is to save the select results to a parquet at /path/to/location directly first, then create a table pointing to this, but this seems roundabout when the original syntax seems valid and designed for this purpose.
What's wrong with my approach?

Writing Hive table from Spark specifying CSV as the format

I'm having an issue writing a Hive table from Spark. The following code works just fine; I can write the table (which defaults to the Parquet format) and read it back in Hive:
df.write.mode('overwrite').saveAsTable("db.table")
hive> describe table;
OK
val string
Time taken: 0.021 seconds, Fetched: 1 row(s)
However, if I specify the format should be csv:
df.write.mode('overwrite').format('csv').saveAsTable("db.table")
then I can save the table, but Hive doesn't recognize the schema:
hive> describe table;
OK
col array<string> from deserializer
Time taken: 0.02 seconds, Fetched: 1 row(s)
It's also worth noting that I can create a Hive table manually and then insertInto it:
spark.sql("create table db.table(val string)")
df.select('val').write.mode("overwrite").insertInto("db.table")
Doing so, Hive seems to recognize the schema. But that's clunky and I can't figure a way to automate the schema string anyway.
That is because Hive SerDe do not support csv by default.
If you insist on using csv format, creating table as below:
CREATE TABLE my_table(a string, b string, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar" = "'",
"escapeChar" = "\\"
)
STORED AS TEXTFILE;
And insert data through df.write.insertInto
For more info:
https://cwiki.apache.org/confluence/display/Hive/CSV+Serde
You are creating a table with text format and trying to insert CSV data into it, which may run in to problems. So as suggested in the answer by Zhang Tong, create the hive table using hive OpenCSVSerde.
After that, if you are more comfortable with Hive query language than dataframes, you can try this.
df.registerTempTable("temp")
spark.sql("insert overwrite db.table select * from temp")
This happens because HiveSerde is different for csv than what is used by Spark. Hive by default use TEXTFORMAT and the delimiter has to be specified while creating the table.
One Option is to use the insertInto API instead of saveAsTable while writing from spark. While using insertInto, Spark writes the contents of the Dataframe to the specified table. But it requires the schema of the dataframe to be same as the schema of the table. Position of the columns is important here as it ignores the column names.
Seq((5, 6)).toDF("a", "b").write.insertInto("t1")

spark to hive data types

is there a way, how to convert an input string field into orc table with column specified as varchar(xx) in sparksql select query? Or I have to use some workaroud? I'm using Spark 1.6.
I found on Cloudera forum, Spark does not care about length, it saves the value as string with no size limit.
The table is inserted into Hive OK, but I'm little bit worried about data quality.
temp_table = sqlContext.table(ext)
df = temp_table.select(temp_dable.day.cast('string'))
I would like to see something like that :)))
df = temp_table.select(temp_dable.day.cast('varchar(100)'))
Edit:
df.write.partitionBy(part).mode('overwrite').insertInto(int)
Table I'm inserting into is saved as an ORC file (the line above probably should have .format('orc')).
I found here, that If I specify a column as a varchar(xx) type, than the input string will be cutoff to the xx length.
Thx

Impala table from spark partitioned parquet files

I have generated some partitioned parquet data using Spark, and I'm wondering how to map it to an Impala table... Sadly, I haven't found any solution yet.
The schema of parquet is like :
{ key: long,
value: string,
date: long }
and I partitioned it with key and date, that gives me this kind of directories on my hdfs :
/data/key=1/date=20170101/files.parquet
/data/key=1/date=20170102/files.parquet
/data/key=2/date=20170101/files.parquet
/data/key=2/date=20170102/files.parquet
...
Do you know how I could tell Impala to create a table from this dataset with corresponding partitions (and without having to loop on each partition as I could have read) ? Is it possible ?
Thank you in advance
Assuming by schema of parquet , you meant the schema of the dataset and then using the columns to partition , you will have only the key column in the actual files.parquet files . Now you can proceed as follows
The solution is to use an impala external table .
create external table mytable (key BIGINT) partitioned by (value String ,
date BIGINT) stored as parquet location '....../data/'
Note that in above statement , you have to give path till the data folder
alter table mytable recover partitions'
refresh mytable;
The above 2 commands will automatically detect the partitions based on the schema of the table and get to know about the parquet files present in the sub directories.
Now , you can start querying the data .
Hope it helps

Resources