while running query on hive external table from trino, getting error Malformed ORC file. Cannot read SQL type 'double' from ORC stream of type STRING - presto

while running query on hive table from trino , getting below error
SQL Error [16777219]: Query failed (#20220820_042537_00480_dszgc): Error opening Hive split hdfs://CDPSAPRODHA/warehouse/tablespace/external/hive/customer_360.db/transaction_lines/order_date=20220818/part-00167-c28380bb-d942-445e-acd6-09ec8b29d777.c000 (offset=0, length=213919): Malformed ORC file. Cannot read SQL type 'double' from ORC stream '.assortment_id' of type STRING with attributes {} [hdfs://CDPSAPRODHA/warehouse/tablespace/external/hive/customer_360.db/transaction_lines/order_date=20220818/part-00167-c28380bb-d942-445e-acd6-09ec8b29d777.c000]
data type of "assortment_id" is string
Query :
select * from transaction_lines;

Related

zeppelin spark read parquet mysql write sql exception

After reading the parquet file in Apache Spark,
I specified the fields with spark sql .
If it is executed after setting the field to be the same as the created database table (type also), an SQL Exception occurs in the write jdbc part.
The questionable part is use near '"URL" TEXT , "GOD_NO" TEXT , "BRND_NM" TEXT , "CTGRY_NM" TEXT , "EVSP_ID" TEX...' at line 1.
seems to be being created. I'd like to know what I did wrong, thanks.
spark result
spark result
zeppelin interpreter

Pyspark - Creating a delta table while enableHiveSupport()

I'm creating a delta table on a EMR (6.2) using the following code:
try:
self.spark.sql(f'''
CREATE TABLE default.features_scd
(`{entity_key}` {entity_value_type}, `{CURRENT}` BOOLEAN,
`{EFFECTIVE_TIMESTAMP}` TIMESTAMP, `{END_TIMESTAMP}` TIMESTAMP, `date` DATE)
USING DELTA
PARTITIONED BY (DATE)
LOCATION 's3://mybucket/some/path'
''')
except IllegalArgumentException as e:
self.logger.error('got an illegal argument exception')
pass
I have enableHiveSupport() on the spark session.
I'm getting the warning:
WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table default.features_scd into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
And the exception:
IllegalArgumentException: Can not create a Path from an empty string
Basically the table is being created well and I'm able to get my goal, but I wish to not have to try except and pass on that error.
If I run the same code without the enableHiveSupport() it runs smoothly. But I need the hive support in the same session for creating/updating a hive table.
Is there a way to prevent this exception?

How to write data to hive table with snappy compression in Spark SQL

I have an orc hive table that is created using Hive command
create table orc1(line string) stored as orcfile
I want to write some data to this table using spark sql, I use following code and want the data to be snappy compressed on HDFS
test("test spark orc file format with compression") {
import SESSION.implicits._
Seq("Hello Spark", "Hello Hadoop").toDF("a").createOrReplaceTempView("tmp")
SESSION.sql("set hive.exec.compress.output=true")
SESSION.sql("set mapred.output.compress=true")
SESSION.sql("set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec")
SESSION.sql("set io.compression.codecs=org.apache.hadoop.io.compress.SnappyCodec")
SESSION.sql("set mapred.output.compression.type=BLOCK")
SESSION.sql("insert overwrite table orc1 select a from tmp ")
}
The data is written, but it is NOT compressed with snnapy.
If I run the insert overwrite in Hive Beeline/Hive to write the data and use the above set command , then I could see that the table's files are compressed with snappy.
So, I would ask how to write data with snappy compression in Spark SQL 2.1 to orc tables that are created by Hive
You can set the compression to snappy on the create table command like so
create table orc1(line string) stored as orc tblproperties ("orc.compress"="SNAPPY");
Then any inserts to the table will be snappy compressed (I corrected orcfile to orc in the command also).

Error While Writing into a Hive table from Spark Sql

I am trying to insert data into a Hive External table from Spark Sql.
I am created the hive external table through the following command
CREATE EXTERNAL TABLE tab1 ( col1 type,col2 type ,col3 type) CLUSTERED BY (col1,col2) SORTED BY (col1) INTO 8 BUCKETS STORED AS PARQUET
In my spark job , I have written the following code
Dataset df = session.read().option("header","true").csv(csvInput);
df.repartition(numBuckets, somecol)
.write()
.format("parquet")
.bucketBy(numBuckets,col1,col2)
.sortBy(col1)
.saveAsTable(hiveTableName);
Each time I am running this code I am getting the following exception
org.apache.spark.sql.AnalysisException: Table `tab1` already exists.;
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:408)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:393)
at somepackage.Parquet_Read_WriteNew.writeToParquetHiveMetastore(Parquet_Read_WriteNew.java:100)
You should be specifying a save mode while saving the data in hive.
df.write.mode(SaveMode.Append)
.format("parquet")
.bucketBy(numBuckets,col1,col2)
.sortBy(col1)
.insertInto(hiveTableName);
Spark provides the following save modes:
Save Mode
ErrorIfExists: Throws an exception if the target already exists. If target doesn’t exist write the data out.
Append: If target already exists, append the data to it. If the data doesn’t exist write the data out.
Overwrite: If the target already exists, delete the target. Write the data out.
Ignore: If the target already exists, silently skip writing out. Otherwise write out the data.
You are using the saveAsTable API, which create the table into Hive. Since you have already created the hive table through command, the table tab1 already exists. so when Spark API trying to create it, it throws error saying table already exists, org.apache.spark.sql.AnalysisException: Tabletab1already exists.
Either drop the table and let spark API saveAsTable create the table itself.
Or use the API insertInto to insert into an existing hive table.
df.repartition(numBuckets, somecol)
.write()
.format("parquet")
.bucketBy(numBuckets,col1,col2)
.sortBy(col1)
.insertInto(hiveTableName);

Spark: write timestamp to parquet and read it from Hive / Impala

I need to write a timestamp into parquet, then read it with Hive and Impala.
In order to write it, I tried eg
my.select(
...,
unix_timestamp() as "myts"
.write
.parquet(dir)
Then to read I created an external table in Hive:
CREATE EXTERNAL TABLE IF NOT EXISTS mytable (
...
myts TIMESTAMP
)
Doing so, I get the error
HiveException: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable
I also tried to replaced the unix_timestamp() with
to_utc_timestamp(lit("2018-05-06 20:30:00"), "UTC")
and same problem. In Impala, it returns me:
Column type: TIMESTAMP, Parquet schema: optional int64
Whereas timestamp are supposed to be int96.
What is the correct way to write timestamp into parquet?
Found a workaround: a UDF that returns a java.sql.Timestamp objects, with no casting, then spark will save it as int96.

Resources