Spark: write timestamp to parquet and read it from Hive / Impala - apache-spark

I need to write a timestamp into parquet, then read it with Hive and Impala.
In order to write it, I tried eg
my.select(
...,
unix_timestamp() as "myts"
.write
.parquet(dir)
Then to read I created an external table in Hive:
CREATE EXTERNAL TABLE IF NOT EXISTS mytable (
...
myts TIMESTAMP
)
Doing so, I get the error
HiveException: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable
I also tried to replaced the unix_timestamp() with
to_utc_timestamp(lit("2018-05-06 20:30:00"), "UTC")
and same problem. In Impala, it returns me:
Column type: TIMESTAMP, Parquet schema: optional int64
Whereas timestamp are supposed to be int96.
What is the correct way to write timestamp into parquet?

Found a workaround: a UDF that returns a java.sql.Timestamp objects, with no casting, then spark will save it as int96.

Related

Conversion incompatibility between timestamp type in Glue and in Spark?

I want to run a simple sql select of timestamp fields from my data using spark sql (pyspark).
However, all the timestamp fields appear as 1970-01-19 10:45:37.009 .
So looks like I have some conversion incompatibility between timestamp in Glue and in Spark.
I'm running with pyspark, and I have the glue catalog configuration so I get my database schema from Glue. In both Glue and the spark sql dataframe these columns appear with timestamp type.
However, it looks like when I read the parquet files from s3 path, the event_time column (for example) is of type long and when I get its data, I get a correct event_time as epoch in milliseconds = 1593938489000. So I can convert it and get the actual datetime.
But when I run spark.sql , the event_time column gets timestamp type but it isn’t useful and missing precision. So I get this = 1970-01-19 10:45:37.009 .
When I run the same sql query in Athena, the timestamp field looks fine so my schema in Glue looks correct.
Is there a way to overcome it?
I didn't manage to find any spark.sql configurations that solved it.
You are getting 1970, due to incorrect way of formatting. Please give a try below code to convert long to UTC timestamp
from pyspark.sql import types as T
from pyspark.sql import functions as F
df = df.withColumn('timestamp_col_original', F.lit('1593938489000'))
df = df.withColumn('timestamp_col', (F.col('timestamp_col_original') / 1000).cast(T.TimestampType()))
df.show()
While converting : 1593938489000 I was getting below
timestamp_col_original| timestamp_col|
+----------------------+-------------------+
| 1593938489000|2020-07-05 08:41:29|
| 1593938489000|2020-07-05 08:41:29|
| 1593938489000|2020-07-05 08:41:29|
| 1593938489000|2020-07-05 08:41:29|
+----------------------+-------------------+

External table on parquet files. .count() works, .show() fails

I defined an external table on a group of partitioned parquet files as such:
CREATE EXTERNAL TABLE foobarbaz (
src_file string,
[...]
temperature string
)
PARTITIONED BY (dt string)
STORED AS PARQUET
LOCATION '{1}'
If I then run
df = spark.table(foobarbaz)
print(df.count())
I get the correct non-zero result.
If I run
df = spark.table(foobarbaz)
df.show()
PySpark raises
py4j.protocol.Py4JJavaError: An error occurred while calling o95.showString. [...] Caused by: java.lang.UnsupportedOperationException
Why?
full traceback
I found an issue specific to my situation that may still be relevant to future readers. I extracted the schema using parquet-tools. One column was listed as int96 so in the schema definition I naively used int type for this column. Closer investigation revealed the column was of type datetime. Changing the schema definition accordingly resolved the issue.

PySpark cannot insertInto Hive table because "Can only write data to relations with a single path"

I have a Hive Orc table with a definition similar to the following definition
CREATE EXTERNAL TABLE `example.example_table`(
...
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'path'='s3a://path/to/table')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3a://path/to/table'
TBLPROPERTIES (
...
)
I am attempting to use PySpark to append a dataframe to this table using "df.write.insertInto("example.example_table")". When running this, I get the following error:
org.apache.spark.sql.AnalysisException: Can only write data to relations with a single path.;
at org.apache.spark.sql.execution.datasources.DataSourceAnalysis$$anonfun$apply$1.applyOrElse(DataSourceStrategy.scala:188)
at org.apache.spark.sql.execution.datasources.DataSourceAnalysis$$anonfun$apply$1.applyOrElse(DataSourceStrategy.scala:134)
...
When looking at the underlying Scala code, the condition that throws this error is checking to see if the table location has multiple "rootPaths". Obviously, my table is defined with a single location. What else could cause this?
It is that path that you are defining that causes the error. I just ran into this same problem myself. Hive generates a location path based on the hive.metastore.warehouse.dir property, so you have that default location plus the path you specified, which is causing that linked code to fail.
If you want to pick a specific path other than the default, then try using LOCATION.
Try running a describe extended example.example_table query to see more detailed information on the table. One of the output rows will be a Detailed Table Information which contains a bunch of useful information:
Table(
tableName:
dbName:
owner:
createTime:1548335003
lastAccessTime:0
retention:0
sd:StorageDescriptor(cols:
location:[*path_to_table*]
inputFormat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
outputFormat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
compressed:false
numBuckets:-1
serdeInfo:SerDeInfo(
name:null
serializationLib:org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
parameters:{
serialization.format=1
path=[*path_to_table*]
}
)
bucketCols:[]
sortCols:[]
parameters:{}
skewedInfo:SkewedInfo(skewedColNames:[]
skewedColValues:[]
skewedColValueLocationMaps:{})
storedAsSubDirectories:false
)
partitionKeys:[]
parameters:{transient_lastDdlTime=1548335003}
viewOriginalText:null
viewExpandedText:null
tableType:MANAGED_TABLE
rewriteEnabled:false
)
We had the same problem in a project when migrating from Spark 1.x and HDFS to Spark 3.x and S3. We solve this issue setting the next Spark property to false:
spark.sql.hive.convertMetastoreParquet
You can just run
spark.sql("SET spark.sql.hive.convertMetastoreParquet=false")
Or maybe
spark.conf("spark.sql.hive.convertMetastoreParquet", False)
Being spark the SparkSession object. The explanaition of this is currently in Spark documentation.

Error While Writing into a Hive table from Spark Sql

I am trying to insert data into a Hive External table from Spark Sql.
I am created the hive external table through the following command
CREATE EXTERNAL TABLE tab1 ( col1 type,col2 type ,col3 type) CLUSTERED BY (col1,col2) SORTED BY (col1) INTO 8 BUCKETS STORED AS PARQUET
In my spark job , I have written the following code
Dataset df = session.read().option("header","true").csv(csvInput);
df.repartition(numBuckets, somecol)
.write()
.format("parquet")
.bucketBy(numBuckets,col1,col2)
.sortBy(col1)
.saveAsTable(hiveTableName);
Each time I am running this code I am getting the following exception
org.apache.spark.sql.AnalysisException: Table `tab1` already exists.;
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:408)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:393)
at somepackage.Parquet_Read_WriteNew.writeToParquetHiveMetastore(Parquet_Read_WriteNew.java:100)
You should be specifying a save mode while saving the data in hive.
df.write.mode(SaveMode.Append)
.format("parquet")
.bucketBy(numBuckets,col1,col2)
.sortBy(col1)
.insertInto(hiveTableName);
Spark provides the following save modes:
Save Mode
ErrorIfExists: Throws an exception if the target already exists. If target doesn’t exist write the data out.
Append: If target already exists, append the data to it. If the data doesn’t exist write the data out.
Overwrite: If the target already exists, delete the target. Write the data out.
Ignore: If the target already exists, silently skip writing out. Otherwise write out the data.
You are using the saveAsTable API, which create the table into Hive. Since you have already created the hive table through command, the table tab1 already exists. so when Spark API trying to create it, it throws error saying table already exists, org.apache.spark.sql.AnalysisException: Tabletab1already exists.
Either drop the table and let spark API saveAsTable create the table itself.
Or use the API insertInto to insert into an existing hive table.
df.repartition(numBuckets, somecol)
.write()
.format("parquet")
.bucketBy(numBuckets,col1,col2)
.sortBy(col1)
.insertInto(hiveTableName);

save Spark dataframe to Hive: table not readable because "parquet not a SequenceFile"

I'd like to save data in a Spark (v 1.3.0) dataframe to a Hive table using PySpark.
The documentation states:
"spark.sql.hive.convertMetastoreParquet: When set to false, Spark SQL will use the Hive SerDe for parquet tables instead of the built in support."
Looking at the Spark tutorial, is seems that this property can be set:
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false")
# code to create dataframe
my_dataframe.saveAsTable("my_dataframe")
However, when I try to query the saved table in Hive it returns:
hive> select * from my_dataframe;
OK
Failed with exception java.io.IOException:java.io.IOException:
hdfs://hadoop01.woolford.io:8020/user/hive/warehouse/my_dataframe/part-r-00001.parquet
not a SequenceFile
How do I save the table so that it's immediately readable in Hive?
I've been there...
The API is kinda misleading on this one.
DataFrame.saveAsTable does not create a Hive table, but an internal Spark table source.
It also stores something into Hive metastore, but not what you intend.
This remark was made by spark-user mailing list regarding Spark 1.3.
If you wish to create a Hive table from Spark, you can use this approach:
1. Use Create Table ... via SparkSQL for Hive metastore.
2. Use DataFrame.insertInto(tableName, overwriteMode) for the actual data (Spark 1.3)
I hit this issue last week and was able to find a workaround
Here's the story:
I can see the table in Hive if I created the table without partitionBy:
spark-shell>someDF.write.mode(SaveMode.Overwrite)
.format("parquet")
.saveAsTable("TBL_HIVE_IS_HAPPY")
hive> desc TBL_HIVE_IS_HAPPY;
OK
user_id string
email string
ts string
But Hive can't understand the table schema(schema is empty...) if I do this:
spark-shell>someDF.write.mode(SaveMode.Overwrite)
.format("parquet")
.saveAsTable("TBL_HIVE_IS_NOT_HAPPY")
hive> desc TBL_HIVE_IS_NOT_HAPPY;
# col_name data_type from_deserializer
[Solution]:
spark-shell>sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false")
spark-shell>df.write
.partitionBy("ts")
.mode(SaveMode.Overwrite)
.saveAsTable("Happy_HIVE")//Suppose this table is saved at /apps/hive/warehouse/Happy_HIVE
hive> DROP TABLE IF EXISTS Happy_HIVE;
hive> CREATE EXTERNAL TABLE Happy_HIVE (user_id string,email string,ts string)
PARTITIONED BY(day STRING)
STORED AS PARQUET
LOCATION '/apps/hive/warehouse/Happy_HIVE';
hive> MSCK REPAIR TABLE Happy_HIVE;
The problem is that the datasource table created through Dataframe API(partitionBy+saveAsTable) is not compatible with Hive.(see this link). By setting spark.sql.hive.convertMetastoreParquet to false as suggested in the doc, Spark only puts data onto HDFS,but won't create table on Hive. And then you can manually go into hive shell to create an external table with proper schema&partition definition pointing to the data location.
I've tested this in Spark 1.6.1 and it worked for me. I hope this helps!
I have done in pyspark, spark version 2.3.0 :
create empty table where we need to save/overwrite data like:
create table databaseName.NewTableName like databaseName.OldTableName;
then run below command:
df1.write.mode("overwrite").partitionBy("year","month","day").format("parquet").saveAsTable("databaseName.NewTableName");
The issue is you can't read this table with hive but you can read with spark.
metadata doesn't already exist. In other words, it will add any partitions that exist on HDFS but not in metastore, to the hive metastore.

Resources