Spark timestamp type is not getting accepted with hive timestamp - apache-spark

I have a spark Dataframe which contains a field as a timestamp. I am storing the dataframe into HDFS location where hive external table is created. Hive table contains the field with timestamp type. But while reading data from the external location hive is populating the timestamp field as a blank value in the table.
my spark dataframe query:
df.select($"ipAddress", $"clientIdentd", $"userId", to_timestamp(unix_timestamp($"dateTime", "dd/MMM/yyyy:HH:mm:ss Z").cast("timestamp")).as("dateTime"), $"method", $"endpoint", $"protocol", $"responseCode", $"contentSize", $"referrerURL", $"browserInfo")
Hive create table statement:
CREATE EXTERNAL TABLE `finalweblogs3`(
`ipAddress` string,
`clientIdentd` string,
`userId` string,
`dateTime` timestamp,
`method` string,
`endpoint` string,
`protocol` string,
`responseCode` string,
`contentSize` string,
`referrerURL` string,
`browserInfo` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'=',',
'serialization.format'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://localhost:9000/streaming/spark/finalweblogs3'
I am not able to get it why this is happening.

I resolved it by changing the storing format as "Parquet".
I still don't know why it is not working for CSV format.

Related

DateTime datatype in BigQuery

I have a partitioned table where one of the column is of type DateTime and the table is partitioned on same column. According to spark-bigquery documentation, the corresponding Spark SQL type is of String type.
https://github.com/GoogleCloudDataproc/spark-bigquery-connector
I tried doing the same but I am getting datatype mismatch issue.
Code Snippet:
ZonedDateTime nowPST = ZonedDateTime.ofInstant(Instant.now(), TimeZone.getTimeZone("PST").toZoneId());
df = df.withColumn("createdDate", lit(nowPST.toLocalDateTime().toString()));
Error:
Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Failed to load to <PROJECT_ID>:<DATASET_NAME>.<TABLE_NAME> in job JobId{project=<PROJECT_ID>, job=<JOB_ID>, location=US}. BigQuery error was Provided Schema does not match Table <PROJECT_ID>:<DATASET_NAME>.<TABLE_NAME>. Field createdDate has changed type from DATETIME to STRING
at com.google.cloud.spark.bigquery.BigQueryWriteHelper.loadDataToBigQuery(BigQueryWriteHelper.scala:156)
at com.google.cloud.spark.bigquery.BigQueryWriteHelper.writeDataFrameToBigQuery(BigQueryWriteHelper.scala:89)
... 36 more
As Spark has no support for DateTime, the BigQuery connector does not support writing DateTime - there is no equivalent Spark data type that can be used. We are exploring ways to augment the DataFrame's metadata in order to support the types which are supported by BigQuery and not by Spark (DateTime, Time, Geography).
At the moment please have this field as String, and have the conversion on the BigQuery side.
I am running into this issue now as well with both geography https://community.databricks.com/s/question/0D58Y000099mPyDSAU/does-databricks-support-writing-geographygeometry-data-into-bigquery
And for Datetime types. The only way I could get the table from databricks to BigQuery (without creating a temporary table and Inserting the data as this would still be costly due to the size of the table) was to write the table out to a CSV into a GCS Bucket
results_df.write.format("csv").mode("overwrite").save("gs://<bucket-name>/ancillary_test")
And then load the data from the bucket to the table in BigQuery specifying the schema
LOAD DATA INTO <dataset>.<tablename>(
PRICENODEID INTEGER,
ISONAME STRING,
PRICENODENAME STRING,
MARKETTYPE STRING,
GMTDATETIME TIMESTAMP,
TIMEZONE STRING,
LOCALDATETIME DATETIME,
ANCILLARY STRING,
PRICE FLOAT64,
CHANGE_DATE TIMESTAMP
)
FROM FILES (
format = 'CSV',
uris = ['gs://<bucket-name>/ancillary_test/*.csv']
);

Spark is unable to read Hive ORC table having data files with different version of schema

I have a hive ORC partitioned table created as below.
create table default.tab_xyz (
wid String,
update_at timestamp,
type String,
id long)
PARTITIONED BY (
minorversion string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
Now under the partition there are two versions 2.1 and 2.2. When i have taken schema from version files it is like below.
* 2.1 Version Schema *
wid : String,
type: String,
update_at : timestamp
* 2.2 Version Schema *
wid : String,
update_at : timestamp,
type: String,
id : long
When i execute the
select * from default.tab_xyz in hive it works fine.
But when i do
spark.sql("select * from default.tab_xyz") it throws me below error
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.orc.mapred.OrcTimestamp because for 2.1 version file it mapping
type column with
update_at column in table.
is there anyway handle it in spark to get all the data even though schema are different.

PySpark/Glue: When using a date column as a partition key, its always converted into String?

I am using PySpark on AWS Glue. It appears when writing a data set with date column used as partition key, its always converted into a string?
df = df \
.withColumn("querydatetime", to_date(df["querydatetime"], DATE_FORMAT_STR))
...
df \
.repartition("querydestinationplace", "querydatetime") \
.write \
.mode("overwrite") \
.partitionBy(["querydestinationplace", "querydatetime"]) \
.parquet("s3://xxx/flights-test")
I notice my table DDL from Athena
CREATE EXTERNAL TABLE `flights_test`(
`key` string,
`agent` int,
`queryoutbounddate` date,
`queryinbounddate` date,
`price` decimal(10,2),
`outdeparture` timestamp,
`indeparture` timestamp,
`numberoutstops` int,
`out_is_holiday` boolean,
`out_is_longweekends` boolean,
`in_is_holiday` boolean,
`in_is_longweekends` boolean)
PARTITIONED BY (
`querydestinationplace` string,
`querydatetime` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://xxx/flights-test/'
TBLPROPERTIES (...)
Notice
PARTITIONED BY (
`querydestinationplace` string,
`querydatetime` string)
Must the partition columns always be string? In fact querydestinationplace should be an int type. Will this string type be less efficient than an Int or Date?
This is a known behavior of paquet. You can add the following line before reading the parquet file to omit this behavior:
# prevent casting the integer id fields, which are used for patitioning,
# to be converted to integers.
sqlContext.setConf("spark.sql.sources.partitionColumnTypeInference.enabled", "false")

parquet fields showing NULL when reading through HIVE, BUT showing values when reading through spark

I am writing my spark streaming data frame as parquet file in my HDFS. I have created hive table in top of that HDFS location.
my spark structured streaming write command as follows:
parquet_frame.writeStream.option("compression", "none").option("latestFirst", "true").option("startingOffsets", "latest").option("checkpointLocation", "/user/ddd/openareacheckpoint_feb/").outputMode("append").trigger(Trigger.ProcessingTime("10 seconds")).partitionBy("dfo_data_dt").format("parquet").option("path", "hdfs://ddd/apps/hive/warehouse/ddddd.db/frg_drag/").start().awaitTermination()
if i try to read data from HIVE, am getting NULL for double data types,INT data types except for string and BIGINT.
BUT the same HDFS file i have read through spark shell and it is getting generated values without any NULL.
command in spark to read parquet file:
val pp = spark.read.parquet("hdfs://ddd/apps/hive/warehouse/ddddd.db/frg_drag/dfo_data_dt=20190225/")
pp.show
my create table statement in HIVE as follows:
CREATE TABLE `ddddd.frg_drag`(
`unit` string,
`pol` string,
`lop` string,
`gok` string,
`dfo_call_group` string,
`dfo_dfr` double,
`dfo_dfrs` double,
`dfo_dfrf` double,
`dfo_dfra` double,
`dfo_dfrgg` double,
`dfo_dfrqq` double,
`dfo_w_percent` double,
`dfo_afv_percent` double,
`dfo_endfd` double,
`dfo_time` timestamp,
`dfo_data_hour` int,
`dfo_data_minute` int)
PARTITIONED BY (
`dfo_data_dt` bigint)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'hdfs://ddd/apps/hive/warehouse/ddddd.db/frg_drag'
TBLPROPERTIES (
'transient_lastDdlTime'='1551108381')
can some help me how can resolve this issue. I am new to spark world

Unable to directly load hive parquet table using spark dataframe

I have gone thru related posts available in SO and couldn't find this specific issue anywhere over the internet.
I am trying to load Hive table (Hive external table pointed to parquet files) but spark data frame couldn't read the data and it is just able to read schema. But for the same hive table i can query from hive shell. When i try to load hive table into dataframe it is not returning any data. Below is my script looks like and the DDL. I am using Spark 2.1 (Mapr distribution)
Unable to read data from hive table has underlying parquet files from spark
val df4 = spark.sql("select * from default.Tablename")
scala> df4.show()
+----------------------+------------------------+----------+---+-------------+-------------+---------+
|col1 |col2 |col3 |key |col4| record_status|source_cd|
+----------------------+------------------------+----------+---+-------------+-------------+---------+
+----------------------+------------------------+----------+---+-------------+-------------+---------+
Hive DDL
CREATE EXTERNAL TABLE `Tablename`(
`col1` string,
`col2` string,
`col3` decimal(19,0),
`key` string,
`col6` string,
`record_status` string,
`source_cd` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'path'='maprfs:abc/bds/dbname.db/Tablename')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'maprfs:/Datalocation/Tablename'
TBLPROPERTIES (
'numFiles'='2',
'spark.sql.sources.provider'='parquet',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"col1\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"col2\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"col3\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"key\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"col6\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"record_status\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"source_cd\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}',
'totalSize'='68216',
'transient_lastDdlTime'='1502904476')
remove
'spark.sql.sources.provider'='parquet'
and you will success

Resources