PySpark reading datetime as is - apache-spark

I am using pyspark to load data by jdbc from mssql database. I have a problem with reading datetimes and treat fetched datetime as UTC without applying any timezone, becouse datetime is already in UTC.
Here on mssql i have a table
CREATE TABLE [config].[some_temp_table](
[id] [int] NULL,
[dt] [datetime] NULL
)
-- select * from [config].[some_temp_table]
--id dt
--1 2022-07-18 23:11:26.613
And I want to read it with pyspark by jdbc.
After reading it I have this in my DataFrame
df.>>> df.show(truncate=False)
+---+-----------------------+
|id |dt |
+---+-----------------------+
|1 |2022-07-18 21:11:26.613|
+---+-----------------------+
Timezone from server when spark runs:
# date +"%Z %z"
CEST +0200
So what I understand - spark is reading datetime and treating it as datetime with local timezone from server on which it runs. So it's getting '2022-07-18 23:11:26.613' and it thinks that it is datetime with +0200 timezone so it converts it to UTC like that: '2022-07-18 21:11:26.613'. Correct me if my thinking is wrong.
What I want to do:
Read datetime from mssql database by spark and save it into parquet without any conversion.
For example if spark reads datatime '2022-07-18 23:11:26.613' it should save same value into parquet, so after reading the parquet file I want to see value '2022-07-18 23:11:26.613'.
Is there any option to tell spark to treat datetime from jdbc connection as UTC already or not to do any conversion ?
What I tried:
spark.conf.set("spark.sql.session.timeZone", "UTC") - just doing
nothing
serverTimezone=UTC - added to jdbc uri , did nothing also
Note:
I don't know if this is needed but timezone on mssql database is
--select current_timezone()
(UTC+01:00)

Related

Conversion incompatibility between timestamp type in Glue and in Spark?

I want to run a simple sql select of timestamp fields from my data using spark sql (pyspark).
However, all the timestamp fields appear as 1970-01-19 10:45:37.009 .
So looks like I have some conversion incompatibility between timestamp in Glue and in Spark.
I'm running with pyspark, and I have the glue catalog configuration so I get my database schema from Glue. In both Glue and the spark sql dataframe these columns appear with timestamp type.
However, it looks like when I read the parquet files from s3 path, the event_time column (for example) is of type long and when I get its data, I get a correct event_time as epoch in milliseconds = 1593938489000. So I can convert it and get the actual datetime.
But when I run spark.sql , the event_time column gets timestamp type but it isn’t useful and missing precision. So I get this = 1970-01-19 10:45:37.009 .
When I run the same sql query in Athena, the timestamp field looks fine so my schema in Glue looks correct.
Is there a way to overcome it?
I didn't manage to find any spark.sql configurations that solved it.
You are getting 1970, due to incorrect way of formatting. Please give a try below code to convert long to UTC timestamp
from pyspark.sql import types as T
from pyspark.sql import functions as F
df = df.withColumn('timestamp_col_original', F.lit('1593938489000'))
df = df.withColumn('timestamp_col', (F.col('timestamp_col_original') / 1000).cast(T.TimestampType()))
df.show()
While converting : 1593938489000 I was getting below
timestamp_col_original| timestamp_col|
+----------------------+-------------------+
| 1593938489000|2020-07-05 08:41:29|
| 1593938489000|2020-07-05 08:41:29|
| 1593938489000|2020-07-05 08:41:29|
| 1593938489000|2020-07-05 08:41:29|
+----------------------+-------------------+

how to force avro writer to write timestamp in UTC in spark scala dataframe [duplicate]

This question already has answers here:
How to set timezone to UTC in Apache Spark?
(5 answers)
Closed 2 years ago.
I need to write Timestamp field to avro and ensure the data is saved in UTC. currently avro converts it to long (timestamp millis ) in the Local timezone of the server which is causing issues as if the server reading bk is a different timezone. I looked at the DataFrameWriter it seem to mention an option called timeZone but it doesnt seem to help.Is there a way to force Avro to consider all timestamp fields received in a specific timezone?
**CODE SNIPPET**
--write to spark avro
val data = Seq(Row("1",java.sql.Timestamp.valueOf("2020-05-11 15:17:57.188")))
val schemaOrig = List( StructField("rowkey",StringType,true)
,StructField("txn_ts",TimestampType,true))
val sourceDf = spark.createDataFrame(spark.sparkContext.parallelize(data),StructType(schemaOrig))
sourceDf.write.option("timeZone","UTC").avro("/test4")
--now try to read back from avro
spark.read.avro("/test4").show(false)
avroDf.show(false)
original value in soure 2020-05-11 15:17:57.188
in avro 1589224677188
read bk from avro wt out format
+-------------+-------------+
|rowkey |txn_ts |
+-------------+-------------+
|1 |1589224677188|
+-------------+-------------+
This is mapping fine but issue is if the local time of the server writing is EST and the one reading back is GMT it would give problem .
println(new java.sql.Timestamp(1589224677188L))
2020-05-11 7:17:57.188 -- time in GMT
.option("timeZone","UTC") option will not convert timestamp to UTC timezone.
Set this spark.conf.set("spark.sql.session.timeZone", "UTC") config property to set UTC as default timezone for all timestamps.
By defaul value for spark.sql.session.timeZone property is JVM system local time zone if not set.
Incase If above options are not working due to lower version of spark try using below options.
--conf "spark.driver.extraJavaOptions=-Duser.timezone=UTC" --conf "spark.executor.extraJavaOptions=-Duser.timezone=UTC"

Spark: write timestamp to parquet and read it from Hive / Impala

I need to write a timestamp into parquet, then read it with Hive and Impala.
In order to write it, I tried eg
my.select(
...,
unix_timestamp() as "myts"
.write
.parquet(dir)
Then to read I created an external table in Hive:
CREATE EXTERNAL TABLE IF NOT EXISTS mytable (
...
myts TIMESTAMP
)
Doing so, I get the error
HiveException: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable
I also tried to replaced the unix_timestamp() with
to_utc_timestamp(lit("2018-05-06 20:30:00"), "UTC")
and same problem. In Impala, it returns me:
Column type: TIMESTAMP, Parquet schema: optional int64
Whereas timestamp are supposed to be int96.
What is the correct way to write timestamp into parquet?
Found a workaround: a UDF that returns a java.sql.Timestamp objects, with no casting, then spark will save it as int96.

Azure Data Lake Analytics - Output dates as +0000 rather than -0800

I have a datetime column in an Azure Data Lake Analytics table.
All my incoming data is UTC +0000. When using the below code, all the csv outputs convert the dates to -0800
OUTPUT #data
TO #"/data.csv"
USING Outputters.Text(quoting : false, delimiter : '|');
An example datatime in the output:
2018-01-15T12:20:13.0000000-08:00
Are there any options for controlling the output format of the dates? I don't really understand why everything is suddenly in -0800 when the incoming data isn't.
Currently, ADLA does not store TimeZone information in DateTime, meaning it will always default to the local time of the cluster machine when reading (-8:00 in your case). Therefore, you can either normalize your DateTime to this local time by running
DateTime.SpecifyKind(myDate, DateTimeKind.Local)
or use
DateTime.ConvertToUtc()
to output in Utc form (but note that next time you ingest that same data, ADLA will still default to reading it in offset -0800). Examples below:
#getDates =
EXTRACT
id int,
date DateTime
FROM "/test/DateTestUtc.csv"
USING Extractors.Csv();
#formatDates =
SELECT
id,
DateTime.SpecifyKind(date, DateTimeKind.Local) AS localDate,
date.ConvertToUtc() AS utcDate
FROM #getDates;
OUTPUT #formatDates
TO "/test/dateTestUtcKind_AllUTC.csv"
USING Outputters.Csv();
You can file a feature request for DateTime with offset on our ADL feedback site. Let me know if you have other questions!

Spark Sql: Loading the file from excel sheet (with extension .xlsx) can not infer the schema of a date-type column properly

I have a xlsx file containing date/time filed (My Time) in following format and sample records -
5/16/2017 12:19:00 AM
5/16/2017 12:56:00 AM
5/16/2017 1:17:00 PM
5/16/2017 5:26:00 PM
5/16/2017 6:26:00 PM
I am reading the xlsx file in following manner: -
val inputDF = spark.sqlContext.read.format("com.crealytics.spark.excel")
.option("location","file:///C:/Users/file.xlsx")
.option("useHeader","true")
.option("treatEmptyValuesAsNulls","true")
.option("inferSchema","true")
.option("addColorColumns","false")
.load()
When I try to get schema using: -
inputDF.printSchema()
, I get Double.
Sometimes, even I get the schema as String.
And when I print the data, I get the output as: -
------------------
My Time
------------------
42871.014189814814
42871.03973379629
42871.553773148145
42871.72765046296
42871.76887731482
------------------
Above output is clearly not correct for the given input.
Moreover, if I convert the xlsx file in csv format and read it, I get the output correctly. Here is the way how I read in csv format: -
spark.sqlContext.read.format("csv")
.option("header", "true")
.option("inferSchema", true)
.load("file:///C:/Users/file.xlsx")
So, any help in this regard, how to infer the correct schema of any column of type date.
Note:-
Spark version is 2.0.0
Language used is Scala
I met the same problem. I also have no idea why. However, I suggest you set that "inferSchema" as "false" and then structure yours.

Resources