Timestamp Timezone Wrong/Missing in Spark/Databricks SQL Output - apache-spark

When converting a timestamp between timezones in databricks/spark sql, the timezone itself seems lost in the end result, and I can't seem to either keep it or add it back.
I have a bunch of UTC times and am using the from_utc_timetamp() to convert them to a different timezone based on another field. The result is calculated correctly, but if I output it with a timezone it shows as UTC. It seems the conversion is done correctly but the end result has no timezone stored with it (affirmed by this answer), so it uses the server zone for the timezone in all cases.
Example: Using the following SQL:
createTimestampUTC,
v.timezone,
date_format(from_utc_timestamp(createTimestampUTC, v.timezone),"yyyy-MM-dd'T'HH:mm:s Z") createTimestampLocal,
I get the following:
You can see that the third column has done the conversions correctly for the timezones, but the output itself still shows as being in UTC timezone.
Repeating this with a lowercase z in the date_format function shows the same; namely, the conversions occur but the end result is still treated as UTC.
createTimestampUTC,
v.timezone,
date_format(from_utc_timestamp(createTimestampUTC, v.timezone),"yyyy-MM-dd'T'HH:mm:s z") createTimestampLocal,
I can also use an O in the format output instead of a Z or z, but this just gives me GMT instead of UTC; same output basically.
All the databricks documentation or stackoverflow questions I can find seem to treat printing timezones as a matter of setting the spark server time and outputting that way, or doing the conversion without keeping the resulting timezone. I'm trying to convert to multiple different timezones though, and to keep the timezone in the output. I need to generate the end result in this format:
Is there a way to do this? How do I either keep the timezone after the conversion or add it back in the format I need based on the timezone column I have? Given that the conversion works, and that I can output the end result with a +0000 on it, all the functionality to do this seems there, how do I put it together?

Spark does not support TIMESTAMP WITH TIMEZONE datatype as defined by ANSI SQL. Even though there are some functions that convert the timestamp across timezones, this information is never stored. Databricks documentation on timestamps explains:
Spark SQL defines the timestamp type as TIMESTAMP WITH SESSION TIME
ZONE, which is a combination of the fields (YEAR, MONTH, DAY, HOUR,
MINUTE, SECOND, SESSION TZ) where the YEAR through SECOND field
identify a time instant in the UTC time zone, and where SESSION TZ is
taken from the SQL config spark.sql.session.timeZone.
In your case spark.sql.session.timeZone is UTC and Z symbol in datetime pattern will always return UTC. Therefore you will never get a correct behavior with date_format if you deal with multiple timezones in a single query.
The only thing you can do is to explicitly store timezone information in a column and manually append it for display.
concat(
date_format(from_utc_timestamp(createTimestampUTC, v.timezone), "yyyy-MM-dd'T'HH:mm:s "),
v.timezone
) createTimestampLocal
This will display 2022-03-01T16:47:22.000 America/New_York. If you need an offset (-05:00) you will need to write a UDF to do the conversion and use Python or Scala native libraries that handle datetime conversions.

Related

dealing with UTC dates and the future

I just discovered, that storing dates in utc is not ideally correct if we are also dealing with dates in the future. It seems to be the case because, timezones seem to change more often than we think they do. Fortunately, we seem to have the IANA tzdb that seems to get updated periodically, but, confusingly, postgres seems to use a specific version of the db which it seems to have at build time..
So, my question is, if the timezones are changing, with daylight saving going on, with political, geographical adjustments happening, and our database is not with the latest of the tzdb, how would we be able to keep track of the accuracy of the dates in the system? Additionally, would libraries like date-fns-tz basically not be accurate to account for new timezone changes?
Ideally I would think a library would make a network call to a central server that would maintain the latest changes, but, it doesn't seem to be the case. How are the latest date/timezone changes usually dealt with?
The IANA time zone database collects the global knowledge about what time zone was in effect at what time in every part of the world. That information is naturally incomplete, specifically when it comes to the future. A (IANA) time zone is not an offset from UTC, but a rule that says when which offset from UTC is active. EST is not a time zone in that sense, it is an abbreviation for a certain UTC offset. If you live in New York, you will sometimes have EST, sometimes EDT, depending on the rules for the time zone America/New_York. Of course you should update the time zone database, but not because the timestamps change (they are immutable), but because the way that the timestamps are displayed in a certain time zone can change.
What is stored in the database is always an UTC timestamp, so the timestamp itself is immutable. What changes is the representation. So if you predict that the world will end next July 15 at noon Austrian time, and the Austrian government abolishes daylight savings time, your prediction will be an hour off (unless you expect the cataclysm to follow Austrian legislation). If you are worried about that, make your predictions in UTC or at least add the UTC offset to the timestamp.
If you store the timestamp with time zone in the database, and you query it today with timezone set to Europe/Vienna, you will get a certain result. If you update the time zone database, and the new legislation is reflected in the update, then the same query will return a different result tomorrow. However, it will still be the same timestamp, only the UTC offset in use will be different:
SELECT TIMESTAMP WITH TIME ZONE '2023-07-15 12:00:00+02'
= TIMESTAMP WITH TIME ZONE '2023-07-15 11:00:00+01';
?column?
══════════
t
(1 row)
To clarify #Laurenz's statement in the comments further with an example, lets take an extreme case of samoa , where they switched from GMT-11 timezone, to GMT+13 skipping an entire day.
While ignoring what a timezone actually is (different similar opinions in the comments), for the purpose of the calculations below, lets just consider it a value offset from the standard UTC. Also, do note, I use my own symbolic ways to calculate, but, it is very understandable, hopefully ;-)
so, samoa on Dec 29, 2011 skipped a day, how? Based on what I found, when the clock struck midnight they effectively skipped Friday. But, the unix timestamp
remains equivalent/unchanged:
GMT-11
(-)GMT+13
__________
= 24hrs
Let, WST=GMT-11
2011-12-29 T 24:00:00 - 11 (clock strikes midnight)
= 2011-12-30 T 00:00:00 - 11 (WST)
= 2011-12-30 T 11:00:00 (UTC)
now the switch occurs, WST=GMT+13
2011-12-31 T 00:00:00 + 13 (WST)
= 2011-12-31 T-13:00:00 (UTC)
= 2011-12-30 T 11:00:00 (UTC)
So, as far as I can see, storing future dates does not really affect the value of the date itself. But, what it does affect is the way the dates are displayed, e.g. if the timezone info was not updated, people would still see the day after the 29th at samoa as Friday, 30th. But, in that case, it would be Fri, 30th GMT-11, whereas if the information was updated, it would be Sat, 31, GMT+13. So, all is well.
more details in the comment section of #Laurenz's answer
Also, as #Adrian mentions above, softwares that deal with timezones, come packaged with a version of tzdb if they support the conversion at all. It seems to be the case in postgres as well though it seem you can configure it to use the system's version. For such cases, you gotta update the software or the system's db itself.
I understand that you want to store a future point in time, like "10:00am on July 5th 2078 in the time zone of Australia/Sydney", regardless of what offset that time zone has compared to UTC when you retrieve the point in time again. And when the time comes, the point in time might not even exist, because it is being skipped for the introduction of daylight saving time (or it might exist more than once).
Speaking XML Schema, the information you want to store consists of
a dateTime without timezoneOffset, in the given example 2078-07-05T10:00:00 (no trailing Z)
plus a time zone, given as a string from the IANA database, in the given example Australia/Sydney.
I don't know how this is best stored in a PostgreSQL database, whether as two separate strings, or in a special data type. The PostgreSQL documentation says:
All timezone-aware dates and times are stored internally in UTC. They are converted to local time in the zone specified by the TimeZone configuration parameter before being displayed to the client.
That sounds to me as if the UTC value was fixed, and the local time value in a given time zone might change if daylight saving time is introduced or abolished in that time zone. (Am I correct here?) You want it the other way round: The local time remains the same and the UTC value might change after DST introduction/abolition.
For example, assume that polling stations for the next general election open at 2025-09-21T08:00:00+02:00 in my time zone. But if my country abolishes DST before then, they will open instead on 2025-09-21T08:00:00+01:00 without an explicit rescheduling. In other words: The UTC time changes, but the local time does not.
Or consider a flight whose local departure time and time zone are stored, which has a duration of 10 hours and arrives in another time zone. Its local arrival time then changes when the offset of the departure time zone changes, for example, because daylight saving time is introduced or abolished in that country on day X, but the offset of the arrival time zone does not change. An app that computes the local arrival time must then show a changed arrival time when it is executed on day X or later, although the stored data (the local departure time, departure time zone, arrival time zone and flight duration) have not changed. The required change can happen automatically if the app uses a library that is based on the IANA time zone database and receives an upgrade that includes the DST introduction/abolition before day X arrives.
For an example of such a library, see https://day.js.org/docs/en/timezone/parsing-in-zone.

terraform setting time with timezone

I need to define a TERRAFORM timestamp variable on a defined timezone.
All our business process are scheduled on Europe/Paris timezone.
This timezone oscillates between GMT+1 and GMT+2 during the 4 seasons.
How can I set my timestamp() variable on this timezone?
Terraform's timestamp-manipulation functions are intended for producing machine-readable timestamps in various formats, not for human-oriented timestamps. Therefore there are no built-in functions for converting to and from local ("wallclock") time.
The formatdate function is able to accept a timestamp containing a UTC offset and include that offset in its output, but Terraform has no built-in way to generate such a timestamp: the timestamp function always returns a UTC timestamp and there is no way to customize that. Also, UTC offsets are not the same thing as timezones because, as you've noted, timezones include daylight savings rules which cause the offsets to be different at different times of year.
If you wish to reinterpret Terraform's timestamps in your local timezone then you will need to do that outside of Terraform.
locals {
now = timestamp()
paris_tz = timeadd(local.now, "2h")
date_fr = formatdate("D-MM-YYYY", local.paris_tz)
date_utc = formatdate("YYYY-MM-DD", local.now)
}
Then you can use your local var.
credit: https://clebergasparoto.com/how-to-manipulate-date-and-time-with-terraform

how to Covert all the timestamps to EST in Hive table

I have a Hive table which contains a timestamp field and it can have any timezone ..(UTC/PST/CST....)
I want to convert all of them to a single timestamp, EST. it can be done either in Hive or Pyspark.
Basically, i am using it in my pyspark application which has a grouping logic on this datetime field and before doing that we want to have all the times in Hive table to be converted to EST time.
Sid
Mention to the facts which HIV Timezone have limitation on maximum time associates to Y2K38 bugs and JDBC compatibility issue,
TIMESTAMP type to serde2 that supports unix timestamp (1970-01-01 00:00:01 UTC to 2038-01-19 03:14:07 UTC) with optional nanosecond precision using both LazyBinary and LazySimple SerDes.
For LazySimpleSerDe, the data is stored in jdbc compliant java.sql.Timestamp parsable strings.
HIV-2272
Here is simulation associates to supporting timestamps earlier than 1970 and later than 2038.
Hive JDBC doesn't support TIMESTAMP column
Therefore, I think will be better if you are using HIV DataType of Date Type or String Type. Then you can use any timezone offset as the default on persistent.
* utc_timestamp is the column name */
/* bellow will convert a timestamp in UTC to EST timezone */
select from_utc_timestamp(utc_timestamp, 'EST') from table1;
Hope this helps.
HIV Data Types
Sidd, usually Hive uses the local timezone of the host where the data was written. The function from_utc_timestamp() and to_utc_timestamp can we very helpful. Instead of stating the timezone as UTC/EST you should rather use location/region in that case, since this will account for the day light savings.
Here's a helpful link for more examples: Local Time Convert To UTC Time In Hive
In case you have further questions, please share what have you already tried and share a sample snippet of your data for investigating further.

How to convert the Node.js date format to oracle datetime format

I need to convert the Node.js datetime '2016-07-13T07:38:15.500Z' to oracle format, where as the .500z represents the timezone.
Im using oracle 11g
The .500z doesn't represent the time zone. The .500 is fractional seconds. The z is short for Zulu, which means the time zone has been normalized to GMT/UTC.
In Oracle Database, the DATE data type doesn't support fractional seconds or time zones. For fractional seconds you'd need to use any of the 3 TIMESTAMP data types. If you need to store the actual time zone, use TIMESTAMP WITH TIME ZONE. If you don't need the actual time zone (most people don't) and want to make converting time zones really easy, then use TIMESTAMP WITH LOCAL TIME ZONE.
You haven't provided enough information to offer much of an answer, but here's a simple example that may sufficient:
select to_date('2016-07-13T07:38:15', 'YYYY-MM-DD"T"HH24:MI:SS'),
to_timestamp('2016-07-13T07:38:15.500Z', 'YYYY-MM-DD"T"HH24:MI:SS.FF"Z"')
from dual;

Cassandra select query with timezone issue

We have two different cassandra cluster on two different timezones.
Cluster1: 2.1.8 version, with IST TZ
Cluster2: 2.1.9 version, with UTC TZ
On cluster1 for a select query with timestamp column, i need not mention the tz[+0530] value , whereas on the other cluster I must and should provide the TZ value in select query to fetch the row. Is it to do with cassandra version?
I use cqlsh to do the query part. I tried cqlshrc file option, which only changes the format of output.
cluster1:
select * from test.check where row_timestamp = '1970-01-01 00:00:00';
cluster2:
select * from test.check where row_timestamp = '1970-01-01 00:00:00+0000';
IF no TZ is mentioned, i get "0" rows.
I dont want to give TZ in cluster2, please adivce how to do that.
It is a bit strange, I must admit, but there might been some changes in time zone manipulation between 2.1.8 and 2.1.9. This is from the changelog:
(cqlsh) Fix timestamps before 1970 on Windows, always use UTC for
timestamp display (CASSANDRA-10000)
On the other hand, the documentation is quite clear on this issue:
If no time zone is specified, the time zone of the Cassandra
coordinator node handing the write request is used. For accuracy,
DataStax recommends specifying the time zone rather than relying on
the time zone configured on the Cassandra nodes.
So, my sincere recommendation is to specify the time zone, and specify the same, presumably GMT (or UTC time). Save yourself the headache. Mind, GMT is not exactly equal to UTC, there is a slight difference in meaning. That way, you should ignore the time zone settings on the clusters. The time stamp is ultimately stored as a number of milliseconds (from certain point). The time zone information is purely a "rendering" thing. The number of milliseconds passed is the same in, for example 2015/03/05 14:00:00+0100 and 2015/03/05 16:00:00+0300.
If you are specifying nothing, and getting 0 results, while you do get results when you use +0000, then make sure that the data you are expecting originally is written with the expected time zone. Maybe there actually is not any data in the span because of that, or the coordinating node time stamp is different.

Resources