how to Covert all the timestamps to EST in Hive table - apache-spark

I have a Hive table which contains a timestamp field and it can have any timezone ..(UTC/PST/CST....)
I want to convert all of them to a single timestamp, EST. it can be done either in Hive or Pyspark.
Basically, i am using it in my pyspark application which has a grouping logic on this datetime field and before doing that we want to have all the times in Hive table to be converted to EST time.
Sid

Mention to the facts which HIV Timezone have limitation on maximum time associates to Y2K38 bugs and JDBC compatibility issue,
TIMESTAMP type to serde2 that supports unix timestamp (1970-01-01 00:00:01 UTC to 2038-01-19 03:14:07 UTC) with optional nanosecond precision using both LazyBinary and LazySimple SerDes.
For LazySimpleSerDe, the data is stored in jdbc compliant java.sql.Timestamp parsable strings.
HIV-2272
Here is simulation associates to supporting timestamps earlier than 1970 and later than 2038.
Hive JDBC doesn't support TIMESTAMP column
Therefore, I think will be better if you are using HIV DataType of Date Type or String Type. Then you can use any timezone offset as the default on persistent.
* utc_timestamp is the column name */
/* bellow will convert a timestamp in UTC to EST timezone */
select from_utc_timestamp(utc_timestamp, 'EST') from table1;
Hope this helps.
HIV Data Types

Sidd, usually Hive uses the local timezone of the host where the data was written. The function from_utc_timestamp() and to_utc_timestamp can we very helpful. Instead of stating the timezone as UTC/EST you should rather use location/region in that case, since this will account for the day light savings.
Here's a helpful link for more examples: Local Time Convert To UTC Time In Hive
In case you have further questions, please share what have you already tried and share a sample snippet of your data for investigating further.

Related

Timestamp Timezone Wrong/Missing in Spark/Databricks SQL Output

When converting a timestamp between timezones in databricks/spark sql, the timezone itself seems lost in the end result, and I can't seem to either keep it or add it back.
I have a bunch of UTC times and am using the from_utc_timetamp() to convert them to a different timezone based on another field. The result is calculated correctly, but if I output it with a timezone it shows as UTC. It seems the conversion is done correctly but the end result has no timezone stored with it (affirmed by this answer), so it uses the server zone for the timezone in all cases.
Example: Using the following SQL:
createTimestampUTC,
v.timezone,
date_format(from_utc_timestamp(createTimestampUTC, v.timezone),"yyyy-MM-dd'T'HH:mm:s Z") createTimestampLocal,
I get the following:
You can see that the third column has done the conversions correctly for the timezones, but the output itself still shows as being in UTC timezone.
Repeating this with a lowercase z in the date_format function shows the same; namely, the conversions occur but the end result is still treated as UTC.
createTimestampUTC,
v.timezone,
date_format(from_utc_timestamp(createTimestampUTC, v.timezone),"yyyy-MM-dd'T'HH:mm:s z") createTimestampLocal,
I can also use an O in the format output instead of a Z or z, but this just gives me GMT instead of UTC; same output basically.
All the databricks documentation or stackoverflow questions I can find seem to treat printing timezones as a matter of setting the spark server time and outputting that way, or doing the conversion without keeping the resulting timezone. I'm trying to convert to multiple different timezones though, and to keep the timezone in the output. I need to generate the end result in this format:
Is there a way to do this? How do I either keep the timezone after the conversion or add it back in the format I need based on the timezone column I have? Given that the conversion works, and that I can output the end result with a +0000 on it, all the functionality to do this seems there, how do I put it together?
Spark does not support TIMESTAMP WITH TIMEZONE datatype as defined by ANSI SQL. Even though there are some functions that convert the timestamp across timezones, this information is never stored. Databricks documentation on timestamps explains:
Spark SQL defines the timestamp type as TIMESTAMP WITH SESSION TIME
ZONE, which is a combination of the fields (YEAR, MONTH, DAY, HOUR,
MINUTE, SECOND, SESSION TZ) where the YEAR through SECOND field
identify a time instant in the UTC time zone, and where SESSION TZ is
taken from the SQL config spark.sql.session.timeZone.
In your case spark.sql.session.timeZone is UTC and Z symbol in datetime pattern will always return UTC. Therefore you will never get a correct behavior with date_format if you deal with multiple timezones in a single query.
The only thing you can do is to explicitly store timezone information in a column and manually append it for display.
concat(
date_format(from_utc_timestamp(createTimestampUTC, v.timezone), "yyyy-MM-dd'T'HH:mm:s "),
v.timezone
) createTimestampLocal
This will display 2022-03-01T16:47:22.000 America/New_York. If you need an offset (-05:00) you will need to write a UDF to do the conversion and use Python or Scala native libraries that handle datetime conversions.

Presto sql function date_parse fails for specific date (1960-01-01)

How to resolve this presto sql error for date_parse('1960-01-01', '%Y-%m-%d')
This function works fine for other dates.
This is due to a long-standing issue with how Presto models timestamps. Long story short, the implementation of timestamps is not compliant with the SQL specification and it incorrectly attempts to treat them as "point in time" or "instant" values and interpret them within a time zone specification. For some dates and time zone rules, the values are undefined due to daylight savings transitions, etc.
This was fixed in recent versions of Trino (formerly known as Presto SQL), so you may want to update.
By the way, you can convert a varchar to a date using the date() function or by casting the value to date:
trino> select date('1960-01-01');
_col0
------------
1960-01-01
(1 row)
trino> select cast('1960-01-01' as date);
_col0
------------
1960-01-01
(1 row)

What is the data type of timestamp in cassandra

I notice that if my model has a field expirationTime of type DateTime then I cannot store it iin timestamp field in Cassandra.
QueryBuilder.set("expiration_time",model.expirationTime) //data gets corrupted
But if I store time as milli seconds then it works.
QueryBuilder.set("expiration_time",model.expirationTime.getMillis()) //WORKS
Question 1 - Does that mean that the timestamp field in cassandra is of type long?
Question2 - Is it cqlsh which converts the time into readable format like 2018-05-18 03:21+0530??
From DataStax documentation on CQL types:
Date and time with millisecond precision, encoded as 8 bytes since epoch. Can be represented as a string, such as 2015-05-03 13:30:54.234.
In Java as input you can use either long with milliseconds, or string literal, supported in CQL, or java.util.Date (see the code). When reading, results mapped into java.util.Date in driver 3.x/1.x (see full table for CQL<->Java types mapping), or to the java.time.Instant in the driver 4.x/2.x (see CQL<->Java types mapping).
In Python/cqlsh, yes - the data is read as 8-byte long, then is then converted into string representation.

How to convert the Node.js date format to oracle datetime format

I need to convert the Node.js datetime '2016-07-13T07:38:15.500Z' to oracle format, where as the .500z represents the timezone.
Im using oracle 11g
The .500z doesn't represent the time zone. The .500 is fractional seconds. The z is short for Zulu, which means the time zone has been normalized to GMT/UTC.
In Oracle Database, the DATE data type doesn't support fractional seconds or time zones. For fractional seconds you'd need to use any of the 3 TIMESTAMP data types. If you need to store the actual time zone, use TIMESTAMP WITH TIME ZONE. If you don't need the actual time zone (most people don't) and want to make converting time zones really easy, then use TIMESTAMP WITH LOCAL TIME ZONE.
You haven't provided enough information to offer much of an answer, but here's a simple example that may sufficient:
select to_date('2016-07-13T07:38:15', 'YYYY-MM-DD"T"HH24:MI:SS'),
to_timestamp('2016-07-13T07:38:15.500Z', 'YYYY-MM-DD"T"HH24:MI:SS.FF"Z"')
from dual;

Cassandra select query with timezone issue

We have two different cassandra cluster on two different timezones.
Cluster1: 2.1.8 version, with IST TZ
Cluster2: 2.1.9 version, with UTC TZ
On cluster1 for a select query with timestamp column, i need not mention the tz[+0530] value , whereas on the other cluster I must and should provide the TZ value in select query to fetch the row. Is it to do with cassandra version?
I use cqlsh to do the query part. I tried cqlshrc file option, which only changes the format of output.
cluster1:
select * from test.check where row_timestamp = '1970-01-01 00:00:00';
cluster2:
select * from test.check where row_timestamp = '1970-01-01 00:00:00+0000';
IF no TZ is mentioned, i get "0" rows.
I dont want to give TZ in cluster2, please adivce how to do that.
It is a bit strange, I must admit, but there might been some changes in time zone manipulation between 2.1.8 and 2.1.9. This is from the changelog:
(cqlsh) Fix timestamps before 1970 on Windows, always use UTC for
timestamp display (CASSANDRA-10000)
On the other hand, the documentation is quite clear on this issue:
If no time zone is specified, the time zone of the Cassandra
coordinator node handing the write request is used. For accuracy,
DataStax recommends specifying the time zone rather than relying on
the time zone configured on the Cassandra nodes.
So, my sincere recommendation is to specify the time zone, and specify the same, presumably GMT (or UTC time). Save yourself the headache. Mind, GMT is not exactly equal to UTC, there is a slight difference in meaning. That way, you should ignore the time zone settings on the clusters. The time stamp is ultimately stored as a number of milliseconds (from certain point). The time zone information is purely a "rendering" thing. The number of milliseconds passed is the same in, for example 2015/03/05 14:00:00+0100 and 2015/03/05 16:00:00+0300.
If you are specifying nothing, and getting 0 results, while you do get results when you use +0000, then make sure that the data you are expecting originally is written with the expected time zone. Maybe there actually is not any data in the span because of that, or the coordinating node time stamp is different.

Resources