Presto sql function date_parse fails for specific date (1960-01-01) - presto

How to resolve this presto sql error for date_parse('1960-01-01', '%Y-%m-%d')
This function works fine for other dates.

This is due to a long-standing issue with how Presto models timestamps. Long story short, the implementation of timestamps is not compliant with the SQL specification and it incorrectly attempts to treat them as "point in time" or "instant" values and interpret them within a time zone specification. For some dates and time zone rules, the values are undefined due to daylight savings transitions, etc.
This was fixed in recent versions of Trino (formerly known as Presto SQL), so you may want to update.
By the way, you can convert a varchar to a date using the date() function or by casting the value to date:
trino> select date('1960-01-01');
_col0
------------
1960-01-01
(1 row)
trino> select cast('1960-01-01' as date);
_col0
------------
1960-01-01
(1 row)

Related

Timestamp Timezone Wrong/Missing in Spark/Databricks SQL Output

When converting a timestamp between timezones in databricks/spark sql, the timezone itself seems lost in the end result, and I can't seem to either keep it or add it back.
I have a bunch of UTC times and am using the from_utc_timetamp() to convert them to a different timezone based on another field. The result is calculated correctly, but if I output it with a timezone it shows as UTC. It seems the conversion is done correctly but the end result has no timezone stored with it (affirmed by this answer), so it uses the server zone for the timezone in all cases.
Example: Using the following SQL:
createTimestampUTC,
v.timezone,
date_format(from_utc_timestamp(createTimestampUTC, v.timezone),"yyyy-MM-dd'T'HH:mm:s Z") createTimestampLocal,
I get the following:
You can see that the third column has done the conversions correctly for the timezones, but the output itself still shows as being in UTC timezone.
Repeating this with a lowercase z in the date_format function shows the same; namely, the conversions occur but the end result is still treated as UTC.
createTimestampUTC,
v.timezone,
date_format(from_utc_timestamp(createTimestampUTC, v.timezone),"yyyy-MM-dd'T'HH:mm:s z") createTimestampLocal,
I can also use an O in the format output instead of a Z or z, but this just gives me GMT instead of UTC; same output basically.
All the databricks documentation or stackoverflow questions I can find seem to treat printing timezones as a matter of setting the spark server time and outputting that way, or doing the conversion without keeping the resulting timezone. I'm trying to convert to multiple different timezones though, and to keep the timezone in the output. I need to generate the end result in this format:
Is there a way to do this? How do I either keep the timezone after the conversion or add it back in the format I need based on the timezone column I have? Given that the conversion works, and that I can output the end result with a +0000 on it, all the functionality to do this seems there, how do I put it together?
Spark does not support TIMESTAMP WITH TIMEZONE datatype as defined by ANSI SQL. Even though there are some functions that convert the timestamp across timezones, this information is never stored. Databricks documentation on timestamps explains:
Spark SQL defines the timestamp type as TIMESTAMP WITH SESSION TIME
ZONE, which is a combination of the fields (YEAR, MONTH, DAY, HOUR,
MINUTE, SECOND, SESSION TZ) where the YEAR through SECOND field
identify a time instant in the UTC time zone, and where SESSION TZ is
taken from the SQL config spark.sql.session.timeZone.
In your case spark.sql.session.timeZone is UTC and Z symbol in datetime pattern will always return UTC. Therefore you will never get a correct behavior with date_format if you deal with multiple timezones in a single query.
The only thing you can do is to explicitly store timezone information in a column and manually append it for display.
concat(
date_format(from_utc_timestamp(createTimestampUTC, v.timezone), "yyyy-MM-dd'T'HH:mm:s "),
v.timezone
) createTimestampLocal
This will display 2022-03-01T16:47:22.000 America/New_York. If you need an offset (-05:00) you will need to write a UDF to do the conversion and use Python or Scala native libraries that handle datetime conversions.

how to Covert all the timestamps to EST in Hive table

I have a Hive table which contains a timestamp field and it can have any timezone ..(UTC/PST/CST....)
I want to convert all of them to a single timestamp, EST. it can be done either in Hive or Pyspark.
Basically, i am using it in my pyspark application which has a grouping logic on this datetime field and before doing that we want to have all the times in Hive table to be converted to EST time.
Sid
Mention to the facts which HIV Timezone have limitation on maximum time associates to Y2K38 bugs and JDBC compatibility issue,
TIMESTAMP type to serde2 that supports unix timestamp (1970-01-01 00:00:01 UTC to 2038-01-19 03:14:07 UTC) with optional nanosecond precision using both LazyBinary and LazySimple SerDes.
For LazySimpleSerDe, the data is stored in jdbc compliant java.sql.Timestamp parsable strings.
HIV-2272
Here is simulation associates to supporting timestamps earlier than 1970 and later than 2038.
Hive JDBC doesn't support TIMESTAMP column
Therefore, I think will be better if you are using HIV DataType of Date Type or String Type. Then you can use any timezone offset as the default on persistent.
* utc_timestamp is the column name */
/* bellow will convert a timestamp in UTC to EST timezone */
select from_utc_timestamp(utc_timestamp, 'EST') from table1;
Hope this helps.
HIV Data Types
Sidd, usually Hive uses the local timezone of the host where the data was written. The function from_utc_timestamp() and to_utc_timestamp can we very helpful. Instead of stating the timezone as UTC/EST you should rather use location/region in that case, since this will account for the day light savings.
Here's a helpful link for more examples: Local Time Convert To UTC Time In Hive
In case you have further questions, please share what have you already tried and share a sample snippet of your data for investigating further.

Cassandra timeuuid column to nanoseconds precision

Cassandra table has timeuuid data type column so how do I see the value of type timeuuid in nanoseconds?
timeuuid:
49cbda60-961b-11e8-9854-134d5b3f9cf8
49cbda60-961b-11e8-9854-134d5b3f9cf9
How to convert this timeuuid to nanoseconds
need a select statement like:
select Dateof(timeuuid) from the table a;
There is a utility method in driver UUIDs.unixTimestamp(UUID id) that returns a normal epoch timestamp that can be converted into a Date object.
Worth noting that ns precision from the time UUID will not necessarily be meaningful. A type 1 uuid includes a timestamp which is the number of 100 nanosecond intervals since the Gregorian calendar was first adopted at midnight, October 15, 1582 UTC. But the driver takes a 1ms timestamp (precision depends on OS really, can be 10 or 40ms precision even) and keeps a monotonic counter to fill the rest of the 10000 unused precision but can end up counting into the future if in a 1ms there are over 10k values (note: performance limitations will ultimately prevent this). This is much more performant and guarantees no duplicates, especially as sub ms time accuracy in computers is pretty meaningless in a distributed system.
So if your looking from a purely CQL perspective theres no way to do it without a UDF, not that there is much value in getting beyond ms value anyway so dateOf should be sufficient. If you REALLY want it though
CREATE OR REPLACE FUNCTION uuidToNS (id timeuuid)
CALLED ON NULL INPUT RETURNS bigint
LANGUAGE java AS '
return id.timestamp();
';
Will give you the 100ns's from October 15, 1582. To translate that to nanoseconds from epoc, mulitply it by 100 to convert to nanos and add the difference from epoch time (-12219292800L * 1_000_000_000 in nanos). This might overflow longs so might need to use something different.

Error while running range query on multiple clustering columns using spark cassandra connector:

Following is the cassandra table schema :
CREATE TABLE my_table (
year text,
month text,
day text,
hour int,
min int,
sec int,
PRIMARY KEY ((year, month, day), hour, min, sec) )
If i run following query using cassandra cql it works:
SELECT * FROM my_table WHERE year ='2017' and month ='01' and day ='16' and (hour,min,sec) > (1,15,0) LIMIT 200
However, when i run same query using spark-cassandra connector it does not work:
sparkSession.read().format("org.apache.spark.sql.cassandra").options(map).load()
.where(year ='2017' and month ='01' and day ='16' and (hour,min,sec) >= (1,15,0)");
I am getting following exception in logs:
> Exception in thread "main" org.apache.spark.sql.AnalysisException:
> cannot resolve '(struct(`hour`, `min`, `sec`) >= struct(1, 15, 0))'
> due to data type mismatch: differing types in '(struct(`hour`, `min`,
> `sec`) >= struct(1, 15, 0))' and (struct<hour:int,min:int,sec:int>
> struct<col1:int,col2:int,col3:int>).; line 1 pos 96
Spark-cassandra-connector version:2.0.0-M3
Spark-version:2.0.0
Any help is much appreciated
Quite simply CQL is not Spark Sql or Catalyst compatible. What you are seeing is a conflict in syntax.
This where clause :
.where(year ='2017' and month ='01' and day ='16' and (hour,min,sec) >= (1,15,0)
Is not directly pushed down to Cassandra. Instead it is being transformed into catalyst predicates. This is where you have a problem
Cataylst sees this
(hour,min,sec) >= (1,15,0)
And tries to make types for them
The left hand side becomes
struct<hour:int,min:int,sec:int>
The right hand side becomes
struct<col1:int,col2:int,col3:int>
These are not tuples, but explicitly typed structs. They cannot be directly compared hence your error. In the DataFrame api you would just define a new Struct with the correct types and make a literal of that but I'm not sure how to express that in SparkSQL.
Regardless this tuple predicate will not be pushed down to Cassandra. The Struct you are defining of hour, min, sec is going to be hidden from Cassandra because the underlying table doesn't provide a Struct<hour, min, sec> which means that Spark thinks it needs to generate that after pulling the data from Cassandra.
You are better off just using the separate clauses with AND as mentioned by
#AkashSethi

Cassandra select query with timezone issue

We have two different cassandra cluster on two different timezones.
Cluster1: 2.1.8 version, with IST TZ
Cluster2: 2.1.9 version, with UTC TZ
On cluster1 for a select query with timestamp column, i need not mention the tz[+0530] value , whereas on the other cluster I must and should provide the TZ value in select query to fetch the row. Is it to do with cassandra version?
I use cqlsh to do the query part. I tried cqlshrc file option, which only changes the format of output.
cluster1:
select * from test.check where row_timestamp = '1970-01-01 00:00:00';
cluster2:
select * from test.check where row_timestamp = '1970-01-01 00:00:00+0000';
IF no TZ is mentioned, i get "0" rows.
I dont want to give TZ in cluster2, please adivce how to do that.
It is a bit strange, I must admit, but there might been some changes in time zone manipulation between 2.1.8 and 2.1.9. This is from the changelog:
(cqlsh) Fix timestamps before 1970 on Windows, always use UTC for
timestamp display (CASSANDRA-10000)
On the other hand, the documentation is quite clear on this issue:
If no time zone is specified, the time zone of the Cassandra
coordinator node handing the write request is used. For accuracy,
DataStax recommends specifying the time zone rather than relying on
the time zone configured on the Cassandra nodes.
So, my sincere recommendation is to specify the time zone, and specify the same, presumably GMT (or UTC time). Save yourself the headache. Mind, GMT is not exactly equal to UTC, there is a slight difference in meaning. That way, you should ignore the time zone settings on the clusters. The time stamp is ultimately stored as a number of milliseconds (from certain point). The time zone information is purely a "rendering" thing. The number of milliseconds passed is the same in, for example 2015/03/05 14:00:00+0100 and 2015/03/05 16:00:00+0300.
If you are specifying nothing, and getting 0 results, while you do get results when you use +0000, then make sure that the data you are expecting originally is written with the expected time zone. Maybe there actually is not any data in the span because of that, or the coordinating node time stamp is different.

Resources