How to set the date to UTC with PySpark when converting from Epoch Time without UDF? - apache-spark

I have these to ways to convert an Epoch Time to date and time
f.to_date(f.from_unixtime(1625838240))
f.date_format(f.from_unixtime(1625838240), 'HH:mm:ss')
and from_unixtime is based on the local timezone.
How can I force the above commands to always return local time UTC only without UDF?

Change the spark session timezone. See documentation
spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") <-- set timezone
time_df = spark.createDataFrame([(1428476400,)], ['unix_time'])
time_df.select(from_unixtime('unix_time').alias('ts')).collect()
[Row(ts='2015-04-08 00:00:00')] <-- timezone applied
spark.conf.unset("spark.sql.session.timeZone")

Related

Timestamp Timezone Wrong/Missing in Spark/Databricks SQL Output

When converting a timestamp between timezones in databricks/spark sql, the timezone itself seems lost in the end result, and I can't seem to either keep it or add it back.
I have a bunch of UTC times and am using the from_utc_timetamp() to convert them to a different timezone based on another field. The result is calculated correctly, but if I output it with a timezone it shows as UTC. It seems the conversion is done correctly but the end result has no timezone stored with it (affirmed by this answer), so it uses the server zone for the timezone in all cases.
Example: Using the following SQL:
createTimestampUTC,
v.timezone,
date_format(from_utc_timestamp(createTimestampUTC, v.timezone),"yyyy-MM-dd'T'HH:mm:s Z") createTimestampLocal,
I get the following:
You can see that the third column has done the conversions correctly for the timezones, but the output itself still shows as being in UTC timezone.
Repeating this with a lowercase z in the date_format function shows the same; namely, the conversions occur but the end result is still treated as UTC.
createTimestampUTC,
v.timezone,
date_format(from_utc_timestamp(createTimestampUTC, v.timezone),"yyyy-MM-dd'T'HH:mm:s z") createTimestampLocal,
I can also use an O in the format output instead of a Z or z, but this just gives me GMT instead of UTC; same output basically.
All the databricks documentation or stackoverflow questions I can find seem to treat printing timezones as a matter of setting the spark server time and outputting that way, or doing the conversion without keeping the resulting timezone. I'm trying to convert to multiple different timezones though, and to keep the timezone in the output. I need to generate the end result in this format:
Is there a way to do this? How do I either keep the timezone after the conversion or add it back in the format I need based on the timezone column I have? Given that the conversion works, and that I can output the end result with a +0000 on it, all the functionality to do this seems there, how do I put it together?
Spark does not support TIMESTAMP WITH TIMEZONE datatype as defined by ANSI SQL. Even though there are some functions that convert the timestamp across timezones, this information is never stored. Databricks documentation on timestamps explains:
Spark SQL defines the timestamp type as TIMESTAMP WITH SESSION TIME
ZONE, which is a combination of the fields (YEAR, MONTH, DAY, HOUR,
MINUTE, SECOND, SESSION TZ) where the YEAR through SECOND field
identify a time instant in the UTC time zone, and where SESSION TZ is
taken from the SQL config spark.sql.session.timeZone.
In your case spark.sql.session.timeZone is UTC and Z symbol in datetime pattern will always return UTC. Therefore you will never get a correct behavior with date_format if you deal with multiple timezones in a single query.
The only thing you can do is to explicitly store timezone information in a column and manually append it for display.
concat(
date_format(from_utc_timestamp(createTimestampUTC, v.timezone), "yyyy-MM-dd'T'HH:mm:s "),
v.timezone
) createTimestampLocal
This will display 2022-03-01T16:47:22.000 America/New_York. If you need an offset (-05:00) you will need to write a UDF to do the conversion and use Python or Scala native libraries that handle datetime conversions.

Get Epoch timestamp accurate by the day with datetime

I want to get a day-accurate (not hour, minutes, seconds) Epoch timestamp that remains the same throughout the day.
This is accurate by the millisecond (and therefore too accurate):
from datetime import date, datetime
timestamp = datetime.today().strftime("%s")
Is there any simple way to make it less precise?
A UNIX timestamp is by necessity accurate to the (milli)second, because it's a number counting seconds. The only thing you can do is choose a specific time which "stays constant" throughout the day, for which midnight probably makes the most sense:
from datetime import datetime, timezone
timestamp = datetime.now(timezone.utc).replace(hour=0, minute=0, second=0, microsecond=0).timestamp()
It depends what do you want.
If you just want a quick way, either use time.time_ns() or time.time(). Epoch time is used by system (on many OS), and so there is no conversion. The _ns() version avoid floating point maths, so faster.
If you want to store it in more efficient way, you can just do a:
(int(time.time()) % (24*60*60) so you get the epoch at start of the day. Epoch contrary most of other times (and GPS time) has all days long 246060 seconds (so discarding leap seconds).

terraform setting time with timezone

I need to define a TERRAFORM timestamp variable on a defined timezone.
All our business process are scheduled on Europe/Paris timezone.
This timezone oscillates between GMT+1 and GMT+2 during the 4 seasons.
How can I set my timestamp() variable on this timezone?
Terraform's timestamp-manipulation functions are intended for producing machine-readable timestamps in various formats, not for human-oriented timestamps. Therefore there are no built-in functions for converting to and from local ("wallclock") time.
The formatdate function is able to accept a timestamp containing a UTC offset and include that offset in its output, but Terraform has no built-in way to generate such a timestamp: the timestamp function always returns a UTC timestamp and there is no way to customize that. Also, UTC offsets are not the same thing as timezones because, as you've noted, timezones include daylight savings rules which cause the offsets to be different at different times of year.
If you wish to reinterpret Terraform's timestamps in your local timezone then you will need to do that outside of Terraform.
locals {
now = timestamp()
paris_tz = timeadd(local.now, "2h")
date_fr = formatdate("D-MM-YYYY", local.paris_tz)
date_utc = formatdate("YYYY-MM-DD", local.now)
}
Then you can use your local var.
credit: https://clebergasparoto.com/how-to-manipulate-date-and-time-with-terraform

how to Covert all the timestamps to EST in Hive table

I have a Hive table which contains a timestamp field and it can have any timezone ..(UTC/PST/CST....)
I want to convert all of them to a single timestamp, EST. it can be done either in Hive or Pyspark.
Basically, i am using it in my pyspark application which has a grouping logic on this datetime field and before doing that we want to have all the times in Hive table to be converted to EST time.
Sid
Mention to the facts which HIV Timezone have limitation on maximum time associates to Y2K38 bugs and JDBC compatibility issue,
TIMESTAMP type to serde2 that supports unix timestamp (1970-01-01 00:00:01 UTC to 2038-01-19 03:14:07 UTC) with optional nanosecond precision using both LazyBinary and LazySimple SerDes.
For LazySimpleSerDe, the data is stored in jdbc compliant java.sql.Timestamp parsable strings.
HIV-2272
Here is simulation associates to supporting timestamps earlier than 1970 and later than 2038.
Hive JDBC doesn't support TIMESTAMP column
Therefore, I think will be better if you are using HIV DataType of Date Type or String Type. Then you can use any timezone offset as the default on persistent.
* utc_timestamp is the column name */
/* bellow will convert a timestamp in UTC to EST timezone */
select from_utc_timestamp(utc_timestamp, 'EST') from table1;
Hope this helps.
HIV Data Types
Sidd, usually Hive uses the local timezone of the host where the data was written. The function from_utc_timestamp() and to_utc_timestamp can we very helpful. Instead of stating the timezone as UTC/EST you should rather use location/region in that case, since this will account for the day light savings.
Here's a helpful link for more examples: Local Time Convert To UTC Time In Hive
In case you have further questions, please share what have you already tried and share a sample snippet of your data for investigating further.

How to convert UTC Date Time to Local Date time without TimeZoneInfo class?

i want to convert UTC date time to local date time by myself and do not want to use .net TimeZoneInfo or other classs about this.
i know Tehran is a GMT offset of +03:30 i use code below to convert UTC Date time to tehran (my local computer is in this location):
DateTime dt = DateTime.UtcNow.AddHours(3.30);
it shows time like 5/2/2014 8:32:05 PM but Tehran time is 5/2/2014 9:32:05 PM it has one Hour deference.
How can i fixed it?
i know Tehran is a GMT offset of +03:30
Well, that's its offset from UTC in standard time, but it's currently observing daylight saving time (details). So the current UTC offset is actually +04:30, hence the difference of an hour.
I suspect you're really off by more than an hour though, are you're adding an offset of 3.3 hours, which is 3 hours and 18 minutes. The literal 3.30 doesn't mean "3 hours and 30 minutes", it means 3.30 as a double literal. If you want 3 hours and 30 minutes, that's 3 and a half hours, so you'd need to use 3.5 instead. The time in Tehran when you posted was 9:46 PM... so I suspect you actually ran the code at 9:44 PM.
This sort of thing is why you should really, really, really use a proper time-zone-aware system rather than trying to code it yourself. Personally I wouldn't use TimeZoneInfo - I'd use my Noda Time library which allows you to either use the Windows time zones via TimeZoneInfo, or the IANA time zone database. The latter - also known as Olsen, or TZDB, or zoneinfo, is the most commonly-used time zone database on non-Windows platforms.

Resources