I have a dataframe in Spark which contains Unix(Epoch) time and also timezone name. I hope to convert the epochtime to local time according to different tz name.
Here is how my data looks like:
data = [
(1420088400, 'America/New_York'),
(1420088400, 'America/Los_Angeles'),
(1510401180, 'America/New_York'),
(1510401180, 'America/Los_Angeles')]
df = spark.createDataFrame(data, ["epoch_time", "tz_name"])
df.createOrReplaceTempView("df")
df1 = spark.sql("""select *, from_unixtime(epoch_time) as gmt_time,"
from_utc_timestamp(from_unixtime(epoch_time), tz_name) as local_time"
from df""")
df1.show(truncate= False)
Here is the result:
+----------+-------------------+-------------------+---------------------+
|epoch_time|tz_name |gmt_time |local_time |
+----------+-------------------+-------------------+---------------------+
|1420088400|America/New_York |2015-01-01 05:00:00|2015-01-01 00:00:00.0|
|1420088400|America/Los_Angeles|2015-01-01 05:00:00|2014-12-31 21:00:00.0|
|1510401180|America/New_York |2017-11-11 11:53:00|2017-11-11 06:53:00.0|
|1510401180|America/Los_Angeles|2017-11-11 11:53:00|2017-11-11 03:53:00.0|
+----------+-------------------+-------------------+---------------------+
I'm not quite sure if this transfer is right, but it seems the daylight saving has been taking care of.
Should I first change epochtime to time string using from_unixtime, then change it to utc timestamp using to_utc_timestamp, finally change this UTC timestamp to local time with tz_name? Tried this but got error
df2 = spark.sql("""select *, from_unixtime(epoch_time) as gmt_time,
from_utc_timestamp(from_unixtime(epoch_time), tz_name) as local_time,
from_utc_timestamp(to_utc_timestamp(from_unixtime(epoch_time),from_unixtime(unix_timestamp(), 'z')), tz_name) as newtime from df""")
How could I check my EMR server timezone?
Tried use , is this the server timezone?
spark.sql("select from_unixtime(unix_timestamp(), 'z')").show()
which gave me:
+--------------------------------------------------------------------------+
|from_unixtime(unix_timestamp(current_timestamp(), yyyy-MM-dd HH:mm:ss), z)|
+--------------------------------------------------------------------------+
| UTC|
+--------------------------------------------------------------------------+
Thank you for your clarification.
When you call from_unixtime it will format the date based on your Java runtime's timezone, since it's just using the default timezone for SimpleDateFormat here. In your case it's UTC. So when you convert the values to local time you would only need to call from_utc_timestamp with the tz_name value passed in. However if you were to change your system timezone then you would need to call to_utc_timestamp first.
Spark 2.2 introduces a timezone setting so you can set the timezone for your SparkSession like so
spark.conf.set("spark.sql.session.timeZone", "GMT")
In which case the time functions will use GMT vs your system timezone, see source here
Related
I am using spark 2.4 and using the below code to cast the string datetime column(rec_dt) in a dataframe(df1) to timestamp(rec_date) and create another dataframe(df2).
All the datetime values are getting parsed correctly except for the values where there are daylight saving datetime values.
The time zone of my session is 'Europe/London' and I do not want to store the data as UTC time zone and finally I have to write data as 'Europe/London' time zone only.
spark_session.conf.get("spark.sql.session.timeZone")
# Europe/London
Code :
df2 = df1.withColumn("rec_date", to_timestamp("rec_dt","yyyy-MM-dd-HH.mm.ss"))
output :
Please help.
I am getting datetime in UTC format from datafactory in databricks.
I am trying to convert it into databricks timestamp and insert into database.
Format that i am receiving: 2020-11-02T01:00:00Z
Convert into : 2020-11-02T01:00:00.000+0000 (iso format)
i tried to convert the string into isoformat()
df.selectExpr("make_timestamp(YEAR, MONTH, DAY, HOUR, MINUTE, SECOND) as MAKE_TIMESTAMP")
and then
spark.sql("INSERT INTO test VALUES (1, 1, 'IMPORT','"+ slice_date_time.isoformat() +"','deltaload',0, '0')")
But when i try to insert it I am receiving error:
Cannot safely cast 'start_time': string to timestamp
Cannot safely cast 'end_time': string to timestamp;
I also tried making timestamp. but still the same error.
Using Spark DataFrame API:
# create dataframe
list_data = [(1, '2020-11-02T01:00:00Z'), (2, '2020-11-03T01:00:00Z'), (3, '2020-11-04T01:00:00Z')]
df = spark.createDataFrame(list_data, ['id', 'utc_time'])
# make sure to set your timezone in spark conf
from pyspark.sql.functions import to_timestamp, date_format
spark.conf.set('spark.sql.session.timeZone', 'UTC')
df.select("utc_time").withColumn('iso_time', date_format(to_timestamp(df.utc_time, "yyyy-MM-dd'T'HH:mm:ssXXX"),
"yyyy-MM-dd'T'HH:mm:ss.SSSZ"
).alias('iso_time')).show(10, False)
+--------------------+----------------------------+
|utc_time |iso_time |
+--------------------+----------------------------+
|2020-11-02T01:00:00Z|2020-11-02T01:00:00.000+0000|
|2020-11-03T01:00:00Z|2020-11-03T01:00:00.000+0000|
|2020-11-04T01:00:00Z|2020-11-04T01:00:00.000+0000|
+--------------------+----------------------------+
Try to store iso_time directly into database, if your database supports different datetime format try to adjust
yyyy-MM-dd'T'HH:mm:ss.SSSZ
Which is the correct or ideal or preferred method to convert a CST Date and/or Datetime field to UTC along with DST aware settings and store in MongoDB in ISO format in Python/PyMongo ? The source date/datetime field can come from any timezone (right now we know its CST), I need to convert all of them to UTC and store into target MongoDB.
As per MongoDB docs, MongoDB stores times in UTC by default, and will convert any local time representations into this form. Applications that must operate or report on some unmodified local time value may store the time zone alongside the UTC timestamp, and compute the original local time in their application logic.
Examples:
Method#1: with Timestamp (local timezone defined)
from datetime import datetime
import pytz
local_timezone = pytz.timezone("US/Central")
utc_datetime = local_timezone.localize(datetime.strptime ("1/2/2017 12:43 pm",'%m/%d/%Y %H:%M %p'),is_dst=True).astimezone(pytz.utc)
print(utc_datetime)
print(type(utc_datetime))
2017-01-02 18:43:00+00:00
<class 'datetime.datetime'>
without timestamp i.e. just date: - it adds an offset value of 6 hours in timestamp and during DST 5 hours. Removing or without astimezone(pytz.utc) , it returns date/time like 2017-01-02 00:00:00-06:00 i.e. showing -6 hours offset difference. Should we really be using astimezeon(pytz.utc) ??
from datetime import datetime
import pytz
local_timezone = pytz.timezone("US/Central")
utc_datetime = local_timezone.localize(datetime.strptime ("1/2/2017",'%m/%d/%Y'),is_dst=True).astimezone(pytz.utc)
print(utc_datetime)
print(type(utc_datetime))
2017-01-02 06:00:00+00:00
<class 'datetime.datetime'>
Method#2: with Timestamp (local timezone NOT defined)
from datetime import datetime, timezone
utc_datetime=datetime.utcfromtimestamp(datetime.strptime ("1/2/2017 12:43 pm",'%m/%d/%Y %H:%M %p').replace(tzinfo = timezone.utc).timestamp())
print(utc_datetime)
print(type(utc_datetime))
2017-01-02 12:43:00
<class 'datetime.datetime'>
without Timestamp i.e. just date part - no offset
from datetime import datetime, timezone
utc_datetime=datetime.utcfromtimestamp(datetime.strptime ("1/2/2017",'%m/%d/%Y').replace(tzinfo = timezone.utc).timestamp())
print(utc_datetime)
print(type(utc_datetime))
2017-01-02 00:00:00
<class 'datetime.datetime'>
After loading into MongoDB - it adds a "Z" at the end of the date/timestamp. Should I also add "tz_aware=True" when initiating connection with MongoClient ?
ISOFormat - changing above utc timestamp to isoformat() returns and gets loaded as string in MongoDB instead of a Date. So, how do we ensure it is still stored in ISO Date format in MongoDB ?
utc_datetime_iso=datetime.utcfromtimestamp(datetime.strptime ("1/2/2017",'%m/%d/%Y').replace(tzinfo = timezone.utc).timestamp()).**isoformat()**
print(utc_datetime_iso)
print(type(utc_datetime_iso))
2017-01-02T00:00:00
<class 'str'>
I never worked with python, so I can give only some general notes.
Never store date/time values as string, use proper Date object. Storing date/time values as strings is usually a design failure.
All Date values in MongoDB are stored in UTC - always and only. Some client applications implicitly converts UTC to local times and display local values, however internally in MongoDB it is always UTC.
If you run db.collection.insertOne({ts: ISODate("2020-09-07T14:00:00+02:00")}) then MongoDB stores ISODate("2020-09-07T12:00:00Z"), the original time zone information is lost. If you need to preserve the original time zone, then you have to store it in a separate field.
ISODate is just an alias for new Date. However, there is a difference. If you don't specify any time zone (e.g. "2020-09-07T14:00:00") then new Date() assumes local time but ISODate() assumes UTC time. I don't know which method is internally used by python.
So, new Date("2020-09-07T14:00:00") results in 2020-09-07 12:00:00Z whereas ISODate("2020-09-07T14:00:00") results in 2020-09-07 14:00:00Z
I need to get the date and time of another country:
dateFormat = "%Y%m%d_%H%M"
ts=spark.sql(""" select current_timestamp() as ctime """).collect()[0]["ctime"]
ts.strftime(dateFormat)
You don't need pyspark for such a task, especially when you call .collect():
import pytz
from datetime import datetime
tz = pytz.timezone('Asia/Shanghai')
ts = datetime.now(tz)
ts.strftime('%Y%m%d_%H%M')
The session time zone is set with the configuration ‘spark.sql.session.timeZone’ and will default to the JVM system local time zone,you can change Time Zone , add your time zone will give you proper date
spark.conf.set("spark.sql.session.timeZone", "UTC")
I am hoping the conclusion that I stated in the title of this post is not correct. I actually have found a round-about way to render timestamps in Java DateTimeFormatter.ISO_INSTANT format. But my way is very clunky, and I am hoping there is some out-of-the-box way in Spark SQL to do this that I simply have not found yet.
Here is my cumbersome way to do this (from spark-shell):
scala> val df = List("1970-01-01 00:00:00.0").toDF("timestr").
| withColumn("ts", col("timestr").cast("timestamp")).
| withColumn("startOfEpochISO8601", expr("concat(replace(ts, ' ', 'T'), 'Z')"))
df: org.apache.spark.sql.DataFrame = [timestr: string, ts: timestamp ... 1 more field]
scala> df.show(false)
+---------------------+-------------------+--------------------+
|timestr |ts |startOfEpochISO8601 |
+---------------------+-------------------+--------------------+
|1970-01-01 00:00:00.0|1970-01-01 00:00:00|1970-01-01T00:00:00Z|
+---------------------+-------------------+--------------------+
Now, I think there are ways to render timestamps in iso 8601 format if we are outputting to csv or json. But I'd like to frame this question in terms of how one would do this if one were writing to some custom output format (without writing to json or csv first to take advantage of existing is0 8601 formatting support, then reading that and re-writing to the custom format). The only way i can think of is what I showed above.
Please let me know if you have something better !
UPDATE:
I accepted #Gelerion's answer because it put me on the right track, but I am adding my own answer because there is also the non-obvious requirement that spark.sql.session.timeZone be set to UTC or GMT in order for the output to be correct.
If I understood you right:
val timestamped = List("1970-01-01 00:00:00.0").toDF("timestr")
timestamped.select(date_format($"timestr", "yyyy-MM-dd'T'HH:mm:ss.SS'Z'")).show()
+-------------------------------------------------+
|date_format(timestr, yyyy-MM-dd'T'HH:mm:ss.SS'Z')|
+-------------------------------------------------+
|1970-01-01T00:00:00.00Z |
+-------------------------------------------------+
I accepted #Gelerion's answer because it put me on the right track, but I am posting this answer as a supplement because there is also the non-obvious requirement that spark.sql.session.timeZone be set to UTC or GMT in order for the output to be correct.
The snippet below relativizes the time 1970-01-01T00:00:00 in a time zone 1 hour behind UTC to the correct UTC value, which is exactly the start of the Unix epoch.This works correctly:
spark.conf.set("spark.sql.session.timeZone", "GMT")
List("1970-01-01T00:00:00-01:00").toDF("timestr").
withColumn("ts", col("timestr").cast("timestamp")).
withColumn("tsAsInt", col("ts").cast("integer")).
withColumn("asUtc", date_format($"ts", "yyyy-MM-dd'T'HH:mm:ssX")).
show(false)
// RESULT:
// +-------------------------+-------------------+-------+--------------------+
// |timestr |ts |tsAsInt|asUtc |
// +-------------------------+-------------------+-------+--------------------+
// |1970-01-01T00:00:00-01:00|1970-01-01 01:00:00|3600 |1970-01-01T01:00:00Z|
// +-------------------------+-------------------+-------+--------------------+
This shows that if you fail to set spark.sql.session.timeZone to GMT or UTC you will not
get the correct (relativized to UTC) answer:
spark.conf.set("spark.sql.session.timeZone", "PST")
List("1970-01-01T00:00:00-01:00").toDF("timestr").
withColumn("ts", col("timestr").cast("timestamp")).
withColumn("tsAsInt", col("ts").cast("integer")).
withColumn("asUtc", date_format($"ts", "yyyy-MM-dd'T'HH:mm:ssX")).
show(false)
// RESULT:
// +-------------------------+-------------------+-------+----------------------+
// |timestr |ts |tsAsInt|asUtc |
// +-------------------------+-------------------+-------+----------------------+
// |1970-01-01T00:00:00-01:00|1969-12-31 17:00:00|3600 |1969-12-31T17:00:00-08|
// +-------------------------+-------------------+-------+----------------------+