How to Convert POSIX time as regular date and time in Spark 2? - apache-spark

I am new to Spark & Pyspark, and just started with spark 2.0. I am trying to convert time stamp from server (In POSIX/Unix format) into regular date ( such as yyyy-mm-dd & time), but unable to do so. I have used the following two commands:
df_new = df.withColumn('fromTimestamp', f.from_unixtime(df['timestamp'], 'yyyy-mm-dd HH:mm:ss'))
and
df.select("timestamp",from_unixtime(f.col("timestamp"))).show()
where f is alias for pyspark.sql.functions API. They both produce the following result:
| #RequiredResult | ActualResult |
+--------------------+--------------------+
|2020-06-01 00:00:03 |52385-52-27 00:52:14|
|2020-06-01 00:00:02 |52385-35-27 00:35:19|
+--------------------+--------------------+
Furthermore, I want to aggregate time intervals (to 30 min or 60 min)durations. Any leads on how to do it?

The Unix timestamp is defined as the number of seconds since 1 January 1970. However some unix-like systems use the number of milliseconds since this date, producing 1000 times higher values.
For example the date 2020-06-01 00:00:03 would be represented by the timestamp 1590962403. If the timestamp 1590962403000 was used, this would result in a date of the year 52385:
spark.sql("""select from_unixtime(1590962403) as seconds,
from_unixtime(1590962403000) as ms""")\
.show(truncate=False)
prints
+-------------------+---------------------+
|seconds |ms |
+-------------------+---------------------+
|2020-06-01 00:00:03|+52385-08-04 18:50:00|
+-------------------+---------------------+
So you should divide the timestamp column by 1000 before applying from_unixtime.

Related

bin() adx scalar function returns an aggregate every round hour but I want it to be on the half hour

| take 2000000
| summarize Value = avg(Value) by bin(TimeStampOfValue, 1h)
I have an adx table with Value and a Timestamp and I run this query I get the avg Value every hour for example:
TimeStampOfValue
Value
2022-01-30T22:00:00
500
2022-01-30T23:00:00
499,99
I'd like it to return:
TimeStampOfValue
Value
2022-01-30T22:30:00
500
2022-01-30T23:30:00
499,99
How do I shift the 'by bin' by 30 minutes? so it runs hourly on the half hour mark? Is this even possible?
One solution is to use 'bin_at' with a specific time so it starts hourly from there, is this the only way?

Python Pandas: Supporting 25 hours in datetime index

I want to use a date/time as an index for a dataframe in Pandas.
However, daylight saving time is not properly addressed in the database, so the date/time values for the day in which daylight saving time ends have 25 hours and are represented as such:
2019102700
2019102701
...
2019102724
I am using the following code to convert those values to a DateTime object that I use as an index to a Pandas dataframe:
df.index = pd.to_datetime(df["date_time"], format="%Y%m%d%H")
However, that gives an error:
ValueError: unconverted data remains: 4
Presumably because the to_datetime function is not expecting the hour to be 24. Similarly, the day in which daylight saving time starts only has 23 hours.
One solution I thought of was storing the dates as strings, but that seems neither elegant nor efficient. Is there any way to solve the issue of handling daylight saving time when using to_datetime?
If you know the timezone, here's a way to calculate UTC timestamps. Parse only the date part, localize to the actual time zone the data "belongs" to, and convert that to UTC. Now you can parse the hour part and add it as a time delta - e.g.
import pandas as pd
df = pd.DataFrame({'date_time_str': ['2019102722','2019102723','2019102724',
'2019102800','2019102801','2019102802']})
df['date_time'] = (pd.to_datetime(df['date_time_str'].str[:-2], format='%Y%m%d')
.dt.tz_localize('Europe/Berlin')
.dt.tz_convert('UTC'))
df['date_time'] += df['date_time_str'].str[-2:].astype('timedelta64[h]')
# df['date_time']
# 0 2019-10-27 20:00:00+00:00
# 1 2019-10-27 21:00:00+00:00
# 2 2019-10-27 22:00:00+00:00
# 3 2019-10-27 23:00:00+00:00
# 4 2019-10-28 00:00:00+00:00
# 5 2019-10-28 01:00:00+00:00
# Name: date_time, dtype: datetime64[ns, UTC]
I'm not sure if it is the most elegant or efficient solution, but I would:
df.loc[df.date_time.str[-2:]=='25', 'date_time'] = (pd.to_numeric(df.date_time[df.date_time.str[-2:]=='25'])+100-24).apply(str)
df.index = pd.to_datetime(df["date_time"], format="%Y%m%d%H")
Pick the first and the last index, convert them to tz_aware datetime, then you can generate a date_range that handles 25-hour days. And assign the date_range to your df index:
start = pd.to_datetime(df.index[0]).tz_localize("Europe/Berlin")
end = pd.to_datetime(df.index[-1]).tz_localize("Europe/Berlin")
index_ = pd.date_range(start, end, freq="15min")
df = df.set_index(index_)

is there any method in pandas to convert dataframe from day to defaullt d/m/y format?

I would like to convert all day in the data-frame into day/feb/2020 format
here date field consist only day
from first one convert the date field like this
My current approach is:
import datetime
y=[]
for day in planned_ds.Date:
x=datetime.datetime(2020, 5, day)
print(x)
Is there any easy method to convert all day data-frame to d/m/y format?
One way as assuming you have data like
df = pd.DataFrame([1,2,3,4,5], columns=["date"])
is to convert them to dates and then shift them to start when you need them to:
pd.to_datetime(df["date"], unit="D") - pd.datetime(1970,1,1) + pd.datetime(2020,1,31)
this results in
0 2020-02-01
1 2020-02-02
2 2020-02-03
3 2020-02-04
4 2020-02-05

What is the most performant way to slice a datetime in a multi-index?

What is the most 'performant' way to filter a DataFrame by time if the DataFrame has a multi-index containing a datetime index?
For example, how to filter for business hours only in a datetime index which is contained in a multi-index.
how to filter for business hours only in a datetime index which is contained in a multi-index.
df.index.get_level_values(1).hour.isin([9,10,11,13,14,15,16])
That's just one example--filter the second level of a MultiIndex which is a datetime column and get a boolean mask which is True wherever the hour is 9 to 5 excluding lunch break.
Need more precision?
dt = df.index.get_level_values(1)
minutes = dt.hour * 60 + dt.minute
minutes.between(8*60+15, 17*60+45)
That's 8:15 to 17:45.
Siesta?
minutes.between(9*60+30, 15*60) | minutes.between(17*60+30, 20*60)

Getting Millisecond in Hive timestamp with offset Timezone

I want to convert a timestamp to millisecond with different formats in hive.
Currently I'm able to convert a string to the correct timestamp using the following code but wanted to store the timestamp data type from something of the format of YYYYMMDD-HH:MM:SS[.sss][Z | [ + | - hh[:mm]]] where:
YYYY = 0000 to 9999
MM = 01-12
DD = 01-31
HH = 00-23 hours
MM = 00-59 minutes
SS = 00-59 seconds
sss = milliseconds
hh = 01-12 offset hours
mm = 00-59 offset minutes
Example: 20060901-02:39-05 is five hours behind UTC, thus Eastern Time on 1st of September 2006 and the timestamp with be in the yyyy-MM-dd HH:mm:ss.SSS format
What I have for UTC timestamp of YYYYMMDD-HH:MM:SS.sss is as follows:
cast(concat(concat_ws('-',substr(tag[52],1,4), substr(tag[52],5,2), substr(tag[52],7,2)),
space(1),
concat_ws(':',substr(tag[52],10,2), substr(tag[52],13,2), substr(tag[52],16,2)),
'.', substr(tag[52],19,3)) AS TIMESTAMP)
This takes a tag and does string manipulation of values of the tag to put into Timestamp datatype resulting in yyyy-MM-dd HH:MM:SS.sss...
I would like something similar to this that puts into Timestamp with offset in Hive.
Is this even possible?

Resources