Spark Explode between 2 Dates

Spark Explode between 2 Dates - apache-spark

I am using spark 3.2, and trying to explode data between 2 months, so that I can get a column which contains all months between those 2 dates.
dataSetFromFile = dataSetFromFile.withColumn(
"MONTH",
functions.explode(functions.sequence(
functions.lit("2016-02-01").cast(DataTypes.DateType),
functions.lit("2016-05-31").cast(DataTypes.DateType),
functions.expr("INTERVAL 1 month")
))
);
It's giving me output as
MONTH 2016-02-01 2016-03-01 2016-03-31 2016-04-30 2016-05-31
Expected Output should be:
MONTH 2016-02-01 2016-03-01 2016-04-01 2016-05-01
Can you please let us know where I did wrong?

The problem is related to timezone.
Solution 1: Change DateType to TimestampType:
dataSetFromFile = dataSetFromFile.withColumn("MONTH",functions.explode(functions.sequence( functions.lit("2016-02-01").cast(DataTypes.TimestampType), functions.lit("2016-05-31").cast(DataTypes.TimestampType), functions.expr("INTERVAL 1 month"))));}
Solution 2: Set timezone to UTC:
spark.conf.set("spark.sql.session.timeZone", "UTC")
dataSetFromFile = dataSetFromFile.withColumn("MONTH",functions.explode(functions.sequence( functions.lit("2016-02-01").cast(DataTypes.DateType), functions.lit("2016-05-31").cast(DataTypes.DateType), functions.expr("INTERVAL 1 month"))));}

Related

Extract YYYY-MM-DD HH:MM: SS and convert to different time zone

I am exploring different date formats and trying to convert date formats to others. Currently, I m stuck in a scenario where I have input dates and times as below:
I was able to convert it to a date timestamp using concatenation
concat_ws(' ',new_df.transaction_date,new_df.Transaction_Time)
While I m trying to use
withColumn("date_time2", F.to_date(col('date_time'), "MMM d yyyy hh:mmaa")) with ('spark.sql.legacy.timeParserPolicy','LEGACY')
It is displayed as 'undefined'
I am looking for pointers/code snippets to extract YYYY-MM-DD HH:MM:SS in CET (input is in PST) as below
input_date_time
output (in CET)
Mar 1, 2022 01:00:00 PM PST
2022-03-01 22:00:00

Parse PST string to timestamp with timezone in UTC. Then convert to "CET" time:
import pyspark.sql.functions as F
df = spark.createDataFrame(data=[["Mar 1, 2022 01:00:00 PM PST"]], schema=["input_date_time_pst"])
df = df.withColumn("input_date_time_pst", F.to_timestamp("input_date_time_pst", format="MMM d, yyyy hh:mm:ss a z"))
df = df.withColumn("output_cet", F.from_utc_timestamp("input_date_time_pst", "CET"))
[Out]:
+-------------------+-------------------+
|input_date_time_pst|output_cet |
+-------------------+-------------------+
|2022-03-01 21:00:00|2022-03-01 22:00:00|
+-------------------+-------------------+
Note - The 2022-03-01 21:00:00 above is Mar 1, 2022 01:00:00 PM PST displayed in UTC.

Error 'Not supported for type RangeIndex' when calculating weekly grow rate

I have a dataframe MyDf like this
MyDate MyData
2020-06-02 4588.0
2020-06-03 4555.5
2020-06-04 4604.3
2020-06-05 4634.1
2020-06-06 4617.8
2020-06-07 4598.9
2020-06-08 4596.1
2020-06-09 4607.0
2020-06-10 4601.6
2020-06-11 4547.4
I want to calculate the weekly rate of growth of MyData column this way:
#'WeeklyRate'='MyData'/pastweek('MyData')
from pandas.tseries.offsets import Week
MyDf['WeeklyRate'] = ((MyDf['MyData'] / MyDf['MyData'].shift(1, freq=Week()).reindex(MyDf['MyDate'].index))
.fillna(value=-1)
.astype(float))
#ToDo:not sure if .astype(float)
But I get the error "Not supported for type RangeIndex" when I run the last line.
What am I doing wrong?

I think if you set "MyDate" as your index, it will be solved :)
MyDf = MyDf.set_index("MyDate")
MyDf['WeeklyRate'] = MyDf['MyData'] / MyDf['MyData'].shift(1, freq=Week())
MyDf['WeeklyRate'] = MyDf["WeeklyRate"].fillna(value=-1)

String to Date migration from Spark 2.0 to 3.0 gives Fail to recognize 'EEE MMM dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter

I have a date string from a source in the format 'Fri May 24 00:00:00 BST 2019' that I would convert to a date and store in my dataframe as '2019-05-24' using code like my example which works for me under spark 2.0
from pyspark.sql.functions import to_date, unix_timestamp, from_unixtime
df = spark.createDataFrame([("Fri May 24 00:00:00 BST 2019",)], ['date_str'])
df2 = df.select('date_str', to_date(from_unixtime(unix_timestamp('date_str', 'EEE MMM dd HH:mm:ss zzz yyyy'))).alias('date'))
df2.show(1, False)
In my sandbox environment I've updated to spark 3.0 and now get the following error for the above code, is there a new method of doing this in 3.0 to convert my string to a date
: org.apache.spark.SparkUpgradeException: You may get a different
result due to the upgrading of Spark 3.0: Fail to recognize 'EEE MMM
dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter.
You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the
behavior before Spark 3.0.
You can form a valid datetime pattern with the guide from
https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

If you want to use the legacy format in a newer version of spark(>3), you need to set spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY") or
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY"), which will resolve the issue.

Thanks for responses, excellent advice, for the moment I'll be going with the LEGACY setting. I have a workaround with Spark 3.0 by substringing out the EEE element but I've noticed a bug with how BST timezone converts incorrectly offseting by 10 hours while under LEGACY it correctly remains the same as I'm currently in BST zone. I can do something with this but will wait till the clocks change in the autumn to confirm.
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
df = spark.createDataFrame([('Fri May 24 00:00:00 BST 2019',)], ['mydate'])
df = df.select('mydate',
to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime'),
to_timestamp(df.mydate, 'EEE MMM dd HH:mm:ss zzz yyyy').alias('LEGACYdatetime')
).show(1, False)
df = spark.createDataFrame([('Fri May 24 00:00:00 GMT 2019',)], ['mydate'])
df = df.select('mydate',
to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime'),
to_timestamp(df.mydate, 'EEE MMM dd HH:mm:ss zzz yyyy').alias('LEGACYdatetime')
).show(1, False)
spark.sql("set spark.sql.legacy.timeParserPolicy=CORRECTED")
df = spark.createDataFrame([('Fri May 24 00:00:00 BST 2019',)], ['mydate'])
df = df.select('mydate',
to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime')
).show(1, False)
df = spark.createDataFrame([('Fri May 24 00:00:00 GMT 2019',)], ['mydate'])
df = df.select('mydate',
to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime')
).show(1, False)
+----------------------------+-------------------+-------------------+
|mydate |datetime |LEGACYdatetime |
+----------------------------+-------------------+-------------------+
|Fri May 24 00:00:00 BST 2019|2019-05-24 00:00:00|2019-05-24 00:00:00|
+----------------------------+-------------------+-------------------+
+----------------------------+-------------------+-------------------+
|mydate |datetime |LEGACYdatetime |
+----------------------------+-------------------+-------------------+
|Fri May 24 00:00:00 GMT 2019|2019-05-24 01:00:00|2019-05-24 01:00:00|
+----------------------------+-------------------+-------------------+
+----------------------------+-------------------+
|mydate |datetime |
+----------------------------+-------------------+
|Fri May 24 00:00:00 BST 2019|2019-05-23 14:00:00|
+----------------------------+-------------------+
+----------------------------+-------------------+
|mydate |datetime |
+----------------------------+-------------------+
|Fri May 24 00:00:00 GMT 2019|2019-05-24 01:00:00|
+----------------------------+-------------------+

The difference between the Legacy and the current version of Spark is subtle
for example:
spark.sql("set spark.sql.legacy.timeParserPolicy=EXCEPTION")
df = spark.createDataFrame([('12/25/2019 01:30:00 PM',),], ['Christmas'])
df.select(to_timestamp(col('Christmas'),'MM/dd/yyyy hh:mm:ss a')).show()
Outputs the following:
+----------------------------------------------+
|to_timestamp(Christmas, MM/dd/yyyy hh:mm:ss a)|
+----------------------------------------------+
| 2019-12-25 13:30:00|
+----------------------------------------------+
However
spark.sql("set spark.sql.legacy.timeParserPolicy=EXCEPTION")
df = spark.createDataFrame([('12/25/2019 01:30:00 PM',),], ['Christmas'])
df.select(to_timestamp(col('Christmas'),'MM/dd/yyyy hh:mm:ss aa')).show()
Will raise a SparkUpgradeException
Notice we have 'aa' in the time format not just one.
According to Java Docs, which is what the to_timestamp function uses, 'aa' was always wrong, I guess the earlier version of Spark was more lenient.
So either fix the date formats or set the timeParserPolicy to 'LEGACY' as Shivam suggested.

String does not contain a date

I have a DataFrame With This Column :
Mi_Meteo['Time_Instant'].head():
0 2013/11/14 17:00
1 2013/11/14 18:00
2 2013/11/14 19:00
3 2013/11/14 20:00
4 2013/11/14 21:00
Name: Time_Instant, dtype: object
After Doing Some Inspection This is What I realised :
Mi_Meteo['Time_Instant'].value_counts():
2013/12/09 02:00 33
2013/12/01 22:00 33
2013/12/11 10:00 33
2013/12/05 09:00 33
.
.
.
.
2013/11/16 02:00 21
2013/11/07 10:00 11
2013/11/17 22:00 11
DateTIme 3
So I striped it:
Mi_Meteo['Time_Instant'] = Mi_Meteo['Time_Instant'].str.rstrip('DateTIme')# Cause Otherwise I would get this Error When Converting : 'Unknown string format'
And Then I tried To Convert it :
Mi_Meteo['Time_Instant'] = pd.to_datetime(Mi_Meteo['Time_Instant'])
But I Get This Error:
String does not contain a date.
Any Suggestion Would Be Much Appreciated , Thank U all.

A bit late, why don't you use this:
Mi_Meteo['Time_Instant'] = pd.to_datetime(Mi_Meteo['Time_Instant'], errors='coerce')
In the pandas.to_datetime document a description of the 'errors' parameter:
errors{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’ If ‘raise’, then
invalid parsing will raise an exception.
If ‘coerce’, then invalid parsing will be set as NaT.
If ‘ignore’, then invalid parsing will return the input.

I got the same error - it turns out that two of my dates were empty: ' '.
To find the row index of the problematic dates I used the following list comprehension:
badRows = [n for n,x in enumerate(df['DATE'].tolist()) if x.strip() in ['']]
This returned a list, containing the indices of the rows in the 'DATE' column that were causing the problems:
[745672, 745673]
Can then delete these rows in place:
df.drop(df.index[badRows],inplace=True)

I'm having trouble reproducing your error, so I cannot be sure if this will fix the issue you have. If not then please try to provide a minimum sample of code/data that reproduces your error.
This is what I tried to reproduce your situation:
lzt = ['2013/11/16 02:00 ',
'2013/11/07 10:00 ',
'2013/11/17 22:00 ',
'DateTIme',
'DateTIme',
'DateTIme']
ser = pd.Series(lzt)
ser = ser.str.rstrip('DateTIme')
ser = pd.to_datetime(ser)
But as I said I got no error, so either we have a different version of pandas or there's something else wrong with your data. Using rstrip leave some blank string data:
0 2013/11/16 02:00
1 2013/11/07 10:00
2 2013/11/17 22:00
3
4
5
which for me gives NaT (not a time) when I run pd.to_datetime on it:
Out[34]:
0 2013-11-16 02:00:00
1 2013-11-07 10:00:00
2 2013-11-17 22:00:00
3 NaT
4 NaT
5 NaT
dtype: datetime64[ns]
I'd say it's better practice to remove the unwanted rows all together:
ser = ser[ser != 'DateTIme']
Out[39]:
0 2013-11-16 02:00:00
1 2013-11-07 10:00:00
2 2013-11-17 22:00:00
dtype: datetime64[ns]
See if that works, otherwise please give enough information to reproduce the error.

There are two possible solutions to this:
Either you can make the error disappear by using coerce in errors argument of pd.to_datetime() as follows:Mi_Meteo['Time_Instant'] = pd.to_datetime(Mi_Meteo['Time_Instant'], errors='coerce')
Or if you are interested to know which dates have the unparsable values, you can search for them by converting each value at a time as follows. Ths will work regardless of the type or format of the wrong value:
dates = []
wrong_dates = []
for i in Mi_Meteo['Time_Instant'],unique():
try:
date = pd.to_datetime(i)
dates.append(i)
except:
wrong_dates.append(i)
In the wrong_dates list you will have all the wrong values while in dates, all the right values

Convert incomplete 12h datetime-like strings into appropriate datetime type

I've got a pandas Series containing datetime-like strings with 12h format, but without the am/pm abbreviations. It covers an entire month of data :
40 01/01/2017 11:51:00
41 01/01/2017 11:51:05
42 01/01/2017 11:55:05
43 01/01/2017 11:55:10
44 01/01/2017 11:59:30
45 01/01/2017 11:59:35
46 02/01/2017 12:00:05
47 02/01/2017 12:00:10
48 02/01/2017 12:13:20
49 02/01/2017 12:13:25
50 02/01/2017 12:24:50
51 02/01/2017 12:24:55
52 02/01/2017 12:33:30
Name: TS, dtype: object
(318621,) # shape
My goal is to convert it to datetime format, so as to obtain the appropriate unix timestamps values, and make comparisions/arithmetics with other datetime data with, this time, 24h format. So I already tried this :
pd.to_datetime(df.TS, format = '%d/%m/%Y %I:%M:%S') # %I for 12h format
Which outputs me :
64 2017-01-02 00:46:50
65 2017-01-02 00:46:55
66 2017-01-02 01:01:00
67 2017-01-02 01:01:05
68 2017-01-02 01:05:00
But the am/pm informations are not taken into account. I know that, as a rule, the am/pm first have to be specified in the strings, then one can use dt.dt.strptime() or pd.to_datetime() to parse them with the %p indicator.
So I wanted to know if there's an other way to deal with this issue through datetime or pandas datetime modules ? Or, do I have to manualy add the abbreviations 'am/pm' before the parsing ?

You have data in 5 second intervals throughout multiple days. The desired end format is like this (with AM/PM column we need to add, because Pandas cannot possibly guess, since it looks at one value at a time):
31/12/2016 11:59:55 PM
01/01/2017 12:00:00 AM
01/01/2017 12:00:05 AM
01/01/2017 11:59:55 AM
01/01/2017 12:00:00 PM
01/01/2017 12:59:55 PM
01/01/2017 01:00:00 PM
01/01/2017 01:00:05 PM
01/01/2017 11:59:55 PM
02/01/2017 12:00:00 AM
First, we can parse the whole thing without AM/PM info, as you already showed:
ts = pd.to_datetime(df.TS, format = '%d/%m/%Y %I:%M:%S')
We have a small problem: 12:00:00 is parsed as noon, not midnight. Let's normalize that:
ts[ts.dt.hour == 12] -= pd.Timedelta(12, 'h')
Now we have times from 00:00:00 to 11:59:55, twice per day.
Next, note that the transitions are always at 00:00:00. We can easily detect these, as well as the first instance of each date:
twelve = ts.dt.time == datetime.time(0,0,0)
newdate = ts.dt.date.diff() > pd.Timedelta(0)
midnight = twelve & newdate
noon = twelve & ~newdate
Next, build an offset series, which should be easy to inspect for correctness:
offset = pd.Series(np.nan, ts.index, dtype='timedelta64[ns]')
offset[midnight] = pd.Timedelta(0)
offset[noon] = pd.Timedelta(12, 'h')
offset.fillna(method='ffill', inplace=True)
And finally:
ts += offset

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark Explode between 2 Dates - apache-spark

Related

Extract YYYY-MM-DD HH:MM: SS and convert to different time zone

Error 'Not supported for type RangeIndex' when calculating weekly grow rate

String to Date migration from Spark 2.0 to 3.0 gives Fail to recognize 'EEE MMM dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter

String does not contain a date

Convert incomplete 12h datetime-like strings into appropriate datetime type

Categories

Resources