I am using spark 3.2, and trying to explode data between 2 months, so that I can get a column which contains all months between those 2 dates.
dataSetFromFile = dataSetFromFile.withColumn(
"MONTH",
functions.explode(functions.sequence(
functions.lit("2016-02-01").cast(DataTypes.DateType),
functions.lit("2016-05-31").cast(DataTypes.DateType),
functions.expr("INTERVAL 1 month")
))
);
It's giving me output as
MONTH 2016-02-01 2016-03-01 2016-03-31 2016-04-30 2016-05-31
Expected Output should be:
MONTH 2016-02-01 2016-03-01 2016-04-01 2016-05-01
Can you please let us know where I did wrong?
The problem is related to timezone.
Solution 1: Change DateType to TimestampType:
dataSetFromFile = dataSetFromFile.withColumn("MONTH",functions.explode(functions.sequence( functions.lit("2016-02-01").cast(DataTypes.TimestampType), functions.lit("2016-05-31").cast(DataTypes.TimestampType), functions.expr("INTERVAL 1 month"))));}
Solution 2: Set timezone to UTC:
spark.conf.set("spark.sql.session.timeZone", "UTC")
dataSetFromFile = dataSetFromFile.withColumn("MONTH",functions.explode(functions.sequence( functions.lit("2016-02-01").cast(DataTypes.DateType), functions.lit("2016-05-31").cast(DataTypes.DateType), functions.expr("INTERVAL 1 month"))));}
Related
I am exploring different date formats and trying to convert date formats to others. Currently, I m stuck in a scenario where I have input dates and times as below:
I was able to convert it to a date timestamp using concatenation
concat_ws(' ',new_df.transaction_date,new_df.Transaction_Time)
While I m trying to use
withColumn("date_time2", F.to_date(col('date_time'), "MMM d yyyy hh:mmaa")) with ('spark.sql.legacy.timeParserPolicy','LEGACY')
It is displayed as 'undefined'
I am looking for pointers/code snippets to extract YYYY-MM-DD HH:MM:SS in CET (input is in PST) as below
input_date_time
output (in CET)
Mar 1, 2022 01:00:00 PM PST
2022-03-01 22:00:00
Parse PST string to timestamp with timezone in UTC. Then convert to "CET" time:
import pyspark.sql.functions as F
df = spark.createDataFrame(data=[["Mar 1, 2022 01:00:00 PM PST"]], schema=["input_date_time_pst"])
df = df.withColumn("input_date_time_pst", F.to_timestamp("input_date_time_pst", format="MMM d, yyyy hh:mm:ss a z"))
df = df.withColumn("output_cet", F.from_utc_timestamp("input_date_time_pst", "CET"))
[Out]:
+-------------------+-------------------+
|input_date_time_pst|output_cet |
+-------------------+-------------------+
|2022-03-01 21:00:00|2022-03-01 22:00:00|
+-------------------+-------------------+
Note - The 2022-03-01 21:00:00 above is Mar 1, 2022 01:00:00 PM PST displayed in UTC.
I have a dataframe MyDf like this
MyDate MyData
2020-06-02 4588.0
2020-06-03 4555.5
2020-06-04 4604.3
2020-06-05 4634.1
2020-06-06 4617.8
2020-06-07 4598.9
2020-06-08 4596.1
2020-06-09 4607.0
2020-06-10 4601.6
2020-06-11 4547.4
I want to calculate the weekly rate of growth of MyData column this way:
#'WeeklyRate'='MyData'/pastweek('MyData')
from pandas.tseries.offsets import Week
MyDf['WeeklyRate'] = ((MyDf['MyData'] / MyDf['MyData'].shift(1, freq=Week()).reindex(MyDf['MyDate'].index))
.fillna(value=-1)
.astype(float))
#ToDo:not sure if .astype(float)
But I get the error "Not supported for type RangeIndex" when I run the last line.
What am I doing wrong?
I think if you set "MyDate" as your index, it will be solved :)
MyDf = MyDf.set_index("MyDate")
MyDf['WeeklyRate'] = MyDf['MyData'] / MyDf['MyData'].shift(1, freq=Week())
MyDf['WeeklyRate'] = MyDf["WeeklyRate"].fillna(value=-1)
I have a date string from a source in the format 'Fri May 24 00:00:00 BST 2019' that I would convert to a date and store in my dataframe as '2019-05-24' using code like my example which works for me under spark 2.0
from pyspark.sql.functions import to_date, unix_timestamp, from_unixtime
df = spark.createDataFrame([("Fri May 24 00:00:00 BST 2019",)], ['date_str'])
df2 = df.select('date_str', to_date(from_unixtime(unix_timestamp('date_str', 'EEE MMM dd HH:mm:ss zzz yyyy'))).alias('date'))
df2.show(1, False)
In my sandbox environment I've updated to spark 3.0 and now get the following error for the above code, is there a new method of doing this in 3.0 to convert my string to a date
: org.apache.spark.SparkUpgradeException: You may get a different
result due to the upgrading of Spark 3.0: Fail to recognize 'EEE MMM
dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter.
You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the
behavior before Spark 3.0.
You can form a valid datetime pattern with the guide from
https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
If you want to use the legacy format in a newer version of spark(>3), you need to set spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY") or
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY"), which will resolve the issue.
Thanks for responses, excellent advice, for the moment I'll be going with the LEGACY setting. I have a workaround with Spark 3.0 by substringing out the EEE element but I've noticed a bug with how BST timezone converts incorrectly offseting by 10 hours while under LEGACY it correctly remains the same as I'm currently in BST zone. I can do something with this but will wait till the clocks change in the autumn to confirm.
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
df = spark.createDataFrame([('Fri May 24 00:00:00 BST 2019',)], ['mydate'])
df = df.select('mydate',
to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime'),
to_timestamp(df.mydate, 'EEE MMM dd HH:mm:ss zzz yyyy').alias('LEGACYdatetime')
).show(1, False)
df = spark.createDataFrame([('Fri May 24 00:00:00 GMT 2019',)], ['mydate'])
df = df.select('mydate',
to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime'),
to_timestamp(df.mydate, 'EEE MMM dd HH:mm:ss zzz yyyy').alias('LEGACYdatetime')
).show(1, False)
spark.sql("set spark.sql.legacy.timeParserPolicy=CORRECTED")
df = spark.createDataFrame([('Fri May 24 00:00:00 BST 2019',)], ['mydate'])
df = df.select('mydate',
to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime')
).show(1, False)
df = spark.createDataFrame([('Fri May 24 00:00:00 GMT 2019',)], ['mydate'])
df = df.select('mydate',
to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime')
).show(1, False)
+----------------------------+-------------------+-------------------+
|mydate |datetime |LEGACYdatetime |
+----------------------------+-------------------+-------------------+
|Fri May 24 00:00:00 BST 2019|2019-05-24 00:00:00|2019-05-24 00:00:00|
+----------------------------+-------------------+-------------------+
+----------------------------+-------------------+-------------------+
|mydate |datetime |LEGACYdatetime |
+----------------------------+-------------------+-------------------+
|Fri May 24 00:00:00 GMT 2019|2019-05-24 01:00:00|2019-05-24 01:00:00|
+----------------------------+-------------------+-------------------+
+----------------------------+-------------------+
|mydate |datetime |
+----------------------------+-------------------+
|Fri May 24 00:00:00 BST 2019|2019-05-23 14:00:00|
+----------------------------+-------------------+
+----------------------------+-------------------+
|mydate |datetime |
+----------------------------+-------------------+
|Fri May 24 00:00:00 GMT 2019|2019-05-24 01:00:00|
+----------------------------+-------------------+
The difference between the Legacy and the current version of Spark is subtle
for example:
spark.sql("set spark.sql.legacy.timeParserPolicy=EXCEPTION")
df = spark.createDataFrame([('12/25/2019 01:30:00 PM',),], ['Christmas'])
df.select(to_timestamp(col('Christmas'),'MM/dd/yyyy hh:mm:ss a')).show()
Outputs the following:
+----------------------------------------------+
|to_timestamp(Christmas, MM/dd/yyyy hh:mm:ss a)|
+----------------------------------------------+
| 2019-12-25 13:30:00|
+----------------------------------------------+
However
spark.sql("set spark.sql.legacy.timeParserPolicy=EXCEPTION")
df = spark.createDataFrame([('12/25/2019 01:30:00 PM',),], ['Christmas'])
df.select(to_timestamp(col('Christmas'),'MM/dd/yyyy hh:mm:ss aa')).show()
Will raise a SparkUpgradeException
Notice we have 'aa' in the time format not just one.
According to Java Docs, which is what the to_timestamp function uses, 'aa' was always wrong, I guess the earlier version of Spark was more lenient.
So either fix the date formats or set the timeParserPolicy to 'LEGACY' as Shivam suggested.
I have a DataFrame With This Column :
Mi_Meteo['Time_Instant'].head():
0 2013/11/14 17:00
1 2013/11/14 18:00
2 2013/11/14 19:00
3 2013/11/14 20:00
4 2013/11/14 21:00
Name: Time_Instant, dtype: object
After Doing Some Inspection This is What I realised :
Mi_Meteo['Time_Instant'].value_counts():
2013/12/09 02:00 33
2013/12/01 22:00 33
2013/12/11 10:00 33
2013/12/05 09:00 33
.
.
.
.
2013/11/16 02:00 21
2013/11/07 10:00 11
2013/11/17 22:00 11
DateTIme 3
So I striped it:
Mi_Meteo['Time_Instant'] = Mi_Meteo['Time_Instant'].str.rstrip('DateTIme')# Cause Otherwise I would get this Error When Converting : 'Unknown string format'
And Then I tried To Convert it :
Mi_Meteo['Time_Instant'] = pd.to_datetime(Mi_Meteo['Time_Instant'])
But I Get This Error:
String does not contain a date.
Any Suggestion Would Be Much Appreciated , Thank U all.
A bit late, why don't you use this:
Mi_Meteo['Time_Instant'] = pd.to_datetime(Mi_Meteo['Time_Instant'], errors='coerce')
In the pandas.to_datetime document a description of the 'errors' parameter:
errors{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’ If ‘raise’, then
invalid parsing will raise an exception.
If ‘coerce’, then invalid parsing will be set as NaT.
If ‘ignore’, then invalid parsing will return the input.
I got the same error - it turns out that two of my dates were empty: ' '.
To find the row index of the problematic dates I used the following list comprehension:
badRows = [n for n,x in enumerate(df['DATE'].tolist()) if x.strip() in ['']]
This returned a list, containing the indices of the rows in the 'DATE' column that were causing the problems:
[745672, 745673]
Can then delete these rows in place:
df.drop(df.index[badRows],inplace=True)
I'm having trouble reproducing your error, so I cannot be sure if this will fix the issue you have. If not then please try to provide a minimum sample of code/data that reproduces your error.
This is what I tried to reproduce your situation:
lzt = ['2013/11/16 02:00 ',
'2013/11/07 10:00 ',
'2013/11/17 22:00 ',
'DateTIme',
'DateTIme',
'DateTIme']
ser = pd.Series(lzt)
ser = ser.str.rstrip('DateTIme')
ser = pd.to_datetime(ser)
But as I said I got no error, so either we have a different version of pandas or there's something else wrong with your data. Using rstrip leave some blank string data:
0 2013/11/16 02:00
1 2013/11/07 10:00
2 2013/11/17 22:00
3
4
5
which for me gives NaT (not a time) when I run pd.to_datetime on it:
Out[34]:
0 2013-11-16 02:00:00
1 2013-11-07 10:00:00
2 2013-11-17 22:00:00
3 NaT
4 NaT
5 NaT
dtype: datetime64[ns]
I'd say it's better practice to remove the unwanted rows all together:
ser = ser[ser != 'DateTIme']
Out[39]:
0 2013-11-16 02:00:00
1 2013-11-07 10:00:00
2 2013-11-17 22:00:00
dtype: datetime64[ns]
See if that works, otherwise please give enough information to reproduce the error.
There are two possible solutions to this:
Either you can make the error disappear by using coerce in errors argument of pd.to_datetime() as follows:Mi_Meteo['Time_Instant'] = pd.to_datetime(Mi_Meteo['Time_Instant'], errors='coerce')
Or if you are interested to know which dates have the unparsable values, you can search for them by converting each value at a time as follows. Ths will work regardless of the type or format of the wrong value:
dates = []
wrong_dates = []
for i in Mi_Meteo['Time_Instant'],unique():
try:
date = pd.to_datetime(i)
dates.append(i)
except:
wrong_dates.append(i)
In the wrong_dates list you will have all the wrong values while in dates, all the right values
I've got a pandas Series containing datetime-like strings with 12h format, but without the am/pm abbreviations. It covers an entire month of data :
40 01/01/2017 11:51:00
41 01/01/2017 11:51:05
42 01/01/2017 11:55:05
43 01/01/2017 11:55:10
44 01/01/2017 11:59:30
45 01/01/2017 11:59:35
46 02/01/2017 12:00:05
47 02/01/2017 12:00:10
48 02/01/2017 12:13:20
49 02/01/2017 12:13:25
50 02/01/2017 12:24:50
51 02/01/2017 12:24:55
52 02/01/2017 12:33:30
Name: TS, dtype: object
(318621,) # shape
My goal is to convert it to datetime format, so as to obtain the appropriate unix timestamps values, and make comparisions/arithmetics with other datetime data with, this time, 24h format. So I already tried this :
pd.to_datetime(df.TS, format = '%d/%m/%Y %I:%M:%S') # %I for 12h format
Which outputs me :
64 2017-01-02 00:46:50
65 2017-01-02 00:46:55
66 2017-01-02 01:01:00
67 2017-01-02 01:01:05
68 2017-01-02 01:05:00
But the am/pm informations are not taken into account. I know that, as a rule, the am/pm first have to be specified in the strings, then one can use dt.dt.strptime() or pd.to_datetime() to parse them with the %p indicator.
So I wanted to know if there's an other way to deal with this issue through datetime or pandas datetime modules ? Or, do I have to manualy add the abbreviations 'am/pm' before the parsing ?
You have data in 5 second intervals throughout multiple days. The desired end format is like this (with AM/PM column we need to add, because Pandas cannot possibly guess, since it looks at one value at a time):
31/12/2016 11:59:55 PM
01/01/2017 12:00:00 AM
01/01/2017 12:00:05 AM
01/01/2017 11:59:55 AM
01/01/2017 12:00:00 PM
01/01/2017 12:59:55 PM
01/01/2017 01:00:00 PM
01/01/2017 01:00:05 PM
01/01/2017 11:59:55 PM
02/01/2017 12:00:00 AM
First, we can parse the whole thing without AM/PM info, as you already showed:
ts = pd.to_datetime(df.TS, format = '%d/%m/%Y %I:%M:%S')
We have a small problem: 12:00:00 is parsed as noon, not midnight. Let's normalize that:
ts[ts.dt.hour == 12] -= pd.Timedelta(12, 'h')
Now we have times from 00:00:00 to 11:59:55, twice per day.
Next, note that the transitions are always at 00:00:00. We can easily detect these, as well as the first instance of each date:
twelve = ts.dt.time == datetime.time(0,0,0)
newdate = ts.dt.date.diff() > pd.Timedelta(0)
midnight = twelve & newdate
noon = twelve & ~newdate
Next, build an offset series, which should be easy to inspect for correctness:
offset = pd.Series(np.nan, ts.index, dtype='timedelta64[ns]')
offset[midnight] = pd.Timedelta(0)
offset[noon] = pd.Timedelta(12, 'h')
offset.fillna(method='ffill', inplace=True)
And finally:
ts += offset