I'm trying to format my timestamp column to include milliseconds without success. How can I format my time to look like this - 2019-01-04 11:09:21.152 ?
I have looked at the documentation and following the SimpleDataTimeFormat , which the pyspark docs say are being used by the to_timestamp function.
This is my dataframe.
+--------------------------+
|updated_date |
+--------------------------+
|2019-01-04 11:09:21.152815|
+--------------------------+
I use the millisecond format without any success as below
>>> df.select('updated_date').withColumn("updated_date_col2",
to_timestamp("updated_date", "YYYY-MM-dd HH:mm:ss:SSS")).show(1,False)
+--------------------------+-------------------+
|updated_date |updated_date_col2 |
+--------------------------+-------------------+
|2019-01-04 11:09:21.152815|2019-01-04 11:09:21|
+--------------------------+-------------------+
I expect updated_date_col2 to be formatted as 2019-01-04 11:09:21.152
I think you can use UDF and Python's standard datetime module as below.
import datetime
from pyspark.sql.functions import udf
from pyspark.sql.types import TimestampType
def _to_timestamp(s):
return datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')
udf_to_timestamp = udf(_to_timestamp, TimestampType())
df.select('updated_date').withColumn("updated_date_col2", udf_to_timestamp("updated_date")).show(1,False)
This is not a solution with to_timestamp but you can easily keep your column to time format
Following code is one of example on converting a numerical milliseconds to timestamp.
from datetime import datetime
ms = datetime.now().timestamp() # ex) ms = 1547521021.83301
df = spark.createDataFrame([(1, ms)], ['obs', 'time'])
df = df.withColumn('time', df.time.cast("timestamp"))
df.show(1, False)
+---+--------------------------+
|obs|time |
+---+--------------------------+
|1 |2019-01-15 12:15:49.565263|
+---+--------------------------+
if you use new Date().getTime() or Date.now() in JS or datetime.datetime.now().timestamp() in Python, you can get a numerical milliseconds.
Reason pyspark to_timestamp parses only till seconds, while TimestampType have the ability to hold milliseconds.
Following workaround may work:
If the timestamp pattern contains S, Invoke a UDF to get the string 'INTERVAL MILLISECONDS' to use in expression
ts_pattern = "YYYY-MM-dd HH:mm:ss:SSS"
my_col_name = "time_with_ms"
# get the time till seconds
df = df.withColumn(my_col_name, to_timestamp(df["updated_date_col2"],ts_pattern))
# add milliseconds as inteval
if 'S' in timestamp_pattern:
df = df.withColumn(my_col_name, df[my_col_name] + expr("INTERVAL 256 MILLISECONDS"))
To get INTERVAL 256 MILLISECONDS we may use a Java UDF:
df = df.withColumn(col_name, df[col_name] + expr(getIntervalStringUDF(df[my_col_name], ts_pattern)))
Inside UDF: getIntervalStringUDF(String timeString, String pattern)
Use SimpleDateFormat to parse date according to pattern
return formatted date as string using pattern "'INTERVAL 'SSS' MILLISECONDS'"
return 'INTERVAL 0 MILLISECONDS' on parse/format exceptions
Related
I am trying to create an array of dates that all months from a minimum date to a maximum date!
Example:
min_date = "2021-05-31"
max_date = "2021-11-30"
.withColumn('array_date', F.expr('sequence(to_date(min_date), to_date(max_date), interval 1 month)')
But it gives me the following Output:
['2021-05-31', '2021-06-30', '2021-07-31', '2021-08-31', '2021-09-30', '2021-10-31']
Why doesn't the upper limit appear on 11/30/2021? In the documentation, it says that the extremes are included.
My desired output is:
['2021-05-31', '2021-06-30', '2021-07-31', '2021-08-31', '2021-09-30', '2021-10-31', '2021-11-30']
Thank you!
I think this is related to the timezone. I can reproduce the same behavior in my timezone Europe/Paris but when setting timezone to UTC it gives expected result:
from pyspark.sql import functions as F
spark.conf.set("spark.sql.session.timeZone", "UTC")
df = spark.createDataFrame([("2021-05-31", "2021-11-30")], ["min_date", "max_date"])
df.withColumn(
"array_date",
F.expr("sequence(to_date(min_date), to_date(max_date), interval 1 month)")
).show(truncate=False)
#+----------+----------+------------------------------------------------------------------------------------+
#|min_date |max_date |array_date |
#+----------+----------+------------------------------------------------------------------------------------+
#|2021-05-31|2021-11-30|[2021-05-31, 2021-06-30, 2021-07-31, 2021-08-31, 2021-09-30, 2021-10-31, 2021-11-30]|
#+----------+----------+------------------------------------------------------------------------------------+
Alternatively, you can use TimestampType for start and end parameters of the sequence instead of DateType:
df.withColumn(
"array_date",
F.expr("sequence(to_timestamp(min_date), to_timestamp(max_date), interval 1 month)").cast("array<date>")
).show(truncate=False)
In ISO 8601, durations are in the format PT5M ( 5 minutes) or PT2H5M (2 hours 5 minutes). I have a JSON file that contains values in such a format. I wanted to know if spark can extract the duration in minutes. I tried to read it as "DateType" and used the "minutes" function to get minutes, it returned me with null values.
Example json
{"name": "Fennel Mushrooms","cookTime":"PT30M"}
Currently, I am reading it as a string and using the "regex_extract" function. I wanted to know a more efficient way.
https://www.digi.com/resources/documentation/digidocs/90001437-13/reference/r_iso_8601_duration_format.htm
Spark does not provide for a way to convert ISO 8601 duration into intervals. Neither does timedelta in Python datetime library.
However, pd.Timdelta can parse ISO 8601 duration to time deltas. To support of a wider category of ISO 8601 duration, we can wrap the pd.Timdelta in a pandas_udf
from pyspark.sql import functions as F
import pandas as pd
df = spark.createDataFrame([("PT5M", ), ("PT50M", ), ("PT2H5M", ), ], ("duration", ))
#F.pandas_udf("int")
def parse_iso8601_duration(str_duration: pd.Series) -> pd.Series:
return str_duration.apply(lambda duration: (pd.Timedelta(duration).seconds / 60))
df.withColumn("duration_in_minutes", parse_iso8601_duration(F.col("duration"))).show()
Output
+--------+-------------------+
|duration|duration_in_minutes|
+--------+-------------------+
| PT5M| 5|
| PT50M| 50|
| PT2H5M| 125|
+--------+-------------------+
I want to cast a string to timestamp. The problem I'm facing is that the string shows the 1st three letters of the month, rather than the month number:
E.g. 31-JAN-20 12.03.48.759214 AM
Is there any smart way to above value into like?
2020-01-31T12:03:48.000+0000
Thanks
Use to_timestamp to convert the string into timestamp type then use format_date to get the desired pattern :
from pyspark.sql import functions as F
df = spark.createDataFrame([("31-JAN-20 12.03.48.759214 AM",)], ["date"])
df.withColumn(
"date2",
F.date_format(
F.to_timestamp("date", "dd-MMM-yy h.mm.ss.SSSSSS a"),
"yyyy-MM-dd'T'HH:mm:ss.SSS Z"
)
).show(truncate=False)
#+----------------------------+-----------------------------+
#|date |date2 |
#+----------------------------+-----------------------------+
#|31-JAN-20 12.03.48.759214 AM|2020-01-31T00:03:48.759 +0100|
#+----------------------------+-----------------------------+
I have one field in a pandas DataFrame that was imported as string format.
It should be a datetime variable. How do I convert it to a datetime column and then filter based on date.
Example:
df = pd.DataFrame({'date': ['05SEP2014:00:00:00.000']})
Use the to_datetime function, specifying a format to match your data.
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
If you have more than one column to be converted you can do the following:
df[["col1", "col2", "col3"]] = df[["col1", "col2", "col3"]].apply(pd.to_datetime)
You can use the DataFrame method .apply() to operate on the values in Mycol:
>>> df = pd.DataFrame(['05SEP2014:00:00:00.000'],columns=['Mycol'])
>>> df
Mycol
0 05SEP2014:00:00:00.000
>>> import datetime as dt
>>> df['Mycol'] = df['Mycol'].apply(lambda x:
dt.datetime.strptime(x,'%d%b%Y:%H:%M:%S.%f'))
>>> df
Mycol
0 2014-09-05
Use the pandas to_datetime function to parse the column as DateTime. Also, by using infer_datetime_format=True, it will automatically detect the format and convert the mentioned column to DateTime.
import pandas as pd
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], infer_datetime_format=True)
chrisb's answer works:
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
however it results in a Python warning of
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I would guess this is due to some chaining indexing.
Time Saver:
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'])
To silence SettingWithCopyWarning
If you got this warning, then that means your dataframe was probably created by filtering another dataframe. Make a copy of your dataframe before any assignment and you're good to go.
df = df.copy()
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y:%H:%M:%S.%f')
errors='coerce' is useful
If some rows are not in the correct format or not datetime at all, errors= parameter is very useful, so that you can convert the valid rows and handle the rows that contained invalid values later.
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y:%H:%M:%S.%f', errors='coerce')
# for multiple columns
df[['start', 'end']] = df[['start', 'end']].apply(pd.to_datetime, format='%d%b%Y:%H:%M:%S.%f', errors='coerce')
Setting the correct format= is much faster than letting pandas find out1
Long story short, passing the correct format= from the beginning as in chrisb's post is much faster than letting pandas figure out the format, especially if the format contains time component. The runtime difference for dataframes greater than 10k rows is huge (~25 times faster, so we're talking like a couple minutes vs a few seconds). All valid format options can be found at https://strftime.org/.
1 Code used to produce the timeit test plot.
import perfplot
from random import choices
from datetime import datetime
mdYHMSf = range(1,13), range(1,29), range(2000,2024), range(24), *[range(60)]*2, range(1000)
perfplot.show(
kernels=[lambda x: pd.to_datetime(x),
lambda x: pd.to_datetime(x, format='%m/%d/%Y %H:%M:%S.%f'),
lambda x: pd.to_datetime(x, infer_datetime_format=True),
lambda s: s.apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M:%S.%f'))],
labels=["pd.to_datetime(df['date'])",
"pd.to_datetime(df['date'], format='%m/%d/%Y %H:%M:%S.%f')",
"pd.to_datetime(df['date'], infer_datetime_format=True)",
"df['date'].apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M:%S.%f'))"],
n_range=[2**k for k in range(20)],
setup=lambda n: pd.Series([f"{m}/{d}/{Y} {H}:{M}:{S}.{f}"
for m,d,Y,H,M,S,f in zip(*[choices(e, k=n) for e in mdYHMSf])]),
equality_check=pd.Series.equals,
xlabel='len(df)'
)
Just like we convert object data type to float or int. Use astype()
raw_data['Mycol']=raw_data['Mycol'].astype('datetime64[ns]')
I have a pandas dataframe with columns containing start and stop times in this format: 2016-01-01 00:00:00
I would like to convert these times to datetime objects so that I can subtract one from the other to compute total duration. I'm using the following:
import datetime
df = df['start_time'] =
df['start_time'].apply(lambda x:datetime.datetime.strptime(x,'%Y/%m/%d/%T %I:%M:%S %p'))
However, I have the following ValueError:
ValueError: 'T' is a bad directive in format '%Y/%m/%d/%T %I:%M:%S %p'
This would convert the column into datetime64 dtype. Then you could process whatever you need using that column.
df['start_time'] = pd.to_datetime(df['start_time'], format="%Y-%m-%d %H:%M:%S")
Also if you want to avoid explicitly specifying datetime format you can use the following:
df['start_time'] = pd.to_datetime(df['start_time'], infer_datetime_format=True)
Simpliest is use to_datetime:
df['start_time'] = pd.to_datetime(df['start_time'])