Why Spark is not recognizing this time format? - apache-spark

I get null for the timestamp 27-04-2021 14:11 with this code. What mistake am I doing? Why is the timestamp format string DD-MM-yyyy HH:mm not correct here?
df = spark.createDataFrame([('27-04-2021 14:11',)], ['t'])
df = df.select(to_timestamp(df.t, 'DD-MM-yyyy HH:mm').alias('dt'))
display(df)

D is for day of the year, and d is for day of the month.
Try this:
df = df.select(F.to_timestamp(df.t, "dd-MM-yyyy HH:mm").alias("dt"))

Related

How can I convert a specific string date to date or datetime in Spark?

I have this string pattern in my Spark dataframe: 'Sep 14, 2014, 1:34:36 PM'.
I want to convert this to date or datetime format, using Databricks and Spark.
I've already tried the cast and to_date functions, but nothing works and I got null return everytime.
How can I do that?
Thanks in advance!
If we have a created table like this:
var ds = spark.sparkContext.parallelize(Seq(
"Sep 14, 2014, 01:34:36 PM"
)).toDF("date")
Through the following statement:
ds = ds.withColumn("casted", to_timestamp(col("date"), "MMM dd, yyyy, hh:mm:ss aa"))
You get this result:
+-------------------------+-------------------+
|date |casted |
+-------------------------+-------------------+
|Sep 14, 2014, 01:34:36 PM|2014-09-14 13:34:36|
+-------------------------+-------------------+
which should be useful to you. You can use to_date or other APIs that require a datetime format, good luck!
Your date/time stamp string is incorrect. You have 1 instead of 01.
#
# 1 - Create sample dataframe + view
#
# required library
from pyspark.sql.functions import *
# array of tuples - data
dat1 = [
("1", "Sep 14, 2014, 01:34:36 pm")
]
# array of names - columns
col1 = ["row_id", "date_string1"]
# make data frame
df1 = spark.createDataFrame(data=dat1, schema=col1)
# expand date range into list of dates
df1 = df1.withColumn("time_stamp1", to_timestamp(col("date_string1"), "MMM dd, yyyy, hh:mm:ss a"))
# show schema
df1.printSchema()
# show data
display(df1)
This code produces the correct answer.
If the data has 1:34:36, it fails. You can use a when clause to pick the correct conversion.

How to convert a str in hh:mm:ss format type to timestamp type without (year month day info) in pyspark?

I am trying to convert the str type hh:mm:ss to timestamp type without (year month day info), below is my code, however, it still pops out the 1970-01-01 info.
import pyspark
from pyspark.sql.functions import *
df1 = spark.createDataFrame([('10:30:00',)], ['date'])
df2 = (df1
.withColumn("new_date", to_timestamp("date", 'HH:mm:ss')))
df2.show(2)
sample output: 1970-01-01 10:30:00;
How to ignore the year-month-day info in this case? Can someone please help?
Thanks a lot

How to convert a datetime column to firstday of month?

I have a PySpark dataframe with column which has datetime values in the format '09/19/2020 09:27:18 AM'
I want to convert to first day of month 01-Nov-2020 in this format.
I have tried "date_col", F.trunc("date_col", "month") which is resulting in null date
and
df_result = df_result.withColumn('gl_date', F.udf(lambda d: datetime.datetime.strptime(d, '%MM/%dd/%yyyy %HH:%mm:%S a').strftime('%Y/%m/1'), t.StringType())(F.col('date_col')))
the second method I tried errors with date format '%MM/%dd/%yyyy %HH:%mm:%S a' is not matched with '09/19/2020 09:27:18 AM'
You can convert the column to timestamp type before calling trunc:
import pyspark.sql.functions as F
df_result2 = df_result.withColumn(
'gl_date',
F.date_format(
F.trunc(
F.to_timestamp("date_col", "MM/dd/yyyy hh:mm:ss a"),
"month"
),
"dd-MMM-yyyy"
)
)

i want to convert 4/1/2019 to 1/4/2019 in my dataframe column

i want to swap month and date in a date in a column of a dataframe
i have tried all the below methods:
#df_combined1['BILLING_START_DATE_x'] = pd.to_datetime(df_combined1['BILLING_START_DATE_x'], format='%d/%m/%Y').dt.strftime('%d-%m-%Y')
#df_combined1['BILLING_START_DATE_x'] = df_combined1['BILLING_START_DATE_x'].apply(lambda x: dt.datetime.strftime(x, '%d-%m-%Y'))
#df_combined1['BILLING_START_DATE_x'] = pd.to_datetime(df_combined1['BILLING_START_DATE_x'], format='%m-%d-%Y')
i need to swap month and the day
If format all datetimes is DD/MM/YYYY:
df_combined1['BILLING_START_DATE_x'] = (pd.to_datetime(df_combined1['BILLING_START_DATE_x'],
format='%d/%m/%Y')
.dt.strftime('%m/%d/%Y'))
If format all datetimes is MM/DD/YYYY:
df_combined1['BILLING_START_DATE_x'] = (pd.to_datetime(df_combined1['BILLING_START_DATE_x'],
format='%m/%d/%Y')
.dt.strftime('%d/%m/%Y'))

Subtracting two date columns in PySpark Python

I am trying to subtract two columns in PySpark Dataframe in Python I have got a number of problems doing it, I have column type as timestamp, the column is date1 = 2011-01-03 13:25:59 and want to subtract this from other date column date2 = 2011-01-03 13:27:00 so I want date2 - date1 and from those dataframe columns and making a seperate timediff column, which shows difference of these both columns such as timeDiff = 00:01:01
how can I do this in PySaprk
I tried the following code:
#timeDiff = df.withColumn(('timeDiff', col(df['date2']) - col(df['date1'])))
this code didn't work
I tried doing this simple thing:
timeDiff = df['date2'] - df['date1']
this actually worked, but after that I tried to add this seperate column to the my dataframe by the following piece of code
df = df.withColumn("Duration", timeDiff)
it is having the follwing error:
Py4JJavaError: An error occurred while calling o107.withColumn.
: org.apache.spark.sql.AnalysisException: cannot resolve '(`date2` - `date1`)' due to data type mismatch: '(`date2` - `date1`)' requires (numeric or calendarinterval) type, not timestamp;;
Any one can help me with any other method or how can I resolve this error ??
from pyspark.sql.functions import unix_timestamp
#sample data
df = sc.parallelize([
['2011-01-03 13:25:59', '2011-01-03 13:27:00'],
['2011-01-03 3:25:59', '2011-01-03 3:30:00']
]).toDF(('date1', 'date2'))
timeDiff = (unix_timestamp('date2', "yyyy-MM-dd HH:mm:ss") - unix_timestamp('date1', "yyyy-MM-dd HH:mm:ss"))
df = df.withColumn("Duration", timeDiff)
df.show()
Output is:
+-------------------+-------------------+--------+
| date1| date2|Duration|
+-------------------+-------------------+--------+
|2011-01-03 13:25:59|2011-01-03 13:27:00| 61|
| 2011-01-03 3:25:59| 2011-01-03 3:30:00| 241|
+-------------------+-------------------+--------+
Agreed on above answer and thanks!
But I think maybe need to change to:
timeDiff = (unix_timestamp(F.col('date2'), "yyyy-MM-dd HH:mm:ss") - unix_timestamp(F.col('date1'), "yyyy-MM-dd HH:mm:ss"))
Given
import pyspark.sql.functions as F

Resources