Subtracting two date columns in PySpark Python - python-3.x

I am trying to subtract two columns in PySpark Dataframe in Python I have got a number of problems doing it, I have column type as timestamp, the column is date1 = 2011-01-03 13:25:59 and want to subtract this from other date column date2 = 2011-01-03 13:27:00 so I want date2 - date1 and from those dataframe columns and making a seperate timediff column, which shows difference of these both columns such as timeDiff = 00:01:01
how can I do this in PySaprk
I tried the following code:
#timeDiff = df.withColumn(('timeDiff', col(df['date2']) - col(df['date1'])))
this code didn't work
I tried doing this simple thing:
timeDiff = df['date2'] - df['date1']
this actually worked, but after that I tried to add this seperate column to the my dataframe by the following piece of code
df = df.withColumn("Duration", timeDiff)
it is having the follwing error:
Py4JJavaError: An error occurred while calling o107.withColumn.
: org.apache.spark.sql.AnalysisException: cannot resolve '(`date2` - `date1`)' due to data type mismatch: '(`date2` - `date1`)' requires (numeric or calendarinterval) type, not timestamp;;
Any one can help me with any other method or how can I resolve this error ??

from pyspark.sql.functions import unix_timestamp
#sample data
df = sc.parallelize([
['2011-01-03 13:25:59', '2011-01-03 13:27:00'],
['2011-01-03 3:25:59', '2011-01-03 3:30:00']
]).toDF(('date1', 'date2'))
timeDiff = (unix_timestamp('date2', "yyyy-MM-dd HH:mm:ss") - unix_timestamp('date1', "yyyy-MM-dd HH:mm:ss"))
df = df.withColumn("Duration", timeDiff)
df.show()
Output is:
+-------------------+-------------------+--------+
| date1| date2|Duration|
+-------------------+-------------------+--------+
|2011-01-03 13:25:59|2011-01-03 13:27:00| 61|
| 2011-01-03 3:25:59| 2011-01-03 3:30:00| 241|
+-------------------+-------------------+--------+

Agreed on above answer and thanks!
But I think maybe need to change to:
timeDiff = (unix_timestamp(F.col('date2'), "yyyy-MM-dd HH:mm:ss") - unix_timestamp(F.col('date1'), "yyyy-MM-dd HH:mm:ss"))
Given
import pyspark.sql.functions as F

Related

How to cast a string column to date having two different types of date formats in Pyspark

I have a dataframe column which is of type string and has dates in it. I want to cast the column from string to date but the column contains two types of date formats.
I tried using the to_date function but it is not working as expected and giving null values after applying function.
Below are the two date formats I am getting in the df col(datatype - string)
I tried applying the to_date function and below are the results
Please let me know how we can solve this issue and get the date column in only one format
Thanks in advance
You can use pyspark.sql.functions.coalesce to return the first non-null result in a list of columns. So the trick here is to parse using multiple formats and take the first non-null one:
from pyspark.sql import functions as F
df = spark.createDataFrame([
("9/1/2022",),
("2022-11-24",),
], ["Alert Release Date"])
x = F.col("Alert Release Date")
df.withColumn("date", F.coalesce(F.to_date(x, "M/d/yyyy"), F.to_date(x, "yyyy-MM-dd"))).show()
+------------------+----------+
|Alert Release Date| date|
+------------------+----------+
| 9/1/2022|2022-09-01|
| 2022-11-24|2022-11-24|
+------------------+----------+

Why Spark is not recognizing this time format?

I get null for the timestamp 27-04-2021 14:11 with this code. What mistake am I doing? Why is the timestamp format string DD-MM-yyyy HH:mm not correct here?
df = spark.createDataFrame([('27-04-2021 14:11',)], ['t'])
df = df.select(to_timestamp(df.t, 'DD-MM-yyyy HH:mm').alias('dt'))
display(df)
D is for day of the year, and d is for day of the month.
Try this:
df = df.select(F.to_timestamp(df.t, "dd-MM-yyyy HH:mm").alias("dt"))

Transform a day of the year to day_month format

So I was wondering if it's possible with PySpark to transform a random day of the year (0-365) to day-month format. In my case, the input would be a string.
Example:
Input : "091"
Expected output (month-day): "0331"
This is possible but you need to have to year also. Convert the year to date (first January of that year) and add the days to get the desired result, then format it.
Here's a working example
from pyspark.sql import functions as F
df = spark.createDataFrame([("2020", "091")], ["year", "day_of_year"])
df1 = df.withColumn(
"first_day_year",
F.concat_ws("-", "year", F.lit("01"), F.lit("01"))
).withColumn(
"day_month",
F.date_format(
F.expr("date_add(first_day_year, cast(day_of_year as int) -1)"),
"MMdd"
)
).drop("first_day_year")
df1.show()
#+----+-----------+---------+
#|year|day_of_year|day_month|
#+----+-----------+---------+
#|2020| 091| 0331|
#+----+-----------+---------+
You can use date_add to add the number of days to the new year's day.
import pyspark.sql.functions as F
df2 = df.withColumn(
'day_month',
F.expr("date_format(date_add('2020-01-01', int(day - 1)), 'MMdd')")
)
df2.show()
+---+---------+
|day|day_month|
+---+---------+
| 91| 0331|
+---+---------+
Note that the result will vary depending on whether it's a leap year or not.

Filter pyspark by time difference

I have a dataframe in pyspark that looks like this:
+----------+-------------------+-------+-----------------------+-----------------------+--------+
|Session_Id|Instance_Id |Actions|Start_Date |End_Date |Duration|
+----------+-------------------+-------+-----------------------+-----------------------+--------+
|14252203 |i-051fc2d21fbe001e3|2 |2019-12-17 01:07:30.000|2019-12-17 01:07:30.000|0 |
|43024091 |i-051fc2d21fbe001e3|2 |2019-12-17 01:08:00.000|2019-12-17 01:08:00.000|0 |
|50961995 |i-0c733c7e356bc1615|2 |2019-12-17 01:07:30.000|2019-12-17 01:07:30.000|0 |
|56308963 |i-0c733c7e356bc1615|2 |2019-12-17 01:08:00.000|2019-12-17 01:08:00.000|0 |
|60120472 |i-0c733c7e356bc1615|2 |2019-12-17 01:07:30.000|2019-12-17 01:07:30.000|0 |
|69132492 |i-051fc2d21fbe001e3|2 |2019-12-17 01:07:30.000|2019-12-17 01:07:30.000|0 |
+----------+-------------------+-------+-----------------------+-----------------------+--------+
I'm trying to filter any rows that are too recent with this:
now = datetime.datetime.now()
filtered = grouped.filter(f.abs(f.unix_timestamp(now) - f.unix_timestamp(datetime.datetime.strptime(f.col('End_Date')[:-4], '%Y-%m-%d %H:%M:%S'))) > 100)
which transforms End_Date to a timestamp and calculates the difference from now till End_Date and filters anything less than 100 seconds. Which I got from Filter pyspark dataframe based on time difference between two columns
Every time I run this, I get this error:
TypeError: Invalid argument, not a string or column: 2019-12-19 18:55:13.268489 of type <type 'datetime.datetime'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
How can I filter by comparing timestamps?
I think you're confusing between Python functions and Spark. unix_timestamp function requires a string or Column object but you're passing a Python datetime object, that why you get that error.
Instead use Spark builtin functions : current_date which gives you column with current date value and to_date to convert End_Date column to date.
This should work fine for you:
filtered = grouped.filter(abs(unix_timestamp(current_date()) - unix_timestamp(to_date(col('End_Date'), 'yyyy-MM-dd HH:mm:ss'))) > 100)

PySpark - to_date format from column

I am currently trying to figure out, how to pass the String - format argument to the to_date pyspark function via a column parameter.
Specifically, I have the following setup:
sc = SparkContext.getOrCreate()
df = sc.parallelize([('a','2018-01-01','yyyy-MM-dd'),
('b','2018-02-02','yyyy-MM-dd'),
('c','02-02-2018','dd-MM-yyyy')]).toDF(
["col_name","value","format"])
I am currently trying to add a new column, where each of the dates from the column F.col("value"), which is a string value, is parsed to a date.
Separately for each format, this can be done with
df = df.withColumn("test1",F.to_date(F.col("value"),"yyyy-MM-dd")).\
withColumn("test2",F.to_date(F.col("value"),"dd-MM-yyyy"))
This however gives me 2 new columns - but I want to have 1 column containing both results - but calling the column does not seem to be possible with the to_date function:
df = df.withColumn("test3",F.to_date(F.col("value"),F.col("format")))
Here an error "Column object not callable" is being thrown.
Is is possible to have a generic approach for all possible formats (so that I do not have to manually add new columns for each format)?
You can use a column value as a parameter without a udf using the spark-sql syntax:
Spark version 2.2 and above
from pyspark.sql.functions import expr
df.withColumn("test3",expr("to_date(value, format)")).show()
#+--------+----------+----------+----------+
#|col_name| value| format| test3|
#+--------+----------+----------+----------+
#| a|2018-01-01|yyyy-MM-dd|2018-01-01|
#| b|2018-02-02|yyyy-MM-dd|2018-02-02|
#| c|02-02-2018|dd-MM-yyyy|2018-02-02|
#+--------+----------+----------+----------+
Or equivalently using pyspark-sql:
df.createOrReplaceTempView("df")
spark.sql("select *, to_date(value, format) as test3 from df").show()
Spark version 1.5 and above
Older versions of spark do not support having a format argument to the to_date function, so you'll have to use unix_timestamp and from_unixtime:
from pyspark.sql.functions import expr
df.withColumn(
"test3",
expr("from_unixtime(unix_timestamp(value,format))").cast("date")
).show()
Or equivalently using pyspark-sql:
df.createOrReplaceTempView("df")
spark.sql(
"select *, cast(from_unixtime(unix_timestamp(value,format)) as date) as test3 from df"
).show()
As far as I know, your problem requires some udf(user defined functions) to apply the correct format. But then inside a udf you can not directly use spark functions like to_date. So I created a little workaround in the solution. First the udf takes the python date conversion with the appropriate format from the column and converts it to an iso-format. Then another withColumn converts the iso-date to the correct format in column test3. However, you have to adapt the format in the original column to match the python dateformat strings, e.g. yyyy -> %Y, MM -> %m, ...
test_df = spark.createDataFrame([
('a','2018-01-01','%Y-%m-%d'),
('b','2018-02-02','%Y-%m-%d'),
('c','02-02-2018','%d-%m-%Y')
], ("col_name","value","format"))
def map_to_date(s,format):
return datetime.datetime.strptime(s,format).isoformat()
myudf = udf(map_to_date)
test_df.withColumn("test3",myudf(col("value"),col("format")))\
.withColumn("test3",to_date("test3")).show(truncate=False)
Result:
+--------+----------+--------+----------+
|col_name|value |format |test3 |
+--------+----------+--------+----------+
|a |2018-01-01|%Y-%m-%d|2018-01-01|
|b |2018-02-02|%Y-%m-%d|2018-02-02|
|c |02-02-2018|%d-%m-%Y|2018-02-02|
+--------+----------+--------+----------+
You dont need the format column also. You can use coalesce to check for all possible options
def get_right_date_format(date_string):
from pyspark.sql import functions as F
return F.coalesce(
F.to_date(date_string, 'yyyy-MM-dd'),
F.to_date(date_string, 'dd-MM-yyyy'),
F.to_date(date_string, 'yyyy-dd-MM')
)
df = sc.parallelize([('a','2018-01-01'),
('b','2018-02-02'),
('c','2018-21-02'),
('d','02-02-2018')]).toDF(
["col_name","value"])
df = df.withColumn("formatted_data",get_right_date_format(df.value, 'dd-MM-yyyy'))
The issue with this approach though is a date like 2020-02-01 would be treated as 1st Feb 2020, when it is likely that 2nd Jan 2020 is also possible.
Just an alternative approach !!!

Resources