Apache Spark subtract days from timestamp column

Apache Spark subtract days from timestamp column - apache-spark

I am using Spark Dataset and having trouble subtracting days from a timestamp column.
I would like to subtract days from Timestamp Column and get new Column with full datetime format. Example:
2017-09-22 13:17:39.900 - 10 ----> 2017-09-12 13:17:39.900
With date_sub functions I am getting 2017-09-12 without 13:17:39.900.

You cast data to timestamp and expr to subtract an INTERVAL:
import org.apache.spark.sql.functions.expr
val df = Seq("2017-09-22 13:17:39.900").toDF("timestamp")
df.withColumn(
"10_days_before",
$"timestamp".cast("timestamp") - expr("INTERVAL 10 DAYS")).show(false)
+-----------------------+---------------------+
|timestamp |10_days_before |
+-----------------------+---------------------+
|2017-09-22 13:17:39.900|2017-09-12 13:17:39.9|
+-----------------------+---------------------+
If data is already of TimestampType you can skip cast.

Or you can simply use date_sub function from pyspark +1.5:
from pyspark.sql.functions import *
df.withColumn("10_days_before", date_sub(col('timestamp'),10).cast('timestamp'))

Related

pyspark convert millisecond timestamp to timestamp

I am new in spark and i would like to retrieve timestamps in my DF.
checkpoint actual values
1594976390070
and i want :
checkpoint values without ms
1594976390070 / 1000
Actually i am using this piece of code to cast as timestamp:
# Casting dates as Timestamp
for d in dateFields:
df= df.withColumn(d,checkpoint.cast(TimestampType()))
I wonder how to convert it into a simple timestamp.

Divide your column by 1000 and use F.from_unixtime to convert to timestamp type:
import pyspark.sql.functions as F
for d in dateFields:
df = df.withColumn(d,
(checkpoint / F.lit(1000.)).cast('timestamp')
)

Spark cassandra sqlcontext with unix epoch timestamp column

I have Cassandra table with unix epoch timestamp column (value e.g 1599613045). I would like to use spark sqlcontext to select from this table from date to date based on this unix epoch timestamp column. I intend to convert from date, to date input into epoch timestamp and compare (>= & <=) with table epoch ts column. Is it possible ? Any suggestion ? Many thanks!

Follow the below approach,
Let's consider cassandra is running on localhost:9042
keyspace-->mykeyspace
tabe-->mytable
columnName-->timestamp
spark-scala code:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
// create SparkSession
val spark=SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
//Read table from cassandra, spark-cassandra connector should be added to classpath
spark.conf.set("spark.cassandra.connection.host", "localhost")
spark.conf.set("spark.cassandra.connection.port", "9042")
var cassandraDF = spark.read.format("org.apache.spark.sql.cassandra")
.options(Map("keyspace" -> "mykeyspace", "table" -> "mytable")).load()
//select timestamp column
cassandraDF=cassandraDF.select('timestamp)
cassandraDF.show(false)
// let's consider following as the output
+----------+
| timestamp|
+----------+
|1576089000|
|1575916200|
|1590258600|
|1591900200|
+----------+
// To convert the above output to spark's default date format yyyy-MM-dd
val outDF=cassandraDF.withColumn("date",to_date(from_unixtime('timestamp)))
outDF.show(false)
+----------+----------+
| timestamp| date|
+----------+----------+
|1576089000|2019-12-12|
|1575916200|2019-12-10|
|1590258600|2020-05-24|
|1591900200|2020-06-12|
+----------+----------+
// You can proceed with next steps from here

Filter pyspark by time difference

I have a dataframe in pyspark that looks like this:
+----------+-------------------+-------+-----------------------+-----------------------+--------+
|Session_Id|Instance_Id |Actions|Start_Date |End_Date |Duration|
+----------+-------------------+-------+-----------------------+-----------------------+--------+
|14252203 |i-051fc2d21fbe001e3|2 |2019-12-17 01:07:30.000|2019-12-17 01:07:30.000|0 |
|43024091 |i-051fc2d21fbe001e3|2 |2019-12-17 01:08:00.000|2019-12-17 01:08:00.000|0 |
|50961995 |i-0c733c7e356bc1615|2 |2019-12-17 01:07:30.000|2019-12-17 01:07:30.000|0 |
|56308963 |i-0c733c7e356bc1615|2 |2019-12-17 01:08:00.000|2019-12-17 01:08:00.000|0 |
|60120472 |i-0c733c7e356bc1615|2 |2019-12-17 01:07:30.000|2019-12-17 01:07:30.000|0 |
|69132492 |i-051fc2d21fbe001e3|2 |2019-12-17 01:07:30.000|2019-12-17 01:07:30.000|0 |
+----------+-------------------+-------+-----------------------+-----------------------+--------+
I'm trying to filter any rows that are too recent with this:
now = datetime.datetime.now()
filtered = grouped.filter(f.abs(f.unix_timestamp(now) - f.unix_timestamp(datetime.datetime.strptime(f.col('End_Date')[:-4], '%Y-%m-%d %H:%M:%S'))) > 100)
which transforms End_Date to a timestamp and calculates the difference from now till End_Date and filters anything less than 100 seconds. Which I got from Filter pyspark dataframe based on time difference between two columns
Every time I run this, I get this error:
TypeError: Invalid argument, not a string or column: 2019-12-19 18:55:13.268489 of type <type 'datetime.datetime'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
How can I filter by comparing timestamps?

I think you're confusing between Python functions and Spark. unix_timestamp function requires a string or Column object but you're passing a Python datetime object, that why you get that error.
Instead use Spark builtin functions : current_date which gives you column with current date value and to_date to convert End_Date column to date.
This should work fine for you:
filtered = grouped.filter(abs(unix_timestamp(current_date()) - unix_timestamp(to_date(col('End_Date'), 'yyyy-MM-dd HH:mm:ss'))) > 100)

PySpark - to_date format from column

I am currently trying to figure out, how to pass the String - format argument to the to_date pyspark function via a column parameter.
Specifically, I have the following setup:
sc = SparkContext.getOrCreate()
df = sc.parallelize([('a','2018-01-01','yyyy-MM-dd'),
('b','2018-02-02','yyyy-MM-dd'),
('c','02-02-2018','dd-MM-yyyy')]).toDF(
["col_name","value","format"])
I am currently trying to add a new column, where each of the dates from the column F.col("value"), which is a string value, is parsed to a date.
Separately for each format, this can be done with
df = df.withColumn("test1",F.to_date(F.col("value"),"yyyy-MM-dd")).\
withColumn("test2",F.to_date(F.col("value"),"dd-MM-yyyy"))
This however gives me 2 new columns - but I want to have 1 column containing both results - but calling the column does not seem to be possible with the to_date function:
df = df.withColumn("test3",F.to_date(F.col("value"),F.col("format")))
Here an error "Column object not callable" is being thrown.
Is is possible to have a generic approach for all possible formats (so that I do not have to manually add new columns for each format)?

You can use a column value as a parameter without a udf using the spark-sql syntax:
Spark version 2.2 and above
from pyspark.sql.functions import expr
df.withColumn("test3",expr("to_date(value, format)")).show()
#+--------+----------+----------+----------+
#|col_name| value| format| test3|
#+--------+----------+----------+----------+
#| a|2018-01-01|yyyy-MM-dd|2018-01-01|
#| b|2018-02-02|yyyy-MM-dd|2018-02-02|
#| c|02-02-2018|dd-MM-yyyy|2018-02-02|
#+--------+----------+----------+----------+
Or equivalently using pyspark-sql:
df.createOrReplaceTempView("df")
spark.sql("select *, to_date(value, format) as test3 from df").show()
Spark version 1.5 and above
Older versions of spark do not support having a format argument to the to_date function, so you'll have to use unix_timestamp and from_unixtime:
from pyspark.sql.functions import expr
df.withColumn(
"test3",
expr("from_unixtime(unix_timestamp(value,format))").cast("date")
).show()
Or equivalently using pyspark-sql:
df.createOrReplaceTempView("df")
spark.sql(
"select *, cast(from_unixtime(unix_timestamp(value,format)) as date) as test3 from df"
).show()

As far as I know, your problem requires some udf(user defined functions) to apply the correct format. But then inside a udf you can not directly use spark functions like to_date. So I created a little workaround in the solution. First the udf takes the python date conversion with the appropriate format from the column and converts it to an iso-format. Then another withColumn converts the iso-date to the correct format in column test3. However, you have to adapt the format in the original column to match the python dateformat strings, e.g. yyyy -> %Y, MM -> %m, ...
test_df = spark.createDataFrame([
('a','2018-01-01','%Y-%m-%d'),
('b','2018-02-02','%Y-%m-%d'),
('c','02-02-2018','%d-%m-%Y')
], ("col_name","value","format"))
def map_to_date(s,format):
return datetime.datetime.strptime(s,format).isoformat()
myudf = udf(map_to_date)
test_df.withColumn("test3",myudf(col("value"),col("format")))\
.withColumn("test3",to_date("test3")).show(truncate=False)
Result:
+--------+----------+--------+----------+
|col_name|value |format |test3 |
+--------+----------+--------+----------+
|a |2018-01-01|%Y-%m-%d|2018-01-01|
|b |2018-02-02|%Y-%m-%d|2018-02-02|
|c |02-02-2018|%d-%m-%Y|2018-02-02|
+--------+----------+--------+----------+

You dont need the format column also. You can use coalesce to check for all possible options
def get_right_date_format(date_string):
from pyspark.sql import functions as F
return F.coalesce(
F.to_date(date_string, 'yyyy-MM-dd'),
F.to_date(date_string, 'dd-MM-yyyy'),
F.to_date(date_string, 'yyyy-dd-MM')
)
df = sc.parallelize([('a','2018-01-01'),
('b','2018-02-02'),
('c','2018-21-02'),
('d','02-02-2018')]).toDF(
["col_name","value"])
df = df.withColumn("formatted_data",get_right_date_format(df.value, 'dd-MM-yyyy'))
The issue with this approach though is a date like 2020-02-01 would be treated as 1st Feb 2020, when it is likely that 2nd Jan 2020 is also possible.
Just an alternative approach !!!

Create a timestamp Column in Spark Dataframe from other column having timestamp value

I have a spark dataframe having a timestamp Column.
I want to get previous day date of the column.Then add time (3,59,59) to the date.
Ex- value in current column(x1) : 2018-07-11 21:40:00
previous day date : 2018-07-10
after adding time(3,59,59) to the previous day date ,it should be like :
2018-07-10 03:59:59 (x2)
I want to add a column in the dataframe with "x2" values corresponding to "x1" values in all records.
I want one more column with values equal to difference of (x1-x2).totalDays in exact double values

Substracting day and adding time and converting to timestamp type
from pyspark.sql.types import *
from pyspark.sql import *
>>>df.withColumn('x2',concat(date_sub(col("x1"),1),lit(" 03:59:59")).cast("timestamp"))
Caluculating Time and Date difference:
Date Difference:-
Using datediff function we can caluculate date difference
>>>df1.withColumn("x3",datediff(col("x1"),col("x2")))
Time Difference
Calculate time difference for this convert to unix time then subtract x2 column from x1
>>>df1.withColumn("x3",unix_timestamp(col("x1"))-unix_timestamp(col("x2")))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Apache Spark subtract days from timestamp column - apache-spark

Or you can simply use date_sub function from pyspark +1.5: from pyspark.sql.functions import * df.withColumn("10_days_before", date_sub(col('timestamp'),10).cast('timestamp'))

Related

pyspark convert millisecond timestamp to timestamp

Spark cassandra sqlcontext with unix epoch timestamp column

Filter pyspark by time difference

PySpark - to_date format from column

Create a timestamp Column in Spark Dataframe from other column having timestamp value

Categories

Resources