Convert a string column to timestamp when read into spark - apache-spark

I'm trying to read a csv file into spark with databricks, but my time column is in string format, my time column entry is like: 2019-08-01 23:59:05-07:00, I want to convert it into timestamp type, here's what I tried:
df = (spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(path_to_file)
.withColumn("observed", unix_timestamp("dt", "yyyy-MM-dd hh:mm:ss.SSSZ")
.cast("double")
.cast("timestamp"))
)
But I got error message: cannot resolve '`dt`' given input columns, I'm guessing I didn't get the "yyyy-MM-dd hh:mm:ss.SSSZ" format right?

Assuming your csv looks like this:
df = spark.createDataFrame([('2019-08-01 23:59:05-07:00',)], ['dt'])
df.show()
+--------------------+
| dt|
+--------------------+
|2019-08-01 23:59:...|
+--------------------+
You can simply parse the timestamp with a to_timestamp function
from pyspark.sql.functions import to_timestamp
df.withColumn('observed', to_timestamp('dt', "yyyy-MM-dd HH:mm:ssXXX")).show()
+--------------------+-------------------+
| dt| observed|
+--------------------+-------------------+
|2019-08-01 23:59:...|2019-08-02 08:59:05|
+--------------------+-------------------+
So, as #HristoIliev mentioned, the reason behind cannot resolve '`dt`' is that 'dt' is supposed to be name of the column already in your dataframe, and 'observed' is supposed to be the name of a new column. If you adjust the names thought it still won't work, because there is format mismatch: yyyy-MM-dd hh:mm:ss.SSSZ won't parse 2019-08-01 23:59:05-07:00, but "yyyy-MM-dd HH:mm:ssXXX" will.

Related

Change the timestamp from UTC to given format in Pyspark

i have a timestamp value i.e "2021-08-18T16:49:42.175-06:00". how can i convert this to "2021-08-18T16:49:42.175Z" format in pyspark.
You can use Pyspark DataFrame function date_format to reformat your timestamp string to any other format.
Example:
df = df.withColumn("ts_column", date_format("ts_column", "yyyy-MM-dd'T'HH:mm:ss.SSSZ")
date_format expects a TimestampType column so you might need to cast it Timestamp first if it currently is StringType
Set the timeZone to "UTC" and read-only upt0 23 chars.
Try below:
spark.conf.set("spark.sql.session.timeZone", "UTC")
spark.sql(""" select to_timestamp('2021-08-18T16:49:42.175-06:00') as ts,
date_format(to_timestamp(substr('2021-08-18T16:49:42.175-06:00',1,23)),'yyyy-MM-dd HH:mm:ss.SSSZ') as ts2 from range(1) """).show(false)
+-----------------------+----------------------------+
|ts |ts2 |
+-----------------------+----------------------------+
|2021-08-18 22:49:42.175|2021-08-18 16:49:42.175+0000|
+-----------------------+----------------------------+
Note that +0000 is UTC
If you want to get "Z", then use X
spark.conf.set("spark.sql.session.timeZone", "UTC")
spark.sql("""
with t1 ( select to_timestamp('2021-08-18T16:49:42.175-06:00') as ts,
to_timestamp(substr('2021-08-18T16:49:42.175-06:00',1,23)) as ts2 from range(1) )
select *, date_format(ts2,'YYYY-MM-d HH:MM:ss.SX') ts3 from t1
""").show(false)
+-----------------------+-----------------------+------------------------+
|ts |ts2 |ts3 |
+-----------------------+-----------------------+------------------------+
|2021-08-18 22:49:42.175|2021-08-18 16:49:42.175|2021-08-18 16:08:42.175Z|
+-----------------------+-----------------------+------------------------+

How to convert all the date format to a timestamp for date column?

I am using PySpark version 3.0.1. I am reading a csv file as a PySpark dataframe having 2 date column. But when I try to print the schema both column is populated as string type.
Above screenshot attached is a Dataframe and schema of the Dataframe.
How to convert the row values there in both the date column to timestamp format using pyspark?
I have tried many things but all code is required the current format but how to convert to proper timestamp if I am not aware of what format is coming in csv file.
I have tried below code as wellb but this is creating a new column with null value
df1 = df.withColumn('datetime', col('joining_date').cast('timestamp'))
print(df1.show())
print(df1.printSchema())
Since there are two different date types, you need to convert using two different date formats, and coalesce the results.
import pyspark.sql.functions as F
result = df.withColumn(
'datetime',
F.coalesce(
F.to_timestamp('joining_date', 'MM-dd-yy'),
F.to_timestamp('joining_date', 'MM/dd/yy')
)
)
result.show()
+------------+-------------------+
|joining_date| datetime|
+------------+-------------------+
| 01-20-20|2020-01-20 00:00:00|
| 01/19/20|2020-01-19 00:00:00|
+------------+-------------------+
If you want to convert all to a single format:
import pyspark.sql.functions as F
result = df.withColumn(
'datetime',
F.date_format(
F.coalesce(
F.to_timestamp('joining_date', 'MM-dd-yy'),
F.to_timestamp('joining_date', 'MM/dd/yy')
),
'MM-dd-yy'
)
)
result.show()
+------------+--------+
|joining_date|datetime|
+------------+--------+
| 01-20-20|01-20-20|
| 01/19/20|01-19-20|
+------------+--------+

Spark cassandra sqlcontext with unix epoch timestamp column

I have Cassandra table with unix epoch timestamp column (value e.g 1599613045). I would like to use spark sqlcontext to select from this table from date to date based on this unix epoch timestamp column. I intend to convert from date, to date input into epoch timestamp and compare (>= & <=) with table epoch ts column. Is it possible ? Any suggestion ? Many thanks!
Follow the below approach,
Let's consider cassandra is running on localhost:9042
keyspace-->mykeyspace
tabe-->mytable
columnName-->timestamp
spark-scala code:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
// create SparkSession
val spark=SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
//Read table from cassandra, spark-cassandra connector should be added to classpath
spark.conf.set("spark.cassandra.connection.host", "localhost")
spark.conf.set("spark.cassandra.connection.port", "9042")
var cassandraDF = spark.read.format("org.apache.spark.sql.cassandra")
.options(Map("keyspace" -> "mykeyspace", "table" -> "mytable")).load()
//select timestamp column
cassandraDF=cassandraDF.select('timestamp)
cassandraDF.show(false)
// let's consider following as the output
+----------+
| timestamp|
+----------+
|1576089000|
|1575916200|
|1590258600|
|1591900200|
+----------+
// To convert the above output to spark's default date format yyyy-MM-dd
val outDF=cassandraDF.withColumn("date",to_date(from_unixtime('timestamp)))
outDF.show(false)
+----------+----------+
| timestamp| date|
+----------+----------+
|1576089000|2019-12-12|
|1575916200|2019-12-10|
|1590258600|2020-05-24|
|1591900200|2020-06-12|
+----------+----------+
// You can proceed with next steps from here

PySpark - to_date format from column

I am currently trying to figure out, how to pass the String - format argument to the to_date pyspark function via a column parameter.
Specifically, I have the following setup:
sc = SparkContext.getOrCreate()
df = sc.parallelize([('a','2018-01-01','yyyy-MM-dd'),
('b','2018-02-02','yyyy-MM-dd'),
('c','02-02-2018','dd-MM-yyyy')]).toDF(
["col_name","value","format"])
I am currently trying to add a new column, where each of the dates from the column F.col("value"), which is a string value, is parsed to a date.
Separately for each format, this can be done with
df = df.withColumn("test1",F.to_date(F.col("value"),"yyyy-MM-dd")).\
withColumn("test2",F.to_date(F.col("value"),"dd-MM-yyyy"))
This however gives me 2 new columns - but I want to have 1 column containing both results - but calling the column does not seem to be possible with the to_date function:
df = df.withColumn("test3",F.to_date(F.col("value"),F.col("format")))
Here an error "Column object not callable" is being thrown.
Is is possible to have a generic approach for all possible formats (so that I do not have to manually add new columns for each format)?
You can use a column value as a parameter without a udf using the spark-sql syntax:
Spark version 2.2 and above
from pyspark.sql.functions import expr
df.withColumn("test3",expr("to_date(value, format)")).show()
#+--------+----------+----------+----------+
#|col_name| value| format| test3|
#+--------+----------+----------+----------+
#| a|2018-01-01|yyyy-MM-dd|2018-01-01|
#| b|2018-02-02|yyyy-MM-dd|2018-02-02|
#| c|02-02-2018|dd-MM-yyyy|2018-02-02|
#+--------+----------+----------+----------+
Or equivalently using pyspark-sql:
df.createOrReplaceTempView("df")
spark.sql("select *, to_date(value, format) as test3 from df").show()
Spark version 1.5 and above
Older versions of spark do not support having a format argument to the to_date function, so you'll have to use unix_timestamp and from_unixtime:
from pyspark.sql.functions import expr
df.withColumn(
"test3",
expr("from_unixtime(unix_timestamp(value,format))").cast("date")
).show()
Or equivalently using pyspark-sql:
df.createOrReplaceTempView("df")
spark.sql(
"select *, cast(from_unixtime(unix_timestamp(value,format)) as date) as test3 from df"
).show()
As far as I know, your problem requires some udf(user defined functions) to apply the correct format. But then inside a udf you can not directly use spark functions like to_date. So I created a little workaround in the solution. First the udf takes the python date conversion with the appropriate format from the column and converts it to an iso-format. Then another withColumn converts the iso-date to the correct format in column test3. However, you have to adapt the format in the original column to match the python dateformat strings, e.g. yyyy -> %Y, MM -> %m, ...
test_df = spark.createDataFrame([
('a','2018-01-01','%Y-%m-%d'),
('b','2018-02-02','%Y-%m-%d'),
('c','02-02-2018','%d-%m-%Y')
], ("col_name","value","format"))
def map_to_date(s,format):
return datetime.datetime.strptime(s,format).isoformat()
myudf = udf(map_to_date)
test_df.withColumn("test3",myudf(col("value"),col("format")))\
.withColumn("test3",to_date("test3")).show(truncate=False)
Result:
+--------+----------+--------+----------+
|col_name|value |format |test3 |
+--------+----------+--------+----------+
|a |2018-01-01|%Y-%m-%d|2018-01-01|
|b |2018-02-02|%Y-%m-%d|2018-02-02|
|c |02-02-2018|%d-%m-%Y|2018-02-02|
+--------+----------+--------+----------+
You dont need the format column also. You can use coalesce to check for all possible options
def get_right_date_format(date_string):
from pyspark.sql import functions as F
return F.coalesce(
F.to_date(date_string, 'yyyy-MM-dd'),
F.to_date(date_string, 'dd-MM-yyyy'),
F.to_date(date_string, 'yyyy-dd-MM')
)
df = sc.parallelize([('a','2018-01-01'),
('b','2018-02-02'),
('c','2018-21-02'),
('d','02-02-2018')]).toDF(
["col_name","value"])
df = df.withColumn("formatted_data",get_right_date_format(df.value, 'dd-MM-yyyy'))
The issue with this approach though is a date like 2020-02-01 would be treated as 1st Feb 2020, when it is likely that 2nd Jan 2020 is also possible.
Just an alternative approach !!!

Spark sql - Pyspark string to date conversion

I've a column with the data 20180501 in string format, I want to convert it to date format, tried using
to_date(cast(unix_timestamp('20180501', 'YYYYMMDD') as timestamp))'
but still it didn't worked. I'm using Spark SQL with dataframes
The format should be yyyyMMdd:
spark.sql("SELECT to_date(cast(unix_timestamp('20180501', 'yyyyMMdd') as timestamp))").show()
# +------------------------------------------------------------------+
# |to_date(CAST(unix_timestamp('20180501', 'yyyyMMdd') AS TIMESTAMP))|
# +------------------------------------------------------------------+
# | 2018-05-01|
# +------------------------------------------------------------------+
As pointed out in the other answer the format you use is incorrect. But you can also use to_date directly:
spark.sql("SELECT to_date('20180501', 'yyyyMMdd')").show()
+-------------------------------+
|to_date('20180501', 'yyyyMMdd')|
+-------------------------------+
| 2018-05-01|
+-------------------------------+

Resources