Convert a Spark dataframe column from string to date - apache-spark

I have a spark dataframe i built from a sql context.
I truncated the a datetime field using DATE_FORMAT(time, 'Y/M/d HH:00:00') AS time_hourly
Now the column type is a string. How can I convert a string dataFrame column to datetime type?

You can use a trunc(column date, format) to not to lose date datatype.
There is a to_date function to convert string to date

Assuming that df is your dataframe and the column name to be cast is time_hourly
You can try the following:
from pyspark.sql.types import DateType
df.select(df.time_hourly.cast(DateType()).alias('datetime'))
For more info please see:
1) the documentation of "cast()"
https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html
2) the documentation of data-types
https://spark.apache.org/docs/1.6.2/api/python/_modules/pyspark/sql/types.html

Related

How to cast a string column to date having two different types of date formats in Pyspark

I have a dataframe column which is of type string and has dates in it. I want to cast the column from string to date but the column contains two types of date formats.
I tried using the to_date function but it is not working as expected and giving null values after applying function.
Below are the two date formats I am getting in the df col(datatype - string)
I tried applying the to_date function and below are the results
Please let me know how we can solve this issue and get the date column in only one format
Thanks in advance
You can use pyspark.sql.functions.coalesce to return the first non-null result in a list of columns. So the trick here is to parse using multiple formats and take the first non-null one:
from pyspark.sql import functions as F
df = spark.createDataFrame([
("9/1/2022",),
("2022-11-24",),
], ["Alert Release Date"])
x = F.col("Alert Release Date")
df.withColumn("date", F.coalesce(F.to_date(x, "M/d/yyyy"), F.to_date(x, "yyyy-MM-dd"))).show()
+------------------+----------+
|Alert Release Date| date|
+------------------+----------+
| 9/1/2022|2022-09-01|
| 2022-11-24|2022-11-24|
+------------------+----------+

spark dataframe: date formatting not working

I have a csv file in which a date column has values like 01080600, basically MM-dd-HH-mm.
I want to add a column in dataframe which shows this in a more readable format.
I do :
spark.sql("SELECT date...")
.withColumn("readable date", to_date(col("date"), "MM:dd HH:mm"))
.show(10)
But readable date is returned null.
What am I missing here?
While formating or converting to date or timestamp you need to provide the date_format as is following your pattern , example in your case you need to modify your format as below and further which can be formatted depending on the final format you wish your date col to take using date_format
References to various patterns and parsing can be found here
To Timestamp
sql.sql("""
SELECT
TO_TIMESTAMP('01080600','ddMMhhmm') as date,
DATE_FORMAT(TO_TIMESTAMP('01080600','ddMMhhmm'),'MM/dd hh:mm') as formated_date
""").show()
+-------------------+-------------+
| date|formated_date|
+-------------------+-------------+
|1970-08-01 06:00:00| 08/01 06:00|
+-------------------+-------------+

How filter PySpark DataFrame with PySpark for current date

I have a dataframe with the following fields
I'm trying to use PySpark to filter on SaleDate where the SaleDate is the current date.
My attempt is as follows
from pyspark.sql.functions import col
df.where((col("SaleDate") = to_date())
This is assuming todays date is 16/10/2021
I keep on getting the error:
SyntaxError: keyword can't be an expression (<stdin>, line 2)
I should mention that the SaleDate is actually a StringType() and not DateType as shown in the image.
|-- SaleDate: string (nullable = true)
You should use current_date function to get the current date instead of to_date.
So you first need to convert value in SaleDate column from string to date with to_date, then compare the obtained date with current_date:
from pyspark.sql import functions as F
df.where(F.to_date('SaleDate', 'yyyy/MM/dd HH:mm:ss.SSS') == F.current_date())

Coverting datetime.datetime into timestamp in pandas

I have a column in pandas which contains datetime.datetime array. For instance the rows has the following format:
datetime.datetime(2017,12,31,0,0)
I want to convert this to TimeStamp such that I get:
Timestamp('2017-12-31 00:00:00')
as output, I wonder how one does this?
Try: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_timestamp.html
So maybe: df['datetime'].to_timestamp

Spark-Java:How to convert Dataset string column of format "yyyy-MM-ddThh:mm:ss.SSS+0000" to timestamp with a format?

I have a Dataset with one column lastModified of type string with format "yyyy-MM-ddThh:mm:ss.SSS+0000" (sample data: 2018-08-17T19:58:46.000+0000).
I have to add a new column lastModif_mapped of type Timestamp by converting the lastModified's value to format "yyyy-MM-dd hh:mm:ss.SSS".
I tried the code below, but the new column is getting the value null in it:
Dataset<Row> filtered = null;
filtered = ds1.select(ds1.col("id"),ds1.col("lastmodified"))
.withColumn("lastModif_mapped", functions.unix_timestamp(ds1.col("lastmodified"), "yyyy-MM-dd HH:mm:ss.SSS").cast("timestamp")).alias("lastModif_mapped");
Where am I going wrong?
As I have answered in your original question, your input data String field didn't correspond to allowed formats of the unix_timestamp(Column s, String p):
If a string, the data must be in a format that can be cast to a timestamp, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
For you case, you need to use to_timestamp(Column s, String fmt)
import static org.apache.spark.sql.functions.to_timestamp;
...
to_timestamp(ds1.col("lastmodified"), "yyyy-MM-dd'T'HH:mm:ss.SSSXXX")
And you don't need to cast explicitly to Timestamp since to_timestamp returns already Timestamp.
When you use withColumn("lastModif_mapped",...) you don't need to add alias("lastModif_mapped"), because withColumn would create a new column with the provided name.

Resources