Convert string to date in pyspark - apache-spark

I have a date value in a column of string type that takes this format:
06-MAY-16 09.17.15
I want to convert it to this format:
20160506
I have tried using DATE_FORMAT(TO_DATE(<column>), 'yyyyMMdd') but a NULL value is returned.
Does anyone have any ideas about how to go about doing this in pyspark or spark SQL?
Thanks

I've got it! This is the code I used which seems to have worked:
FROM_UNIXTIME(UNIX_TIMESTAMP(<column>, 'dd-MMM-yy HH.mm.ss'), 'yyyyMMdd')
Hope this helps others!

Your original attempt is close to the solution. You just needed to add the format in the TO_DATE() function. This will work as well:
DATE_FORMAT(TO_DATE(<col>, 'dd-MMM-yy HH.mm.ss'), 'yyyyMMdd')
And for pyspark:
import pyspark.sql.functions as F
df = df.withColumn('<col>', F.date_format(F.to_date(F.col('<col>'), 'dd-MMM-yy HH.mm.ss'), 'yyyyMMdd'))

Convert your string to a date before you try to 'reformat' it.
Convert pyspark string to date format -- to_timestamp(df.t, 'dd-MMM-YY HH.mm.ss').alias('my_date')
Pyspark date yyyy-mmm-dd conversion -- date_format(col("my_date"), "yyyyMMdd")

Related

How to convert timestamp with 6digit milliseconds using to_timestamp function in pyspark

I have a timestamp column in my dataframe with timestamps in a format like: 2022-07-28T10:38:50.926866Z that are currently strings.
I want to convert this column into actual timestamps and I've searched around but every time I try to_timestamp with this type of data I get nulls.
Things I've tried:
df = spark.createDataFrame([("2022-07-28T10:38:50.926866Z",)],['date_str'])
df.withColumn("ts1", F.to_timestamp(col('date_str'), "yyyy-MM-dd'T'HH:mm:ss.SSSSSS'Z'")).show(truncate=False)
This always gets me null but when I run something similar on an example with just 3 ms digits, it seems to work:
df = spark.createDataFrame([("2022-07-28T10:38:50.926Z",)],['date_str'])
df.withColumn("ts1", F.to_timestamp(col('date_str'), "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")).show(truncate=False)
I'm completely lost on how to handle this string conversion.
I actually ended up solving this by removing the last 4 characters of each timestamp string first and then running the to_timestamp. I don't mind losing the ms so this worked for me.
df = df.withColumn("date_str", F.substring("date_str", 1, 23))
df.withColumn("date_str", F.to_timestamp(df_final.date_str, "yyyy-MM-dd'T'HH:mm:ss.SSS")).show()

Python convert a str date into a datetime with timezone object

In my django project i have to convert a str variable passed as a date ("2021-11-10") to a datetime with timezone object for execute an ORM filter on a DateTime field.
In my db values are stored as for example:
2021-11-11 01:18:04.200149+00
i try:
# test date
df = "2021-11-11"
df = df + " 00:00:00+00"
start_d = datetime.strptime(df, '%Y-%m-%d %H:%M:%S%Z')
but i get an error due to an error about str format and datetime representation (are different)
How can i convert a single date string into a datetimeobject with timezone stated from midnight of the date value?
So many thanks in advance
It's not the way to datetime.strptime.
Read a little bit more here
I believe it will help you.
you should implement month as str and without "-".
good luck

Pyspark change string to timestamptype

I am new to PySpark and I am trying to create a function that can be used across when inputted a column from String type to a timestampType.
This the input column string looks like: 23/04/2021 12:00:00 AM
I want this to be turned in to timestampType so I can get latest date using pyspark.
Below is the function I so far created:
def datetype_change(self, key, col):
self.log.info("datetype_change...".format(self.app_name.upper()))
self.df[key] = self.df[key].withColumn("column_name", F.unix_timestamp(F.col("column_name"), 'yyyy-MM-dd HH:mm:ss').cast(TimestampType()))
When I run it I'm getting an error:
NameError: name 'TimestampType' is not defined
How do I change this function so it can take the intended output?
Found my answer:
def datetype_change(self,key,col):
self.log.info("-datetype_change...".format(self.app_name.upper()))
self.df[key] = self.df[key].withColumn(col, F.unix_timestamp(self.df[key][col], 'dd/MM/yyyy hh:mm:ss aa').cast(TimestampType()))

convert DD-MMM-YYYY to DD_MM_YYYY in spark

I have a file that contains a date column and the values are 01-Feb-2019 , 01-02-2019 02:00:00.
I have to convert these into DD_MM_YYYY format in spark.
Any suggestions?
I tried below with no luck
val r = dfCsvTS02.withColumn("create_dts", date_format($"create_dts", "dd-MM-yyyy hh:mm:ss"))
iS it possible that whatever the way we get the date , it will convert all to dd-mm-yyyy
Simply use functions to_timestamp to convert date and date_format to format. Something like this:
val r = dfCsvTS02.withColumn("create_dts", date_format(to_timestamp($"create_dts", "dd-MMM-yyyy").cast("date"), "dd-MM-yyyy"))

Filtering a spark dataframe based on date

I have a dataframe of
date, string, string
I want to select dates before a certain period. I have tried the following with no luck
data.filter(data("date") < new java.sql.Date(format.parse("2015-03-14").getTime))
I'm getting an error stating the following
org.apache.spark.sql.AnalysisException: resolved attribute(s) date#75 missing from date#72,uid#73,iid#74 in operator !Filter (date#75 < 16508);
As far as I can guess the query is incorrect. Can anyone show me what way the query should be formatted?
I checked that all enteries in the dataframe have values - they do.
The following solutions are applicable since spark 1.5 :
For lower than :
// filter data where the date is lesser than 2015-03-14
data.filter(data("date").lt(lit("2015-03-14")))
For greater than :
// filter data where the date is greater than 2015-03-14
data.filter(data("date").gt(lit("2015-03-14")))
For equality, you can use either equalTo or === :
data.filter(data("date") === lit("2015-03-14"))
If your DataFrame date column is of type StringType, you can convert it using the to_date function :
// filter data where the date is greater than 2015-03-14
data.filter(to_date(data("date")).gt(lit("2015-03-14")))
You can also filter according to a year using the year function :
// filter data where year is greater or equal to 2016
data.filter(year($"date").geq(lit(2016)))
Don't use this as suggested in other answers
.filter(f.col("dateColumn") < f.lit('2017-11-01'))
But use this instead
.filter(f.col("dateColumn") < f.unix_timestamp(f.lit('2017-11-01 00:00:00')).cast('timestamp'))
This will use the TimestampType instead of the StringType, which will be more performant in some cases. For example Parquet predicate pushdown will only work with the latter.
Edit: Both snippets assume this import:
from pyspark.sql import functions as f
I find the most readable way to express this is using a sql expression:
df.filter("my_date < date'2015-01-01'")
we can verify this works correctly by looking at the physical plan from .explain()
+- *(1) Filter (isnotnull(my_date#22) && (my_date#22 < 16436))
In PySpark(python) one of the option is to have the column in unix_timestamp format.We can convert string to unix_timestamp and specify the format as shown below.
Note we need to import unix_timestamp and lit function
from pyspark.sql.functions import unix_timestamp, lit
df.withColumn("tx_date", to_date(unix_timestamp(df_cast["date"], "MM/dd/yyyy").cast("timestamp")))
Now we can apply the filters
df_cast.filter(df_cast["tx_date"] >= lit('2017-01-01')) \
.filter(df_cast["tx_date"] <= lit('2017-01-31')).show()
df=df.filter(df["columnname"]>='2020-01-13')
We can also use SQL kind of expression inside filter :
Note -> Here I am showing two conditions and a date range for future
reference :
ordersDf.filter("order_status = 'PENDING_PAYMENT' AND order_date BETWEEN '2013-07-01' AND '2013-07-31' ")
imho it should be like this:
import java.util.Date
import java.util.Calendar
import java.sql.Timestamp
import java.sql.Date
val jDate = Calendar.getInstance().getTime()
val sqlDateTime = new java.sql.Timestamp(jDate.getTime())
val sqlDate = new java.sql.Date(jDate.getTime())
data.filter(data("date").gt(sqlDate))
data.filter(data("date").gt(sqlDateTime))

Resources