convert DD-MMM-YYYY to DD_MM_YYYY in spark - apache-spark

I have a file that contains a date column and the values are 01-Feb-2019 , 01-02-2019 02:00:00.
I have to convert these into DD_MM_YYYY format in spark.
Any suggestions?
I tried below with no luck
val r = dfCsvTS02.withColumn("create_dts", date_format($"create_dts", "dd-MM-yyyy hh:mm:ss"))
iS it possible that whatever the way we get the date , it will convert all to dd-mm-yyyy

Simply use functions to_timestamp to convert date and date_format to format. Something like this:
val r = dfCsvTS02.withColumn("create_dts", date_format(to_timestamp($"create_dts", "dd-MMM-yyyy").cast("date"), "dd-MM-yyyy"))

Related

Convert string to date in pyspark

I have a date value in a column of string type that takes this format:
06-MAY-16 09.17.15
I want to convert it to this format:
20160506
I have tried using DATE_FORMAT(TO_DATE(<column>), 'yyyyMMdd') but a NULL value is returned.
Does anyone have any ideas about how to go about doing this in pyspark or spark SQL?
Thanks
I've got it! This is the code I used which seems to have worked:
FROM_UNIXTIME(UNIX_TIMESTAMP(<column>, 'dd-MMM-yy HH.mm.ss'), 'yyyyMMdd')
Hope this helps others!
Your original attempt is close to the solution. You just needed to add the format in the TO_DATE() function. This will work as well:
DATE_FORMAT(TO_DATE(<col>, 'dd-MMM-yy HH.mm.ss'), 'yyyyMMdd')
And for pyspark:
import pyspark.sql.functions as F
df = df.withColumn('<col>', F.date_format(F.to_date(F.col('<col>'), 'dd-MMM-yy HH.mm.ss'), 'yyyyMMdd'))
Convert your string to a date before you try to 'reformat' it.
Convert pyspark string to date format -- to_timestamp(df.t, 'dd-MMM-YY HH.mm.ss').alias('my_date')
Pyspark date yyyy-mmm-dd conversion -- date_format(col("my_date"), "yyyyMMdd")

How to convert a string to pandas datetime

I have a column in my pandas data frame which is string and want to convert it to pandas date so that I will be able to sort
import pandas as pd
dat = pd.DataFrame({'col' : ['202101', '202212']})
dat['col'].astype('datetime64[ns]')
However this generates error. Could you please help to find the correct way to perform this
I think this code should work.
dat['date'] = pd.to_datetime(dat['col'], format= "%Y%m")
dat['date'] = dat['date'].dt.to_period('M')
dat.sort_values(by = 'date')
If you want to replace the sorted dataframe add in brackets inplace = True.
Your code didn't work because your wrong format to date. If you would have date in format for example 20210131 yyyy-mm-dd. This code would be enought.
dat['date'] = pd.to_datetime(dat['col'], format= "%Y%m%d")

Python convert a str date into a datetime with timezone object

In my django project i have to convert a str variable passed as a date ("2021-11-10") to a datetime with timezone object for execute an ORM filter on a DateTime field.
In my db values are stored as for example:
2021-11-11 01:18:04.200149+00
i try:
# test date
df = "2021-11-11"
df = df + " 00:00:00+00"
start_d = datetime.strptime(df, '%Y-%m-%d %H:%M:%S%Z')
but i get an error due to an error about str format and datetime representation (are different)
How can i convert a single date string into a datetimeobject with timezone stated from midnight of the date value?
So many thanks in advance
It's not the way to datetime.strptime.
Read a little bit more here
I believe it will help you.
you should implement month as str and without "-".
good luck

Pandas to_json date format is changing

I have this dataframe with start_date and end_date
and when i convert to json using to_json using this line
json_data = df.to_json(orient='records')
now if i print json_data the start_date is getting converted from yyyy-mm-dd to integer format
Please suggest a way so that the date format remains in yyyy-mm-dd format
Use DataFrame.select_dtypes for datetime columns, convert to format YYYy-MM-DD and last overwrite original data by DataFrame.update:
df.update(df.select_dtypes('datetime').apply(lambda x: x.dt.strftime('%Y-%m-%d')))
Then your solution working correct:
json_data = df.to_json(orient='records')
First set the format of your date, then set the date_format to 'iso':
df['start_date'] = pd.to_datetime(df['start_date']).dt.strftime('%Y-%m-%d')
df['end_date'] = pd.to_datetime(df['end_date']).dt.strftime('%Y-%m-%d')
data = df.to_json(orient='records', date_format='iso')
print(data)
[{"start_date":"2020-08-10","end_date":"2020-08-16"}]

AWS Glue - How to exclude rows where string does not match a date format

I have a dataset with a datecreated column. this column is typically in the format 'dd/MM/yy' but sometimes it has garbage text. I want to ultimately convert the column to a DATE and have the garbage text as a NULL value.
I have been trying to use resolveChoice, but it is resulting in all null values.
data_res = date_dyf.resolveChoice(specs =
[('datescanned','cast:timestamp')])
Sample data
3,1/1/18,text7
93,this is a test,text8
9,this is a test,text9
82,12/12/17,text10
Try converting a DynamicFrame into Spark's DataFrame and parse date using to_date function:
from pyspark.sql.functions import to_date
df = date_dyf.toDF
parsedDateDf = df.withColumn("datescanned", to_date(df["datescanned"], "dd/MM/yy"))
dyf = DynamicFrame.fromDF(parsedDateDf, glueContext, "convertedDyf")
If a string doesn't match the format a null value will be set

Resources