How to convert a datetime column to firstday of month? - apache-spark

I have a PySpark dataframe with column which has datetime values in the format '09/19/2020 09:27:18 AM'
I want to convert to first day of month 01-Nov-2020 in this format.
I have tried "date_col", F.trunc("date_col", "month") which is resulting in null date
and
df_result = df_result.withColumn('gl_date', F.udf(lambda d: datetime.datetime.strptime(d, '%MM/%dd/%yyyy %HH:%mm:%S a').strftime('%Y/%m/1'), t.StringType())(F.col('date_col')))
the second method I tried errors with date format '%MM/%dd/%yyyy %HH:%mm:%S a' is not matched with '09/19/2020 09:27:18 AM'

You can convert the column to timestamp type before calling trunc:
import pyspark.sql.functions as F
df_result2 = df_result.withColumn(
'gl_date',
F.date_format(
F.trunc(
F.to_timestamp("date_col", "MM/dd/yyyy hh:mm:ss a"),
"month"
),
"dd-MMM-yyyy"
)
)

Related

How to convert excel date to numeric value using Python

How do I convert Excel date format to number in Python? I'm importing a number of Excel files into Pandas dataframe in a loop and some values are formatted incorrectly in Excel. For example, the number column is imported as date and I'm trying to convert this date value into numeric.
Original New
1912-04-26 00:00:00 4500
How do I convert the date value in original to the numeric value in new? I know this code can convert numeric to date, but is there any similar function that does the opposite?
df.loc[0]['Date']= xlrd.xldate_as_datetime(df.loc[0]['Date'], 0)
I tried to specify the data type when I read in the files and also tried to simply change the data type of the column to 'float' but both didn't work.
Thank you.
I found that the number means the number of days from 1900-01-00.
Following code is to calculate how many days passed from 1900-01-00 until the given date.
import pandas as pd
from datetime import datetime, timedelta
df = pd.DataFrame(
{
'date': ['1912-04-26 00:00:00'],
}
)
print(df)
# date
#0 1912-04-26 00:00:00
def date_to_int(given_date):
given_date = datetime.strptime(given_date, '%Y-%m-%d %H:%M:%S')
base_date = datetime(1900, 1, 1) - timedelta(days=2)
delta = given_date - base_date
return delta.days
df['date'] = df['date'].apply(date_to_int)
print(df)
# date
#0 4500

Why Spark is not recognizing this time format?

I get null for the timestamp 27-04-2021 14:11 with this code. What mistake am I doing? Why is the timestamp format string DD-MM-yyyy HH:mm not correct here?
df = spark.createDataFrame([('27-04-2021 14:11',)], ['t'])
df = df.select(to_timestamp(df.t, 'DD-MM-yyyy HH:mm').alias('dt'))
display(df)
D is for day of the year, and d is for day of the month.
Try this:
df = df.select(F.to_timestamp(df.t, "dd-MM-yyyy HH:mm").alias("dt"))

Pyspark date format from multiple columns

I have four string columns 'hour', 'day', 'month', 'year' in my data frame. I would like to create new column fulldate in format 'dd/MM/yyyy HH:mm'.
df2 = df1.withColumn("fulldate", to_date(concat(col('day'), lit('/'), col('month'), lit('/'), col('year'), lit(' '), col('hour'), lit(':'), lit('0'), lit('0')), 'dd/MM/yyyy HH:mm'))
but it doesn't seem to work. I'm getting format "yyyy-mm-dd".
Am I missing something?
For Spark 3+, you can use make_timestamp function to create a timestamp column from those columns and use date_format to convert it to the desired date pattern :
from pyspark.sql import functions as F
df2 = df1.withColumn(
"fulldate",
F.date_format(
F.expr("make_timestamp(year, month, day, hour, 0, 0)"),
"dd/MM/yyyy HH:mm"
)
)
Use date_format instead of to_date.
to_date converts a column to date type from the given format, while date_format converts a date type column to the given format.
from pyspark.sql.functions import date_format, concat, col, lit
df2 = df1.withColumn(
"fulldate",
date_format(
concat(col('year'), lit('/'), col('month'), lit('/'), col('day'), lit(' '), col('hour'), lit(':'), lit('00'), lit(':'), lit('00')),
'dd/MM/yyyy HH:mm'
)
)
For better readability, you can use format_string:
from pyspark.sql.functions import date_format, format_string, col
df2 = df1.withColumn(
"fulldate",
date_format(
format_string('%d/%d/%d %d:00:00', col('year'), col('month'), col('day'), col('hour')),
'dd/MM/yyyy HH:mm'
)
)

i want to convert 4/1/2019 to 1/4/2019 in my dataframe column

i want to swap month and date in a date in a column of a dataframe
i have tried all the below methods:
#df_combined1['BILLING_START_DATE_x'] = pd.to_datetime(df_combined1['BILLING_START_DATE_x'], format='%d/%m/%Y').dt.strftime('%d-%m-%Y')
#df_combined1['BILLING_START_DATE_x'] = df_combined1['BILLING_START_DATE_x'].apply(lambda x: dt.datetime.strftime(x, '%d-%m-%Y'))
#df_combined1['BILLING_START_DATE_x'] = pd.to_datetime(df_combined1['BILLING_START_DATE_x'], format='%m-%d-%Y')
i need to swap month and the day
If format all datetimes is DD/MM/YYYY:
df_combined1['BILLING_START_DATE_x'] = (pd.to_datetime(df_combined1['BILLING_START_DATE_x'],
format='%d/%m/%Y')
.dt.strftime('%m/%d/%Y'))
If format all datetimes is MM/DD/YYYY:
df_combined1['BILLING_START_DATE_x'] = (pd.to_datetime(df_combined1['BILLING_START_DATE_x'],
format='%m/%d/%Y')
.dt.strftime('%d/%m/%Y'))

Construct a DataTime index from multiple columns of a datadrame

I am parsing a dataframe from a sas7bdat file and I want to convert the index into datetime to resample the data.
I have one column with the Date which is type String and another column of the time which is of type datetime.time. Does anybody know how to convert this to one column of datetime?
I already tried the pd.datetime like this but it requires individual columns for year, month and day:
df['TimeIn']=str(df['TimeIn'])
df['datetime']=pd.to_datetime(df[['Date', 'TimeIn']], dayfirst=True)
This gives me a value error:
ValueError: to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing
DataFrame column headers
If you convert both the date and time column to str then you can concatenate them and then call to_datetime:
In[155]:
df = pd.DataFrame({'Date':['08/05/2018'], 'TimeIn':['10:32:12']})
df
Out[155]:
Date TimeIn
0 08/05/2018 10:32:12
In[156]:
df['new_date'] = pd.to_datetime(df['Date']+' '+df['TimeIn'])
df
Out[156]:
Date TimeIn new_date
0 08/05/2018 10:32:12 2018-08-05 10:32:12

Resources