How can I convert a specific string date to date or datetime in Spark? - apache-spark

I have this string pattern in my Spark dataframe: 'Sep 14, 2014, 1:34:36 PM'.
I want to convert this to date or datetime format, using Databricks and Spark.
I've already tried the cast and to_date functions, but nothing works and I got null return everytime.
How can I do that?
Thanks in advance!

If we have a created table like this:
var ds = spark.sparkContext.parallelize(Seq(
"Sep 14, 2014, 01:34:36 PM"
)).toDF("date")
Through the following statement:
ds = ds.withColumn("casted", to_timestamp(col("date"), "MMM dd, yyyy, hh:mm:ss aa"))
You get this result:
+-------------------------+-------------------+
|date |casted |
+-------------------------+-------------------+
|Sep 14, 2014, 01:34:36 PM|2014-09-14 13:34:36|
+-------------------------+-------------------+
which should be useful to you. You can use to_date or other APIs that require a datetime format, good luck!

Your date/time stamp string is incorrect. You have 1 instead of 01.
#
# 1 - Create sample dataframe + view
#
# required library
from pyspark.sql.functions import *
# array of tuples - data
dat1 = [
("1", "Sep 14, 2014, 01:34:36 pm")
]
# array of names - columns
col1 = ["row_id", "date_string1"]
# make data frame
df1 = spark.createDataFrame(data=dat1, schema=col1)
# expand date range into list of dates
df1 = df1.withColumn("time_stamp1", to_timestamp(col("date_string1"), "MMM dd, yyyy, hh:mm:ss a"))
# show schema
df1.printSchema()
# show data
display(df1)
This code produces the correct answer.
If the data has 1:34:36, it fails. You can use a when clause to pick the correct conversion.

Related

How to convert excel date to numeric value using Python

How do I convert Excel date format to number in Python? I'm importing a number of Excel files into Pandas dataframe in a loop and some values are formatted incorrectly in Excel. For example, the number column is imported as date and I'm trying to convert this date value into numeric.
Original New
1912-04-26 00:00:00 4500
How do I convert the date value in original to the numeric value in new? I know this code can convert numeric to date, but is there any similar function that does the opposite?
df.loc[0]['Date']= xlrd.xldate_as_datetime(df.loc[0]['Date'], 0)
I tried to specify the data type when I read in the files and also tried to simply change the data type of the column to 'float' but both didn't work.
Thank you.
I found that the number means the number of days from 1900-01-00.
Following code is to calculate how many days passed from 1900-01-00 until the given date.
import pandas as pd
from datetime import datetime, timedelta
df = pd.DataFrame(
{
'date': ['1912-04-26 00:00:00'],
}
)
print(df)
# date
#0 1912-04-26 00:00:00
def date_to_int(given_date):
given_date = datetime.strptime(given_date, '%Y-%m-%d %H:%M:%S')
base_date = datetime(1900, 1, 1) - timedelta(days=2)
delta = given_date - base_date
return delta.days
df['date'] = df['date'].apply(date_to_int)
print(df)
# date
#0 4500

Pyspark parse datetime field with day and month names into timestamp

I'm not even sure where to start. I want to parse a column that is currently a string into a timestamp. The records look like the following:
Thu, 28 Jan 2021 02:54:17 +0000
What is the best way to parse this as a timestamp? I wasn't even sure where to start since it's not a super common way to store dates
You could probably start from the docs Datetime Patterns for Formatting and Parsing:
import pyspark.sql.functions as F
df = spark.createDataFrame([("Thu, 28 Jan 2021 02:54:17 +0000",)], ['timestamp'])
df.withColumn(
"timestamp",
F.to_timestamp("timestamp", "E, dd MMM yyyy HH:mm:ss Z")
).show()
#+-------------------+
#| timestamp|
#+-------------------+
#|2021-01-28 02:54:17|
#+-------------------+
However, since Spark version 3.0, you can no longer use some symbols like E while parsing to timestamp:
Symbols of ‘E’, ‘F’, ‘q’ and ‘Q’ can only be used for datetime
formatting, e.g. date_format. They are not allowed used for datetime
parsing, e.g. to_timestamp.
You can either set the time parser to legacy:
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
Or use some string functions to remove the day part from string before using to_timestamp:
df.withColumn(
"timestamp",
F.to_timestamp(F.split("timestamp", ",")[1], " dd MMM yyyy HH:mm:ss Z")
).show()

Sort pandas dataframe by date in day/month/year format

I am trying to parse data from a csv file, sort them by date and write the sorted dataframe in a new csv file.
Say we have a very simple csv file with date entries following the pattern day/month/year:
Date,Reference
15/11/2020,'001'
02/11/2020,'002'
10/11/2020,'003'
26/11/2020,'004'
23/10/2020,'005'
I read the csv into a Pandas dataframe. When I attempt to order the dataframe based on the dates in ascending order I expect the data to be ordered as follows:
23/10/2020,'005'
02/11/2020,'002'
10/11/2020,'003'
15/11/2020,'001'
26/11/2020,'004'
Sadly, this is not what I get.
If I attempt to convert the date to datetime and then sort, then some date entries are converted to the month/day/year (e.g. 2020-10-23 instead of 2020-23-10) which messes up the ordering:
date reference
2020-02-11 '002'
2020-10-11 '003'
2020-10-23 '005'
2020-11-15 '001'
2020-11-26 '004'
If I sort without converting to datetime, then the ordering is also wrong:
date reference
02/11/2020 '002'
10/11/2020 '003'
15/11/2020 '001'
23/10/2020 '005'
26/11/2020 '004'
Here is my code:
import pandas as pd
df = pd.read_csv('order_dates.csv',
header=0,
names=['date', 'reference'],
dayfirst=True)
df.reset_index(drop=True, inplace=True)
# df.date = pd.to_datetime(df.date)
df.sort_val
df.sort_values(by='date', ascending=True, inplace=True)
print(df)
df.to_csv('sorted.csv')
Why is sorting by date so hard? Can someone explain why the above sorting attempts fail?
Ideally, I would like the sorted.csv to have the date entries in the day/month/year format.
Try:
df.loc[:,'date'] = pd.to_datetime(df.loc[:, 'date'], format='%d/%m-%Y')
What you can do is to specify the datetime format while reading the csv file. To do this try that:
>>> df = pd.read_csv('filename.csv', parse_dates=['Date'],infer_datetime_format='%d/%m/%Y').sort_values(by='Date')
This will read your dates from csv and give you this output where dates are sorted.
Date Reference
4 2020-10-23 '005
1 2020-11-02 '002'
2 2020-11-10 '003'
0 2020-11-15 '001'
3 2020-11-26 '004'
What's left now is to simply change the formatting to the desired one
>>> df['Date'] = df['Date'].dt.strftime('%d/%m/%Y')
Keep in mind however that this will change the Date back to string (object)
>>> df
Date Reference
4 23/10/2020 '005
1 02/11/2020 '002'
2 10/11/2020 '003'
0 15/11/2020 '001'
3 26/11/2020 '004'
>>> df.dtypes
Date object

How to convert a column of data in a DataFrame filled with string representation of non-uniformed date formats to datetime?

Let's say:
>>> print(df)
location date
paris 23/02/2010
chicago 3-23-2013
...
new york 04-23-2013
helsinki 13/10/2015
Currently, df["date"] is in str. I want to convert the date column to datetime using
>>> df["date"] = pd.to_datetime(df["date"])
I would get ValueError due to ParserError. This is because the format of the date is inconsistent (i.e. dd/mm/yyyy, then next one is m/dd/yyyy).
If I were to write the code below, it still wouldn't work due to the date not being uniformed and delimiters being different:
>>> df["date"] = pd.to_datetime(df["date"], format="%d/%m/%Y")
The last option that I could think of was to write the code below, which replaces all of the dates that are not formatted like the first date to NaT:
>>> df["date"] = pd.to_datetime(df["date"], errors="coerce")
How do I convert the whole date column to datetime while having the dates not uniform in terms of the delimiters, and the orders of days, months and years?
use, apply method of pandas
df['date'] = df.apply(lambda x: pd.to_datetime(x['date']),axis = 1)

PySpark - to_date format from column

I am currently trying to figure out, how to pass the String - format argument to the to_date pyspark function via a column parameter.
Specifically, I have the following setup:
sc = SparkContext.getOrCreate()
df = sc.parallelize([('a','2018-01-01','yyyy-MM-dd'),
('b','2018-02-02','yyyy-MM-dd'),
('c','02-02-2018','dd-MM-yyyy')]).toDF(
["col_name","value","format"])
I am currently trying to add a new column, where each of the dates from the column F.col("value"), which is a string value, is parsed to a date.
Separately for each format, this can be done with
df = df.withColumn("test1",F.to_date(F.col("value"),"yyyy-MM-dd")).\
withColumn("test2",F.to_date(F.col("value"),"dd-MM-yyyy"))
This however gives me 2 new columns - but I want to have 1 column containing both results - but calling the column does not seem to be possible with the to_date function:
df = df.withColumn("test3",F.to_date(F.col("value"),F.col("format")))
Here an error "Column object not callable" is being thrown.
Is is possible to have a generic approach for all possible formats (so that I do not have to manually add new columns for each format)?
You can use a column value as a parameter without a udf using the spark-sql syntax:
Spark version 2.2 and above
from pyspark.sql.functions import expr
df.withColumn("test3",expr("to_date(value, format)")).show()
#+--------+----------+----------+----------+
#|col_name| value| format| test3|
#+--------+----------+----------+----------+
#| a|2018-01-01|yyyy-MM-dd|2018-01-01|
#| b|2018-02-02|yyyy-MM-dd|2018-02-02|
#| c|02-02-2018|dd-MM-yyyy|2018-02-02|
#+--------+----------+----------+----------+
Or equivalently using pyspark-sql:
df.createOrReplaceTempView("df")
spark.sql("select *, to_date(value, format) as test3 from df").show()
Spark version 1.5 and above
Older versions of spark do not support having a format argument to the to_date function, so you'll have to use unix_timestamp and from_unixtime:
from pyspark.sql.functions import expr
df.withColumn(
"test3",
expr("from_unixtime(unix_timestamp(value,format))").cast("date")
).show()
Or equivalently using pyspark-sql:
df.createOrReplaceTempView("df")
spark.sql(
"select *, cast(from_unixtime(unix_timestamp(value,format)) as date) as test3 from df"
).show()
As far as I know, your problem requires some udf(user defined functions) to apply the correct format. But then inside a udf you can not directly use spark functions like to_date. So I created a little workaround in the solution. First the udf takes the python date conversion with the appropriate format from the column and converts it to an iso-format. Then another withColumn converts the iso-date to the correct format in column test3. However, you have to adapt the format in the original column to match the python dateformat strings, e.g. yyyy -> %Y, MM -> %m, ...
test_df = spark.createDataFrame([
('a','2018-01-01','%Y-%m-%d'),
('b','2018-02-02','%Y-%m-%d'),
('c','02-02-2018','%d-%m-%Y')
], ("col_name","value","format"))
def map_to_date(s,format):
return datetime.datetime.strptime(s,format).isoformat()
myudf = udf(map_to_date)
test_df.withColumn("test3",myudf(col("value"),col("format")))\
.withColumn("test3",to_date("test3")).show(truncate=False)
Result:
+--------+----------+--------+----------+
|col_name|value |format |test3 |
+--------+----------+--------+----------+
|a |2018-01-01|%Y-%m-%d|2018-01-01|
|b |2018-02-02|%Y-%m-%d|2018-02-02|
|c |02-02-2018|%d-%m-%Y|2018-02-02|
+--------+----------+--------+----------+
You dont need the format column also. You can use coalesce to check for all possible options
def get_right_date_format(date_string):
from pyspark.sql import functions as F
return F.coalesce(
F.to_date(date_string, 'yyyy-MM-dd'),
F.to_date(date_string, 'dd-MM-yyyy'),
F.to_date(date_string, 'yyyy-dd-MM')
)
df = sc.parallelize([('a','2018-01-01'),
('b','2018-02-02'),
('c','2018-21-02'),
('d','02-02-2018')]).toDF(
["col_name","value"])
df = df.withColumn("formatted_data",get_right_date_format(df.value, 'dd-MM-yyyy'))
The issue with this approach though is a date like 2020-02-01 would be treated as 1st Feb 2020, when it is likely that 2nd Jan 2020 is also possible.
Just an alternative approach !!!

Resources