PySpark dataframe convert unusual string format to Timestamp - apache-spark

I am using PySpark through Spark 1.5.0.
I have an unusual String format in rows of a column for datetime values. It looks like this:
Row[(datetime='2016_08_21 11_31_08')]
Is there a way to convert this unorthodox yyyy_mm_dd hh_mm_dd format into a Timestamp?
Something that can eventually come along the lines of
df = df.withColumn("date_time",df.datetime.astype('Timestamp'))
I had thought that Spark SQL functions like regexp_replace could work, but of course I need to replace
_ with - in the date half
and _ with : in the time part.
I was thinking I could split the column in 2 using substring and count backward from the end of time. Then do the 'regexp_replace' separately, then concatenate. But this seems to many operations? Is there an easier way?

Spark >= 2.2
from pyspark.sql.functions import to_timestamp
(sc
.parallelize([Row(dt='2016_08_21 11_31_08')])
.toDF()
.withColumn("parsed", to_timestamp("dt", "yyyy_MM_dd HH_mm_ss"))
.show(1, False))
## +-------------------+-------------------+
## |dt |parsed |
## +-------------------+-------------------+
## |2016_08_21 11_31_08|2016-08-21 11:31:08|
## +-------------------+-------------------+
Spark < 2.2
It is nothing that unix_timestamp cannot handle:
from pyspark.sql import Row
from pyspark.sql.functions import unix_timestamp
(sc
.parallelize([Row(dt='2016_08_21 11_31_08')])
.toDF()
.withColumn("parsed", unix_timestamp("dt", "yyyy_MM_dd HH_mm_ss")
# For Spark <= 1.5
# See issues.apache.org/jira/browse/SPARK-11724
.cast("double")
.cast("timestamp"))
.show(1, False))
## +-------------------+---------------------+
## |dt |parsed |
## +-------------------+---------------------+
## |2016_08_21 11_31_08|2016-08-21 11:31:08.0|
## +-------------------+---------------------+
In both cases the format string should be compatible with Java SimpleDateFormat.

zero323's answer answers the question, but I wanted to add that if your datetime string has a standard format, you should be able to cast it directly into timestamp type:
df.withColumn('datetime', col('datetime_str').cast('timestamp'))
It has the advantage of handling milliseconds, while unix_timestamp only has only second-precision (to_timestamp works with milliseconds too but requires Spark >= 2.2 as zero323 stated). I tested it on Spark 2.3.0, using the following format: '2016-07-13 14:33:53.979' (with milliseconds, but it also works without them).

I add some more code lines from Florent F's answer for better understanding and running the snippet in local machine:
import os, pdb, sys
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.types import StructType, ArrayType
from pyspark.sql.types import StringType
from pyspark.sql.functions import col
sc = pyspark.SparkContext('local[*]')
spark = SparkSession.builder.getOrCreate()
# preparing some example data - df1 with String type and df2 with Timestamp type
df1 = sc.parallelize([{"key":"a", "date":"2016-02-01"},
{"key":"b", "date":"2016-02-02"}]).toDF()
df1.show()
df2 = df1.withColumn('datetime', col('date').cast("timestamp"))
df2.show()

Just want to add more resources and example into this discussion.
https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
For example, if your ts string is "22 Dec 2022 19:06:36 EST", then the format is "dd MMM yyyy HH:mm:ss zzz"

Related

Spark SQL: Parse date string from dd/mm/yyyy to yyyy/mm/dd

I want to use spark SQL or pyspark to reformat a date field from 'dd/mm/yyyy' to 'yyyy/mm/dd'. The field type is string:
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
spark = SparkSession.builder.master("local[1]")\
.appName("date.com")\
.getOrCreate()
my_df = spark.createDataFrame(["13/04/2020", "16/04/2020", "19/04/2020"], StringType()).toDF("date")
expected_df = spark.createDataFrame(["2020/04/12", "2020/04/16", "2020/04/19"], StringType()).toDF("date")
I have tried the following spark sql command, but this returns the date as literally 'yyyy/MM/dd' rather than '2020/04/12'.
select date_format(date, 'dd/MM/yyyy'), 'yyyy/MM/dd' as reformatted_date
FROM my_df
I have also looked at the following documentation but didn't see anything that fits my scenario: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
If it's not possible in spark sql then pyspark would work.
Any ideas?
You need to convert to date type using to_date first:
select date_format(to_date(date, 'dd/MM/yyyy'), 'yyyy/MM/dd') as reformatted_date
from my_df
df1.select( to_date(date_format(to_date(lit("12/12/2020"), "dd/MM/yyyy"), "yyyy-MM-dd") ).as("campo")).show()

How to convert GUID into integer in pyspark

Hi Stackoverflow fams:
I am new to pyspark and trying to learn as much as I can. But for now, I want to convert GUID's into integers in pysprak. I can currently run the following statement in SQL to convert GUID's into an int.
CHECKSUM(HASHBYTES('sha2_512',GUID)) AS int_value_wanted
I wanted to do the same thing in pyspark and tried to create a temporary table out of spark dataframe and add the above statement in the sql query. But the code keeps throwing "Undefined function: 'CHECKSUM'". Is there a way I can add the "CHECKSUM" function into pyspark or do the same thing using another pyspark way?
from awsglue.context import GlueContext
from pyspark.sql import SQLContext
glueContext = GlueContext(SparkContext.getOrCreate())
spark_session = glueContext.spark_session
sqlContext = SQLContext(spark_session.sparkContext, spark_session)
spark_df = spark.createDataFrame(
[("2540f487-7a29-400a-98a0-c03902e67f73", "1386172469"),
("0b32389a-ce01-4e6a-855c-15940cc91e9e", "-2013240275")],
("GUDI","int_value_wanted")
)
spark_df.show(truncate=False)
spark_df.registerTempTable('temp')
new_df = sqlContext.sql("SELECT .*, CHECKSUM(HASHBYTES('sha2_512', GUDI)) AS detail_id FROM temp")
new_df.show(truncate=False)
+------------------------------------+----------------+
|GUDI |int_value_wanted|
+------------------------------------+----------------+
|2540f487-7a29-400a-98a0-c03902e67f73|1386172469 |
|0b32389a-ce01-4e6a-855c-15940cc91e9e|-2013240275 |
+------------------------------------+----------------+
Thanks
There is a sha2 built-in function, which returns the checksum for the SHA-2 family as a hex string. SHA-512 is also supported.

How do I add a new date column with constant value to a Spark DataFrame (using PySpark)?

I want to add a column with a default date ('1901-01-01') with exiting dataframe using pyspark?
I used below code snippet
from pyspark.sql import functions as F
strRecordStartTime="1970-01-01"
recrodStartTime=hashNonKeyData.withColumn("RECORD_START_DATE_TIME",
lit(strRecordStartTime).cast("timestamp")
)
It gives me following error
org.apache.spark.sql.AnalysisException: cannot resolve '1970-01-01'
Any pointer is appreciated?
Try to use python native datetime with lit, I'm sorry don't have the access to machine now.
recrodStartTime = hashNonKeyData.withColumn('RECORD_START_DATE_TIME', lit(datetime.datetime(1970, 1, 1))
I have created one spark dataframe:
from pyspark.sql.types import StringType
df1 = spark.createDataFrame(["Ravi","Gaurav","Ketan","Mahesh"], StringType()).toDF("Name")
Now lets add one new column to the exiting dataframe:
from pyspark.sql.functions import lit
import dateutil.parser
yourdate = dateutil.parser.parse('1901-01-01')
df2= df1.withColumn('Age', lit(yourdate)) // addition of new column
df2.show() // to print the dataframe
You can validate your your schema by using below command.
df2.printSchema
Hope that helps.
from pyspark.sql import functions as F
strRecordStartTime = "1970-01-01"
recrodStartTime = hashNonKeyData.withColumn("RECORD_START_DATE_TIME", F.to_date(F.lit(strRecordStartTime)))

PySpark UDF not recognizing number of arguments

I have defined a Python function "DateTimeFormat" which takes three arguments
Spark Dataframe column which has date formats (String)
The input format of column's value like yyyy-mm-dd (String)
The output format i.e. the format in which the input has to be returned like yyyymmdd (String)
I have now registered this function as UDF in Pyspark.
udf_date_time = udf(DateTimeFormat,StringType())
I am trying to call this UDF in dataframe select and it seems to be working fine as long as the input format and output are different like below
df.select(udf_date_time('entry_date',lit('mmddyyyy'),lit('yyyy-mm-dd')))
But it fails, when the input format and output format are same with the following error
df.select('exit_date',udf_date_time('exit_date',lit('yyyy-mm-dd'),lit('yyyy-mm-dd')))
"DateTimeFormat" takes exactly 3 arguments. 2 given
But I'm clearly sending three arguments to the UDF
I have tried the above example on Python 2.7 and Spark 2.1
The function seems to work as expected in normal Python when input and output formats are the same
>>>DateTimeFormat('10152019','mmddyyyy','mmddyyyy')
'10152019'
>>>
But the below code is giving error when run in SPARK
import datetime
# Standard date,timestamp formatter
# Takes string date, its format and output format as arguments
# Returns string formatted date
def DateTimeFormat(col,in_frmt,out_frmt):
date_formatter ={'yyyy':'%Y','mm':'%m','dd':'%d','HH':'%H','MM':'%M','SS':'%S'}
for key,value in date_formatter.items():
in_frmt = in_frmt.replace(key,value)
out_frmt = out_frmt.replace(key,value)
return datetime.datetime.strptime(col,in_frmt).strftime(out_frmt)
Calling UDF using the code below
from pyspark.sql.functions import udf,lit
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
# Create SPARK session
spark = SparkSession.builder.appName("DateChanger").enableHiveSupport().getOrCreate()
df = spark.read.format("csv").option("header", "true").load(file_path)
# Registering UDF
udf_date_time = udf(DateTimeFormat,StringType())
df.select('exit_date',udf_date_time('exit_date',lit('yyyy-mm-dd'),lit('yyyy-mm-dd'))).show()
CSV file input Input file
Expected result is the command
df.select('exit_date',udf_date_time('exit_date',lit('yyyy-mm-dd'),lit('yyyy-mm-dd'))).show()
should NOT throw any error like
DateTimeFormat takes exactly 3 arguments but 2 given
I am not sure if there's a better way to do this but you can try the following.
Here I have assumed that you want your dates to a particular format and have set the default for the output format (out_frmt='yyyy-mm-dd') in your DateTimeFormat function
I have added a new function called udf_score to help with conversions. That might interest you
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf, lit
df = spark.createDataFrame([
["10-15-2019"],
["10-16-2019"],
["10-17-2019"],
], ['exit_date'])
import datetime
def DateTimeFormat(col,in_frmt,out_frmt='yyyy-mm-dd'):
date_formatter ={'yyyy':'%Y','mm':'%m','dd':'%d','HH':'%H','MM':'%M','SS':'%S'}
for key,value in date_formatter.items():
in_frmt = in_frmt.replace(key,value)
out_frmt = out_frmt.replace(key,value)
return datetime.datetime.strptime(col,in_frmt).strftime(out_frmt)
def udf_score(in_frmt):
return udf(lambda l: DateTimeFormat(l, in_frmt))
in_frmt = 'mm-dd-yyyy'
df.select('exit_date',udf_score(in_frmt)('exit_date').alias('new_dates')).show()
+----------+----------+
| exit_date| new_dates|
+----------+----------+
|10-15-2019|2019-10-15|
|10-16-2019|2019-10-16|
|10-17-2019|2019-10-17|
+----------+----------+

pyspark to_timestamp function doesn't convert certain timestamps

I would like to use the to_timestamp function to format timestamps in pyspark. How can I do it without the timezone shifting or certain dates being omitted. ?
from pyspark.sql.types import StringType
from pyspark.sql.functions import col, udf, to_timestamp
date_format = "yyyy-MM-dd'T'HH:mm:ss"
vals = [('2018-03-11T02:39:00Z'), ('2018-03-11T01:39:00Z'), ('2018-03-11T03:39:00Z')]
testdf = spark.createDataFrame(vals, StringType())
testdf.withColumn("to_timestamp", to_timestamp("value",date_format)).show(4,False)
testdf.withColumn("to_timestamp", to_timestamp("value", date_format)).show(4,False)
+--------------------+-------------------+
|value |to_timestamp |
+--------------------+-------------------+
|2018-03-11T02:39:00Z|null |
|2018-03-11T01:39:00Z|2018-03-11 01:39:00|
|2018-03-11T03:39:00Z|2018-03-11 03:39:00|
+--------------------+-------------------+
I expected 2018-03-11T02:39:00Z to format correctly to 2018-03-11 02:39:00
Then I switched to the default to_timestamp function.
testdf.withColumn("to_timestamp", to_timestamp("value")).show(4,False)`
+--------------------+-------------------+
|value |to_timestamp |
+--------------------+-------------------+
|2018-03-11T02:39:00Z|2018-03-10 20:39:00|
|2018-03-11T01:39:00Z|2018-03-10 19:39:00|
|2018-03-11T03:39:00Z|2018-03-10 21:39:00|
+--------------------+-------------------+
The shift in time when you call to_timestamp() with default values is because you spark instance is set to your local timezone and not UTC. You can check by running
spark.conf.get('spark.sql.session.timeZone')
If you want your timestamp to be displayed in UTC, set the conf value.
spark.conf.set('spark.sql.session.timeZone', 'UTC')
Another important point in your code, when you define date format as "yyyy-MM-dd'T'HH:mm:ss", you are essentially asking spark to ignore timezone and consider all timestamps to be in UTC/Zulu. Proper format would be date_format = "yyyy-MM-dd'T'HH:mm:ssXXX" but its a moot point if you are calling to_timestamp() with defaults.
use from_utc_timestamp method which will treat input column value as UTC timestamp
testdf.withColumn("to_timestamp", from_utc_timestamp("value")).show(4,False)

Resources