spark convert datetime to timestamp - apache-spark

I have a column in pyspark dataframe which is in the format 2021-10-28T22:19:03.0030059Z (string datatype). How to convert this into a timestamp datatype in pyspark?
I'm using the code snippet below but this returns nulls, as it's unable to convert it. Can someone please recommend on how to convert this?
df3.select(to_timestamp(df.DateTime, 'yyyy-MM-ddHH:mm:ss:SSS').alias('dt'),col('DateTime')).show()

You have to escape (put it in '') T and Z:
import pyspark.sql.functions as F
df = spark.createDataFrame([{"DateTime": "2021-10-28T22:19:03.0030059Z"}])
df.select(F.to_timestamp(df.DateTime, "yyyy-MM-dd'T'HH:mm:ss.SSSSSSS'Z'").alias('dt'),F.col('DateTime')).show(truncate = False)`

Related

Casting date to integer returns null in Spark SQL

I want to convert a date column into integer using Spark SQL.
I'm following this code, but I want to use Spark SQL and not PySpark.
Reproduce the example:
from pyspark.sql.types import *
import pyspark.sql.functions as F
# DUMMY DATA
simpleData = [("James",34,"2006-01-01","true","M",3000.60),
("Michael",33,"1980-01-10","true","F",3300.80),
("Robert",37,"1992-07-01","false","M",5000.50)
]
columns = ["firstname","age","jobStartDate","isGraduated","gender","salary"]
df = spark.createDataFrame(data = simpleData, schema = columns)
df = df.withColumn("jobStartDate", df['jobStartDate'].cast(DateType()))
df = df.withColumn("jobStartDateAsInteger1", F.unix_timestamp(df['jobStartDate']))
display(df)
What I want is to do the same transformation, but using Spark SQL. I am using the following code:
df.createOrReplaceTempView("date_to_integer")
%sql
select
seg.*,
CAST (jobStartDate AS INTEGER) as JobStartDateAsInteger2 -- return null value
from date_to_integer seg
How to solve it?
First you need to CAST your jobStartDate to DATE and then use UNIX_TIMESTAMP to transform it to UNIX integer.
SELECT
seg.*,
UNIX_TIMESTAMP(CAST (jobStartDate AS DATE)) AS JobStartDateAsInteger2
FROM date_to_integer seg

Handle Dictionary like String conversion to pyspark dataframe

I'm a pyspark dataframe as
Below code :-
data=[[1,"sai","ram",'"color":"red","green","blue","flower":"rose","tulip"'],
[2,"avi","kumar",'"color":"grey","black","white","flower":"roses","tulips"'] ,
[3,"ravi","prakash",'"color":"pink","cherry red","blue","flower":"rosey","tulipey"']
]
data_columns=["id","f_name","l_name","feature_stack"]
d=spark.createDataFrame(data=data,schema=data_columns)
d.show(truncate=False)
i wanted to convert this dictonary type string to a dataframe as:
Thanks in advance...
Try with regexp_extract function to extract color,flower from the feature_stack data.
Example:
from pyspark.sql.functions import *
d.withColumn("color",regexp_extract(col("feature_stack"),'"color":(.*),"flower"',1)).\
withColumn("flower",regexp_extract(col("feature_stack"),'"flower":(.*)',1)).\
show(10,False)

PySpark UDF not recognizing number of arguments

I have defined a Python function "DateTimeFormat" which takes three arguments
Spark Dataframe column which has date formats (String)
The input format of column's value like yyyy-mm-dd (String)
The output format i.e. the format in which the input has to be returned like yyyymmdd (String)
I have now registered this function as UDF in Pyspark.
udf_date_time = udf(DateTimeFormat,StringType())
I am trying to call this UDF in dataframe select and it seems to be working fine as long as the input format and output are different like below
df.select(udf_date_time('entry_date',lit('mmddyyyy'),lit('yyyy-mm-dd')))
But it fails, when the input format and output format are same with the following error
df.select('exit_date',udf_date_time('exit_date',lit('yyyy-mm-dd'),lit('yyyy-mm-dd')))
"DateTimeFormat" takes exactly 3 arguments. 2 given
But I'm clearly sending three arguments to the UDF
I have tried the above example on Python 2.7 and Spark 2.1
The function seems to work as expected in normal Python when input and output formats are the same
>>>DateTimeFormat('10152019','mmddyyyy','mmddyyyy')
'10152019'
>>>
But the below code is giving error when run in SPARK
import datetime
# Standard date,timestamp formatter
# Takes string date, its format and output format as arguments
# Returns string formatted date
def DateTimeFormat(col,in_frmt,out_frmt):
date_formatter ={'yyyy':'%Y','mm':'%m','dd':'%d','HH':'%H','MM':'%M','SS':'%S'}
for key,value in date_formatter.items():
in_frmt = in_frmt.replace(key,value)
out_frmt = out_frmt.replace(key,value)
return datetime.datetime.strptime(col,in_frmt).strftime(out_frmt)
Calling UDF using the code below
from pyspark.sql.functions import udf,lit
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
# Create SPARK session
spark = SparkSession.builder.appName("DateChanger").enableHiveSupport().getOrCreate()
df = spark.read.format("csv").option("header", "true").load(file_path)
# Registering UDF
udf_date_time = udf(DateTimeFormat,StringType())
df.select('exit_date',udf_date_time('exit_date',lit('yyyy-mm-dd'),lit('yyyy-mm-dd'))).show()
CSV file input Input file
Expected result is the command
df.select('exit_date',udf_date_time('exit_date',lit('yyyy-mm-dd'),lit('yyyy-mm-dd'))).show()
should NOT throw any error like
DateTimeFormat takes exactly 3 arguments but 2 given
I am not sure if there's a better way to do this but you can try the following.
Here I have assumed that you want your dates to a particular format and have set the default for the output format (out_frmt='yyyy-mm-dd') in your DateTimeFormat function
I have added a new function called udf_score to help with conversions. That might interest you
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf, lit
df = spark.createDataFrame([
["10-15-2019"],
["10-16-2019"],
["10-17-2019"],
], ['exit_date'])
import datetime
def DateTimeFormat(col,in_frmt,out_frmt='yyyy-mm-dd'):
date_formatter ={'yyyy':'%Y','mm':'%m','dd':'%d','HH':'%H','MM':'%M','SS':'%S'}
for key,value in date_formatter.items():
in_frmt = in_frmt.replace(key,value)
out_frmt = out_frmt.replace(key,value)
return datetime.datetime.strptime(col,in_frmt).strftime(out_frmt)
def udf_score(in_frmt):
return udf(lambda l: DateTimeFormat(l, in_frmt))
in_frmt = 'mm-dd-yyyy'
df.select('exit_date',udf_score(in_frmt)('exit_date').alias('new_dates')).show()
+----------+----------+
| exit_date| new_dates|
+----------+----------+
|10-15-2019|2019-10-15|
|10-16-2019|2019-10-16|
|10-17-2019|2019-10-17|
+----------+----------+

How to convert a column in H2OFrame to a python list?

I've read the PythonBooklet.pdf by H2O.ai and the python API documentation, but still can't find a clean way to do this. I know I can do either of the following:
Convert H2OFrame to Spark DataFrame and do a flatMap + collect or collect + list comprehension.
Use H2O's get_frame_data, which gives me a string of header and data separated by \n; then convert it a list (a numeric list in my case).
Is there a better way to do this? Thank you.
You can try something like this: bring an H2OFrame into python as a pandas dataframe by calling .as_data_frame(), then call .tolist() on the column of interest.
A self contained example w/ iris
import h2o
h2o.init()
df = h2o.import_file("iris_wheader.csv")
pd = df.as_data_frame()
pd['sepal_len'].tolist()
You can (1) convert the H2o frame to pandas dataframe and (2) convert pandas dataframe to list as follows:
pd=h2o.as_list(h2oFrame)
l=pd["column"].tolist()

Timestamp parsing in pyspark

df1:
Timestamp:
1995-08-01T00:00:01.000+0000
Is there a way to separate the day of the month in the timestamp column of the data frame using pyspark. Not able to provide the code, I am new to spark. I do not have a clue on how to proceed.
You can parse this timestamp using unix_timestamp:
from pyspark.sql import functions as F
format = "yyyy-MM-dd'T'HH:mm:ss.SSSZ"
df2 = df1.withColumn('Timestamp2', F.unix_timestamp('Timestamp', format).cast('timestamp'))
Then, you can use dayofmonth in the new Timestamp column:
df2.select(F.dayofmonth('Timestamp2'))
More detials about these functions can be found in the pyspark functions documentation.
Code:
df1.select(dayofmonth('Timestamp').alias('day'))

Resources