I have a column in pyspark dataframe which is in the format 2021-10-28T22:19:03.0030059Z (string datatype). How to convert this into a timestamp datatype in pyspark?
I'm using the code snippet below but this returns nulls, as it's unable to convert it. Can someone please recommend on how to convert this?, 'yyyy-MM-ddHH:mm:ss:SSS').alias('dt'),col('DateTime')).show()

You have to escape (put it in '') T and Z:
import pyspark.sql.functions as F
df = spark.createDataFrame([{"DateTime": "2021-10-28T22:19:03.0030059Z"}]), "yyyy-MM-dd'T'HH:mm:ss.SSSSSSS'Z'").alias('dt'),F.col('DateTime')).show(truncate = False)`


Casting date to integer returns null in Spark SQL

I want to convert a date column into integer using Spark SQL.
I'm following this code, but I want to use Spark SQL and not PySpark.
Reproduce the example:
from pyspark.sql.types import *
import pyspark.sql.functions as F
simpleData = [("James",34,"2006-01-01","true","M",3000.60),
columns = ["firstname","age","jobStartDate","isGraduated","gender","salary"]
df = spark.createDataFrame(data = simpleData, schema = columns)
df = df.withColumn("jobStartDate", df['jobStartDate'].cast(DateType()))
df = df.withColumn("jobStartDateAsInteger1", F.unix_timestamp(df['jobStartDate']))
What I want is to do the same transformation, but using Spark SQL. I am using the following code:
CAST (jobStartDate AS INTEGER) as JobStartDateAsInteger2 -- return null value
from date_to_integer seg
How to solve it?
First you need to CAST your jobStartDate to DATE and then use UNIX_TIMESTAMP to transform it to UNIX integer.
UNIX_TIMESTAMP(CAST (jobStartDate AS DATE)) AS JobStartDateAsInteger2
FROM date_to_integer seg

Handle Dictionary like String conversion to pyspark dataframe

I'm a pyspark dataframe as
Below code :-
[2,"avi","kumar",'"color":"grey","black","white","flower":"roses","tulips"'] ,
[3,"ravi","prakash",'"color":"pink","cherry red","blue","flower":"rosey","tulipey"']
i wanted to convert this dictonary type string to a dataframe as:
Thanks in advance...
Try with regexp_extract function to extract color,flower from the feature_stack data.
from pyspark.sql.functions import *

PySpark UDF not recognizing number of arguments

I have defined a Python function "DateTimeFormat" which takes three arguments
Spark Dataframe column which has date formats (String)
The input format of column's value like yyyy-mm-dd (String)
The output format i.e. the format in which the input has to be returned like yyyymmdd (String)
I have now registered this function as UDF in Pyspark.
udf_date_time = udf(DateTimeFormat,StringType())
I am trying to call this UDF in dataframe select and it seems to be working fine as long as the input format and output are different like below'entry_date',lit('mmddyyyy'),lit('yyyy-mm-dd')))
But it fails, when the input format and output format are same with the following error'exit_date',udf_date_time('exit_date',lit('yyyy-mm-dd'),lit('yyyy-mm-dd')))
"DateTimeFormat" takes exactly 3 arguments. 2 given
But I'm clearly sending three arguments to the UDF
I have tried the above example on Python 2.7 and Spark 2.1
The function seems to work as expected in normal Python when input and output formats are the same
But the below code is giving error when run in SPARK
import datetime
# Standard date,timestamp formatter
# Takes string date, its format and output format as arguments
# Returns string formatted date
def DateTimeFormat(col,in_frmt,out_frmt):
date_formatter ={'yyyy':'%Y','mm':'%m','dd':'%d','HH':'%H','MM':'%M','SS':'%S'}
for key,value in date_formatter.items():
in_frmt = in_frmt.replace(key,value)
out_frmt = out_frmt.replace(key,value)
return datetime.datetime.strptime(col,in_frmt).strftime(out_frmt)
Calling UDF using the code below
from pyspark.sql.functions import udf,lit
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
# Create SPARK session
spark = SparkSession.builder.appName("DateChanger").enableHiveSupport().getOrCreate()
df ="csv").option("header", "true").load(file_path)
# Registering UDF
udf_date_time = udf(DateTimeFormat,StringType())'exit_date',udf_date_time('exit_date',lit('yyyy-mm-dd'),lit('yyyy-mm-dd'))).show()
CSV file input Input file
Expected result is the command'exit_date',udf_date_time('exit_date',lit('yyyy-mm-dd'),lit('yyyy-mm-dd'))).show()
should NOT throw any error like
DateTimeFormat takes exactly 3 arguments but 2 given
I am not sure if there's a better way to do this but you can try the following.
Here I have assumed that you want your dates to a particular format and have set the default for the output format (out_frmt='yyyy-mm-dd') in your DateTimeFormat function
I have added a new function called udf_score to help with conversions. That might interest you
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf, lit
df = spark.createDataFrame([
], ['exit_date'])
import datetime
def DateTimeFormat(col,in_frmt,out_frmt='yyyy-mm-dd'):
date_formatter ={'yyyy':'%Y','mm':'%m','dd':'%d','HH':'%H','MM':'%M','SS':'%S'}
for key,value in date_formatter.items():
in_frmt = in_frmt.replace(key,value)
out_frmt = out_frmt.replace(key,value)
return datetime.datetime.strptime(col,in_frmt).strftime(out_frmt)
def udf_score(in_frmt):
return udf(lambda l: DateTimeFormat(l, in_frmt))
in_frmt = 'mm-dd-yyyy''exit_date',udf_score(in_frmt)('exit_date').alias('new_dates')).show()
| exit_date| new_dates|

How to convert a column in H2OFrame to a python list?

I've read the PythonBooklet.pdf by and the python API documentation, but still can't find a clean way to do this. I know I can do either of the following:
Convert H2OFrame to Spark DataFrame and do a flatMap + collect or collect + list comprehension.
Use H2O's get_frame_data, which gives me a string of header and data separated by \n; then convert it a list (a numeric list in my case).
Is there a better way to do this? Thank you.
You can try something like this: bring an H2OFrame into python as a pandas dataframe by calling .as_data_frame(), then call .tolist() on the column of interest.
A self contained example w/ iris
import h2o
df = h2o.import_file("iris_wheader.csv")
pd = df.as_data_frame()
You can (1) convert the H2o frame to pandas dataframe and (2) convert pandas dataframe to list as follows:

Timestamp parsing in pyspark

Is there a way to separate the day of the month in the timestamp column of the data frame using pyspark. Not able to provide the code, I am new to spark. I do not have a clue on how to proceed.
You can parse this timestamp using unix_timestamp:
from pyspark.sql import functions as F
format = "yyyy-MM-dd'T'HH:mm:ss.SSSZ"
df2 = df1.withColumn('Timestamp2', F.unix_timestamp('Timestamp', format).cast('timestamp'))
Then, you can use dayofmonth in the new Timestamp column:'Timestamp2'))
More detials about these functions can be found in the pyspark functions documentation.
