Convert string datetime to timestamp (with literal months) SPARK - apache-spark

I'm trying to cast the following string to timestamp in pyspark:
"30-Jun-2022 14:00:00"
I've tried the following approaches:
f.col("date_string").cast("timestamp"),
f.to_timestamp(f.col("date_string")).alias("date_string") ,
.withColumn(
"date_string",
f.to_timestamp(f.col("date_string")
)
But all of them return a null column, what am I doing wrong?
MVCE:
data = [
("30-Jun-2022 14:00:00"),
("25-Jul-2022 11:00:00"),
("10-May-2022 12:00:00"),
("11-Jan-2022 09:00:00")
]
schema = StructType([
StructField("date_string", StringType(),True)
])
df = spark.createDataFrame(data=data,schema=schema)

I do not have a testing environment for pyspark, but in Spark, this:
.withColumn("timestamped",to_timestamp(col("name"), "dd-MMM-yyyy HH:mm:ss"));
returns this (which I assume is what you want):
name,timestamped
30-Jun-2022 14:00:00,2022-06-30 14:00:00

Related

How to convert the Zulu datetime to UTC with offset in pyspark and spark sql

I have the below date time in string type. I want to convert that into UTC with offset
spark = SparkSession.builder.appName("Test").enableHiveSupport().getOrCreate()
print("Print statement-1")
schema = StructType([
StructField("author", StringType(), False),
StructField("dt", StringType(), False)
])
data = [
["author1", "2022-07-22T09:25:47.261Z"],
["author2", "2022-07-22T09:26:47.291Z"],
["author3", "2022-07-22T09:23:47.411Z"],
["author4", "2022-07-224T09:25:47.291Z"]
]
df = spark.createDataFrame(data, schema)
I want to convert dt column as UTC with offset.
For example the first row value as 2022-07-22T09:25:47.2610000 +00:00
How to do that in pyspark and sparkSQL.
I can easily do that using regex_replace
df=df.withColumn("UTC",regexp_replace('dt', 'Z', '000 +00:00'))
bcoz Z is same as +00:00. But I am not sure that regexp_replace is correct of doing the conversion. Is there any method which can do the correct conversion rather than regex_replace?

Casting date to integer returns null in Spark SQL

I want to convert a date column into integer using Spark SQL.
I'm following this code, but I want to use Spark SQL and not PySpark.
Reproduce the example:
from pyspark.sql.types import *
import pyspark.sql.functions as F
# DUMMY DATA
simpleData = [("James",34,"2006-01-01","true","M",3000.60),
("Michael",33,"1980-01-10","true","F",3300.80),
("Robert",37,"1992-07-01","false","M",5000.50)
]
columns = ["firstname","age","jobStartDate","isGraduated","gender","salary"]
df = spark.createDataFrame(data = simpleData, schema = columns)
df = df.withColumn("jobStartDate", df['jobStartDate'].cast(DateType()))
df = df.withColumn("jobStartDateAsInteger1", F.unix_timestamp(df['jobStartDate']))
display(df)
What I want is to do the same transformation, but using Spark SQL. I am using the following code:
df.createOrReplaceTempView("date_to_integer")
%sql
select
seg.*,
CAST (jobStartDate AS INTEGER) as JobStartDateAsInteger2 -- return null value
from date_to_integer seg
How to solve it?
First you need to CAST your jobStartDate to DATE and then use UNIX_TIMESTAMP to transform it to UNIX integer.
SELECT
seg.*,
UNIX_TIMESTAMP(CAST (jobStartDate AS DATE)) AS JobStartDateAsInteger2
FROM date_to_integer seg

Casting date from string spark

I am having a Date in my dataframe in String Datatype with format - dd/MM/yyyy as below:
When I am trying to convert the string to date format, all the functions are returning null values.
Looking to convert the datatype to DateType.
It looks like your date strings contain quotes, you need to remove them, using for example regexp_replace, before calling to_date:
import pyspark.sql.functions as F
df = spark.createDataFrame([("'31-12-2021'",), ("'30-11-2021'",), ("'01-01-2022'",)], ["Birth_Date"])
df = df.withColumn(
"Birth_Date",
F.to_date(F.regexp_replace("Birth_Date", "'", ""), "dd-MM-yyyy")
)
df.show()
#+----------+
#|Birth_Date|
#+----------+
#|2021-12-31|
#|2021-11-30|
#|2022-01-01|
#+----------+

Writing Spark Dataframe to ORC gives the wrong timezone

Whenever I write a Dataframe into ORC, the timezone of Timestamp fields is not correct.
here's my code -
// setting the timezone
val schema = List(
StructField("name", StringType),
StructField("date", TimestampType)
)
val data = Seq(
Row("test", java.sql.Timestamp.valueOf("2021-03-15 10:10:10.0"))
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(data),
StructType(schema)
)
// changing the timezone
spark.conf.set("spark.sql.session.timeZone", "MDT")
// value of the df has changed accordingly
df.show // prints 2021-03-15 08:10:10
// writing to orc
df.write.mode(SaveMode.Overwrite).format("orc").save("/tmp/dateTest.orc/")
value in ORC file will be 2021-03-15 10:10:10.0.
Is there any way to control the writer's timezone? am i missing something here?
Thanks in advance!
So after much investigation, this is something that's not supported (ATM) for ORC. it is supported for csv, though.

Setting datatypes when writting parquet files with spark

I’m reading some data from an Oracle DB and storing it in parquet files with spark-3.0.1/hadoop3.2 through python api.
The process works, but the datatypes aren’t maintained, and the resulting types in the parquet file are all “object” except for date fields.
This is how the datatypes look after creating the dataframe:
conn_properties = {"driver": "oracle.jdbc.driver.OracleDriver",
"url": cfg.ORACLE_URL,
"query": "select ..... from ...",
}
df = spark.read.format("jdbc") \
.options(**conn_properties) \
.load()
df.dtypes:
[('idest', 'string'),
('date', 'timestamp'),
('temp', 'decimal(18,3)'),
('hum', 'decimal(18,3)'),
('prec', 'decimal(18,3)'),
('wspeed', 'decimal(18,3)'),
('wdir', 'decimal(18,3)'),
('radiation', 'decimal(18,3)')]
They all match the original database schema.
But opening the parquet file with pandas I get:
df.dtypes:
idest object
date datetime64[ns]
temp object
hum object
prec object
wspeed object
wdir object
radiation object
dtype: object
I tried to change the datatype for one column using the customSchema option, the doc says Spark SQL types can be used:
Users can specify the corresponding data types of Spark SQL instead of using the defaults.
And FloatType is included in Spark SQL datatypes.
So I expect something like this should work:
custom_schema = "radiation FloatType"
conn_properties = {"driver": "oracle.jdbc.driver.OracleDriver",
"url": cfg.ORACLE_URL,
"query": 'select ... from .... ",
"customSchema": custom_schema
}
But I get this error
pyspark.sql.utils.ParseException:
DataType floattype is not supported.(line 1, pos 10)
== SQL ==
radiation FloatType
I can’t find an option in the corresponding “write” method to specify the schema/types for the parquet DataFrameWriter.
Any idea how can I set the datatype mapping?
Thank you

Resources