Casting date from string spark - apache-spark

I am having a Date in my dataframe in String Datatype with format - dd/MM/yyyy as below:
When I am trying to convert the string to date format, all the functions are returning null values.
Looking to convert the datatype to DateType.

It looks like your date strings contain quotes, you need to remove them, using for example regexp_replace, before calling to_date:
import pyspark.sql.functions as F
df = spark.createDataFrame([("'31-12-2021'",), ("'30-11-2021'",), ("'01-01-2022'",)], ["Birth_Date"])
df = df.withColumn(
"Birth_Date",
F.to_date(F.regexp_replace("Birth_Date", "'", ""), "dd-MM-yyyy")
)
df.show()
#+----------+
#|Birth_Date|
#+----------+
#|2021-12-31|
#|2021-11-30|
#|2022-01-01|
#+----------+

Related

Casting date to integer returns null in Spark SQL

I want to convert a date column into integer using Spark SQL.
I'm following this code, but I want to use Spark SQL and not PySpark.
Reproduce the example:
from pyspark.sql.types import *
import pyspark.sql.functions as F
# DUMMY DATA
simpleData = [("James",34,"2006-01-01","true","M",3000.60),
("Michael",33,"1980-01-10","true","F",3300.80),
("Robert",37,"1992-07-01","false","M",5000.50)
]
columns = ["firstname","age","jobStartDate","isGraduated","gender","salary"]
df = spark.createDataFrame(data = simpleData, schema = columns)
df = df.withColumn("jobStartDate", df['jobStartDate'].cast(DateType()))
df = df.withColumn("jobStartDateAsInteger1", F.unix_timestamp(df['jobStartDate']))
display(df)
What I want is to do the same transformation, but using Spark SQL. I am using the following code:
df.createOrReplaceTempView("date_to_integer")
%sql
select
seg.*,
CAST (jobStartDate AS INTEGER) as JobStartDateAsInteger2 -- return null value
from date_to_integer seg
How to solve it?
First you need to CAST your jobStartDate to DATE and then use UNIX_TIMESTAMP to transform it to UNIX integer.
SELECT
seg.*,
UNIX_TIMESTAMP(CAST (jobStartDate AS DATE)) AS JobStartDateAsInteger2
FROM date_to_integer seg

spark convert datetime to timestamp

I have a column in pyspark dataframe which is in the format 2021-10-28T22:19:03.0030059Z (string datatype). How to convert this into a timestamp datatype in pyspark?
I'm using the code snippet below but this returns nulls, as it's unable to convert it. Can someone please recommend on how to convert this?
df3.select(to_timestamp(df.DateTime, 'yyyy-MM-ddHH:mm:ss:SSS').alias('dt'),col('DateTime')).show()
You have to escape (put it in '') T and Z:
import pyspark.sql.functions as F
df = spark.createDataFrame([{"DateTime": "2021-10-28T22:19:03.0030059Z"}])
df.select(F.to_timestamp(df.DateTime, "yyyy-MM-dd'T'HH:mm:ss.SSSSSSS'Z'").alias('dt'),F.col('DateTime')).show(truncate = False)`

pyspark: removing mirosecond from timestamp

I am working on a pyspark script and one of the required transformation is to convert the microsecond timestamp into seconds timestamp -
Read the parquet file as input
Determine if any column is "timestamp".(Will be in microseconds)
Example - 2019-03-30 19:56:14.520138
If yes, convert it to 'yyyy-mm-dd hh:mm:ss' format
After conversion - 2019-03-30 19:56:14
Write the dataframe in parquet format back to s3.
I have tried, the below but it doesn't work. The returned dataframe still shows microsecond.
df = spark.read.parquet(p_input_loc)
def customize_df(df):
getTimestampCol = list(
filter(lambda x: "timestamp" in x, df.dtypes))
print(getTimestampCol)
"""[('created_at', 'timestamp'), ('updated_at', 'timestamp')]"""
if getTimestampCol:
for row in getTimestampCol:
df = df.withColumn(row[0], f.to_timestamp(row[0], 'yyyy-mm-dd hh:mm:ss'))
return df
else:
return df
So I need help!!
Here problem is with your function usage.
The to_timestamp function parse the date in the formatit is provided and then convert it to timestamp but to change the format you need to use date_format function.
Here is an example
df2 = spark.createDataFrame([("2020-01-01 11:22:59.9989","12312020","31122020"), ("2020-01-01 11:22:59.9989","12312020","31122020" )], ["ID","Start_date","End_date"])
df2.withColumn('ss',f.date_format(df2.ID.cast(t.TimestampType()),'yyyy-MM-dd HH:mm:ss')).select('ss','ID').show(2, False)
+-------------------+------------------------+
|ss |ID |
+-------------------+------------------------+
|2020-01-01 11:22:59|2020-01-01 11:22:59.9989|
|2020-01-01 11:22:59|2020-01-01 11:22:59.9989|
+-------------------+------------------------+
So change your
df = df.withColumn(row[0], f.to_timestamp(row[0], 'yyyy-mm-dd hh:mm:ss'))
with
df = df.withColumn(row[0], f.date_format(row[0], 'yyyy-MM-dd HH:mm:ss'))
as your column is already of timestampType.
Hope it helps

Parsing timestamps from string and rounding seconds in spark

I have a spark DataFrame with a column "requestTime", which is a string representation of a timestamp. How can I convert it to get this format: YY-MM-DD HH:MM:SS, knowing that I have the following value: 20171107014824952 (which means : 2017-11-07 01:48:25)?
The part devoted to the seconds is formed of 5 digits, in the example above the seconds part is = 24952 and what was displayed in the log file is 25 so I have to round up 24.952 before applying the to_timestamp function, that's why I asked for help.
Assuming you have the following spark DataFrame:
df.show()
#+-----------------+
#| requestTime|
#+-----------------+
#|20171107014824952|
#+-----------------+
With the schema:
df.printSchema()
#root
# |-- requestTime: string (nullable = true)
You can use the techniques described in Convert pyspark string to date format to convert this to a timestamp. Since the solution is dependent on your spark version, I've created the following helper function:
import pyspark.sql.functions as f
def timestamp_from_string(date_str, fmt):
try:
"""For spark version 2.2 and above, to_timestamp is available"""
return f.to_timestamp(date_str, fmt)
except (TypeError, AttributeError):
"""For spark version 2.1 and below, you'll have to do it this way"""
return f.from_unixtime(f.unix_timestamp(date_str, fmt))
Now call it on your data using the appropriate format:
df.withColumn(
"requestTime",
timestamp_from_string(f.col("requestTime"), "yyyyMMddhhmmssSSS")
).show()
#+-------------------+
#| requestTime|
#+-------------------+
#|2017-11-07 01:48:24|
#+-------------------+
Unfortunately, this truncates the timestamp instead of rounding.
Therefore, you need to do the rounding yourself before converting. The tricky part is that the number is stored as a string - you'll have to convert it to a double, divide by 1000., convert it back to a long (to chop off the decimal and you can't use int as the number is too big), and finally back to a string.
df.withColumn(
"requestTime",
timestamp_from_string(
f.round(f.col("requestTime").cast("double")/1000.0).cast('long').cast('string'),
"yyyyMMddhhmmss"
)
).show()
#+-------------------+
#| requestTime|
#+-------------------+
#|2017-11-07 01:48:25|
#+-------------------+

Converting a column from string to to_date populating a different month in pyspark

I am using spark 1.6.3. When converting a column val1 (of datatype string) to date, the code is populating a different month in the result than what's in the source.
For example, suppose my source is 6/15/2017 18:32. The code below is producing 15-1-2017 as the result (Note that the month is incorrect).
My code snippet is as below
from pyspark.sql.functions import from_unixtime,unix_timestamp ,to_date
df5 = df.withColumn("val1", to_date(from_unixtime(unix_timestamp(("val1"), "mm/dd/yyyy"))))
Expected output is 6/15/2017 of date type. Please suggest.
You're using the incorrect date format. You need to use MM for the month (not mm).
For example:
df = sqlCtx.createDataFrame([('6/15/2017 18:32',)], ["val1"])
df.printSchema()
#root
# |-- val1: string (nullable = true)
As we can see val1 is a string. We can convert to date using your code with the capital M:
from pyspark.sql.functions import from_unixtime, unix_timestamp, to_date
df5 = df.withColumn("val1", to_date(from_unixtime(unix_timestamp(("val1"), "MM/dd/yyyy"))))
df5.show()
#+----------+
#| val1|
#+----------+
#|2017-06-15|
#+----------+
The new is a date type, which will display as YYYY-MM-DD:
df5.printSchema()
#root
# |-- val1: date (nullable = true)

Resources