Changing string to timestamp in Pyspark - apache-spark

I'm trying to convert a string column to Timestamp column which is in the format:
c1
c2
2019-12-10 10:07:54.000
2019-12-13 10:07:54.000
2020-06-08 15:14:49.000
2020-06-18 10:07:54.000
from pyspark.sql.functions import col, udf, to_timestamp
joined_df.select(to_timestamp(joined_df.c1, '%Y-%m-%d %H:%M:%S.%SSSS').alias('dt')).collect()
joined_df.select(to_timestamp(joined_df.c2, '%Y-%m-%d %H:%M:%S.%SSSS').alias('dt')).collect()
When the dates are changed, I want a new column Date difference by subtracting c2-c1
In python I'm doing it:
df['c1'] = df['c1'].fillna('0000-01-01').apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S.%f'))
df['c2'] = df['c2'].fillna('0000-01-01').apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S.%f'))
df['days'] = (df['c2'] - df['c1']).apply(lambda x: x.days)
Can anyone help how to convert to pyspark?

If you want to get the date difference, you can use datediff:
import pyspark.sql.functions as F
df = df.withColumn('c1', F.col('c1').cast('timestamp')).withColumn('c2', F.col('c2').cast('timestamp'))
result = df.withColumn('days', F.datediff(F.col('c2'), F.col('c1')))
result.show(truncate=False)
+-----------------------+-----------------------+----+
|c1 |c2 |days|
+-----------------------+-----------------------+----+
|2019-12-10 10:07:54.000|2019-12-13 10:07:54.000|3 |
|2020-06-08 15:14:49.000|2020-06-18 10:07:54.000|10 |
+-----------------------+-----------------------+----+

Related

Change the day of the date to a particular day

I basically have a requirement that needs a column that as the PeriodEndDate in. The period always ends on the 23rd of the month.
I need to take a date from a column in this case it is the last day of the month each day, and set the "day" of that date to be "23".
I have tried doing the following:
.withColumn("periodEndDate", change_day(jsonDF2.periodDate, sf.lit(23)))
cannot import name 'change_day' from 'pyspark.sql.functions'
You can use make_date
from pyspark.sql import functions as F
df = spark.createDataFrame([('2022-05-31',)], ['periodDate'])
df = df.withColumn('periodEndDate', F.expr("make_date(year(periodDate), month(periodDate), 23)"))
df.show()
# +----------+-------------+
# |periodDate|periodEndDate|
# +----------+-------------+
# |2022-05-31| 2022-05-23|
# +----------+-------------+
As far as I know, there is no function change_day however, you can make one using UDF. Pass a date and replace day.
Example:
from datetime import datetime
from pyspark.sql import SparkSession
from pyspark.sql.types import TimestampType
from pyspark.sql import functions as F
def change_day(date, day):
return date.replace(day=day)
change_day = F.udf(change_day, TimestampType())
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([{"date": datetime(2022, 1, 31)}])
df = df.withColumn("23day", change_day(F.col("date"), F.lit(23)))
df.show(20, False)
Result:
+-------------------+-------------------+
|date |23day |
+-------------------+-------------------+
|2022-01-31 00:00:00|2022-01-23 00:00:00|
+-------------------+-------------------+

df.rdd.collect() converts timestamp column(UTC) to local timezone(IST) in pyspark

spark reads a table from MySQL which has a timestamp column storing UTC timezone values. Spark is configured in local(IST).
MySQL stores below timestamp values.
spark.conf.set("spark.sql.session.timeZone" , "UTC")
df.show(100,False)
after using above conf, I can see the correct records with df.show(). Later df.rdd.collect() converts these values back to IST timezone.
for row in df.rdd.collect():
print("row.Mindate ",row.Mindate)
row.Mindate 2021-03-02 19:30:31
row.Mindate 2021-04-01 14:05:03
row.Mindate 2021-06-15 11:39:40
row.Mindate 2021-07-07 18:14:17
row.Mindate 2021-08-03 10:48:51
row.Mindate 2021-10-06 10:21:11
spark dataframe and df.rdd show different result sets.
How does it change the values back to local timezone even after "spark.sql.session.timeZone" , "UTC".
Thanks in advance
EDIT 1:
df.printSchema()
root
|-- Mindate: timestamp (nullable = true)
|-- Maxdate: timestamp (nullable = true)
TL;DR
df.rdd.collect() converts timestamp column(UTC) to local timezone(IST) in pyspark
No it does not. In fact the timestamp inside the dataframe you read has no timezone. What you see is simply the behavior of show() based on session local timezone.
timezone info is lost when you store a datetime.datetime value, in a column of type TimestampType
As described in the docs
Datetime type
TimestampType: Represents values comprising values of fields year, month, day, hour, minute, and second, with the session local time-zone. The timestamp value represents an absolute point in time.
As you can see in code TimestampType is a wrapper over python datetime.datetime, but it strips out timezone and internally stores it as epoch time.
class TimestampType(AtomicType, metaclass=DataTypeSingleton):
"""Timestamp (datetime.datetime) data type.
"""
def needConversion(self):
return True
def toInternal(self, dt):
if dt is not None:
seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
else time.mktime(dt.timetuple()))
return int(seconds) * 1000000 + dt.microsecond
def fromInternal(self, ts):
if ts is not None:
# using int to avoid precision loss in float
return datetime.datetime.fromtimestamp(ts // 1000000).replace(microsecond=ts % 1000000)
Some more sample code:
from typing import Union
from pyspark.sql.types import TimestampType, StringType
from datetime import datetime
from pyspark.sql import DataFrame, functions as F
def to_str(val: Union[str, datetime]) -> str:
type_str = f'{type(val).__name__}:'
if isinstance(val, str):
return type_str + val
else:
return f'{type_str}{val.isoformat()} tz:{val.tzinfo}'
def print_df_info(df: DataFrame):
df.show(truncate=False)
for row in df.collect():
log('DF :', ','.join([to_str(cell) for cell in row]))
for row in df.rdd.collect():
log('RDD:', ','.join([to_str(cell) for cell in row]))
spark.conf.set("spark.sql.session.timeZone", "UTC")
timestamps = ['2021-04-01 10:00:00 -05:00']
timestamp_data = [{'col_original_str': s} for s in timestamps]
my_df = spark.createDataFrame(timestamp_data)
# 1. col_original_str -> col_to_timestamp (convert to UTC and stored WITHOUT timezone)
my_df = my_df.withColumn('col_to_timestamp', F.to_timestamp(my_df.col_original_str))
# 2. col_to_timestamp -> col_date_format (convert an Epoch time (which has no timezone) to string)
my_df = my_df.withColumn('col_date_format', F.date_format(my_df.col_to_timestamp, "yyyy-MM-dd HH:mm:ss.SSSXXX"))
# This is really confusing.
# 3. col_to_timestamp -> col_to_utc_timestamp (tell pyspark to interpret col_to_timestamp with
# timezone Asia/Kolkata, and convert it to UTC)
my_df = my_df.withColumn('col_reinterpret_tz', F.to_utc_timestamp(my_df.col_to_timestamp, 'Asia/Kolkata'))
my_df.printSchema()
log('#################################################')
log('df with session.timeZone set to UTC')
spark.conf.set("spark.sql.session.timeZone", "UTC")
print_df_info(my_df)
log('#################################################')
log('df with session.timeZone set to Asia/Kolkata')
spark.conf.set("spark.sql.session.timeZone", "Asia/Kolkata")
print_df_info(my_df)
Note in the output:
DF : and RDD : (See logs from print_df_info()) have exactly same contents. They are different facades over same data.
Changing spark.sql.session.timeZone has no impact on 'internal representation'. See logs from print_df_info().
Changing spark.sql.session.timeZone changes the way show() prints the values of type timestamp.
2021-11-08T12:16:22.817 spark.version: 3.0.3
root
|-- col_original_str: string (nullable = true)
|-- col_to_timestamp: timestamp (nullable = true)
|-- col_date_format: string (nullable = true)
|-- col_reinterpret_tz: timestamp (nullable = true)
2021-11-08T13:57:54.243 #################################################
2021-11-08T13:57:54.244 df with session.timeZone set to UTC
+--------------------------+-------------------+------------------------+-------------------+
|col_original_str |col_to_timestamp |col_date_format |col_reinterpret_tz |
+--------------------------+-------------------+------------------------+-------------------+
|2021-04-01 10:00:00 -05:00|2021-04-01 15:00:00|2021-04-01 15:00:00.000Z|2021-04-01 09:30:00|
+--------------------------+-------------------+------------------------+-------------------+
2021-11-08T13:57:54.506 DF : str:2021-04-01 10:00:00 -05:00,datetime:2021-04-01T10:00:00 tz:None,str:2021-04-01 15:00:00.000Z,datetime:2021-04-01T04:30:00 tz:None
2021-11-08T13:57:54.569 RDD: str:2021-04-01 10:00:00 -05:00,datetime:2021-04-01T10:00:00 tz:None,str:2021-04-01 15:00:00.000Z,datetime:2021-04-01T04:30:00 tz:None
2021-11-08T13:57:54.570 #################################################
2021-11-08T13:57:54.570 df with session.timeZone set to Asia/Kolkata
+--------------------------+-------------------+------------------------+-------------------+
|col_original_str |col_to_timestamp |col_date_format |col_reinterpret_tz |
+--------------------------+-------------------+------------------------+-------------------+
|2021-04-01 10:00:00 -05:00|2021-04-01 20:30:00|2021-04-01 15:00:00.000Z|2021-04-01 15:00:00|
+--------------------------+-------------------+------------------------+-------------------+
2021-11-08T13:57:54.828 DF : str:2021-04-01 10:00:00 -05:00,datetime:2021-04-01T10:00:00 tz:None,str:2021-04-01 15:00:00.000Z,datetime:2021-04-01T04:30:00 tz:None
2021-11-08T13:57:54.916 RDD: str:2021-04-01 10:00:00 -05:00,datetime:2021-04-01T10:00:00 tz:None,str:2021-04-01 15:00:00.000Z,datetime:2021-04-01T04:30:00 tz:None
Some references:
Pyspark to_timestamp with timezone
Pyspark coverting timestamps from UTC to many timezones
Pyspark to_timestamp with timezone
change Unix(Epoch) time to local time in pyspark

Spark SQL Datediff between columns in minutes

I have 2 columns in a table (both dates, formatted as string type). I need to find difference between them in minutes and then average the difference over an year.
Format as below:
Requesttime: 11/10/2019 03:10:15 PM
Respondtime: 11/10/2029 03:20:10 PM
Any suggestions?
You can register a user defined function
import datetime
def min_diff(a,b):
start_time = datetime.datetime.strptime(a,'%m/%d/%Y %I:%M:%S %p')
end_time = datetime.datetime.strptime(b,'%m/%d/%Y %I:%M:%S %p')
return (end_time-start_time).total_seconds()/60
def year(c):
return datetime.datetime.strptime(c,'%m/%d/%Y %I:%M:%S %p').strftime('%Y')
spark.udf.register(name='min_diff',f=lambda a,b:min_diff(a,b))
spark.udf.register(name='year', f=lambda c:year(c))
spark.sql('select avg(min_diff(start_time,end_time)) avg_time_diff, year(start_time) year from test_table group by year').show()
No need for UDF. Just use spark sql functions as below:
import pyspark.sql.functions as F
df = spark.createDataFrame([
['11/10/2019 03:10:15 PM','11/10/2019 03:20:10 PM']
]).toDF('Requesttime','Respondtime')
df = df.withColumn(
'diff_minutes',
(F.to_timestamp('Respondtime', 'dd/MM/yyyy hh:mm:ss a').cast('bigint') -
F.to_timestamp('Requesttime', 'dd/MM/yyyy hh:mm:ss a').cast('bigint')) / 60
)
df.show(truncate=False)
+----------------------+----------------------+-----------------+
|Requesttime |Respondtime |diff_minutes |
+----------------------+----------------------+-----------------+
|11/10/2019 03:10:15 PM|11/10/2019 03:20:10 PM|9.916666666666666|
+----------------------+----------------------+-----------------+
If you want to average the difference over a year, you can do
df.groupBy(
F.year(F.to_timestamp('Requesttime', 'dd/MM/yyyy hh:mm:ss a')).alias('year')
).agg(F.avg('diff_minutes'))

How to extract time from timestamp in pyspark?

I have a requirement to extract time from timestamp(this is a column in dataframe) using pyspark.
lets say this is the timestamp 2019-01-03T18:21:39 , I want to extract only time "18:21:39" such that it always appears in this manner "01:01:01"
df = spark.createDataFrame(["2020-06-17T00:44:30","2020-06-17T06:06:56","2020-06-17T15:04:34"],StringType()).toDF('datetime')
df=df.select(df['datetime'].cast(TimestampType()))
I tried like below but did not get the expected result
df1=df.withColumn('time',concat(hour(df['datetime']),lit(":"),minute(df['datetime']),lit(":"),second(df['datetime'])))
display(df1)
+-------------------+-------+
| datetime| time|
+-------------------+-------+
|2020-06-17 00:44:30|0:44:30|
|2020-06-17 06:06:56| 6:6:56|
|2020-06-17 15:04:34|15:4:34|
+-------------------+-------+
my results are like this 6:6:56 but i want them to be 06:06:56
Use the date_format function.
from pyspark.sql.types import StringType
df = spark \
.createDataFrame(["2020-06-17T00:44:30","2020-06-17T06:06:56","2020-06-17T15:04:34"], StringType()) \
.toDF('datetime')
from pyspark.sql.functions import date_format
q = df.withColumn('time', date_format('datetime', 'HH:mm:ss'))
>>> q.show()
+-------------------+--------+
| datetime| time|
+-------------------+--------+
|2020-06-17T00:44:30|00:44:30|
|2020-06-17T06:06:56|06:06:56|
|2020-06-17T15:04:34|15:04:34|
+-------------------+--------+

Transform a long format date to a short format

The small program I am using (below) gives a Date0 value in a "long" form:
import pandas as pd
from datetime import datetime, timedelta
df = pd.DataFrame(
[
["A", -33.415424]
],
columns = [
"Country",
"Time0"
]
)
df = df.set_index("Country", drop = True)
d = datetime.strptime("2020-01-22", '%Y-%m-%d')
df['d'] = d
df['Date0'] = df['d'] + pd.to_timedelta(df['Time0'], unit='d')
Time0 d Date0
Country
A -33.415424 2020-01-22 2019-12-19 14:01:47.366400
How can I get Date0 to be "only" 2019-12-19?
Surely a basic question, but I'm sorry to say that I am completely lost in formatting dates in Python...
you can do it in this way:
df['Date0'] = df['Date0'].dt.date
Output:
Time0 d Date0
Country
A -33.415424 2020-01-22 2019-12-19

Resources