casting strings to timestamp - apache-spark

I want to cast a string to timestamp. The problem I'm facing is that the string shows the 1st three letters of the month, rather than the month number:
E.g. 31-JAN-20 12.03.48.759214 AM
Is there any smart way to above value into like?
2020-01-31T12:03:48.000+0000
Thanks

Use to_timestamp to convert the string into timestamp type then use format_date to get the desired pattern :
from pyspark.sql import functions as F
df = spark.createDataFrame([("31-JAN-20 12.03.48.759214 AM",)], ["date"])
df.withColumn(
"date2",
F.date_format(
F.to_timestamp("date", "dd-MMM-yy h.mm.ss.SSSSSS a"),
"yyyy-MM-dd'T'HH:mm:ss.SSS Z"
)
).show(truncate=False)
#+----------------------------+-----------------------------+
#|date |date2 |
#+----------------------------+-----------------------------+
#|31-JAN-20 12.03.48.759214 AM|2020-01-31T00:03:48.759 +0100|
#+----------------------------+-----------------------------+

Related

Wrong sequence of months in PySpark sequence interval month

I am trying to create an array of dates that all months from a minimum date to a maximum date!
Example:
min_date = "2021-05-31"
max_date = "2021-11-30"
.withColumn('array_date', F.expr('sequence(to_date(min_date), to_date(max_date), interval 1 month)')
But it gives me the following Output:
['2021-05-31', '2021-06-30', '2021-07-31', '2021-08-31', '2021-09-30', '2021-10-31']
Why doesn't the upper limit appear on 11/30/2021? In the documentation, it says that the extremes are included.
My desired output is:
['2021-05-31', '2021-06-30', '2021-07-31', '2021-08-31', '2021-09-30', '2021-10-31', '2021-11-30']
Thank you!
I think this is related to the timezone. I can reproduce the same behavior in my timezone Europe/Paris but when setting timezone to UTC it gives expected result:
from pyspark.sql import functions as F
spark.conf.set("spark.sql.session.timeZone", "UTC")
df = spark.createDataFrame([("2021-05-31", "2021-11-30")], ["min_date", "max_date"])
df.withColumn(
"array_date",
F.expr("sequence(to_date(min_date), to_date(max_date), interval 1 month)")
).show(truncate=False)
#+----------+----------+------------------------------------------------------------------------------------+
#|min_date |max_date |array_date |
#+----------+----------+------------------------------------------------------------------------------------+
#|2021-05-31|2021-11-30|[2021-05-31, 2021-06-30, 2021-07-31, 2021-08-31, 2021-09-30, 2021-10-31, 2021-11-30]|
#+----------+----------+------------------------------------------------------------------------------------+
Alternatively, you can use TimestampType for start and end parameters of the sequence instead of DateType:
df.withColumn(
"array_date",
F.expr("sequence(to_timestamp(min_date), to_timestamp(max_date), interval 1 month)").cast("array<date>")
).show(truncate=False)

max aggregation on grouped spark dataframe returns wrong value

I have a spark dataframe containing 2 columns(CPID and PluginDuration). I need to find maximum pluginDuration and average pluginDuration for each CPID in the dataframe.
Rows returned for a CPID AN04773 dataframe returned below rows:
df.filter('CPID = "AN04773"').show(10)
Result:
+-------+--------------+
| CPID|PluginDuration|
+-------+--------------+
|AN04773| 1.933333333|
|AN04773| 13.03444444|
|AN04773| 9.2875|
|AN04773| 20.50027778|
+-------+--------------+
when I did groupBy on PID column of the dataframe to find max and avg plugin duration as below, I found the max value returned for some PIDs is not as expected. For example, for PID AN04773 (same PID which I used to show rows from original df). The max pluginDuration should be 20.50027778 but from the result from the below code, the max value is 9.2875 which is not right.
from pyspark.sql import functions as F
fdf = df.groupBy('CPID').agg(F.max('PluginDuration').alias('max_duration'),F.avg('PluginDuration').alias('avg_duration'))
fdf.filter('CPID = "AN04773"').show()
Result:
+-------+------------+--------------+
| CPID|max_duration| avg_duration|
+-------+------------+--------------+
|AN04773| 9.2875|11.18888888825|
+-------+------------+--------------+
want to know why it's not functioning as expected.
The wrong calculation happens because PluginDuration is not defined as a numeric datatype but as a string column. All you have to do is to cast PluginDuration column to be of a numeric type (double, float, etc).
Here is your issue (reproduced in scala but works the same in PySpark):
val data = Seq(("AN04773", "1.933333333"), ("AN04773", "13.03444444"), ("AN04773", "9.2875"), ("AN04773", "20.50027778")).toDF("id", "value")
data.groupBy("id").agg(functions.max("value"), avg("value")).show
// output:
+-------+----------+--------------+
| id|max(value)| avg(value)|
+-------+----------+--------------+
|AN04773| 9.2875|11.18888888825|
+-------+----------+--------------+
but after casting the value column to Double datatype, we get correct calculation values:
data.withColumn("value",col("value").cast("double")).groupBy("id").agg(functions.max("value"), avg("value")).show
// output:
+-------+-----------+--------------+
| id| max(value)| avg(value)|
+-------+-----------+--------------+
|AN04773|20.50027778|11.18888888825|
+-------+-----------+--------------+
As there is no datatype defined in the columns so Scala treated it as string and as a string 9 is greater than 2 , so the maximum is 9.285
Note: If you convert the datatype in pyspark to string you will get the same result as you are getting in Scala.
def transform(self, df):
from pyspark.sql.functions import sum, avg, max
from pyspark.sql.functions import col
#cast PluginDuration column from string to double to perform aggregation
df1=df.withColumn('PluginDuration',
col('PluginDuration').cast('double'))
df2=df1.groupBy("CPID") \
.agg(max("PluginDuration").alias("max_duration"),avg("PluginDuration").alias("avg_duration"))
#df2.show(truncate=False)
#rename and round the column values
df3 = df2.select(col("CPID").alias("chargepoint_id"),
func.round(df2["max_duration"], 2).alias("max_duration"),
func.round(df2["avg_duration"], 2).alias("avg_duration"))
#df3.show(truncate=False)
return df3

I would like to convert an int to string in dataframe

I would like to convert a column in dataframe to a string
it looks like this :
company department id family name start_date end_date
abc sales 38221925 Levy nali 16/05/2017 01/01/2018
I want to convert the id from int to string
I tried
data['id']=data['id'].to_string()
and
data['id']=data['id'].astype(str)
got dtype('O')
I expect to receive string
This is intended behaviour. This is how pandas stores strings.
From the docs
Pandas uses the object dtype for storing strings.
For a simple test, you can make a dummy dataframe and check it's dtype too.
import pandas as pd
df = pd.DataFrame(["abc", "ab"])
df[0].dtype
#Output:
dtype('O')
You can do that by using apply() function in this way:
data['id'] = data['id'].apply(lambda x: str(x))
This will convert all the values of id column to string.
You can ensure the type of the values like this:
type(data['id'][0]) (It is checking the first value of 'id' column)
This will give the output str.
And data['id'].dtype will give dtype('O') that is object.
You can also use data.info() to check all the information about that DataFrame.
str(12)
>>'12'
Can easily convert to a String

How do I convert multiple `string` columns in my dataframe to datetime columns?

I am in the process of converting multiple string columns to date time columns, but I am running into the following issues:
Example column 1:
1/11/2018 9:00:00 AM
Code:
df = df.withColumn(df.column_name, to_timestamp(df.column_name, "MM/dd/yyyy hh:mm:ss aa"))
This works okay
Example column 2:
2019-01-10T00:00:00-05:00
Code:
df = df.withColumn(df.column_name, to_date(df.column_name, "yyyy-MM-dd'T'HH:mm:ss'-05:00'"))
This works okay
Example column 3:
20190112
Code:
df = df.withColumn(df.column_name, to_date(df.column_name, "yyyyMMdd"))
This does not work. I get this error:
AnalysisException: "cannot resolve 'unix_timestamp(t.`date`,
'yyyyMMdd')' due to data type mismatch: argument 1 requires (string or
date or timestamp) type, however, 't.`date`' is of int type.
I feel like it should be straightforward, but I am missing something.
The error is pretty self explanatory, you need your column yo be a String.
Are you sure your column is already a String? It seems not. You can cast it to String first with column.cast
import org.apache.spark.sql.types._
df = df.withColumn(df.column_name, to_date(df.column_name.cast(StringType), "yyyyMMdd")

pyspark to_timestamp does not include milliseconds

I'm trying to format my timestamp column to include milliseconds without success. How can I format my time to look like this - 2019-01-04 11:09:21.152 ?
I have looked at the documentation and following the SimpleDataTimeFormat , which the pyspark docs say are being used by the to_timestamp function.
This is my dataframe.
+--------------------------+
|updated_date |
+--------------------------+
|2019-01-04 11:09:21.152815|
+--------------------------+
I use the millisecond format without any success as below
>>> df.select('updated_date').withColumn("updated_date_col2",
to_timestamp("updated_date", "YYYY-MM-dd HH:mm:ss:SSS")).show(1,False)
+--------------------------+-------------------+
|updated_date |updated_date_col2 |
+--------------------------+-------------------+
|2019-01-04 11:09:21.152815|2019-01-04 11:09:21|
+--------------------------+-------------------+
I expect updated_date_col2 to be formatted as 2019-01-04 11:09:21.152
I think you can use UDF and Python's standard datetime module as below.
import datetime
from pyspark.sql.functions import udf
from pyspark.sql.types import TimestampType
def _to_timestamp(s):
return datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')
udf_to_timestamp = udf(_to_timestamp, TimestampType())
df.select('updated_date').withColumn("updated_date_col2", udf_to_timestamp("updated_date")).show(1,False)
This is not a solution with to_timestamp but you can easily keep your column to time format
Following code is one of example on converting a numerical milliseconds to timestamp.
from datetime import datetime
ms = datetime.now().timestamp() # ex) ms = 1547521021.83301
df = spark.createDataFrame([(1, ms)], ['obs', 'time'])
df = df.withColumn('time', df.time.cast("timestamp"))
df.show(1, False)
+---+--------------------------+
|obs|time |
+---+--------------------------+
|1 |2019-01-15 12:15:49.565263|
+---+--------------------------+
if you use new Date().getTime() or Date.now() in JS or datetime.datetime.now().timestamp() in Python, you can get a numerical milliseconds.
Reason pyspark to_timestamp parses only till seconds, while TimestampType have the ability to hold milliseconds.
Following workaround may work:
If the timestamp pattern contains S, Invoke a UDF to get the string 'INTERVAL MILLISECONDS' to use in expression
ts_pattern = "YYYY-MM-dd HH:mm:ss:SSS"
my_col_name = "time_with_ms"
# get the time till seconds
df = df.withColumn(my_col_name, to_timestamp(df["updated_date_col2"],ts_pattern))
# add milliseconds as inteval
if 'S' in timestamp_pattern:
df = df.withColumn(my_col_name, df[my_col_name] + expr("INTERVAL 256 MILLISECONDS"))
To get INTERVAL 256 MILLISECONDS we may use a Java UDF:
df = df.withColumn(col_name, df[col_name] + expr(getIntervalStringUDF(df[my_col_name], ts_pattern)))
Inside UDF: getIntervalStringUDF(String timeString, String pattern)
Use SimpleDateFormat to parse date according to pattern
return formatted date as string using pattern "'INTERVAL 'SSS' MILLISECONDS'"
return 'INTERVAL 0 MILLISECONDS' on parse/format exceptions

Resources